WO2020080972A1 - Method of speech separation and pauses - Google Patents

Method of speech separation and pauses Download PDF

Info

Publication number
WO2020080972A1
WO2020080972A1 PCT/RU2019/000516 RU2019000516W WO2020080972A1 WO 2020080972 A1 WO2020080972 A1 WO 2020080972A1 RU 2019000516 W RU2019000516 W RU 2019000516W WO 2020080972 A1 WO2020080972 A1 WO 2020080972A1
Authority
WO
WIPO (PCT)
Prior art keywords
sliding window
value
noise
values
components
Prior art date
Application number
PCT/RU2019/000516
Other languages
French (fr)
Inventor
Vladimir Aleksandrovich BELOGUROV
Vladimir Alekseevich ZOLOTAREV
Original Assignee
Joint-Stock Company "Concern "Sozvezdie"
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Joint-Stock Company "Concern "Sozvezdie" filed Critical Joint-Stock Company "Concern "Sozvezdie"
Publication of WO2020080972A1 publication Critical patent/WO2020080972A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/84Detection of presence or absence of voice signals for discriminating voice from noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/20Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04QSELECTING
    • H04Q1/00Details of selecting apparatus or arrangements
    • H04Q1/18Electrical details
    • H04Q1/30Signalling arrangements; Manipulation of signalling currents
    • H04Q1/44Signalling arrangements; Manipulation of signalling currents using alternate current
    • H04Q1/444Signalling arrangements; Manipulation of signalling currents using alternate current with voice-band signalling frequencies
    • H04Q1/46Signalling arrangements; Manipulation of signalling currents using alternate current with voice-band signalling frequencies comprising means for distinguishing between a signalling current of predetermined frequency and a complex current containing that frequency, e.g. speech current
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/45Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of analysis window

Definitions

  • the invention belongs to the field of voice information transmission and can be used in communication and loudspeaker devices.
  • the invention refers to telecommunications, in particular to automatic means of receiving tones in multichannel systems, and can be used, for example, to detect acoustic signals (AS) in telephone channels.
  • Functioning is based on the calculation of a number of decisive statistics, which are distinctive features in the recognition of information speakers from channel noise and stray speech signals.
  • the decisive statistics are the estimation of the signal power in the information frequency band, the distribution of the input signal energy over the frequency band and the value of the envelope irregularity filtered in the input signal bandpass filter.
  • AS uses secondary processing based on application of the majority rule for a series of primary solutions.
  • the disadvantage of the known device is its low efficiency in solving the problem of speech separation and pauses.
  • the invention refers to telecommunications, in particular to automatic means of receiving tones in multichannel systems, and can be used to detect acoustic signals in telephone channels.
  • the known technical solution is not very effective in solving the problem of speech separation and pauses in the presence of acoustic noise.
  • the prototype method is as follows.
  • the signal received by the device is sampled within the time interval set for its analysis and stored in memory for further processing.
  • the processed signal consists of an interval that contains only noise, the duration of this interval is about 100 ms, and an interval that contains an additive mixture of speech signal and noise (hereinafter - signal and noise mixture).
  • the main parameters are the number of zero crossings within 10 ms and the average value function calculated using the 10 ms window. These samples are used to calculate the average values and variance of the weighted sum of the absolute values of the samples' amplitudes and the average number of zero crossings (noise statistics).
  • the thresholds for the average number of zero crossings (ANZC) and signal energy are calculated.
  • a fragment of oscillation is defined, where the trajectory of the mean value of signal energy (MVSE) exceeds the upper threshold. It is assumed that the beginning and the end of the word lie outside this fragment.
  • point N2 The expected end of the word (point N2) is defined in the same way.
  • the next step is to move to the left of point Ni (right of point N2) and compare the number of zero crossings with the threshold calculated from the starting segment. If the number of zero crossings exceeds the threshold by 3 or more times, the beginning of the word is moved to the place where the curve of the number of zero crossings exceeded the threshold for the first time. Otherwise, the point Ni is considered to be the beginning of the word. A similar process is performed for point N2.
  • the disadvantage of the prototype method is the lack of high accuracy in solving the problem of determining the moment of appearance of the speech signal and the high probability of erroneous decisions about the appearance of the signal in the presence of acoustic noise.
  • the objective of the proposed method is to improve the accuracy of determining the moment of appearance of a speech signal and increase the probability value of the correct decision about the appearance of a speech signal in the presence of acoustic noise.
  • the “sliding window” - the interval of a given duration are formed so that there is only noise in the“sliding window”;
  • spectral analysis method determines the values of frequencies, phases and amplitudes of harmonic noise components
  • sliding window is shifted by the value of the offset step, the value of which is determined in advance;
  • envelope noise values are calculated for the current position of the “sliding window”, using the results of the spectral analysis performed for the previous position of the “sliding window”, subtract the calculated noise amplitude values from the sequence of samples taken for the current position of the“sliding window”;
  • the obtained values are compared with the threshold, the value of which is determined in advance, if none of the values does not exceed the threshold, it is considered that the noise has not changed, the“sliding window” is shifted to the value of the offset step and the described procedure is repeated;
  • the frequencies, phases and amplitudes of the harmonic components are determined by spectral analysis using the values obtained by subtracting them from the samples taken for the current“sliding window” position of the calculated noise amplitude values;
  • the total number of harmonic components is determined, for each harmonic - the number of values of paired phase differences of this harmonic and other harmonics, which do not exceed the specified value, and determine the maximum value of the values thus found;
  • the ratio of the found maximum value of the number of harmonics is calculated, for which the values of the paired phase differences do not exceed the specified value, to the total number of components;
  • the calculated value of the ratio of the maximum number of harmonics is compared to the total number of components with the threshold value, the value of which is determined in advance;
  • the“sliding window” does not have a speech signal
  • the process of detecting the speech signal continues according to the described algorithm until the moment when the next shift of the“sliding window” calculated value of the maximum number of harmonic components ratio to their total number will exceed the threshold value, in this case, it is considered that in the“sliding window” speech signal is present, the time of its appearance is set equal to the value of the right border of the“sliding window”, reduced by a predetermined value.
  • the proposed method of speech and pause separation by analyzing the phase values of the frequency components of noise and signal is implemented as follows.
  • ADC analog-to-digital converter
  • the voice signal is detected and the position of its start is determined as follows.
  • A“sliding window” is formed - the interval of the specified duration, the initial position of which is set so that only noise is present in the“sliding window”.
  • Duration of the interval, for which it is considered to contain only noise, and the “sliding window” is determined at the stage of development by experiment or by mathematical modeling based on the condition of providing a given level of efficiency of the solution of the problem of speech separation and pauses, which is understood to be the provision of maximum probability of the correct decision on the appearance of the speech signal in the presence of acoustic noise, provided that the value of the false alarm probability (decision on the presence of a speech signal in its absence) will be no higher than the specified level.
  • Spectral analysis method determines the values of frequencies, phases and amplitudes of harmonic noise components, for example, by using the method of spectral analysis of multi-frequency periodic signals represented by digital samples, described in the book“Functional Control and Diagnosis of Electrical Systems and Devices by Digital Samples of Instantaneous Values of Current and Voltage /edited by E.I. Goldstein - Tomsk: Published“Printing Manufactory”, 2003, pp. 92-94.
  • “Sliding window” is shifted by the value of the offset step, the value of which is determined in advance.
  • Offset step value is determined at the stage of development by experiment or by the mathematical modeling based on providing a given level of efficiency of the problem solution of speech separation and pauses.
  • Envelope noise readings are calculated using the results of the spectral analysis performed for the previous“sliding window” position for the moments in time at which the current“sliding window” position was measured.
  • the calculated readings are subtracted from the sequence of readings taken for the current“sliding window” position.
  • Obtained values are compared with the threshold, the value of which is determined in advance, if none of the values does not exceed the threshold, it is considered that the noise has not changed.“Sliding window” is shifted to the value of the offset step and the described procedure is repeated;
  • This threshold value is determined at the stage of development by experiment or by the mathematical modeling based on providing a given level of efficiency of the problem solution of speech separation and pauses.
  • the frequencies, phases and amplitudes of the harmonic components are determined by spectral analysis using the values obtained by subtracting them from the samples taken for the current“sliding window” position of the calculated noise amplitude values.
  • the harmonic components are randomly numbered
  • N Ci From the found values of the number of components (N Ci ), determine the component with the highest number of components.
  • FIG. 1 An illustrative example of how the algorithm works is shown in Fig. 1.
  • This threshold value and the value, which should not exceed the phase difference of harmonic components, are determined at the stage of development by experiment or by mathematical modeling based on the condition of providing a given level of the solution efficiency of the problem of speech separation and pauses.
  • - envelope noise readings are calculated using the results of the spectral analysis performed for the previous“sliding window” position for the moments in time at which the current“sliding window” position was measured; - calculated readings are subtracted from the sequence of readings taken for the current“sliding window” position;
  • the frequencies, phases and amplitudes of the harmonic components are determined by spectral analysis using the values obtained by subtracting them from the samples taken for the current“sliding window” position of the calculated noise amplitude values;
  • Threshold values are determined at the stage of development by experiment or by the mathematical modeling based on providing a given level of efficiency of the problem solution of speech separation and pauses.
  • the optimal average value by which the value of the right border of the“sliding window” is reduced can be determined at the stage of development by experiment or by mathematical modeling based on the condition of providing a given level of efficiency of the solution of the problem of speech separation and pauses.
  • N sp - number of harmonic noise components used for its representation N sp - number of harmonic noise components used for its representation.
  • the signal is presented as a set of harmonic oscillations with random values of amplitudes (U Si ) and phases (f 5 ⁇ ), which are distributed according to normal (amplitude) and uniform (phase) laws, and the initial values of the phases for the components of the signal are set so that for any pair of harmonics the phase difference does not exceed a predetermined value.
  • Table 1 presents the results of modeling the process of determining the probability of making a decision on the appearance of a speech signal in its absence in a single displacement of the“sliding window” (Ri ti ).
  • N t h threshold value of the ratio of the maximum number of harmonic components to the total number of components
  • the probability of making a decision on the appearance of a speech signal in its absence for 200 offset steps steps of the“sliding window” is calculated by the formula (with the value of the offset step of the“sliding window” 5 ms, the total duration of two hundred steps is 1 s)
  • Table 2 uses the same designations as Table 1.
  • the probability of correct decision-making on the appearance of a speech signal in its presence is equal to 1.
  • the initial value by which the value of the right border of the“sliding window” is reduced is set to zero
  • the step change of this value is set to 1 ms.
  • the value of the“sliding window” offset step is 5 ms
  • the value by which the value of the right border of the“sliding window” is reduced, close to the optimal one is 8 ms
  • the average error of determining the time of appearance of the speech signal is about ⁇ 2.5 ms.
  • FIG. 2 Block diagram of the device implementing the proposed method is shown in Fig. 2, where it is indicated:
  • LFA low frequency amplifier
  • ADC analog-to-digital converter
  • the device contains EAD 1, LPF 2, LFA 3, ADC 4, CD 5 connected in series, the output of which is the output of the declared device, the EAD 1 input is the input of the device.
  • the device works as follows.
  • Noise or an additive signal and noise mixture coming from the output of the EAD 1, filter LPF 2, the band of which is matched to the band of speech signal, then the noise or additive signal and noise mixture is amplified in LFA 3 and fed to the ADC 4 input. Samples of noise or signal and noise mixture, which are formed in ADC 4, are digitally transferred to the CD 5 input.
  • CD 5 the received samples of noise or signal and noise mixture are processed according to the algorithm given above.
  • the processing result is a digital solution for the presence or absence of a speech signal, for example:
  • Device output also receives the value of the time of occurrence of the speech signal, when the decision on the presence of the speech signal is made.
  • the method of determining the time of occurrence of the speech signal is given above.
  • EAD 1 can be, for example, microphones or laryngophones.
  • LFA 3 can be implemented, for example, on an OP467GS chip from
  • ADC 4 can be implemented, for example, on an ADS8422 chip from Texas Instruments.
  • Computer device 5 can be made in the form of a programmable logic integrated circuit (PLIC), and implemented, for example, on the XC2V3000- 6FG676I chip of Xilinx company.
  • PLIC programmable logic integrated circuit
  • the declared method can be implemented by the described device and allows to solve the problem of speech separation and pauses with high efficiency by comparing the threshold value of the calculated ratio of the maximum number of harmonic components of signal or noise, for which the difference in phase values does not exceed the specified value, and the total number of components of the signal or noise.
  • Fig. 1 shows an illustrative example explaining the algorithm.
  • Fig. 2 shows a structural diagram of the device that implements the proposed method.
  • the declared method can be implemented by the device described below and allows to solve the problem of speech separation and pauses with high efficiency by comparing the threshold value of the calculated ratio of the maximum number of harmonic components of signal or noise, for which the difference in phase values does not exceed the specified value, and the total number of components of the signal or noise.
  • EAD 1 can be, for example, microphones or laryngophones.
  • LFA 3 can be implemented, for example, on an OP467GS chip from Analog Devices.
  • ADC 4 can be implemented, for example, on an ADS 8422 chip from Texas Instruments.
  • Computer device 5 can be made in the form of a programmable logic integrated circuit (PLIC), and implemented, for example, on the XC2V3000- 6FG676I chip of Xilinx company.
  • PLIC programmable logic integrated circuit

Abstract

Method of separation of speech and pauses is described. "Sliding window" is set so that only noise is present, then it is moved by the offset value. For each position of the "sliding window", the spectral analysis method determines the amplitudes, frequencies and phases of the harmonic components of noise or a mixture of noise and signal. For all positions of the "sliding window" except the first one are calculated: the envelope noise amplitudes for the current position of the "sliding window", using the results of the spectral analysis performed for its previous position, subtract the calculated noise amplitude values from the samples taken for the current position of the "sliding window"; determine the total number of components for each harmonic, determine the number of values of the paired phase differences of the harmonic and other harmonics that do not exceed the specified value, and determine the number with the highest value. Find the ratio of this number to the total number of harmonics. Consider that a speech signal is present when the ratio of the calculated number to the total number of components exceeds the threshold value.

Description

METHOD OF SPEECH SEPARATION AND PAUSES
Field of Technology
The invention belongs to the field of voice information transmission and can be used in communication and loudspeaker devices.
Previous Technology State
We know the device for the allocation of acoustic signals in the communication channels, described in the patent RU 2171549 H04Q 1/46. The invention refers to telecommunications, in particular to automatic means of receiving tones in multichannel systems, and can be used, for example, to detect acoustic signals (AS) in telephone channels. Functioning is based on the calculation of a number of decisive statistics, which are distinctive features in the recognition of information speakers from channel noise and stray speech signals. The decisive statistics are the estimation of the signal power in the information frequency band, the distribution of the input signal energy over the frequency band and the value of the envelope irregularity filtered in the input signal bandpass filter. To make the final decision on the presence in the communication channel AS uses secondary processing based on application of the majority rule for a series of primary solutions.
The disadvantage of the known device is its low efficiency in solving the problem of speech separation and pauses.
Known device for the allocation of tonal signals in the channels of communication under the patent RU 2214051, H04B 3/46 , H04Q 1/457, H04M 1/50. The invention refers to telecommunications, in particular to automatic means of receiving tones in multichannel systems, and can be used to detect acoustic signals in telephone channels. The known technical solution is not very effective in solving the problem of speech separation and pauses in the presence of acoustic noise.
The closest analogue in terms of technical substance is the method of speech separation and pauses described in the book“Digital Processing of Speech Signals. //L.R. Rabiner, R.V. Shafer. Translation from English, edited by M.V. Nazarov and Yu.N. Prokhorov. Moscow, Radio and Communication, 1981”, pp. 123 - 126, taken as a prototype.
The prototype method is as follows.
The signal received by the device is sampled within the time interval set for its analysis and stored in memory for further processing. The processed signal consists of an interval that contains only noise, the duration of this interval is about 100 ms, and an interval that contains an additive mixture of speech signal and noise (hereinafter - signal and noise mixture).
The main parameters are the number of zero crossings within 10 ms and the average value function calculated using the 10 ms window. These samples are used to calculate the average values and variance of the weighted sum of the absolute values of the samples' amplitudes and the average number of zero crossings (noise statistics).
Taking into account the values of these characteristics and the maximum mean value, the thresholds for the average number of zero crossings (ANZC) and signal energy are calculated. A fragment of oscillation is defined, where the trajectory of the mean value of signal energy (MVSE) exceeds the upper threshold. It is assumed that the beginning and the end of the word lie outside this fragment.
Then, moving in the opposite direction along the time axis from the moment when the average value of the signal energy exceeded the threshold for the first time, determine the moment when MVSE was first less than the lower threshold (point Ni). This moment is selected as the intended start. The expected end of the word (point N2) is defined in the same way. The next step is to move to the left of point Ni (right of point N2) and compare the number of zero crossings with the threshold calculated from the starting segment. If the number of zero crossings exceeds the threshold by 3 or more times, the beginning of the word is moved to the place where the curve of the number of zero crossings exceeded the threshold for the first time. Otherwise, the point Ni is considered to be the beginning of the word. A similar process is performed for point N2.
The disadvantage of the prototype method is the lack of high accuracy in solving the problem of determining the moment of appearance of the speech signal and the high probability of erroneous decisions about the appearance of the signal in the presence of acoustic noise.
Disclosure of the Invention
The objective of the proposed method is to improve the accuracy of determining the moment of appearance of a speech signal and increase the probability value of the correct decision about the appearance of a speech signal in the presence of acoustic noise.
To solve the problem in the method of separation of speech and pauses consisting in the fact that the entire analysis interval, consisting of an interval that does not contain speech signal and an interval that contains a mixture of speech signal and noise, noise or a mixture of speech signal and noise that enter the system, are sampled and recorded in memory for further processing, according to the invention, the “sliding window” - the interval of a given duration, are formed so that there is only noise in the“sliding window”;
spectral analysis method determines the values of frequencies, phases and amplitudes of harmonic noise components;
“sliding window” is shifted by the value of the offset step, the value of which is determined in advance;
envelope noise values are calculated for the current position of the “sliding window”, using the results of the spectral analysis performed for the previous position of the “sliding window”, subtract the calculated noise amplitude values from the sequence of samples taken for the current position of the“sliding window”;
the obtained values are compared with the threshold, the value of which is determined in advance, if none of the values does not exceed the threshold, it is considered that the noise has not changed, the“sliding window” is shifted to the value of the offset step and the described procedure is repeated;
otherwise, the frequencies, phases and amplitudes of the harmonic components are determined by spectral analysis using the values obtained by subtracting them from the samples taken for the current“sliding window” position of the calculated noise amplitude values;
the total number of harmonic components is determined, for each harmonic - the number of values of paired phase differences of this harmonic and other harmonics, which do not exceed the specified value, and determine the maximum value of the values thus found;
the ratio of the found maximum value of the number of harmonics is calculated, for which the values of the paired phase differences do not exceed the specified value, to the total number of components;
the calculated value of the ratio of the maximum number of harmonics is compared to the total number of components with the threshold value, the value of which is determined in advance;
if the calculated value of the ratio of the maximum number of components to their total number does not exceed the threshold value, it is considered that the“sliding window” does not have a speech signal;
in this case, the process of detecting the speech signal continues according to the described algorithm until the moment when the next shift of the“sliding window” calculated value of the maximum number of harmonic components ratio to their total number will exceed the threshold value, in this case, it is considered that in the“sliding window” speech signal is present, the time of its appearance is set equal to the value of the right border of the“sliding window”, reduced by a predetermined value.
Implementation Option of the Invention
The proposed method of speech and pause separation by analyzing the phase values of the frequency components of noise and signal is implemented as follows.
Signals coming from the output of the electroacoustic device (EAD), passed to the output of the lowpass filter (LPF), amplified in the low-frequency amplifier (LFA), are sampled with the use of analog-to-digital converter (ADC) and stored in the computer memory for further processing.
The voice signal is detected and the position of its start is determined as follows.
A“sliding window” is formed - the interval of the specified duration, the initial position of which is set so that only noise is present in the“sliding window”.
Duration of the interval, for which it is considered to contain only noise, and the “sliding window” is determined at the stage of development by experiment or by mathematical modeling based on the condition of providing a given level of efficiency of the solution of the problem of speech separation and pauses, which is understood to be the provision of maximum probability of the correct decision on the appearance of the speech signal in the presence of acoustic noise, provided that the value of the false alarm probability (decision on the presence of a speech signal in its absence) will be no higher than the specified level.
Spectral analysis method determines the values of frequencies, phases and amplitudes of harmonic noise components, for example, by using the method of spectral analysis of multi-frequency periodic signals represented by digital samples, described in the book“Functional Control and Diagnosis of Electrical Systems and Devices by Digital Samples of Instantaneous Values of Current and Voltage /edited by E.I. Goldstein - Tomsk: Published“Printing Manufactory”, 2003, pp. 92-94.
“Sliding window” is shifted by the value of the offset step, the value of which is determined in advance.
Offset step value is determined at the stage of development by experiment or by the mathematical modeling based on providing a given level of efficiency of the problem solution of speech separation and pauses.
Envelope noise readings are calculated using the results of the spectral analysis performed for the previous“sliding window” position for the moments in time at which the current“sliding window” position was measured.
The calculated readings are subtracted from the sequence of readings taken for the current“sliding window” position.
Obtained values are compared with the threshold, the value of which is determined in advance, if none of the values does not exceed the threshold, it is considered that the noise has not changed.“Sliding window” is shifted to the value of the offset step and the described procedure is repeated;
This threshold value is determined at the stage of development by experiment or by the mathematical modeling based on providing a given level of efficiency of the problem solution of speech separation and pauses.
Otherwise, the frequencies, phases and amplitudes of the harmonic components are determined by spectral analysis using the values obtained by subtracting them from the samples taken for the current“sliding window” position of the calculated noise amplitude values.
Determine the maximum number of harmonic components for which the phase differences do not exceed the specified value according to the following algorithm:
1. The harmonic components are randomly numbered;
2. For the component with the first number determine the values of phase differences of the component and all other components, find the number of components, for which the difference of phase values does not exceed the specified value - Nci ;
3. The procedure according to item 2 of the algorithm is repeated for all remaining components;
4. From the found values of the number of components (NCi), determine the component with the highest number of components.
5. The process is complete.
An illustrative example of how the algorithm works is shown in Fig. 1.
Calculate the ratio of the found maximum value of the number of harmonics to the total number of harmonic components.
Compare the found value of the ratio of the maximum number of harmonic components to the total number of components with the threshold value, the value of which is determined in advance.
This threshold value and the value, which should not exceed the phase difference of harmonic components, are determined at the stage of development by experiment or by mathematical modeling based on the condition of providing a given level of the solution efficiency of the problem of speech separation and pauses.
If the calculated value of the ratio of the maximum number of harmonic components to the total number of components does not exceed the threshold value, it is considered that there is no speech signal in the“sliding window”.
In this case, the process of detecting the appearance of a speech signal continues according to the described algorithm, namely:
-“sliding window” is shifted by the value of the offset step, the value of which is determined in advance,
- spectral analysis method determines the values of frequencies, phases and amplitudes of harmonic components;
- envelope noise readings are calculated using the results of the spectral analysis performed for the previous“sliding window” position for the moments in time at which the current“sliding window” position was measured; - calculated readings are subtracted from the sequence of readings taken for the current“sliding window” position;
- obtained values are compared with the threshold, the value of which is determined in advance, if none of the values does not exceed the threshold, it is considered that the noise has not changed;
-“sliding window” is shifted to the value of the offset step and the described procedure is repeated;
otherwise, the frequencies, phases and amplitudes of the harmonic components are determined by spectral analysis using the values obtained by subtracting them from the samples taken for the current“sliding window” position of the calculated noise amplitude values;
- define the maximum number of components for which the differences in phase values do not exceed the specified value, according to the algorithm described above;
- calculate the ratio of the found maximum value of the number of harmonic components to the total number of components, which are determined by the spectral analysis method;
- compare the found value of the ratio of the maximum number of harmonic components to the total number of components with the threshold value, the value of which is determined in advance;
- if the calculated ratio of the maximum number of harmonic components to the total number of components does not exceed the threshold value, then it is considered that there is no speech signal in the“sliding window”, and the process of detecting the appearance of the speech signal continues according to the described algorithm until the next shift of the “sliding window” the calculated value of the ratio of the maximum number of harmonic components to the total number of components will exceed the threshold value;
- in this case, it is considered that a speech signal is present in the“sliding window”, the time of its appearance is set equal to the value of the right border of the“sliding window”, reduced by a predetermined value. Threshold values are determined at the stage of development by experiment or by the mathematical modeling based on providing a given level of efficiency of the problem solution of speech separation and pauses.
The optimal average value by which the value of the right border of the “sliding window” is reduced cannot be obtained by the analytical method, since at present there are no analytical expressions linking this value and the target function - the efficiency of solving the problem of speech separation and pauses.
Therefore, the optimal average value by which the value of the right border of the“sliding window” is reduced can be determined at the stage of development by experiment or by mathematical modeling based on the condition of providing a given level of efficiency of the solution of the problem of speech separation and pauses.
Below are the results of modeling the process of decision-making on the presence of a speech signal using the MATLAB system.
Acoustic noise in modeling is presented as a set of harmonic oscillations with random values of amplitudes (UPi) and phases (cpPi), which are distributed by normal (amplitude) and uniform (phase) laws (see, e.g.“Fundamentals of the Theory of Radio Engineering Systems”. Textbook. // V.I. Borisov, V.M. Zinchuk, A.E. Limarev, N.P. Mukhin. Edited by V.I. Borisov. Voronezh Research Institute of Communications, 2004, pp. 51)
Figure imgf000010_0001
where: w - the frequency of the i-th noise component;
c pi - phase of the i-th noise component;
U Pi - amplitude of the i-th noise component;
Nsp - number of harmonic noise components used for its representation.
The signal is presented as a set of harmonic oscillations with random values of amplitudes (USi) and phases (f5ϊ), which are distributed according to normal (amplitude) and uniform (phase) laws, and the initial values of the phases for the components of the signal are set so that for any pair of harmonics the phase difference does not exceed a predetermined value.
The following input data were used in the modeling process:
- number of implementations - 106;
- duration of the interval with only noise - 1000 ms;
- duration of the“sliding window” - 15 ms;
- offset step value of the“sliding window” - 5 ms.
Averaging was done by number of implementations.
Table 1 presents the results of modeling the process of determining the probability of making a decision on the appearance of a speech signal in its absence in a single displacement of the“sliding window” (Riti).
Table 1
Figure imgf000011_0001
The following symbols are used in Table 1 :
Nth - threshold value of the ratio of the maximum number of harmonic components to the total number of components;
Rp - the value of phase difference, which should not exceed the phase difference of harmonic components, as a percentage of the phase range value.
The probability of making a decision on the appearance of a speech signal in its absence for 200 offset steps steps of the“sliding window” is calculated by the formula (with the value of the offset step of the“sliding window” 5 ms, the total duration of two hundred steps is 1 s)
Pit=l-(1- Piti)200, (2) where Rm - the probability of making a decision about the appearance of a speech signal in its absence in one offset of the“sliding window”. The results of calculation of the probability of making a decision on the appearance of a speech signal in its absence for 200 offset steps of the“sliding window” are presented in Table 2.
Table 2
Figure imgf000012_0001
Table 2 uses the same designations as Table 1.
It follows from the data analysis in Table 2 that with a phase difference of 10% of the phase change range and a threshold value of the ratio of the maximum number of harmonic components to the total number of components equal to 0.8, the probability of false alarms does not exceed 4 104 for any number of harmonic components of noise during the analysis equal to 1 second.
Since the initial phase values for the signal components are set so that the phase difference does not exceed a predetermined value, in this case 10% of the phase change range, the probability of correct decision-making on the appearance of a speech signal in its presence is equal to 1.
The search for the optimal value by which the value of the right border of the“sliding window” is reduced, when calculating the time of occurrence of the speech signal, when making a decision about its presence, was carried out by the direct search method. In this case, the initial value by which the value of the right border of the“sliding window” is reduced is set to zero, the step change of this value is set to 1 ms.
When performing the optimization procedure, it was considered that the position of the“sliding window” relative to the moment of appearance of the speech signal, accidentally, the law of distribution of this random value is uniform. According to the results of the optimization procedure it was obtained that the value of the“sliding window” offset step is 5 ms, the value by which the value of the right border of the“sliding window” is reduced, close to the optimal one, is 8 ms, with the average error of determining the time of appearance of the speech signal is about ± 2.5 ms.
Block diagram of the device implementing the proposed method is shown in Fig. 2, where it is indicated:
1 - Electro-acoustic device (EAD);
2 - lowpass filter (LPF);
3 - low frequency amplifier (LFA);
4 - analog-to-digital converter (ADC);
5 - computer device (CD).
The device contains EAD 1, LPF 2, LFA 3, ADC 4, CD 5 connected in series, the output of which is the output of the declared device, the EAD 1 input is the input of the device.
The device works as follows.
Noise or an additive signal and noise mixture, coming from the output of the EAD 1, filter LPF 2, the band of which is matched to the band of speech signal, then the noise or additive signal and noise mixture is amplified in LFA 3 and fed to the ADC 4 input. Samples of noise or signal and noise mixture, which are formed in ADC 4, are digitally transferred to the CD 5 input.
In CD 5, the received samples of noise or signal and noise mixture are processed according to the algorithm given above.
The processing result is a digital solution for the presence or absence of a speech signal, for example:
1 - the signal is present;
0 - no signal.
Device output also receives the value of the time of occurrence of the speech signal, when the decision on the presence of the speech signal is made. The method of determining the time of occurrence of the speech signal is given above.
The results of modeling the detection process of a speech signal and determination of accuracy of the speech signal position depending on the number of frequency components of noise, threshold value of the ratio of the maximum number of harmonic components to the total number of components and the value of the phase difference, which should not exceed the phase difference of harmonic components, are given in Tables 1 and 2, respectively.
EAD 1 can be, for example, microphones or laryngophones.
LFA 3 can be implemented, for example, on an OP467GS chip from
Analog Devices.
ADC 4 can be implemented, for example, on an ADS8422 chip from Texas Instruments.
Computer device 5 can be made in the form of a programmable logic integrated circuit (PLIC), and implemented, for example, on the XC2V3000- 6FG676I chip of Xilinx company.
Thus, the declared method can be implemented by the described device and allows to solve the problem of speech separation and pauses with high efficiency by comparing the threshold value of the calculated ratio of the maximum number of harmonic components of signal or noise, for which the difference in phase values does not exceed the specified value, and the total number of components of the signal or noise.
Brief Drawings Description
Fig. 1 shows an illustrative example explaining the algorithm.
Fig. 2 shows a structural diagram of the device that implements the proposed method. Industrial Applicability
The declared method can be implemented by the device described below and allows to solve the problem of speech separation and pauses with high efficiency by comparing the threshold value of the calculated ratio of the maximum number of harmonic components of signal or noise, for which the difference in phase values does not exceed the specified value, and the total number of components of the signal or noise.
EAD 1 can be, for example, microphones or laryngophones.
LFA 3 can be implemented, for example, on an OP467GS chip from Analog Devices.
ADC 4 can be implemented, for example, on an ADS 8422 chip from Texas Instruments.
Computer device 5 can be made in the form of a programmable logic integrated circuit (PLIC), and implemented, for example, on the XC2V3000- 6FG676I chip of Xilinx company.
Sources of Information
1. US 2016/0189730 Al, 30.06.2016. Speech separation method and system;
2. US 2007/0021958 Al, 25.01.2007. Robust separation of speech signals in a noisy environment;
3. US 2011/0307251 Al, 15.12.2011. Sound Source Separation Using Spatial Filtering and Regularization Phases;
4. US 2015/0066486 Al, 05.03.2015. Methods and systems for improved signal decomposition;
5. US 5319736 A, 07.06.1994. System for separating speech from background noise;
6. WO 2002/007151 A2, 24.01.2002. Method and apparatus for removing noise from speech signals; 7. RU 2163032 C2, 10.02.2001. Adaptive audio filtering system to improve speech intelligibility in the presence of noise.

Claims

Method of separation of speech and pauses includes in the whole interval of analysis, consisting of the interval that does not contain a voice signal, and the interval which contains a speech signal and noise mixture, noise or a mixture of speech signal and noise that come into the system, discretizing and recorded in memory for further processing, characterized in that the“sliding window” - the interval of a given duration, shape it so that the“sliding window” only noise is present; method of spectral analysis to determine the values of frequencies, phases and amplitudes of the harmonic components of the noise; shift“sliding window” on the offset step, the value of which is determined in advance; counting the values of samples of the envelope of the noise for the current position of the“sliding window”, using the results of spectral analysis, which was conducted for the previous position of the“sliding window” of sequence counts that were taken for the current provisions of the“sliding window”, subtract the calculated values of the noise amplitude; the values obtained are compared with the threshold value which is determined in advance, if any value exceeds the threshold, it is considered that the noise has not changed, shift“sliding window” on the offset step and the described procedure is repeated; otherwise, using the values obtained by subtracting from the counts that were taken for the current provisions of the
"sliding window" of calculated values of the noise amplitude, the method of spectral analysis to determine the values of frequencies, phases and amplitudes of the harmonic components; determine the total number of the harmonic components for each harmonic - the number of values of paired differences of the phases of the harmonics and the remaining harmonics, which do not exceed a predetermined value, and determine the maximum value of the thus found values; calculate the ratio found maximum value of the number of harmonics, for which the values of paired phase differences do not exceed a predetermined value, to the total number of components; comparing the calculated ratio of the maximum number of harmonics to the total number of components with a threshold value, the value of which is determined in advance; if the calculated value of the ratio of the maximum number of components to their total number does not exceed the threshold, it is considered that the speech signal is absent in the“sliding window”; in this case, the detection process of the appearance of the speech signal continues by the described algorithm until when the next shift“sliding window” calculated value of the ratio of the maximum number of harmonic components to their total number exceeds a threshold value, in this case consider that the speech signal is present in the“sliding window”, the time of its appearance set equal to the value of the right border of the“sliding window” reduced by a specified value.
PCT/RU2019/000516 2018-10-15 2019-07-23 Method of speech separation and pauses WO2020080972A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
RU2018136230 2018-10-15
RU2018136230A RU2680735C1 (en) 2018-10-15 2018-10-15 Method of separation of speech and pauses by analysis of the values of phases of frequency components of noise and signal

Publications (1)

Publication Number Publication Date
WO2020080972A1 true WO2020080972A1 (en) 2020-04-23

Family

ID=65479270

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/RU2019/000516 WO2020080972A1 (en) 2018-10-15 2019-07-23 Method of speech separation and pauses

Country Status (2)

Country Link
RU (1) RU2680735C1 (en)
WO (1) WO2020080972A1 (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5319736A (en) * 1989-12-06 1994-06-07 National Research Council Of Canada System for separating speech from background noise
US20070021958A1 (en) * 2005-07-22 2007-01-25 Erik Visser Robust separation of speech signals in a noisy environment
US20110307251A1 (en) * 2010-06-15 2011-12-15 Microsoft Corporation Sound Source Separation Using Spatial Filtering and Regularization Phases
US20140316771A1 (en) * 2012-05-04 2014-10-23 Kaonyx Labs LLC Systems and methods for source signal separation
US20150066486A1 (en) * 2013-08-28 2015-03-05 Accusonus S.A. Methods and systems for improved signal decomposition
US20160189730A1 (en) * 2014-12-30 2016-06-30 Iflytek Co., Ltd. Speech separation method and system

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0852052B1 (en) * 1995-09-14 2001-06-13 Ericsson Inc. System for adaptively filtering audio signals to enhance speech intelligibility in noisy environmental conditions
US20020039425A1 (en) * 2000-07-19 2002-04-04 Burnett Gregory C. Method and apparatus for removing noise from electronic signals

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5319736A (en) * 1989-12-06 1994-06-07 National Research Council Of Canada System for separating speech from background noise
US20070021958A1 (en) * 2005-07-22 2007-01-25 Erik Visser Robust separation of speech signals in a noisy environment
US20110307251A1 (en) * 2010-06-15 2011-12-15 Microsoft Corporation Sound Source Separation Using Spatial Filtering and Regularization Phases
US20140316771A1 (en) * 2012-05-04 2014-10-23 Kaonyx Labs LLC Systems and methods for source signal separation
US20150066486A1 (en) * 2013-08-28 2015-03-05 Accusonus S.A. Methods and systems for improved signal decomposition
US20160189730A1 (en) * 2014-12-30 2016-06-30 Iflytek Co., Ltd. Speech separation method and system

Also Published As

Publication number Publication date
RU2680735C1 (en) 2019-02-26

Similar Documents

Publication Publication Date Title
US9524735B2 (en) Threshold adaptation in two-channel noise estimation and voice activity detection
US8942398B2 (en) Methods and apparatus for early audio feedback cancellation for hearing assistance devices
US10553236B1 (en) Multichannel noise cancellation using frequency domain spectrum masking
WO2015078121A1 (en) Audio signal quality detection method and device
RU2691603C1 (en) Method of separating speech and pauses by analyzing values of interference correlation function and signal and interference mixture
US11102569B2 (en) Methods and apparatus for a microphone system
EP1629691A1 (en) Oscillation suppression
Yegnanarayana et al. Study of robustness of zero frequency resonator method for extraction of fundamental frequency
TWI393453B (en) Tone detector and method of detecting a tone suitable for a robot
Zhang et al. Noise estimation based on an adaptive smoothing factor for improving speech quality in a dual-microphone noise suppression system
RU2700189C1 (en) Method of separating speech and speech-like noise by analyzing values of energy and phases of frequency components of signal and noise
EP3696815B1 (en) Nonlinear noise reduction system
WO2020080972A1 (en) Method of speech separation and pauses
KR101547344B1 (en) Restoraton apparatus and method for voice
RU2668407C1 (en) Method of separation of speech and pause by comparative analysis of interference power values and signal-interference mixture
CN113948088A (en) Voice recognition method and device based on waveform simulation
Khoubrouy et al. Howling detection in hearing aids using discrete energy separation algorithm-2 and generalized Teager-Kaiser operator
RU2786547C1 (en) Method for isolating a speech signal using time-domain analysis of the spectrum of an additive mixture of a signal and acoustic interference
RU2807194C1 (en) Method for speech extraction by analysing amplitude values of interference and signal in two-channel speech signal processing system
JPH0424692A (en) Voice section detection system
KR101732399B1 (en) Sound Detection Method Using Stereo Channel
CN110910899B (en) Real-time audio signal consistency comparison detection method
US9269370B2 (en) Adaptive speech filter for attenuation of ambient noise
Romoli et al. A voice activity detection algorithm for multichannel acoustic echo cancellation exploiting fundamental frequency estimation
JPH056193A (en) Voice section detecting system and voice recognizing device

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19874077

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19874077

Country of ref document: EP

Kind code of ref document: A1