WO2020080972A1

WO2020080972A1 - Method of speech separation and pauses

Info

Publication number: WO2020080972A1
Application number: PCT/RU2019/000516
Authority: WO
Inventors: Vladimir Aleksandrovich BELOGUROV; Vladimir Alekseevich ZOLOTAREV
Original assignee: Joint-Stock Company "Concern "Sozvezdie"
Priority date: 2018-10-15
Filing date: 2019-07-23
Publication date: 2020-04-23
Also published as: RU2680735C1

Abstract

Method of separation of speech and pauses is described. "Sliding window" is set so that only noise is present, then it is moved by the offset value. For each position of the "sliding window", the spectral analysis method determines the amplitudes, frequencies and phases of the harmonic components of noise or a mixture of noise and signal. For all positions of the "sliding window" except the first one are calculated: the envelope noise amplitudes for the current position of the "sliding window", using the results of the spectral analysis performed for its previous position, subtract the calculated noise amplitude values from the samples taken for the current position of the "sliding window"; determine the total number of components for each harmonic, determine the number of values of the paired phase differences of the harmonic and other harmonics that do not exceed the specified value, and determine the number with the highest value. Find the ratio of this number to the total number of harmonics. Consider that a speech signal is present when the ratio of the calculated number to the total number of components exceeds the threshold value.

Description

METHOD OF SPEECH SEPARATION AND PAUSES

Field of Technology

The invention belongs to the field of voice information transmission and can be used in communication and loudspeaker devices.

Previous Technology State

We know the device for the allocation of acoustic signals in the communication channels, described in the patent RU 2171549 H04Q 1/46. The invention refers to telecommunications, in particular to automatic means of receiving tones in multichannel systems, and can be used, for example, to detect acoustic signals (AS) in telephone channels. Functioning is based on the calculation of a number of decisive statistics, which are distinctive features in the recognition of information speakers from channel noise and stray speech signals. The decisive statistics are the estimation of the signal power in the information frequency band, the distribution of the input signal energy over the frequency band and the value of the envelope irregularity filtered in the input signal bandpass filter. To make the final decision on the presence in the communication channel AS uses secondary processing based on application of the majority rule for a series of primary solutions.

The disadvantage of the known device is its low efficiency in solving the problem of speech separation and pauses.

Known device for the allocation of tonal signals in the channels of communication under the patent RU 2214051, H04B 3/46 , H04Q 1/457, H04M 1/50. The invention refers to telecommunications, in particular to automatic means of receiving tones in multichannel systems, and can be used to detect acoustic signals in telephone channels. The known technical solution is not very effective in solving the problem of speech separation and pauses in the presence of acoustic noise.

The closest analogue in terms of technical substance is the method of speech separation and pauses described in the book“Digital Processing of Speech Signals. //L.R. Rabiner, R.V. Shafer. Translation from English, edited by M.V. Nazarov and Yu.N. Prokhorov. Moscow, Radio and Communication, 1981”, pp. 123 - 126, taken as a prototype.

The prototype method is as follows.

The signal received by the device is sampled within the time interval set for its analysis and stored in memory for further processing. The processed signal consists of an interval that contains only noise, the duration of this interval is about 100 ms, and an interval that contains an additive mixture of speech signal and noise (hereinafter - signal and noise mixture).

The main parameters are the number of zero crossings within 10 ms and the average value function calculated using the 10 ms window. These samples are used to calculate the average values and variance of the weighted sum of the absolute values of the samples' amplitudes and the average number of zero crossings (noise statistics).

Taking into account the values of these characteristics and the maximum mean value, the thresholds for the average number of zero crossings (ANZC) and signal energy are calculated. A fragment of oscillation is defined, where the trajectory of the mean value of signal energy (MVSE) exceeds the upper threshold. It is assumed that the beginning and the end of the word lie outside this fragment.

Then, moving in the opposite direction along the time axis from the moment when the average value of the signal energy exceeded the threshold for the first time, determine the moment when MVSE was first less than the lower threshold (point Ni). This moment is selected as the intended start. The expected end of the word (point N2) is defined in the same way. The next step is to move to the left of point Ni (right of point N2) and compare the number of zero crossings with the threshold calculated from the starting segment. If the number of zero crossings exceeds the threshold by 3 or more times, the beginning of the word is moved to the place where the curve of the number of zero crossings exceeded the threshold for the first time. Otherwise, the point Ni is considered to be the beginning of the word. A similar process is performed for point N2.

The disadvantage of the prototype method is the lack of high accuracy in solving the problem of determining the moment of appearance of the speech signal and the high probability of erroneous decisions about the appearance of the signal in the presence of acoustic noise.

Disclosure of the Invention

The objective of the proposed method is to improve the accuracy of determining the moment of appearance of a speech signal and increase the probability value of the correct decision about the appearance of a speech signal in the presence of acoustic noise.

To solve the problem in the method of separation of speech and pauses consisting in the fact that the entire analysis interval, consisting of an interval that does not contain speech signal and an interval that contains a mixture of speech signal and noise, noise or a mixture of speech signal and noise that enter the system, are sampled and recorded in memory for further processing, according to the invention, the “sliding window” - the interval of a given duration, are formed so that there is only noise in the“sliding window”;

spectral analysis method determines the values of frequencies, phases and amplitudes of harmonic noise components;

“sliding window” is shifted by the value of the offset step, the value of which is determined in advance;

envelope noise values are calculated for the current position of the “sliding window”, using the results of the spectral analysis performed for the previous position of the “sliding window”, subtract the calculated noise amplitude values from the sequence of samples taken for the current position of the“sliding window”;

the obtained values are compared with the threshold, the value of which is determined in advance, if none of the values does not exceed the threshold, it is considered that the noise has not changed, the“sliding window” is shifted to the value of the offset step and the described procedure is repeated;

otherwise, the frequencies, phases and amplitudes of the harmonic components are determined by spectral analysis using the values obtained by subtracting them from the samples taken for the current“sliding window” position of the calculated noise amplitude values;

the total number of harmonic components is determined, for each harmonic - the number of values of paired phase differences of this harmonic and other harmonics, which do not exceed the specified value, and determine the maximum value of the values thus found;

the ratio of the found maximum value of the number of harmonics is calculated, for which the values of the paired phase differences do not exceed the specified value, to the total number of components;

the calculated value of the ratio of the maximum number of harmonics is compared to the total number of components with the threshold value, the value of which is determined in advance;

if the calculated value of the ratio of the maximum number of components to their total number does not exceed the threshold value, it is considered that the“sliding window” does not have a speech signal;

in this case, the process of detecting the speech signal continues according to the described algorithm until the moment when the next shift of the“sliding window” calculated value of the maximum number of harmonic components ratio to their total number will exceed the threshold value, in this case, it is considered that in the“sliding window” speech signal is present, the time of its appearance is set equal to the value of the right border of the“sliding window”, reduced by a predetermined value.

Implementation Option of the Invention

The proposed method of speech and pause separation by analyzing the phase values of the frequency components of noise and signal is implemented as follows.

Signals coming from the output of the electroacoustic device (EAD), passed to the output of the lowpass filter (LPF), amplified in the low-frequency amplifier (LFA), are sampled with the use of analog-to-digital converter (ADC) and stored in the computer memory for further processing.

The voice signal is detected and the position of its start is determined as follows.

A“sliding window” is formed - the interval of the specified duration, the initial position of which is set so that only noise is present in the“sliding window”.

Duration of the interval, for which it is considered to contain only noise, and the “sliding window” is determined at the stage of development by experiment or by mathematical modeling based on the condition of providing a given level of efficiency of the solution of the problem of speech separation and pauses, which is understood to be the provision of maximum probability of the correct decision on the appearance of the speech signal in the presence of acoustic noise, provided that the value of the false alarm probability (decision on the presence of a speech signal in its absence) will be no higher than the specified level.

Spectral analysis method determines the values of frequencies, phases and amplitudes of harmonic noise components, for example, by using the method of spectral analysis of multi-frequency periodic signals represented by digital samples, described in the book“Functional Control and Diagnosis of Electrical Systems and Devices by Digital Samples of Instantaneous Values of Current and Voltage /edited by E.I. Goldstein - Tomsk: Published“Printing Manufactory”, 2003, pp. 92-94.

“Sliding window” is shifted by the value of the offset step, the value of which is determined in advance.

Offset step value is determined at the stage of development by experiment or by the mathematical modeling based on providing a given level of efficiency of the problem solution of speech separation and pauses.

Envelope noise readings are calculated using the results of the spectral analysis performed for the previous“sliding window” position for the moments in time at which the current“sliding window” position was measured.

The calculated readings are subtracted from the sequence of readings taken for the current“sliding window” position.

Obtained values are compared with the threshold, the value of which is determined in advance, if none of the values does not exceed the threshold, it is considered that the noise has not changed.“Sliding window” is shifted to the value of the offset step and the described procedure is repeated;

This threshold value is determined at the stage of development by experiment or by the mathematical modeling based on providing a given level of efficiency of the problem solution of speech separation and pauses.

Otherwise, the frequencies, phases and amplitudes of the harmonic components are determined by spectral analysis using the values obtained by subtracting them from the samples taken for the current“sliding window” position of the calculated noise amplitude values.

Determine the maximum number of harmonic components for which the phase differences do not exceed the specified value according to the following algorithm:

1. The harmonic components are randomly numbered;

2. For the component with the first number determine the values of phase differences of the component and all other components, find the number of components, for which the difference of phase values does not exceed the specified value - N_ci ;

3. The procedure according to item 2 of the algorithm is repeated for all remaining components;

4. From the found values of the number of components (N_Ci), determine the component with the highest number of components.

5. The process is complete.

An illustrative example of how the algorithm works is shown in Fig. 1.

Calculate the ratio of the found maximum value of the number of harmonics to the total number of harmonic components.

Compare the found value of the ratio of the maximum number of harmonic components to the total number of components with the threshold value, the value of which is determined in advance.

This threshold value and the value, which should not exceed the phase difference of harmonic components, are determined at the stage of development by experiment or by mathematical modeling based on the condition of providing a given level of the solution efficiency of the problem of speech separation and pauses.

If the calculated value of the ratio of the maximum number of harmonic components to the total number of components does not exceed the threshold value, it is considered that there is no speech signal in the“sliding window”.

In this case, the process of detecting the appearance of a speech signal continues according to the described algorithm, namely:

-“sliding window” is shifted by the value of the offset step, the value of which is determined in advance,

- spectral analysis method determines the values of frequencies, phases and amplitudes of harmonic components;

- envelope noise readings are calculated using the results of the spectral analysis performed for the previous“sliding window” position for the moments in time at which the current“sliding window” position was measured; - calculated readings are subtracted from the sequence of readings taken for the current“sliding window” position;

- obtained values are compared with the threshold, the value of which is determined in advance, if none of the values does not exceed the threshold, it is considered that the noise has not changed;

-“sliding window” is shifted to the value of the offset step and the described procedure is repeated;

- define the maximum number of components for which the differences in phase values do not exceed the specified value, according to the algorithm described above;

- calculate the ratio of the found maximum value of the number of harmonic components to the total number of components, which are determined by the spectral analysis method;

- compare the found value of the ratio of the maximum number of harmonic components to the total number of components with the threshold value, the value of which is determined in advance;

- if the calculated ratio of the maximum number of harmonic components to the total number of components does not exceed the threshold value, then it is considered that there is no speech signal in the“sliding window”, and the process of detecting the appearance of the speech signal continues according to the described algorithm until the next shift of the “sliding window” the calculated value of the ratio of the maximum number of harmonic components to the total number of components will exceed the threshold value;

- in this case, it is considered that a speech signal is present in the“sliding window”, the time of its appearance is set equal to the value of the right border of the“sliding window”, reduced by a predetermined value. Threshold values are determined at the stage of development by experiment or by the mathematical modeling based on providing a given level of efficiency of the problem solution of speech separation and pauses.

The optimal average value by which the value of the right border of the “sliding window” is reduced cannot be obtained by the analytical method, since at present there are no analytical expressions linking this value and the target function - the efficiency of solving the problem of speech separation and pauses.

Therefore, the optimal average value by which the value of the right border of the“sliding window” is reduced can be determined at the stage of development by experiment or by mathematical modeling based on the condition of providing a given level of efficiency of the solution of the problem of speech separation and pauses.

Below are the results of modeling the process of decision-making on the presence of a speech signal using the MATLAB system.

Acoustic noise in modeling is presented as a set of harmonic oscillations with random values of amplitudes (U_Pi) and phases (cp_Pi), which are distributed by normal (amplitude) and uniform (phase) laws (see, e.g.“Fundamentals of the Theory of Radio Engineering Systems”. Textbook. // V.I. Borisov, V.M. Zinchuk, A.E. Limarev, N.P. Mukhin. Edited by V.I. Borisov. Voronezh Research Institute of Communications, 2004, pp. 51)

where: w_Rΐ - the frequency of the i-th noise component;

c _pi - phase of the i-th noise component;

^U _Pi - amplitude of the i-th noise component;

N_sp - number of harmonic noise components used for its representation.

The signal is presented as a set of harmonic oscillations with random values of amplitudes (U_Si) and phases (f₅ϊ), which are distributed according to normal (amplitude) and uniform (phase) laws, and the initial values of the phases for the components of the signal are set so that for any pair of harmonics the phase difference does not exceed a predetermined value.

The following input data were used in the modeling process:

- number of implementations - 10⁶;

- duration of the interval with only noise - 1000 ms;

- duration of the“sliding window” - 15 ms;

- offset step value of the“sliding window” - 5 ms.

Averaging was done by number of implementations.

Table 1 presents the results of modeling the process of determining the probability of making a decision on the appearance of a speech signal in its absence in a single displacement of the“sliding window” (Ri_ti).

Table 1

The following symbols are used in Table 1 :

N_th - threshold value of the ratio of the maximum number of harmonic components to the total number of components;

R_p - the value of phase difference, which should not exceed the phase difference of harmonic components, as a percentage of the phase range value.

The probability of making a decision on the appearance of a speech signal in its absence for 200 offset steps steps of the“sliding window” is calculated by the formula (with the value of the offset step of the“sliding window” 5 ms, the total duration of two hundred steps is 1 s)

Pi_t=l-(1- Pi_ti)²⁰⁰, (2) where Rm - the probability of making a decision about the appearance of a speech signal in its absence in one offset of the“sliding window”. The results of calculation of the probability of making a decision on the appearance of a speech signal in its absence for 200 offset steps of the“sliding window” are presented in Table 2.

Table 2

Table 2 uses the same designations as Table 1.

It follows from the data analysis in Table 2 that with a phase difference of 10% of the phase change range and a threshold value of the ratio of the maximum number of harmonic components to the total number of components equal to 0.8, the probability of false alarms does not exceed 4 10⁴ for any number of harmonic components of noise during the analysis equal to 1 second.

Since the initial phase values for the signal components are set so that the phase difference does not exceed a predetermined value, in this case 10% of the phase change range, the probability of correct decision-making on the appearance of a speech signal in its presence is equal to 1.

The search for the optimal value by which the value of the right border of the“sliding window” is reduced, when calculating the time of occurrence of the speech signal, when making a decision about its presence, was carried out by the direct search method. In this case, the initial value by which the value of the right border of the“sliding window” is reduced is set to zero, the step change of this value is set to 1 ms.

When performing the optimization procedure, it was considered that the position of the“sliding window” relative to the moment of appearance of the speech signal, accidentally, the law of distribution of this random value is uniform. According to the results of the optimization procedure it was obtained that the value of the“sliding window” offset step is 5 ms, the value by which the value of the right border of the“sliding window” is reduced, close to the optimal one, is 8 ms, with the average error of determining the time of appearance of the speech signal is about ± 2.5 ms.

Block diagram of the device implementing the proposed method is shown in Fig. 2, where it is indicated:

1 - Electro-acoustic device (EAD);

2 - lowpass filter (LPF);

3 - low frequency amplifier (LFA);

4 - analog-to-digital converter (ADC);

5 - computer device (CD).

The device contains EAD 1, LPF 2, LFA 3, ADC 4, CD 5 connected in series, the output of which is the output of the declared device, the EAD 1 input is the input of the device.

The device works as follows.

Noise or an additive signal and noise mixture, coming from the output of the EAD 1, filter LPF 2, the band of which is matched to the band of speech signal, then the noise or additive signal and noise mixture is amplified in LFA 3 and fed to the ADC 4 input. Samples of noise or signal and noise mixture, which are formed in ADC 4, are digitally transferred to the CD 5 input.

In CD 5, the received samples of noise or signal and noise mixture are processed according to the algorithm given above.

The processing result is a digital solution for the presence or absence of a speech signal, for example:

1 - the signal is present;

0 - no signal.

Device output also receives the value of the time of occurrence of the speech signal, when the decision on the presence of the speech signal is made. The method of determining the time of occurrence of the speech signal is given above.

The results of modeling the detection process of a speech signal and determination of accuracy of the speech signal position depending on the number of frequency components of noise, threshold value of the ratio of the maximum number of harmonic components to the total number of components and the value of the phase difference, which should not exceed the phase difference of harmonic components, are given in Tables 1 and 2, respectively.

EAD 1 can be, for example, microphones or laryngophones.

LFA 3 can be implemented, for example, on an OP467GS chip from

Analog Devices.

ADC 4 can be implemented, for example, on an ADS8422 chip from Texas Instruments.

Computer device 5 can be made in the form of a programmable logic integrated circuit (PLIC), and implemented, for example, on the XC2V3000- 6FG676I chip of Xilinx company.

Thus, the declared method can be implemented by the described device and allows to solve the problem of speech separation and pauses with high efficiency by comparing the threshold value of the calculated ratio of the maximum number of harmonic components of signal or noise, for which the difference in phase values does not exceed the specified value, and the total number of components of the signal or noise.

Brief Drawings Description

Fig. 1 shows an illustrative example explaining the algorithm.

Fig. 2 shows a structural diagram of the device that implements the proposed method. Industrial Applicability

The declared method can be implemented by the device described below and allows to solve the problem of speech separation and pauses with high efficiency by comparing the threshold value of the calculated ratio of the maximum number of harmonic components of signal or noise, for which the difference in phase values does not exceed the specified value, and the total number of components of the signal or noise.

EAD 1 can be, for example, microphones or laryngophones.

LFA 3 can be implemented, for example, on an OP467GS chip from Analog Devices.

ADC 4 can be implemented, for example, on an ADS 8422 chip from Texas Instruments.

Sources of Information

1. US 2016/0189730 Al, 30.06.2016. Speech separation method and system;

2. US 2007/0021958 Al, 25.01.2007. Robust separation of speech signals in a noisy environment;

3. US 2011/0307251 Al, 15.12.2011. Sound Source Separation Using Spatial Filtering and Regularization Phases;

4. US 2015/0066486 Al, 05.03.2015. Methods and systems for improved signal decomposition;

5. US 5319736 A, 07.06.1994. System for separating speech from background noise;

6. WO 2002/007151 A2, 24.01.2002. Method and apparatus for removing noise from speech signals; 7. RU 2163032 C2, 10.02.2001. Adaptive audio filtering system to improve speech intelligibility in the presence of noise.

Claims

Method of separation of speech and pauses includes in the whole interval of analysis, consisting of the interval that does not contain a voice signal, and the interval which contains a speech signal and noise mixture, noise or a mixture of speech signal and noise that come into the system, discretizing and recorded in memory for further processing, characterized in that the“sliding window” - the interval of a given duration, shape it so that the“sliding window” only noise is present; method of spectral analysis to determine the values of frequencies, phases and amplitudes of the harmonic components of the noise; shift“sliding window” on the offset step, the value of which is determined in advance; counting the values of samples of the envelope of the noise for the current position of the“sliding window”, using the results of spectral analysis, which was conducted for the previous position of the“sliding window” of sequence counts that were taken for the current provisions of the“sliding window”, subtract the calculated values of the noise amplitude; the values obtained are compared with the threshold value which is determined in advance, if any value exceeds the threshold, it is considered that the noise has not changed, shift“sliding window” on the offset step and the described procedure is repeated; otherwise, using the values obtained by subtracting from the counts that were taken for the current provisions of the

"sliding window" of calculated values of the noise amplitude, the method of spectral analysis to determine the values of frequencies, phases and amplitudes of the harmonic components; determine the total number of the harmonic components for each harmonic - the number of values of paired differences of the phases of the harmonics and the remaining harmonics, which do not exceed a predetermined value, and determine the maximum value of the thus found values; calculate the ratio found maximum value of the number of harmonics, for which the values of paired phase differences do not exceed a predetermined value, to the total number of components; comparing the calculated ratio of the maximum number of harmonics to the total number of components with a threshold value, the value of which is determined in advance; if the calculated value of the ratio of the maximum number of components to their total number does not exceed the threshold, it is considered that the speech signal is absent in the“sliding window”; in this case, the detection process of the appearance of the speech signal continues by the described algorithm until when the next shift“sliding window” calculated value of the ratio of the maximum number of harmonic components to their total number exceeds a threshold value, in this case consider that the speech signal is present in the“sliding window”, the time of its appearance set equal to the value of the right border of the“sliding window” reduced by a specified value.