CN111933175B

CN111933175B - Active voice detection method and system based on noise scene recognition

Info

Publication number: CN111933175B
Application number: CN202010783583.5A
Authority: CN
Inventors: 田野; 王磊
Original assignee: Third Research Institute Of China Electronics Technology Group Corp; Beijing Zhongdian Huisheng Technology Co ltd
Current assignee: Third Research Institute Of China Electronics Technology Group Corp; Beijing Zhongdian Huisheng Technology Co ltd
Priority date: 2020-08-06
Filing date: 2020-08-06
Publication date: 2023-10-24
Anticipated expiration: 2040-08-06
Also published as: CN111933175A

Abstract

The invention discloses an active voice detection method based on noise scene recognition, which is characterized in that preferred characteristics facing to noise classification tasks are extracted from audio signals, and characteristic values are input into a noise type classifier to recognize noise types in the audio signals; according to the noise type, determining preferable characteristics and classifiers suitable for voice and noise-oriented classification tasks; extracting preferred features facing to voice and noise classification tasks from the audio signals, inputting the preferred feature values into a voice noise classifier, and judging whether voice signals exist in the audio signals; the invention also discloses an active voice detection system based on noise scene recognition. Before the noise-containing voice and the noise signal are classified, the method disclosed by the invention detects and identifies the current noise type, optimizes the most distinguishable characteristic combination aiming at the specific noise type, can design model parameters aiming at the specific noise type, and ensures the effectiveness and stability of the performance of the whole detection process under different noise types.

Description

Active voice detection method and system based on noise scene recognition

Technical Field

The invention relates to the technical field of voice data processing, in particular to an active voice detection method and system based on noise scene recognition.

Background

The phenomenon of pause, intermittence and the like often exist in a section of voice signal, and the 'silent' sections are overlapped with environmental noise to form a voice signal which does not contain effective voice information, and the voice signal processing effect is interfered while the voice signal occupies larger data transmission resources; the goal of active speech detection (voice activity detection, VAD) techniques is to detect the actual speech segment from the signal and remove these "silent" portions, thereby alleviating the burden of subsequent speech signal processing, and thus active speech detection techniques are widely used in speech coding, speaker recognition, automatic speech recognition, abnormal sound detection, and other systems.

In view of the wide application demands of active speech detection technology, researchers have recently proposed a number of related detection methods, which can be classified into an unsupervised type method and a supervised type method; the unsupervised method mainly takes characteristic and threshold design and threshold rule formulation as cores, typical characteristic includes short-time energy, short-time zero-crossing rate, spectral entropy and the like, the unsupervised method has obvious performance degradation in a noise environment, and the unsupervised method is usually matched with a noise reduction algorithm; the supervised method regards the active voice detection problem as a voice signal and noise signal classification problem, and the performance of the method under the noise environment is higher than that of the non-supervised method by learning noise data in advance.

The supervised class method mainly comprises two links of feature extraction and classifier design; in terms of feature extraction, in order to be able to effectively distinguish acoustic characteristics of noise and speech signals, researchers currently extract high-dimensional features from different angles, such as energy features, zero-crossing rate features, mel-frequency cepstrum coefficient (Mel Frequency Cepstrum Coefficient, MFCC) features, fuzzy entropy features, autocorrelation coefficient features, wavelet coefficient features, and the like, and use various feature combinations to fuse multi-angle feature information, however, although these feature combinations have a certain noise-to-speech distinguishing capability under a specific noise type, in practical applications, feature combinations set for general cases tend to have difficulty in exhibiting stable distinguishing capability under a dynamic noise scene due to time-varying properties of the noise type, and the high-dimension of features tends to also put a burden on subsequent classifier use.

In the aspect of classifier design, in order to construct a two-class model of noise signals and voice signals, the prior art usually adopts a detection method based on MFCC features and a support vector machine (support vector machine, SVM), a detection method based on fuzzy entropy features and SVM and a detection method based on a multi-layer perceptron (multilayer perceptron, MLP); in the aspect of classifier selection, with the continuous development of a machine learning method, aiming at the limitation of the data modeling capability of a single classifier, an integrated learning and deep learning method is developed in recent years, and the generalization capability of a classifier model is improved by improving the breadth and depth of modeling; in the aspect of a classifier modeling strategy, the supervised class method performs differential modeling on noise and voice signals under a specific type in a model training process, so that the voice signals are detected in an audio stream.

The design of the classifier, the selection of the classifier and the modeling strategy of the classifier are good under a single noise type, but modeling parameters aiming at different noise characteristics are not similar due to the variability of the noise type, so that the model classifier design with good differentiation under a plurality of different noise types is obtained through training.

Disclosure of Invention

In view of the above problems, an object of the present invention is to provide an active speech detection method based on noise scene recognition, which detects and recognizes a current noise type before classifying noise-containing speech and noise signals, and converts a dynamic noise environment into a limited noise environment, so that a feature combination with the most distinguishing property can be optimized in high-dimensional features for a specific noise type, and model parameters can be designed for the specific noise type, thereby ensuring the effectiveness and stability of the performance of the whole detection process under different noise types; the invention constructs a noise type classifier and a voice noise classifier; in the aspect of noise type identification, a noise clustering and classifying method based on t-SNE and random forests is provided; in the aspect of distinguishing and identifying noise-containing voice and noise signals, a feature selection and classifier construction method based on random forests is provided.

A second object of the present invention is to provide an active speech detection system based on noise scene recognition, which is easy to implement and convenient to debug.

The first technical scheme adopted by the invention is as follows: an active voice detection method based on noise scene recognition comprises the following steps:

s1: extracting preferred features facing a noise classification task from an audio signal, and inputting the preferred feature values into a noise type classifier to identify a noise type in the audio signal;

s2: according to the noise type, determining preferable characteristics and classifiers suitable for voice and noise-oriented classification tasks;

s3: extracting the characteristic value of the preferred characteristic of the voice and noise-oriented classification task from the audio signal, inputting the characteristic value of the preferred characteristic of the voice and noise-oriented classification task into the voice noise classifier, and judging whether a voice signal exists in the audio signal.

Preferably, the noise type classifier is constructed by t-SNE cluster analysis and random forest method.

Preferably, the noise type classifier is constructed by:

s1-1: constructing a noise signal library, wherein the noise signal library comprises a plurality of types of noise signals;

s1-2: extracting characteristic values of a plurality of audio characteristics of each noise signal in the noise signal library by using a time-frequency domain signal processing method;

s1-3: based on the characteristic value of the audio characteristic, adopting a t-SNE method to perform cluster analysis on noise signals in the noise signal library;

s1-4: selecting a plurality of noise classification preferred features from the plurality of audio features by adopting a random forest method;

s1-5: based on the noise classification preferred features, a random forest method is adopted to train a noise type classification model.

Preferably, the audio features include a plurality or all of zero-crossing rate, MFCC, spectral centroid, spectral spread, spectral entropy, spectral flux, spectral piping, harmonic ratio, fundamental frequency, frequency domain energy, bandwidth, and wavelet components.

Preferably, the noise classification preferred features comprise one or more of spectral centroid, wavelet singular value, wavelet energy and spectral piping features.

Preferably, the speech noise classifier is constructed by using a random forest method.

Preferably, the speech noise classifier is constructed by:

s3-1: noise adding processing is carried out on the pure voice by utilizing different types of noise signals respectively to obtain noise-containing voice and noise signals respectively corresponding to various noise types;

s3-2: extracting characteristic values of a plurality of audio characteristics of each noise-containing voice signal and a corresponding noise signal by adopting a time-frequency domain signal processing method;

s3-3: selecting preferred features for voice and noise classification tasks under each noise type from the plurality of audio features by adopting a random forest method based on the noise-containing voice signals and noise signals corresponding to each noise type;

s3-4: based on the preferred characteristics of the voice and noise classification tasks under each noise type, training a noise-containing voice and noise classification model for each noise type by adopting a random forest method.

Preferably, the noise types include white noise, car interior noise, fighter noise, and other noise.

The second technical scheme adopted by the invention is as follows: an active speech detection system based on noise scene recognition, comprising:

a first feature extraction unit for extracting preferred features for a noise classification task from the audio signal;

the noise classification identification unit is used for identifying the noise type in the audio signal through a noise type classifier according to the optimized characteristic value facing the noise classification task;

the model selection unit is used for determining the preferred characteristics and the classifier applicable to the voice and noise-oriented classification task of the audio signal according to the noise type;

a second feature extraction unit, configured to extract, from the audio signal, feature values of preferred features of the speech and noise classification task;

and the voice detection unit is used for judging whether a voice signal exists in the audio signal or not through the voice noise classifier according to the characteristic value of the preferable characteristic facing the voice and noise classification task.

Preferably, the noise type classifier is constructed by t-SNE cluster analysis and random forest method, and the voice noise classifier is constructed by adopting random forest method.

The beneficial effects of the technical scheme are that:

(1) Aiming at the situations that the noise types and the noise intensities of the application scene of the active voice detection technology are complex and changeable, and the existing detection method rarely considers the environmental conditions of dynamic noise, a set of effective method for detecting the active voice in the dynamic noise environment is provided, and the accuracy of voice detection under different noise types and different noise intensities is effectively ensured.

(2) Aiming at the problem that the single audio feature is difficult to fully and comprehensively represent the audio characteristic information in the active voice detection and the characteristic that the voice signal is not stable in the noise scene, a time-frequency domain feature extraction method based on the methods of MFCC, wavelet decomposition, singular value decomposition and the like is provided, so that the characteristic information of the audio signal is mined from multiple views.

(3) Aiming at the problems that the characteristics and the classification model designed for the general scene are difficult to show stable and effective detection capability in the dynamic scene due to the changeable noise types and the changeable noise intensity in the dynamic noise environment, a noise type classifier based on t-SNE clustering analysis and random forest classification is constructed, N noise signals are clustered into M (M is less than or equal to N) noise types with different characteristics through a t-SNE visual clustering method, and then feature selection and classifier training are carried out on M noises through a random forest, so that the dynamic open noise environment can be converted into a specific noise scene for processing in real-time voice detection, and the accuracy of active voice detection is further ensured.

(4) Aiming at the limitation of modeling capability of a single classifier, a random forest method in ensemble learning is applied, aiming at the problem that noise-containing voice and noise are different in separable characteristics under different characteristic noise types, a random forest characteristic selection method is adopted to conduct optimization on the most distinguishable characteristics under different types of noise, and a corresponding noise-containing voice and noise classification model is trained based on the optimized characteristic combination; because the signal characteristics of different noise types are fully considered in the modeling process, the method can effectively cope with the dynamic noise environment, and stable voice detection capability is obtained.

(5) The analysis result of the test data verifies the effectiveness of the voice detection under the dynamic noise environment condition, and has good practical engineering application value.

(6) The analysis result of the test data verifies that the classification recognition accuracy of the method provided by the invention is higher than that of SVM, MLP and other methods.

Drawings

FIG. 1 is a flow chart of an active speech detection method based on noise scene recognition according to the present invention;

FIG. 2 is a block diagram of a training and use flow of a noise type classification model;

FIG. 3 is a block diagram of a training and use flow of noisy speech and noise classification models;

FIG. 4 is a graph of a visual result of clustering characteristic analysis of 6 kinds of noise based on t-SNE;

FIG. 5 is a graph of recognition accuracy corresponding to feature importance ranking and accumulated features for noise classification;

FIG. 6 is a diagram of the separability of features before, during, and after feature importance ranking for noise classification;

FIG. 7 is a graph of the correspondence between parameter values and classification accuracy in the training of the noise type classification model;

FIG. 8 is a graph of a test result confusion matrix for a noise type classifier;

FIG. 9 is a graph comparing recognition results of various classifiers under different noise environments and different signal to noise ratios;

fig. 10 is a schematic diagram of an active speech detection system based on noise scene recognition according to the present invention.

Detailed Description

Embodiments of the present invention are described in further detail below with reference to the accompanying drawings and examples. The following detailed description of the embodiments and the accompanying drawings are provided to illustrate the principles of the invention and are not intended to limit the scope of the invention, i.e. the invention is not limited to the preferred embodiments described, which is defined by the claims.

In the description of the present invention, it is to be noted that, unless otherwise indicated, the meaning of "plurality" means two or more; the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance; the specific meaning of the above terms in the present invention can be understood as appropriate by those of ordinary skill in the art.

Example 1

As shown in fig. 1, the embodiment discloses an active voice detection method based on noise scene recognition, which includes the following steps:

s1: preferred features for the noise classification task are extracted from the audio signal, and the preferred feature values are input to a noise type classifier to identify the noise type in the audio signal.

As shown in fig. 2, the noise type classifier is constructed by:

aiming at the task of distinguishing noise types, in order to acquire distinguishing information among different noise signals from multiple angles, 37-dimensional time-frequency domain features such as zero-crossing rate, MFCC, spectrum centroid, spectrum diffusion, spectrum entropy, spectrum flux, spectrum edge rolling, harmonic ratio, fundamental frequency, frequency domain energy, bandwidth, wavelet component features and the like are extracted from the distinguishing information, wherein the wavelet component features are 8-dimensional energy features and 6-dimensional singular value features extracted from wavelet components obtained after wavelet decomposition of an audio signal.

Specifically, in the feature calculation, an audio signal is decomposed into 8 wavelet components by using a three-layer wavelet decomposition method, and then the energy E of each component is calculated _3j As a feature, the calculation formula is as follows:

wherein S is _3j Reconstructing the signal x _jk (j=0, 1, …,7; k=1, 2, …, n) is S _3j Is a discrete point magnitude of (a).

Meanwhile, a matrix formed by 8 wavelet components after wavelet decomposition is subjected to singular value decomposition, and the first 6 singular values are taken as characteristics.

the t-SNE method is a subspace embedding method based on probability, and is characterized in that original data distributed in a high-dimensional space is embedded into a certain low-dimensional subspace, the similarity between data point pairs is described by adopting conditional probability instead of Euclidean distance, the neighborhood local characteristics of the data in the low-dimensional space and the original high-dimensional data are kept consistent as much as possible, and the global clustering characteristic of the original high-dimensional data is kept. The advantage of maintaining the global clustering characteristic of high dimension in a low dimension space by using the t-SNE method is utilized to carry out visual analysis on the clustering relation of various noise signals, so that the noise with similar time-frequency domain characteristics is classified into one type for classification and identification, and the accuracy of identification is improved.

In high-dimensional space, data point pairs x _j And x _i The similarity between them is the conditional probability p _ji Conditional probability p _ji Representing point x _i Selecting point x _j Probability of its neighborhood point, p _ji If the value of (2) is larger, the data point pair is in a neighbor relation, otherwise, the data point pair is in a far-away relation; also, in a low dimensional space, the conditional probability q can be used _ji To represent the mapped data point pair y _j And y is _i Similarity of (2); it can be seen that in the process of embedding high-dimensional data into a low-dimensional space, the core aim is to find an optimal low-dimensional data representation such that q _ji And p is as follows _ji The deviation of (2) is minimal.

In the t-SNE algorithm, the matching degree of conditional probabilities in a high-low dimensional space is measured by adopting K-L divergence (Kullback-Leibler divergences), and in order to solve the problem of low-dimensional data accumulation caused by K-L divergence asymmetry, probability distribution conditions among data point pairs are simulated by adopting Gaussian distribution in a high-dimensional space, probability distribution conditions among the data point pairs are simulated by adopting heavy tail distribution t distribution in the low-dimensional space, and the low-dimensional features are promoted to form a trend of 'similar aggregation and heterogeneous separation' through a 'stretching' mechanism, so that the 'stacking' phenomenon when the high-dimensional data is mapped to the low-dimensional space is reduced, the separable characteristic among different types of data is improved, and the global clustering characteristic of the high-dimensional data is maintained as much as possible while the local characteristic of the high-dimensional data is maintained.

according to the feature selection method based on the OOB data classification accuracy, the importance level of the features in different dimensions is measured through the change condition of the classification accuracy corresponding to the OOB data before and after the change, and the specific process is as follows:

based on a training data set formed by N samples, K classification decision trees are constructed, training data of each decision tree is randomly extracted from a total training data set by adopting Bootstrap, and then the importance level of the ith dimension feature can be calculated as follows:

(1) will kth decision tree T _k The corresponding OOB data is marked as D _k ；

(2) Based on decision tree T _k For test data D _k Performing classification recognition, and recording the number of correctly recognized samples as R _k ；

(3) For test data D _k Feature X of (3) _i Disturbance is performed on the numerical value of (2) and then based on the decision tree T _k For test data D after disturbance _k,i Performing classification recognition, and recording the number of correctly recognized samples as R _k,i ；

(4) Repeating steps (1) - (3) above for k=1, 2, …, K, each time R is recorded _k And R is _k,i Results of (2);

(5) feature X _i The importance level of (2) can be calculated from this:

the audio features are extracted by adopting different time-frequency domain methods, so that the time-frequency domain features of the audio can be more comprehensively described from multiple view angles, but the multi-view features provide more sufficient audio features, meanwhile, the burden of a subsequent classification algorithm is increased due to high dimension of the features and complicated data structure, and correlation exists among high-dimensional data, so that the redundant information can interfere the display of effective features. Therefore, the invention adopts the feature optimization method based on random forest, the optimized low-dimensional features only comprise the features with the importance of 7, and the common redundant information features are removed while the separable characteristic features between different noise signals and between voice and noise signals can be described effectively, so that the accuracy of classification recognition between different types of noise signals and between voice and noise signals is improved.

S1-5: based on the noise classification preferred features, a random forest method is adopted to train a noise type classification model. According to the noise classification task, a noise type classification model (random forest classifier) is trained to form a noise type classifier; and classifying the noise through the noise clustering, and optimizing the characteristic with better separable characteristics from the frequency domain characteristics through the characteristic optimization to participate in model training and verification test. Aiming at different noise types, constructing a training data set of a noise type classification model based on the low-dimensional characteristics with good separability; and training a noise type classification model according to the training data set to form a noise type classifier.

The Random Forest (RF) is an integrated learning method adopting a Bagging strategy, and is an integrated classifier formed by a plurality of decision tree-based classifiers, and the final classification result is determined by the voting results of all decision trees together, so that a plurality of the classifier can be integrated into one strong classifier, and better classification performance than that of a single decision tree can be obtained.

The training of the noise type classification model comprises the following specific steps:

(1) building training data sets

The training data set Of each decision tree is randomly extracted from the total training data set N by adopting a Bootstrap resampling method according to a certain proportion, and the residual data after each extraction is called Out Of Bag data (OOB data), namely the test data Of the training effect Of each decision tree, and the performance Of each classifier is evaluated through the test error Of the classifier on the OOB data.

(2) Selecting optimal characteristic combination for node branching

And in the training process of each decision tree, randomly extracting the characteristics of part of dimensions from all the characteristic data, and selecting the optimal characteristic combination according to the Gini gain maximization principle to carry out node branching.

(3) Voting

And voting is carried out by collecting the output results of all the decision trees, and the category with the highest vote number is the final decision result of the model.

The performance evaluation indexes of the trained noise type classification model, namely the noise type classifier, comprise an accuracy rate (PR), a Recall Rate (RR), an F1 score (F1-score) and an accuracy rate (ACC), and are specifically defined as follows:

where TP represents the number of positive samples identified as positive samples, FP represents the number of negative samples identified as positive samples, TN represents the number of negative samples identified as negative samples, and FN represents the number of negative samples identified as positive samples.

As shown in fig. 2, the noise type classifier operates in the usage phase as follows:

(1) extracting preferred features from the input real-time audio signal according to a feature directory which is preferably selected in a training stage and faces to a noise classification task;

(2) the extracted preferred feature values for the noise-oriented classification task are input to a noise type classifier to identify the noise type in the audio signal.

S2: according to the noise type, determining the preferred characteristics and the classifier applicable to the voice and noise-oriented classification task of the audio signal;

s3: extracting a preferred characteristic value facing a voice and noise classification task from an audio signal, inputting the preferred characteristic value facing the voice and noise classification task into a voice and noise classifier, and judging whether a voice signal exists in the audio signal. As shown in fig. 3, the speech noise classifier is constructed by:

Aiming at different noise types, constructing a training data set of a noise-containing voice and noise classification model based on the low-dimensional characteristics with good separability; training a noise-containing voice and noise classification model (random forest classifier) according to the training data set, wherein the specific steps are as follows:

(1) building training data sets

(2) Selecting optimal characteristic combination for node branching

(3) Voting

The performance evaluation indexes of the trained noisy speech and noise classification model, namely the speech noise classifier, comprise the Precision Rate (PR), the Recall Rate (RR), the F1 score (F1-score) and the accuracy rate (ACC).

The operation of the voice noise classifier in the use phase is as follows:

(1) based on the noise type, preferred features and classifiers suitable for the speech and noise oriented classification tasks of the audio signal are determined.

(2) Extracting the characteristic value of the preferred characteristic facing the voice and noise classification task from the audio signal, inputting the characteristic value of the preferred characteristic facing the voice and noise classification task into a voice noise classifier, and judging whether the voice signal exists in the audio signal.

The core of the active voice detection is to effectively distinguish the noise-containing voice signal from the noise signal, in practical application, as the noise types in the environment background where the voice is located are complex and changeable, the distinguishing characteristics of the noise-containing voice and the noise signal under different noises are also different, and the best recognition result is difficult to obtain under various noise types by adopting unified characteristics and classifiers.

The following will analyze the actual effects of the present invention in conjunction with specific application examples:

1. audio data source

In the case analysis of the present invention, the speech signal is the audio of 30 different speakers randomly selected in the data set THCHS-30, 15 for each of men and women. The noise signal is a white noise (white), a restaurant noise (blank), a factory noise (factor 2), a car noise (volvo), a tank noise (m 109) and a fighter noise (f 16), which are 6 kinds of noise selected from the noise library of NOISEX-92 standard noise as analysis objects.

2. Construction of noise type classifier based on t-SNE cluster analysis and random forest

(1) Time-frequency domain feature extraction

For 6 noise signals, firstly uniformly resampling to 8kHz, then framing with 20ms as frame length and 10ms as frame shift, and extracting 37-dimensional time-frequency domain features, wherein the correspondence between the dimensions of the features and the names of the features is shown in table 1.

TABLE 1 Audio feature dimension and feature name correspondence

Dimension(s)	Feature names	Dimension(s)	Features (e.g. a character)Name of the name	Dimension(s)	Feature names
						1	Zero crossing rate	18	Spectral flux	22	Frequency domain energy
2～14	MFCC	19	Spectrum edging	23	Bandwidth of a communication device
						15～16	Spectrum centroid and spectrum spread	20	Ratio of harmonics	24～31	Wavelet energy
17	Spectral entropy	21	Fundamental frequency	32～37	Wavelet singular values

(2) Clustering the characteristic values by adopting a t-SNE clustering characteristic analysis method

The results of the t-SNE visual cluster analysis on the 6 noise features are shown in FIG. 4. As can be seen from the figure, these 6 types of noise form 4 clusters, where the noise characteristics of babble, factory and m109 are clustered together, while the three types of noise volvo, f16, white are each clustered together. Therefore, we consider that under the high-dimensional characteristics extracted by the invention, the three noise environments babble, factory and m109 can be regarded as noise of one type, while volvo, f16 and white are regarded as noise of one type respectively, and then the four types of noise can be identified and classified.

(3) Feature optimization method for noise classification recognition task by adopting random forest method

For the task of multiple different noise identification classifications, in feature optimization, 1500 sets of feature samples are extracted per class of noise, each set of feature sample data dimensions 37. According to the analysis result of the clustering characteristics of the noise, babble, factory and m109 are similar in characteristics, and are classified into a class, wherein 500 groups of characteristic samples are extracted from each noise to form 1500 groups. From this we get a 4 x 1500 x 37 dataset, where two thirds of the noise samples in each class are randomly extracted as model training data, leaving one third as test data. The importance ranking result of each dimension feature is shown as a bar graph in fig. 5, and the recognition accuracy of the test data corresponding to the accumulated feature is shown as a line graph in fig. 5. From the figure, the first 7 features most useful for distinguishing different types of noise are the 15 th, 37 th, 29 th, 19 th, 31 th, 35 th and 30 th dimensional features, under the 7 th descriptive feature, the classification accuracy of different noise can reach 99.55%, then after the 17 th, 2 nd, 23 th, 33 th and 34 th dimensional features are continuously added, the highest accuracy is not improved until the 13 th descriptive feature is added, the highest accuracy reaches 99.7%, and after that, the recognition accuracy is not improved by adding the features. Therefore, in different kinds of noise distinguishing tasks, in order to improve the recognition accuracy and the timeliness of the detection and recognition process, the preferred low-dimensional features only comprise features with the importance of 7, and according to the comparison table 1, only the spectrum centroid, the wavelet singular value, the wavelet energy and the spectrum edging feature need to be calculated in real-time detection.

In order to further verify the validity of the random forest feature optimization result, the feature value distribution situation of each of the 2 features with the front, middle and rear feature importance ranks is given below, and as shown in fig. 6, it is obvious from fig. 6 that the feature with the front importance ranks has obvious separability among 4 kinds of noises, the feature with the middle rank can only distinguish part of noises, and the feature with the rear rank cannot distinguish different noises. It can be seen that the preferred features based on random forests do have better separable characteristics, with obvious advantages for distinguishing between different noise.

(4) Training noise type classification model

According to the method, aiming at a noise classification and identification task, a noise type classification model (random forest classifier) is trained to form a noise type classifier, 6 types of noise are finally classified into 4 types of noise scenes, namely, a base\factor\m109, volvo, f16 and white through the noise clustering and feature optimization, and 7-dimensional features with better separable characteristics are optimized from 37-dimensional time-frequency domain features to participate in model training and verification test. Training data under each noise scene is 1500 groups of samples, and 6000 groups of samples are taken as a total; the test data under each noise scene is 1500 groups of samples, and 6000 groups of samples are taken in total.

In the model training process, in order to avoid model overfitting and ensure the generalization capability of the model on future use data, the invention adopts a 5-fold cross-validation method; in order to obtain a parameter combination which maximizes the accuracy of training and verification tests, the invention adopts a grid search method to optimize parameters, considers the influence degree condition of each parameter on the performance of a random forest model, and mainly optimizes three parameters of the number n_evators of trees, the maximum depth max_depth of the trees and the minimum leaf number min_samples_leaf of tree nodes; setting optimizing ranges of all parameters: n_evamers is [10:10:100], max_depth is [2:1:10], min_samples_leaf is [1:1:5]; as a result of the parameter optimization, as shown in fig. 7, the optimal value of each parameter is finally determined to be n_optimrs=20, max_depth=9, and min_samples_leaf=1.

Under the parameter setting obtained by network searching and optimizing, carrying out training and verification test for 5 times, taking the average value of 5 running results as the final training and verification test result, wherein on the overall data, the training accuracy is 99.81%, and the test accuracy is 98.97%; on each type of noise data, the performance indexes of the noise type classifier are shown in table 2, and the confusion matrix of the noise type classifier on 4 types of different noise test data is shown in fig. 8, so that it can be obviously seen that the noise type classifier has good noise recognition accuracy and generalization capability on unknown test data, and in the actual use process, the accuracy of noise recognition can be further ensured by adopting a mode of multi-segment test result voting decision on continuous audio.

Table 2 list of performance indicators for noise type classifier of the present invention

Category(s)	PR	RR	F1-score
				class0:babble\factory\m109	0.974	0.988	0.980
class1:volvo	1	1	1
				class2:white	1	1	1
class3:f16	0.988	0.974	0.980

The noise type classifier performance evaluation indexes comprise an accuracy rate (PR), a Recall Rate (RR), an F1 score (F1-score) and an accuracy rate (ACC), and are specifically defined as follows:

3. Speech noise classifier based on random forest

(1) Time-frequency domain feature extraction

Firstly, noise adding processing is carried out on pure voice under 6 different noise signals, and the signal to noise ratios are respectively 10dB, 5dB, 0dB and minus 5dB. Then, the noise-added voice and noise sample signal data are resampled to 8kHz uniformly, then frame is divided by taking 20ms as a frame length and taking 10ms as a frame shift, and 37-dimensional time-frequency domain features are extracted, wherein the correspondence between the dimensions and the feature names of the features is shown in a table 1.

(2) Speech feature optimization for noisy speech and noise classification tasks

In the task of classifying noise-containing voice and noise, because the confusion degree of the noise and the voice is different under different signal-to-noise ratios, the preferred features for distinguishing the noise and the voice are also different, so that the corresponding features are preferred under different signal-to-noise ratios aiming at a certain noise scene.

For classification recognition tasks of noisy speech and noise under 4 types of noise scenes, in feature optimization, 3000 groups of feature samples are extracted from each type of noise and the noisy speech thereof under four signal-to-noise ratios of 10dB, 5dB, 0dB and-5 dB, and each group of feature sample data dimension is 37. For the first type of noise (babble, factory and m 109) scenes, 1000 groups of characteristic samples are extracted from each noise and noise-containing voice thereof to form 3000 groups. Thus, we obtain a 2×3000×37 data set under each signal-to-noise ratio of each noise scene, wherein half of the noise samples in each category are randomly extracted as random forest model training data, and the remaining half are used as test data. The voice characteristic optimization result under each signal-to-noise ratio of each type of noise is summarized by Top10, as shown in Table 3, it can be seen from the table that under 4 types of noise scenes, under the signal-to-noise ratios of 10dB, 5dB and 0dB, the optimized characteristic dimensional overlap ratio is higher, which indicates that the characteristics can not only be used for distinguishing noise-containing voice from noise signals under the signal-to-noise ratios, but also have better anti-working condition interference capability; and the preferred characteristics obtained at-5 dB are quite different from those obtained at the first 3 signal-to-noise ratios. Therefore, in the invention, under various noise scenes, the preferred feature union under the three signal-to-noise ratios of 10dB, 5dB and 0dB is taken as the final feature sequence to train the model, one model is independently trained under the signal-to-noise ratio of-5 dB, and the grid search method is adopted to carry out model parameter optimization in each model training.

TABLE 3 details of preferred results for speech characteristics of noisy speech and noise classification in different noise environments

(3) Training of noisy speech and noise classification models

Aiming at the classification recognition task of the noise-containing voice and noise, training a noise-containing voice and noise classification model (random forest classifier) to form a voice noise classifier and testing; in each noise scene, uniformly training a model for three signal-to-noise ratios of 10dB, 5dB and 0dB, selecting training characteristics according to the characteristic optimization results in the table 3, and training and testing 1500 groups of samples of data respectively; training a model for the signal-to-5 dB noise ratio independently, selecting training characteristics according to the preferred result of the upper section voice characteristics, and training and testing 1500 groups of samples of data respectively; in order to verify the advantages of the random forest in the classification of noise-containing voice and noise, based on training and testing data samples, the invention trains an SVM model and a two-layer perceptron MLP model while training a noise-containing voice and noise classification model (random forest classifier), and the three models adopt a grid search method to realize the tuning of model parameters; the recognition accuracy of each classifier is shown in table 4, and compared with the recognition accuracy shown in fig. 9, it can be seen that the recognition accuracy of the voice noise classifier is the best under different noise environment types and different signal to noise ratios, and the recognition effects of the SVM classifier and the MLP classifier are equivalent; for different noise types, under the condition that the signal-to-noise ratio is not lower than 5dB, the classification accuracy of the voice noise classifier can reach more than 95%; when the signal-to-noise ratio is 0dB, the classification accuracy under volvo and white noise is more than 96%, the accuracy under f16 noise is more than 91%, and the accuracy under class0 noise is reduced to 85.3%; when the signal-to-noise ratio is continuously reduced to-5 dB, the recognition accuracy is generally reduced greatly, and the accuracy of voice detection is ensured by combining a voice noise reduction algorithm.

TABLE 4 recognition result list of different classifiers under different noise environments and different signal to noise ratios

Example 2

The active speech detection method based on noise scene recognition in embodiment 1 can be implemented by the following active speech detection system.

As shown in fig. 10, an active speech detection system based on noise scene recognition includes:

the noise classification and identification unit is used for identifying the noise type in the audio signal through a noise type classifier according to the preferable characteristics facing the noise classification task;

the second feature extraction unit is used for extracting feature values of preferred features facing the voice and noise classification task from the audio signal;

and the voice detection unit is used for judging whether a voice signal exists in the audio signal through the voice noise classifier according to the characteristic value of the preferable characteristic facing the voice and noise classification task.

The noise type classifier is constructed by a t-SNE cluster analysis and random forest method, and the voice noise classifier is constructed by adopting a random forest method.

Aiming at the situations that the noise types and the noise intensities of the application scene of the active voice detection technology are complex and changeable, the existing detection method rarely considers the environmental conditions of dynamic noise, and provides an effective method and system for detecting the active voice in the dynamic noise environment, so that the accuracy of voice detection under different noise types and different noise intensities is effectively ensured.

Aiming at the problem that the single audio feature is difficult to fully and comprehensively represent the audio characteristic information in the active voice detection and the characteristic that the voice signal is not stable in the noise scene, a time-frequency domain feature extraction method based on the methods of MFCC, wavelet decomposition, singular value decomposition and the like is provided, so that the characteristic information of the audio signal is mined from multiple views.

Aiming at the problems that the characteristics and the classification model designed for the general scene are difficult to show stable and effective detection capability in the dynamic scene due to the changeable noise types and the changeable noise intensity in the dynamic noise environment, a noise type classifier based on t-SNE clustering analysis and random forest classification is constructed, N noise signals are clustered into M (M is less than or equal to N) noise types with different characteristics through a t-SNE visual clustering method, and then feature selection and classifier training are carried out on M noises through a random forest, so that the dynamic open noise environment can be converted into a specific noise scene for processing in real-time voice detection, and the accuracy of active voice detection is further ensured.

Aiming at the limitation of modeling capability of a single classifier, a random forest method in ensemble learning is applied, aiming at the problem that noise-containing voice and noise are different in separable characteristics under different characteristic noise types, a random forest characteristic selection method is adopted to conduct optimization on the most distinguishable characteristics under different types of noise, and a corresponding noise-containing voice and noise classification model is trained based on the optimized characteristic combination; because the signal characteristics of different noise types are fully considered in the modeling process, the method can effectively cope with the dynamic noise environment, and stable voice detection capability is obtained.

The analysis result of the test data verifies the effectiveness of the voice detection under the dynamic noise environment condition, and has good practical engineering application value; the analysis result of the test data also verifies that the classification recognition accuracy of the method provided by the invention is higher than that of SVM, MLP and other methods.

While the invention has been described with reference to a preferred embodiment, various modifications may be made and equivalents may be substituted for elements thereof without departing from the scope of the invention. In particular, the technical features mentioned in the respective embodiments may be combined in any manner as long as there is no structural conflict. It is intended that the invention not be limited to the particular embodiments disclosed herein, but that the invention will include all embodiments falling within the scope of the appended claims. The present invention is not described in detail in part as being well known to those skilled in the art.

Claims

1. The active voice detection method based on noise scene recognition is characterized by comprising the following steps of:

s3: extracting the preferred features of the voice-oriented and noise-oriented classification tasks from the audio signals, inputting the feature values of the preferred features of the voice-oriented and noise-oriented classification tasks into a voice noise classifier, and judging whether voice signals exist in the audio signals;

the noise type classifier is constructed through t-SNE cluster analysis and a random forest method; the speech noise classifier is constructed by using a random forest method.

2. The active speech detection method of claim 1, wherein the noise type classifier is constructed by:

3. The active speech detection method of claim 2, wherein the audio features include any or all of zero-crossing rate, MFCC, spectral centroid, spectral spread, spectral entropy, spectral flux, spectral piping, harmonic ratio, fundamental frequency, frequency domain energy, bandwidth, and wavelet components.

4. The active speech detection method of claim 2, wherein the noise classification preference feature comprises one or more of a spectral centroid, wavelet singular values, wavelet energy, and spectral piping feature.

5. The active speech detection method of claim 1, wherein the speech noise classifier is constructed by:

6. The method of claim 1, wherein the noise types include white noise, car noise, fighter noise, and other noise.

7. An active speech detection system based on noise scene recognition, comprising:

the noise classification identification unit is used for identifying the noise type in the audio signal through a noise type classifier according to the optimized characteristic value facing the noise classification task; the noise type classifier is constructed by t-SNE cluster analysis and a random forest method;

the voice detection unit is used for judging whether a voice signal exists in the audio signal or not through a voice noise classifier according to the characteristic value of the preferable characteristic facing the voice and noise classification task; the speech noise classifier is constructed by using a random forest method.