CN111933175A

CN111933175A - Active voice detection method and system based on noise scene recognition

Info

Publication number: CN111933175A
Application number: CN202010783583.5A
Authority: CN
Inventors: 田野
Original assignee: Third Research Institute Of China Electronics Technology Group Corp; Beijing Zhongdian Huisheng Technology Co ltd
Current assignee: Third Research Institute Of China Electronics Technology Group Corp; Beijing Zhongdian Huisheng Technology Co ltd
Priority date: 2020-08-06
Filing date: 2020-08-06
Publication date: 2020-11-13
Anticipated expiration: 2040-08-06
Also published as: CN111933175B

Abstract

The invention discloses an active voice detection method based on noise scene recognition, which extracts optimal characteristics facing a noise classification task from an audio signal and inputs characteristic values into a noise type classifier to recognize the noise type in the audio signal; according to the noise type, determining a preferred characteristic and a classifier suitable for a speech and noise classification task; extracting optimal characteristics facing to a voice and noise classification task from the audio signals, inputting the optimal characteristic values into a voice noise classifier, and judging whether the audio signals have voice signals or not; the invention also discloses an active voice detection system based on the noise scene recognition. The method disclosed by the invention detects and identifies the current noise type before carrying out the second classification of the noisy speech and the noise signal, optimizes the most distinctive feature combination for the specific noise type, can design the model parameters for the specific noise type, and ensures the effectiveness and the stability of the performance of the whole detection process under different noise types.

Description

Active voice detection method and system based on noise scene recognition

Technical Field

The invention relates to the technical field of voice data processing, in particular to a method and a system for detecting active voice based on noise scene recognition.

Background

The phenomena of pause, intermittence and the like often exist in a section of voice signal, the silent sections and environmental noise are superposed to form a voice signal which does not contain effective voice information, and the information occupies larger data transmission resources and simultaneously interferes the effect of voice signal processing; the objective of the Voice Activity Detection (VAD) technique is to detect real voice segments from a signal and remove these "unvoiced" portions, so as to reduce the burden of subsequent voice signal processing procedures.

In view of the wide application demand of the active voice detection technology, researchers have proposed many related detection methods in recent years, which can be classified into unsupervised methods and supervised methods; the unsupervised method mainly takes characteristics and threshold value design and threshold rule formulation as a core, typical characteristics comprise short-time energy, short-time zero-crossing rate, spectral entropy and the like, and the unsupervised method has obvious performance reduction in a noise environment and is usually matched with a noise reduction algorithm for use; the supervised method regards the active voice detection problem as a binary problem of a voice signal and a noise signal, and the performance of the supervised method under a noise environment is higher than that of the unsupervised method by learning noise data in advance.

The supervised method mainly comprises two links of feature extraction and classifier design; in the aspect of feature extraction, in order to effectively distinguish acoustic characteristics of noise and a speech signal, researchers extract high-dimensional features from different angles, such as an energy feature, a zero-crossing rate feature, a Mel Frequency Cepstrum Coefficient (MFCC) feature, a fuzzy entropy feature, an autocorrelation Coefficient feature, a wavelet Coefficient feature, and the like, and combine multiple features to use the feature information from multiple angles, however, although the feature combinations have a certain noise and speech distinguishing capability under a specific noise type, in practical applications, due to time-varying nature of the noise type, the feature combinations set for general situations often cannot show a stable distinguishing capability under a dynamic noise scene, and high-dimensional burden of the features often brings use of a subsequent classifier.

In the design aspect of a classifier, in order to construct a binary model of a noise signal and a speech signal, a detection method based on MFCC features and a Support Vector Machine (SVM), a detection method based on fuzzy entropy features and an SVM, and a detection method based on a multi-layer perceptron (MLP) are commonly used in the prior art; in the aspect of classifier selection, along with the continuous development of a machine learning method, aiming at the limitation of the data modeling capability of a single classifier, an integrated learning and deep learning method is developed in recent years, and the generalization capability of a classifier model is improved by improving the modeling breadth and depth; in terms of classifier modeling strategies, supervised methods model noise and speech signals differentially under specific types during model training, thereby detecting speech signals in an audio stream.

The design of the classifier, the selection of the classifier and the modeling strategy of the classifier are good in performance under a single noise type, but due to the variability of the noise type, modeling parameters aiming at different noise characteristics are not similar as much as possible, so that the problem that the design of training to obtain the model classifier with good distinguishability under various different noise types is to be solved urgently at present.

Disclosure of Invention

In view of the above problems, an object of the present invention is to provide a method for detecting active speech based on noise scene recognition, which detects and recognizes the current noise type before classifying noisy speech and noise signals into two categories, converts the dynamic noise environment into a limited noise environment, and further can select the most distinctive feature combination in the high-dimensional features for the specific noise type, and can design model parameters for the specific noise type, thereby ensuring the effectiveness and stability of the performance of the whole detection process under different noise types; the invention constructs a noise type classifier and a voice noise classifier; in the aspect of noise type identification, a noise clustering and classifying method based on t-SNE and random forests is provided; in the aspect of distinguishing and identifying noisy speech and noise signals, a feature selection and classifier construction method based on random forests is provided.

A second object of the present invention is to provide an active speech detection system based on noise scene recognition, which is easy to implement and convenient to debug.

The first technical scheme adopted by the invention is as follows: a method for detecting active voice based on noise scene recognition comprises the following steps:

s1: extracting preferred characteristics facing a noise classification task from an audio signal, and inputting the preferred characteristic values into a noise type classifier to identify the noise type in the audio signal;

s2: according to the noise type, determining a preferred characteristic and a classifier suitable for a voice and noise classification task;

s3: and extracting the feature value of the preferred feature facing the voice and noise classification task from the audio signal, inputting the feature value of the preferred feature facing the voice and noise classification task into the voice and noise classifier, and judging whether the audio signal exists.

Preferably, the noise type classifier is constructed by t-SNE cluster analysis and a random forest method.

Preferably, the noise type classifier is constructed by:

s1-1: constructing a noise signal library, wherein the noise signal library comprises a plurality of types of noise signals;

s1-2: extracting characteristic values of a plurality of audio characteristics of each noise signal in the noise signal library by adopting a time-frequency domain signal processing method;

s1-3: based on the characteristic value of the audio characteristic, performing cluster analysis on the noise signals in the noise signal library by adopting a t-SNE method;

s1-4: selecting a plurality of noise classification preferred features from the plurality of audio features by adopting a random forest method;

s1-5: and training a noise type classification model by adopting a random forest method based on the noise classification optimal characteristics.

Preferably, the audio features include a plurality or all of zero crossing rate, MFCC, spectral centroid, spectral dispersion, spectral entropy, spectral flux, spectral edge, harmonic ratio, fundamental frequency, frequency domain energy, bandwidth, and wavelet components.

Preferably, the noise classification preference features include one or more of spectral centroid, wavelet singular values, wavelet energy and spectral border features.

Preferably, the speech noise classifier is constructed by using a random forest method.

Preferably, the speech noise classifier is constructed by:

s3-1: respectively utilizing different types of noise signals to perform noise adding processing on the pure voice to obtain noise-containing voice and noise signals respectively corresponding to various noise types;

s3-2: extracting characteristic values of a plurality of audio characteristics of each noise-containing voice signal and the corresponding noise signal by adopting a time-frequency domain signal processing method;

s3-3: based on noisy speech signals and noise signals corresponding to each noise type, selecting optimal characteristics facing a speech and noise classification task under each noise type from the multiple audio characteristics by adopting a random forest method;

s3-4: based on the optimal characteristics of the voice and noise classification tasks under each noise type, a random forest method is adopted to train a noise-containing voice and noise classification model for each noise type.

Preferably, the noise types include white noise, noise in cars, fighter noise and other noise.

The second technical scheme adopted by the invention is as follows: an active speech detection system based on noise scene recognition, comprising:

a first feature extraction unit, which is used for extracting the preferred features facing the noise classification task from the audio signal;

the noise classification identification unit is used for identifying the noise type in the audio signal through a noise type classifier according to the preferred characteristic value of the noise-oriented classification task;

the model selection unit is used for determining the optimal characteristics and the classifier which are suitable for the voice and noise oriented classification task of the audio signal according to the noise type;

a second feature extraction unit, configured to extract feature values of the preferred features for the speech and noise classification task from the audio signal;

and the voice detection unit is used for judging whether a voice signal exists in the audio signal through the voice noise classifier according to the feature value of the preferred feature facing the voice and noise classification task.

Preferably, the noise type classifier is constructed by t-SNE cluster analysis and a random forest method, and the speech noise classifier is constructed by using the random forest method.

The beneficial effects of the above technical scheme are that:

(1) aiming at the situation that the noise type and the noise intensity of an application scene of the active voice detection technology are complex and changeable, and the current situation that the dynamic noise environment condition is less considered in the existing detection method, a set of effective method for detecting the active voice in the dynamic noise environment is provided, and the accuracy of voice detection under different noise types and different noise intensities is effectively ensured.

(2) Aiming at the problem that the audio characteristic information is difficult to fully and comprehensively represent by a single audio characteristic in the detection of the active voice and the non-stable characteristic of the voice signal in a noise scene, a time-frequency domain characteristic extraction method based on the MFCC, wavelet decomposition, singular value decomposition and other methods is provided, and the characteristic information of the audio signal is mined from multiple visual angles.

(3) Aiming at the problem that the noise type and the noise intensity are changeable, and the characteristics and classification models designed for general scenes are difficult to show stable and effective detection capability in a dynamic scene, a noise type classifier based on t-SNE cluster analysis and random forest classification is constructed, N noise signals are clustered into M (M is less than or equal to N) noise types with different characteristics through a t-SNE visual clustering method, and then the random forest is used for carrying out characteristic selection and classifier training on the M noises, so that the dynamic open noise environment can be converted into a specific noise scene for processing in real-time voice detection, and the accuracy of active voice detection is further ensured.

(4) Aiming at the limitation of modeling capability of a single classifier, a random forest method in ensemble learning is applied, aiming at the problem that separability characteristics of noise-containing voice and noise are different under different characteristic noise types, a random forest characteristic selection method is adopted to optimize the most distinguishable characteristics under different types of noise, and a corresponding noise-containing voice and noise classification model is trained based on the optimized characteristic combination; because the signal characteristics of different noise types are fully considered in the modeling process, the method can effectively cope with the dynamic noise environment and obtain stable voice detection capability.

(5) The analysis result of the test data verifies the effectiveness of the voice detection under the dynamic noise environment condition, and the method has good practical engineering application value.

(6) The analysis result of the test data proves that the classification and identification accuracy of the method provided by the invention is higher than that of methods such as SVM, MLP and the like.

Drawings

FIG. 1 is a flow chart of a method for detecting active speech based on noise scene recognition according to the present invention;

FIG. 2 is a block diagram of a process flow for training and using a noise type classification model;

FIG. 3 is a block diagram of a process for training and using a noisy speech and noise classification model;

FIG. 4 is a graph of the visual result of the clustering characteristic analysis of 6 kinds of noise based on t-SNE;

FIG. 5 is a graph of recognition accuracy for feature importance ranking and cumulative feature correspondence for noise classification;

FIG. 6 is a chart of the separability of the features before, during and after ranking of the feature importance of noise-oriented classification;

FIG. 7 is a diagram showing the relationship between the values of various parameters and the classification accuracy in the noise type classification model training;

FIG. 8 is a graph of a test result confusion matrix for a noise type classifier;

FIG. 9 is a comparison graph of the recognition results of the classifiers under different noise environments and different signal-to-noise ratios;

FIG. 10 is a diagram of an active speech detection system based on noise scene recognition according to the present invention.

Detailed Description

The embodiments of the present invention will be described in further detail with reference to the drawings and examples. The following detailed description of the embodiments and the accompanying drawings are provided to illustrate the principles of the invention and are not intended to limit the scope of the invention, which is defined by the claims, i.e., the invention is not limited to the preferred embodiments described.

In the description of the present invention, it is to be noted that, unless otherwise specified, "a plurality" means two or more; the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance; the specific meaning of the above terms in the present invention can be understood as appropriate to those of ordinary skill in the art.

Example 1

As shown in fig. 1, the present embodiment discloses an active speech detection method based on noise scene recognition, which includes the following steps:

s1: and extracting the preferred characteristic facing to the noise classification task from the audio signal, and inputting the preferred characteristic value into a noise type classifier to identify the noise type in the audio signal.

As shown in fig. 2, the noise type classifier is constructed by the following steps:

aiming at the task of noise type discrimination, in order to acquire discriminative information among different noise signals from multiple angles, the invention extracts the time-frequency domain characteristics with 37 dimensions such as zero crossing rate, MFCC, frequency spectrum centroid, frequency spectrum diffusion, spectral entropy, spectral flux, frequency spectrum edge rolling, harmonic ratio, fundamental frequency, frequency domain energy, bandwidth, wavelet component characteristics and the like from the noise signals, wherein the wavelet component characteristics are the 8-dimensional energy characteristics and the 6-dimensional singular value characteristics extracted from the wavelet components obtained by performing wavelet decomposition on the audio signals.

Specifically, in the feature calculation, the audio signal is decomposed into 8 wavelet components by using a three-layer wavelet decomposition method, and then the energy E of each component is calculated_3jAs a feature, the calculation formula is as follows:

in the formula, S_3jTo reconstruct the signal, x_jk(j-0, 1, …, 7; k-1, 2, …, n) is S_3jThe discrete point amplitude of (a).

And simultaneously, carrying out singular value decomposition on a matrix consisting of 8 wavelet components after wavelet decomposition, and taking the first 6 singular values as characteristics.

the t-SNE method is a subspace embedding method based on probability, and the core of the method is to embed original data distributed in a high-dimensional space into a certain low-dimensional subspace, describe the similarity between data point pairs by adopting conditional probability to replace Euclidean distance, keep the neighborhood local characteristics of the data in the low-dimensional space and the original high-dimensional data as consistent as possible, and simultaneously keep the global clustering characteristic of the original high-dimensional data. The advantage of global clustering characteristics of high-dimensional data is kept in a low-dimensional space by using a t-SNE method, and the clustering relation of various noise signals is visually analyzed, so that noises with similar time-frequency domain characteristics are classified into one class for classification and identification, and the identification accuracy is improved.

In high dimensional space, data point pairs x_jAnd x_iThe similarity between them is a conditional probability p_jiConditional probability p_jiRepresents point x_iSelect point x_jProbability as its neighborhood point, p_jiWhen the value of (A) is larger, the data point pair is in a close relation, otherwise, the data point pair is in a far relation; all in oneAlso, in a low dimensional space, the conditional probability q can be used_jiTo indicate the mapped data point pair y_jAnd y_iThe similarity of (2); it can be seen that the core aim in embedding high-dimensional data into a low-dimensional space is to find an optimal low-dimensional data representation such that q is_jiAnd p_jiThe deviation of (c) is minimal.

In the t-SNE algorithm, K-L divergence (Kullback-Leibler divergences) is adopted to measure the matching degree of conditional probability in a high-dimensional space and a low-dimensional space, in order to make up the problem of low-dimensional data accumulation caused by asymmetry of the K-L divergence, Gaussian distribution is adopted in the high-dimensional space to simulate the probability distribution situation among data point pairs, heavy tail distribution t distribution is adopted in the low-dimensional space to simulate the probability distribution situation among the data point pairs, low-dimensional features are promoted to form the trend of similar aggregation and heterogeneous separation through a stretching mechanism, the stacking phenomenon when high-dimensional data are mapped to the low-dimensional space is reduced, the separability between different types of data is improved, and the global clustering characteristic of the high-dimensional data is kept as far as possible while the local characteristic of the high-dimensional data is kept.

the invention adopts a characteristic selection method based on the classification accuracy of OOB data, measures the importance level of the characteristics on different dimensions through the change condition of the classification accuracy corresponding to the OOB data before and after change, and comprises the following specific processes:

constructing K classification decision trees based on a training data set consisting of N samples, wherein the training data of each decision tree is randomly extracted from a total training data set by adopting Bootstrap, and then the importance level of the ith dimension characteristic can be calculated as follows:

first, the k-th decision tree T_kThe corresponding OOB data label is D_k；

Based on decision tree T_kFor test data D_kClassifying and identifying, and recording the number of correctly identified samples as R_k；

③ test data D_kChinese character (1)Sign X_iIs disturbed and then is based on a decision tree T_kFor the disturbed test data D_k,iClassifying and identifying, and recording the number of correctly identified samples as R_k,i；

(iv) repeating the above steps (c) to (c) for K1, 2, …, K, and recording R each time_kAnd R_k,iThe result of (1);

feature X_iThe importance level of (b) can be calculated from:

the audio characteristics are extracted by adopting different time-frequency domain methods, the time-frequency domain characteristics of the audio can be more comprehensively described from a plurality of visual angles, but the multi-visual angle characteristics increase the burden of a subsequent classification algorithm due to high dimension of the characteristics and complicated data structure while providing more sufficient audio characteristics, and the high dimension data has correlation, and the redundant information interferes with the display of effective characteristics. Therefore, the invention adopts a random forest-based feature optimization method, the optimized low-dimensional features only comprise features with the importance degree of 7, and common redundant information features are removed while separable characteristic features capable of describing different noise signals and speech and noise signals are effectively reserved, so that the accuracy of classification and recognition among different types of noise signals and speech and noise signals is improved.

S1-5: and training a noise type classification model by adopting a random forest method based on the noise classification optimal characteristics. Aiming at a noise classification task, training a noise type classification model (a random forest classifier) to form a noise type classifier; and classifying the noise through the noise clustering, and preferably selecting the characteristic with better separable characteristic from the frequency domain characteristic through characteristic optimization to participate in model training and verification tests. Aiming at different noise types, constructing a training data set of a noise type classification model based on the optimized low-dimensional features with good separability; and training a noise type classification model according to the training data set to form a noise type classifier.

Random Forest (RF) is an integrated learning method using Bagging strategy, and is an integrated classifier formed from several decision tree-based classifiers, and the final classification result is determined by the voting result of every decision tree, so that several classifiers can be integrated into one strong classifier to obtain better classification performance than that of single decision tree.

The specific steps of training the noise type classification model are as follows:

firstly, a training data set is constructed

The training data set Of each decision tree is randomly extracted by adopting a Bootstrap resampling method and is put back from the total training data set N according to a certain proportion, the residual data after each extraction is called Out Of Bag data (OOB data), namely the test data Of the training effect Of each decision tree, and the performance Of each classifier is evaluated through the test error Of the classifier on the OOB data.

Selecting optimal characteristic combination to carry out node branching

And in the training process of each decision tree, randomly extracting partial dimensional features from all feature data, and selecting the optimal feature combination to carry out node branching according to the Gini gain maximization principle.

③ voting

Voting is carried out by collecting the output results of all the decision trees, and the category with the highest vote number is the final decision result of the model.

The performance evaluation indexes of the trained noise type classification model, namely the noise type classifier, include Precision (PR), recall (RR), F1 score (F1-score) and Accuracy (ACC), which are specifically defined as follows:

where TP represents the number of positive samples, FP represents the number of negative samples, TN represents the number of negative samples, and FN represents the number of positive samples.

As shown in fig. 2, the operation of the noise type classifier in the use phase is:

firstly, extracting optimal features from input real-time audio signals according to a feature list which is optimally selected by a noise-oriented classification task in a training stage;

inputting the extracted optimal characteristic value facing the noise classification task into a noise type classifier to identify the noise type in the audio signal.

S2: according to the noise type, determining the optimal characteristics and classifier suitable for the voice and noise oriented classification task of the audio signal;

s3: and extracting a preferred characteristic value facing to a voice and noise classification task from the audio signal, inputting the preferred characteristic value facing to the voice and noise classification task into a voice and noise classifier, and judging whether the audio signal has the voice signal. As shown in fig. 3, the speech noise classifier is constructed by the following steps:

Aiming at different noise types, constructing a training data set of a noise-containing speech and noise classification model based on the optimized low-dimensional features with good separability; training a noisy speech and noise classification model (random forest classifier) according to the training data set, and specifically comprising the following steps:

firstly, a training data set is constructed

Selecting optimal characteristic combination to carry out node branching

③ voting

The performance evaluation indexes of the trained noise-containing speech and noise classification model, namely the speech noise classifier, comprise Precision (PR), recall (RR), F1 score (F1-score) and Accuracy (ACC).

The operation of the speech noise classifier in the use stage is as follows:

firstly, according to the noise type, determining the optimal characteristics and classifier suitable for the voice and noise classification task of the audio signal.

Secondly, extracting a characteristic value of the preferred characteristic facing the voice and noise classification task from the audio signal, inputting the characteristic value of the preferred characteristic facing the voice and noise classification task into a voice noise classifier, and judging whether the audio signal exists.

The core of the active voice detection is to effectively distinguish a noise-containing voice signal from a noise signal, and in practical application, because the noise types in the environment background of the voice are complex and changeable, the distinguishing characteristics of the noise-containing voice and the noise signal under different noises are different, and the best recognition result is difficult to obtain under various noise types by adopting uniform characteristics and classifiers, the invention provides a random forest characteristic selection and voice noise classifier-based construction method, and different characteristic combinations are preferably selected and trained to form a specific voice noise classifier aiming at different noise types, so that the adaptability of an algorithm model under different environments is improved.

The practical effects of the present invention are analyzed by combining the specific application examples as follows:

1. audio data source

In the case analysis of the present invention, the speech signal was the audio of 30 different speakers, 15 for each boy and girl, randomly selected from the data set THCHS-30. The noise signal is obtained by selecting 6 kinds of noises as analysis objects from a NOISEX-92 standard noise library, wherein the 6 kinds of noises are respectively white noise (white), restaurant internal noise (babble), factory internal noise (factory2), small automobile internal noise (volvo), tank internal noise (m109) and fighter noise (f 16).

2. Construction of noise type classifier based on t-SNE cluster analysis and random forest

(1) Time-frequency domain feature extraction

For 6 kinds of noise signals, firstly, the noise signals are uniformly resampled to 8kHz, then, 20ms is taken as a frame length, 10ms is taken as a frame shift for frame division, 37-dimensional time-frequency domain features are extracted, and the corresponding relation between the dimensions of the features and the feature names is shown in Table 1.

TABLE 1 Audio feature dimension and feature name correspondence

Dimension (d) of	Feature name	Dimension (d) of	Feature name	Dimension (d) of	Feature name
						1	Zero crossing rate	18	Spectral flux	22	Energy in frequency domain
2～14	MFCC	19	Spectral border	23	Bandwidth of
						15～16	Spectral centroid, spectral dispersion	20	Harmonic ratio	24～31	Wavelet energy
17	Entropy of spectrum	21	Fundamental frequency	32～37	Singular value wavelet

(2) Performing clustering analysis on the characteristic values by using a t-SNE clustering characteristic analysis method

The results of the t-SNE visual cluster analysis of the features of the 6 noises are shown in fig. 4. As can be seen from the figure, these 6 kinds of noise form 4 clustering groups, wherein the noise characteristics of the band, the factor and the m109 are clustered together, and the three kinds of noise of volvo, f16 and white are clustered into one pile. Therefore, under the high-dimensional features extracted by the invention, the three noise environments of babble, factor and m109 can be regarded as one type of noise, and the volvo, f16 and white can be regarded as one type of noise respectively, and then the four types of noise can be identified and classified.

(3) Random forest method is adopted for carrying out feature optimization for noise classification and identification tasks

For a plurality of tasks of different noise identification and classification, 1500 groups of feature samples are extracted from each type of noise in feature optimization, and the dimension of each group of feature sample data is 37. According to the analysis result of the clustering characteristics of the noise, the characteristics of the three noises, namely the bag noise, the factor 2 and the m109, are similar and classified into one type, and 500 groups of characteristic samples are extracted from each noise to form 1500 groups. Thus, we have a 4 × 1500 × 37 data set, where two thirds of each type of noise sample are randomly drawn as model training data, and the remaining third is taken as test data. The importance ranking results of the dimension features are shown in a bar chart in fig. 5, and the identification accuracy of the test data corresponding to the accumulated features is shown in a line chart in fig. 5. It can be seen from the figure that the first 7 features most useful for distinguishing different kinds of noise are 15 th, 37 th, 29 th, 19 th, 31 th, 35 th and 30 th dimensional features, under the 7 description features, the classification accuracy of different noises can reach 99.55%, and after the addition of the 17 th, 2 th, 23 th, 33 th and 34 th dimensional features is continued, the highest accuracy is not improved until the 13 description features are added, the highest accuracy reaches 99.7%, and the identification accuracy is not improved after the addition of the features. Therefore, in the task of distinguishing different types of noise, in order to improve the identification accuracy and improve the timeliness of the detection identification process, the preferred low-dimensional features only comprise the features with the importance degree of 7, and only the spectrum centroid, the wavelet singular value, the wavelet energy and the spectrum border feature need to be calculated in real-time detection by referring to the table 1.

In order to further verify the validity of the preferred result of the random forest features, the following gives the feature value distribution of 2 features with the feature importance degrees ranked in the front, in the middle and in the back, as shown in fig. 6, it is obvious from fig. 6 that the feature with the importance degree ranked in the front has obvious separability among 4 kinds of noise, the feature with the importance degree ranked in the middle can only distinguish partial noise, and the feature with the rank in the back cannot distinguish different noise. It can be seen that the preferred features obtained based on random forests do have better separable characteristics, with obvious advantages for distinguishing different noises.

(4) Training noise type classification model

Aiming at a noise classification and identification task, a noise type classification model (a random forest classifier) is trained to form a noise type classifier, 6 kinds of noise are finally classified into 4 kinds of noise scenes through the noise clustering and feature optimization, namely, the 4 kinds of noise scenes are classified into a band \ factor \ m109 and volvo, f16 and white, and 7-dimensional features with better separable characteristics are preferably selected from 37-dimensional time-frequency domain features to participate in model training and verification tests. The training data under each type of noise scene is 1500 groups of samples, and 6000 groups of samples are counted; the test data for each type of noise scenario was 1500 samples, for a total of 6000 samples.

In the model training process, in order to avoid overfitting of the model and ensure the generalization capability of the model on future use data, the invention adopts a 5-fold cross validation method; in order to obtain a parameter combination which maximizes the accuracy of training and verification tests, the invention adopts a grid search method to optimize parameters, considers the influence degree condition of each parameter on the performance of the random forest model, and mainly optimizes three parameters of the number n _ estimators of the trees, the maximum depth max _ depth of the trees and the minimum leaf number min _ samples _ leaf of the tree nodes; setting the optimizing range of each parameter: n _ estimators is [10:10:100], max _ depth is [2:1:10], min _ samples _ leaf is [1:1:5 ]; as shown in fig. 7, the optimal values of the parameters are finally determined as n _ estimators being 20, max _ depth being 9, and min _ samples _ leaf being 1.

Under the parameter setting obtained by network search optimization, 5 times of training and verification tests are carried out, the average value of 5 times of operation results is taken as the final training and verification test result, and on the total data, the training accuracy is 99.81 percent and the test accuracy is 98.97 percent; on each category of noise data, the performance index of the noise type classifier is shown in table 2, and the confusion matrix of the noise type classifier on 4 different types of noise test data is shown in fig. 8, so that it can be obviously seen that the noise type classifier has good noise identification accuracy and generalization capability to unknown test data, and in the actual use process, the accuracy of noise identification can be further ensured by adopting a multi-segment test result voting decision mode for continuous audio.

Table 2 list of performance indicators for noise type classifier of the present invention

Categories	PR	RR	F1-score
				class0:babble\factory\m109	0.974	0.988	0.980
class1:volvo	1	1	1
				class2:white	1	1	1
class3:f16	0.988	0.974	0.980

The performance evaluation indexes of the noise type classifier comprise Precision (PR), recall (RR), F1 score (F1-score) and Accuracy (ACC), and are specifically defined as follows:

3. Speech noise classifier based on random forest

(1) Time-frequency domain feature extraction

Firstly, under 6 different noise signals, noise adding processing is carried out on pure voice, and the signal-to-noise ratios are respectively 10dB, 5dB, 0dB and-5 dB. Then, the sample signal data of the noise-added voice and noise is re-sampled to 8kHz uniformly, then, the frame length is 20ms and the frame shift is 10ms, and the time-frequency domain features of 37 dimensions are extracted, and the corresponding relationship between the dimensions of the features and the feature names is shown in table 1.

(2) Speech feature optimization for noisy speech and noise classification tasks

In the task of classifying noisy speech and noise, because the noise and speech have different confusion degrees under different signal-to-noise ratios, the optimal characteristics for distinguishing the noise and the speech are different, and therefore, the corresponding characteristics are optimal under different signal-to-noise ratios for a certain noise scene.

Aiming at the classification recognition tasks of noisy voices and noises under 4 types of noise scenes, 3000 groups of characteristic samples are respectively extracted from each type of noise and noisy voices under four signal-to-noise ratios of 10dB, 5dB, 0dB and-5 dB in characteristic optimization, and each group of characteristic sample has a dimension of 37. For the scenes of the first type of noise (babble, factory2 and m109), 3000 groups of feature samples are extracted from each type of noise and noisy speech thereof. Thus, we obtain a 2 × 3000 × 37 data set for each class of noise scenes at each signal-to-noise ratio, where half of each class of noise samples are randomly extracted as random forest model training data, and the remaining half are used as test data. The voice feature optimization results under each signal-to-noise ratio of each type of noise are summarized by Top10, as shown in table 3, it can be seen from the table that under 4 types of noise scenes, under the signal-to-noise ratios of 10dB, 5dB and 0dB, the preferred feature dimension overlap ratio is higher, which indicates that the features can not only distinguish noise-containing voice from noise signals in a limited way under the signal-to-noise ratios, but also have better capacity of resisting working condition disturbance; whereas the preferred characteristics obtained at-5 dB are much different from those obtained at the first 3 signal-to-noise ratios. Therefore, in the invention, under various noise scenes, a union set of the optimal characteristics under the three signal-to-noise ratios of 10dB, 5dB and 0dB is taken as a final characteristic sequence to train a model, and one model is independently trained under the signal-to-noise ratio of-5 dB, and a grid searching method is adopted to carry out model parameter optimization in the training of each model.

TABLE 3 Speech feature optimization result list for noisy speech and noise classification in different noise environments

(3) Training of noisy speech and noise classification model

Aiming at a classification recognition task of noisy speech and noise, training a noisy speech and noise classification model (a random forest classifier) to form a speech noise classifier and testing; in each noise scene, a model is uniformly trained for three signal-to-noise ratios of 10dB, 5dB and 0dB, training characteristics are selected according to the result of characteristic optimization in a table 3, and 1500 groups of samples are respectively trained and tested; training a model independently for a-5 dB signal-to-noise ratio, selecting training characteristics according to the optimal result of the previous section voice characteristics, and training and testing 1500 groups of samples of data; in order to verify the advantages of random forests in noisy speech and noise classification, based on training and testing data samples, an SVM model and a two-layer perceptron MLP model are also trained while a noisy speech and noise classification model (a random forest classifier) is trained, and the three models adopt a grid search method to realize the tuning of model parameters; the recognition accuracy of each classifier is shown in table 4, and as compared with fig. 9, it can be seen clearly that the recognition accuracy of the speech noise classifier is the best under different noise environment categories and different signal-to-noise ratios, and the recognition effects of the SVM classifier and the MLP classifier are equivalent; for different noise types, under the condition that the signal-to-noise ratio is not lower than 5dB, the classification accuracy of the voice noise classifier can reach more than 95%; when the signal-to-noise ratio is 0dB, the classification accuracy under volvo and white noises is above 96%, the accuracy under f16 noise is above 91%, and the accuracy under class0 noise is reduced to 85.3%; and when the signal-to-noise ratio is continuously reduced to-5 dB, the recognition accuracy is generally reduced a lot, and the accuracy of voice detection is ensured by combining with a voice noise reduction algorithm.

TABLE 4 identification result List of different classifiers under different noise environments and different signal-to-noise ratios

Example 2

The active speech detection method based on noise scene recognition in embodiment 1 can be implemented by the following active speech detection system.

As shown in fig. 10, an active speech detection system based on noise scene recognition includes:

the noise classification identification unit is used for identifying the noise type in the audio signal through a noise type classifier according to the preferable characteristics facing the noise classification task;

the model selection unit is used for determining the optimal characteristics and the classifier suitable for the voice and noise oriented classification task of the audio signal according to the noise type;

the second characteristic extraction unit is used for extracting the characteristic value of the preferred characteristic facing to the voice and noise classification task from the audio signal;

and the voice detection unit is used for judging whether a voice signal exists in the audio signal through the voice noise classifier according to the characteristic value of the preferred characteristic facing the voice and noise classification task.

The noise type classifier is constructed through t-SNE cluster analysis and a random forest method, and the voice noise classifier is constructed through the random forest method.

The invention provides a set of effective method and system for detecting the active voice under the dynamic noise environment aiming at the current situation that the noise type and the noise intensity of the application scene of the active voice detection technology are complex and changeable, and the current detection method considers less dynamic noise environment conditions, thereby effectively ensuring the accuracy of voice detection under different noise types and different noise intensities.

Aiming at the problem that the audio characteristic information is difficult to fully and comprehensively represent by a single audio characteristic in the detection of the active voice and the non-stable characteristic of the voice signal in a noise scene, a time-frequency domain characteristic extraction method based on the MFCC, wavelet decomposition, singular value decomposition and other methods is provided, and the characteristic information of the audio signal is mined from multiple visual angles.

Aiming at the problem that the noise type and the noise intensity are changeable, and the characteristics and classification models designed for general scenes are difficult to show stable and effective detection capability in a dynamic scene, a noise type classifier based on t-SNE cluster analysis and random forest classification is constructed, N noise signals are clustered into M (M is less than or equal to N) noise types with different characteristics through a t-SNE visual clustering method, and then the random forest is used for carrying out characteristic selection and classifier training on the M noises, so that the dynamic open noise environment can be converted into a specific noise scene for processing in real-time voice detection, and the accuracy of active voice detection is further ensured.

Aiming at the limitation of modeling capability of a single classifier, a random forest method in ensemble learning is applied, aiming at the problem that separability characteristics of noise-containing voice and noise are different under different characteristic noise types, a random forest characteristic selection method is adopted to optimize the most distinguishable characteristics under different types of noise, and a corresponding noise-containing voice and noise classification model is trained based on the optimized characteristic combination; because the signal characteristics of different noise types are fully considered in the modeling process, the method can effectively cope with the dynamic noise environment and obtain stable voice detection capability.

The analysis result of the test data verifies the effectiveness of the voice detection under the dynamic noise environment condition, and the method has good practical engineering application value; the analysis result of the test data also verifies that the classification and identification accuracy of the method provided by the invention is higher than that of methods such as SVM, MLP and the like.

While the invention has been described with reference to a preferred embodiment, various modifications may be made and equivalents may be substituted for elements thereof without departing from the scope of the invention. In particular, the technical features mentioned in the embodiments can be combined in any way as long as there is no structural conflict. It is intended that the invention not be limited to the particular embodiments disclosed, but that the invention will include all embodiments falling within the scope of the appended claims. The invention has not been described in detail and is part of the common general knowledge of a person skilled in the art.

Claims

1. A method for detecting active voice based on noise scene recognition is characterized by comprising the following steps:

s3: and extracting the preferred characteristics facing the voice and noise classification task from the audio signals, inputting the characteristic values of the preferred characteristics facing the voice and noise classification task into the voice and noise classifier, and judging whether the audio signals exist.

2. The method of claim 1, wherein the noise type classifier is constructed by t-SNE cluster analysis and a random forest method.

3. The method according to claim 2, wherein the noise type classifier is constructed by:

4. The method of claim 3, wherein the audio features include one or more or all of zero crossing rate, MFCC, spectral centroid, spectral dispersion, spectral entropy, spectral flux, spectral edge, harmonic ratio, fundamental frequency, frequency domain energy, bandwidth, and wavelet components.

5. The method of claim 1, wherein the noise classification preference feature comprises one or more of spectral centroid, wavelet singular value, wavelet energy, and spectral border feature.

6. The method of claim 1, wherein the speech noise classifier is constructed by using a random forest method.

7. The method according to claim 1, wherein the speech noise classifier is constructed by:

8. The method of claim 1, wherein the noise types include white noise, noise in cars, fighter noise and other noise.

9. An active speech detection system based on noise scene recognition, comprising:

10. The active speech detection system of claim 9, wherein the noise type classifier is constructed by t-SNE cluster analysis and a random forest method, and wherein the speech noise classifier is constructed by using the random forest method.