CN113555004A

CN113555004A - Voice depression state identification method based on feature selection and transfer learning

Info

Publication number: CN113555004A
Application number: CN202110801507.7A
Authority: CN
Inventors: 赵张; 王守岩; 汪静莹; 刘伟
Original assignee: Fudan University
Current assignee: Fudan University
Priority date: 2021-07-15
Filing date: 2021-07-15
Publication date: 2021-10-26

Abstract

The invention provides a speech depression state recognition method based on feature selection and transfer learning, and provides a speech depression state recognition method fusing a Lasso method and a transfer learning method CORAL, aiming at the two problems that feature dimensionality is high and feature distribution is influenced by individual differences tested except for depression level when modeling is carried out based on speech. The method has the advantages that 1, redundant information in the Lasso filtering characteristics is obtained, effective characteristics are reserved, and the identification precision is further improved on the basis of improving the model efficiency; 2. under the premise of not leaking depression label information, the migration learning method CORAL draws the feature distribution of the training set and the test set closer, and reduces the influence of other factors except the depression level on the feature distribution. The combination of the two methods can further improve the accuracy and stability of depression screening.

Description

Voice depression state identification method based on feature selection and transfer learning

Technical Field

The invention belongs to the field of voice signal processing, and particularly relates to a voice depression state identification method based on feature selection and transfer learning.

Background

The depression is a typical and common psychogenic disease in the world, covers all age stages, depends on clinical experience of doctors and relevant scales filled by patients, and has long time consumption in the whole process and low efficiency of a diagnosis process. The voice is an important external expression of emotion, and is a key direction for researchers to realize an automatic depression recognition method due to unique advantages of few use limitations, low equipment cost, no contact, noninvasive and convenient acquisition mode and the like.

At present, there is no specific feature with clear theoretical background support for depression identification, the feature design level is to extract depression-related information in voice as much as possible, generally using features of multiple fields with high dimensionality, and comparing classification results of different feature combinations. However, the number of used features is too large, the model is too complex, the recognition result takes too long, and the diagnosis efficiency is reduced.

Speaking among other people is a very complex process, and many studies have been made to explore the differences between the brain structure and function of patients with depression and the potential factors affecting speech in addition to depression, mainly including: gender, age, emotional state, language style, and academic work background. These factors further increase the difference in feature distribution of different tested speech signals, and increase the difficulty of model recognition.

In addition, in the machine learning related to the speech signal, it is usually assumed that the data in the test set and the data in the test set are independently and identically distributed when performing the training set and the test set division, however, the feature distribution of the tested speech signal is affected not only by the level of depression, but also by other factors such as age, sex, occupation, etc. of the tested individuality difference, so that this assumption condition is not satisfied, and the performance of the model is reduced.

Disclosure of Invention

In order to solve the problems, the invention provides a speech depression state identification method based on feature selection and transfer learning, which adopts the following technical scheme:

the invention provides a speech depression state identification method based on feature selection and transfer learning, which is characterized by comprising the following steps: step S1, collecting voice by using a recording device to obtain a voice sample; step S2, preprocessing the voice sample; step S3, extracting the voice characteristics in the voice sample, wherein the voice characteristics at least comprise chrominance characteristics; step S4, calculating statistic of voice characteristics, and taking the statistic as a characteristic set; step S5, using a Lasso model to perform feature selection on the feature set to obtain an effective feature set; step S6, based on the effective feature set, using CORAL method to perform transfer learning to obtain the training set features after transfer; and step S7, classifying the voice samples based on the characteristics of the training set, and outputting the classification result.

The speech depression state identification method based on feature selection and transfer learning provided by the invention can also have the technical characteristics, wherein the speech characteristics further comprise acoustic characteristics, frequency domain characteristics, pause characteristics and Mel frequency cepstrum coefficients.

The speech depression state identification method based on feature selection and transfer learning provided by the invention can also have the technical characteristics that the statistics comprises the maximum value, the minimum value, the range, the mean value, the median, the intercept term of linear regression, the independent variable coefficient of linear regression, R2 of linear regression, standard deviation, skewness, kurtosis and variation coefficient.

The speech depression state identification method based on feature selection and transfer learning provided by the invention can also have the technical characteristics that the classifier model used in classification is XGboost.

The speech depression state identification method based on feature selection and transfer learning provided by the invention can also have the technical characteristics that the preprocessing comprises the removal of noise fragments, the removal of mute fragments, high-pass filtering and down-sampling.

The invention provides a speech depression state recognition device based on feature selection and transfer learning, which is characterized by comprising the following components: a voice collecting part for collecting the voice sample; a preprocessing section for preprocessing the voice sample; a feature extraction unit configured to extract the speech feature of the speech sample; the characteristic processing part is used for processing the voice characteristics to obtain the effective characteristic set; the transfer learning part is used for carrying out transfer learning on the effective characteristic set to obtain the characteristics of the training set after the transfer; a classification section for classifying the voice sample.

Action and Effect of the invention

According to the method for recognizing the speech depression state based on the feature selection and the transfer learning, after the collected speech samples are preprocessed, the speech features are extracted, 12 statistics of the speech features are calculated to serve as feature sets, the feature sets are further subjected to the feature selection and the transfer learning, and training set features are obtained and used for classifying the speech samples. The method has the advantages that the Lasso model is used for feature selection, redundant information in the features is filtered, and effective features are reserved, so that the method is based on fewer features, lower model complexity achieves better recognition accuracy, the technical problem of high feature dimensionality in modeling based on voice is solved, and meanwhile, the model efficiency is improved.

On the other hand, because the feature-based unsupervised transfer learning method CORAL is used for transfer learning, the feature distribution of the training set and the test set can be drawn close by aligning the second-order covariance matrix on the premise of not revealing depression label information, and the influence of other factors except the depression level on the feature distribution is reduced, so that the technical problem that the feature distribution is influenced by the individual difference outside the tested depression level when modeling is carried out based on voice is solved. The combination of the two methods can further improve the accuracy and stability of depression screening.

Drawings

FIG. 1 is a flow chart of a speech suppression state recognition method based on feature selection and transfer learning according to an embodiment of the present invention;

fig. 2 is a schematic diagram of a speech suppression state recognition apparatus based on feature selection and transfer learning according to an embodiment of the present invention.

Detailed Description

In order to make the technical means, the creation characteristics, the achievement purposes and the efficacies of the present invention easy to understand, the following describes the speech depression state identification method based on the characteristic selection and the transfer learning specifically with reference to the embodiment and the accompanying drawings.

< example 1>

Fig. 1 is a flowchart of a speech suppression state recognition method based on feature selection and transfer learning according to an embodiment of the present invention.

As shown in fig. 1, the method for recognizing a speech suppression state based on feature selection and transfer learning according to the embodiment of the present invention includes the following steps:

and step S1, acquiring voice information, acquiring voice by using a recording device, designing questions of different speech task types, answering the test according to prompts on a screen, acquiring the complete speaking process of the test by using the recording device, and recording the complete speaking process as a wav file, wherein the file is a voice sample.

And step S2, preprocessing the voice signal, preprocessing the collected voice sample, manually screening and eliminating obvious noise fragments, such as cough and dropped sound, and performing high-pass filtering, down-sampling, silence fragment detection and removal.

In this embodiment 1, a second-order butterworth filter with a cutoff frequency of 137.8Hz is used for high-pass filtering, so as to reduce the interference of low-frequency noise on the effective voice information; uniformly sampling the voice signal to 16000hz by using a tool kit librosa; the tool kit Pyaudioanalysis is used to detect voiced and unvoiced segments and remove unvoiced segments that are not voiced. Short-time Fourier transform: the window length is 0.1s, the sliding step length is 0.05s, a hamming window is selected, and NFFT is 1024.

Step S3, extracting voice characteristics in the voice sample, including acoustic characteristics, frequency domain characteristics, pause characteristics, Mel frequency cepstrum coefficients and chroma characteristics, see Table 1.

TABLE 1 Speech characteristics summary sheet

As shown in table 1, the acoustic features include 6 fundamental frequency, energy, and zero-crossing rate related features. The energy characteristics comprise sound intensity and sound intensity envelopes, and the zero-crossing rate related characteristics comprise zero-crossing rate, zero-crossing amplitude, namely the maximum amplitude of a signal between two zero-crossing points, and zero-crossing interval, namely the time interval between two zero-crossing points.

The number of frequency domain features is 5, which are respectively a spectrum center, a spectrum entropy, a spectrum extensibility, a spectrum roll-off point and a spectrum flux.

The total number of mel-frequency cepstrum coefficients is 13, which is a common feature in speech signal processing.

The total number of the chromaticity characteristics is 12, which is a general name of chromaticity maps and chromaticity vectors, represents the energy in 12 sound levels in unit time, the energy of the same sound level of different octaves is accumulated, the chromaticity characteristics are widely applied to the music field, and the method is introduced into the depression identification field.

The number of the pause characteristics is 3, and the pause times, the pause time ratio and the average pause time length ratio are included.

Step S4, calculating a feature statistic: 12 statistics of the speech features are calculated, and the statistics are taken as a feature set. The 12 statistics include: maximum, minimum, range, mean, median, intercept term of linear regression (time as argument), independent coefficient of linear regression (time as argument), R2 of linear regression (time as argument), standard deviation, skewness

Kurtosis

And coefficient of variation

Step S5, feature selection: and (4) performing feature selection on the feature set by using a Lasso model, and compressing the coefficient of the non-significant variable to obtain an effective feature set.

Lasso selects feature variables based on a penalty function, extracts valid features by compressing coefficients, and extracts a general linear regression model Y ═ X β + epsilon and a response variable Y ═ Y (Y ═ Y ∈)₁,y₂,…,y_n)^TThe independent variable X ═ X⁽¹⁾,X⁽²⁾,…,X^(m)) Wherein X is⁽ⁱ⁾Is an n × 1 order vector, and the regression coefficient β ═ β (β)₁,β₂,…，β_m)^T. Based on the common least square estimation, the regression coefficients are compressed in a mode of adding a penalty function, part of coefficients can be compressed to 0, the features of the coefficients compressed to 0 are discarded, the remaining features are reserved effective features, and the Lasso estimation formula is as follows:

the method adopts Lasso-Logistic regression for classification tasks, compares different lambda parameters on the basis of the fixed parameters of a Logistic regression model, and determines the hyper-parameters according to the optimal accuracy. The penalty coefficient λ is determined by adjusting parameters through multi-round experimental cross validation, and it is tried to set the penalty coefficient λ to 1, 0.1, 0.01, 0.005, 0.001, 0.0005, 0.0001, 0.00005, and finally 0.005.

And step S6, performing transfer learning, namely performing transfer learning by using a domain adaptive method CORAL based on the effective characteristic set, and drawing up the characteristic distribution between the test set and the training set by aligning a second-order covariance matrix to obtain the characteristics of the training set after the transfer.

The invention introduces a feature-based unsupervised transfer learning method for reducing the feature distribution difference between a training set and a testing set caused by other individual factors except the depression level on the premise of not leaking depression label information: and (3) a domain adaptive Alignment (CORAL) method, which is used for drawing up the characteristic distribution of the training set and the test set by aligning a second-order covariance matrix. After white noise information is added to the covariance matrix of the target domain, linear transformation is performed, and only two blocks need to be calculated by CORAL: (1) a covariance matrix of the source domain features and the target domain features; (2) and performing linear transformation on the matrix added with the white noise. The specific steps of the migration algorithm are shown in table 2.

TABLE 2 CORAL Algorithm steps

And step S7, classifying, namely classifying the voice samples by using an XGboost classifier model based on the characteristics of the training set, and outputting the classification result of the voice samples.

The XGboost is a Boosting framework-based lifting tree model, and the identification error and variance of the model are reduced by integrating a plurality of CART decision trees into a strong classifier. The XGboost learns the residual error used for the last prediction of the XGboost and the last prediction every time based on a gradient descent tree setting function, the score obtained by each node is calculated according to a sample, the sum of all the scores is used as the classification result of the sample, and the model to be trained in the t iteration is set as f_t(x) And then:

i.e. the classification result, x, of the model on the ith sample after t iterations_iRepresents the number of the i-th sample,

representing the predicted outcome of the t-1 tree, f_tRepresenting the t-th tree. The objective function is set to:

obj (t) is the objective function value for t iterations,

training error for the ith sample;

is the sum of the model complexity of t trees, and is used as a regular term in the objective function. The model complexity Ω is determined by the total number T of decision tree nodes, and the weight coefficients of the decision tree nodes are written as:

in the formula

Is the L2 norm of the weight coefficient; gamma is the coefficient of the sliced leaf node, used to control the total number of nodes; λ is a regular term coefficient.

By training, it is estimated when to terminate training based on the objective function described above. And traversing all the characteristics by adopting a greedy algorithm as a division point during implementation, continuing the division if the OBJ after the division is larger than that before the division, and stopping the division if the weight coefficient or the depth exceeds a threshold value, so that overfitting of the model is avoided.

After training is completed, the model can be used for carrying out classification prediction on the voice sample, and the voice sample is judged to belong to a depressed subject or a normal subject. And finally outputs the result of the classification.

The embodiment also provides evaluation indexes of three classification results of the speech depression state: accuracy, F1 score and AUC values. The three evaluation indexes are specifically defined as follows:

the F1 score is the harmonic mean of recall and accuracy and is in the range of [0,1 ].

The AUC value is the area enclosed by the receiver operating characteristic curve (ROC) and the coordinate axis, and the abscissa of the ROC curve is

The ordinate is

The curve is above y-x and the value range is [0.5, 1]]。

Wherein, the definition of TP, FP, FN, TN is shown in Table 3.

TABLE 3 confusion matrix of classification results of speech depression states

	Audio for depression being tested	Normal tested audio
			Determining audio belonging to a depressed subject	True Positive(TP)	False Positive(FP)
Judging the audio frequency belonging to the normal tested audio frequency	False Negative(FN)	True Negative(TN)

The values of the three evaluation indexes are positively correlated with the classification performance, and the larger the value is, the better the classification result is.

Therefore, through the voice depression state identification method based on feature selection and transfer learning, the depression state identification of the tested voice segment is realized, the classification result of the voice segment is obtained, and the evaluation of the classification result is obtained.

< example 2>

As described above, embodiment 1 provides a speech depression state recognition method based on feature selection and transfer learning, mainly including steps S1 to S6. In practical application, the steps of the method of embodiment 1 can be configured into corresponding computer modules, namely, a voice collecting part, a preprocessing part, a feature extracting part, a feature processing part, a transfer learning part and a classifying part, which form a device for classifying and identifying the voice depression state, so that a voice depression state identification device based on feature selection and transfer learning can also be provided.

Fig. 2 is a schematic diagram of a speech depression state recognition apparatus based on feature selection and transfer learning according to an embodiment of the present invention.

As shown in fig. 2, a speech depression state recognition apparatus (hereinafter, simply referred to as a speech suppression state recognition apparatus) 100 based on feature selection and transition learning includes a speech acquisition section 11, a preprocessing section 12, a feature extraction section 13, a feature processing section 14, a transition learning section 15, and a classification section 16. The speech depression state recognition device 100 is used for recognizing a target speech segment and obtaining a recognition result, namely the speech segment belongs to a depression subject test or a normal subject test.

The voice acquiring unit 11 acquires a voice sample by acquiring a voice segment to be tested, and adopts the voice acquiring method of step S1.

The preprocessing section 12 is for preprocessing the voice sample by the preprocessing method of step S2.

The feature extraction unit 13 is configured to extract a speech feature in the speech sample, and employs the speech feature extraction method of step S3.

The feature processing unit 14 is configured to process the extracted speech features to obtain an active feature set, and to adopt the feature processing method of steps S4 to S5.

The migration learning unit 15 performs migration learning to obtain training set features after migration, and adopts the migration learning method of step S6.

The classification unit 16 classifies the speech segment and outputs the result, and adopts the classification method of step S7.

The execution process of each part is consistent with the process described in the corresponding step in the speech suppression state recognition method based on feature selection and transfer learning, and is not described herein again.

Examples effects and effects

According to the method for recognizing the speech depression state based on the feature selection and the transfer learning, after the collected speech samples are preprocessed, the speech features are extracted, 12 statistics of the speech features are calculated to serve as feature sets, and the feature sets are further subjected to the feature selection and the transfer learning to obtain the features of a training set for classifying the speech samples. The method has the advantages that the Lasso model is used for feature selection, redundant information in the features is filtered, and effective features are reserved, so that the method is based on fewer features, lower model complexity achieves better recognition accuracy, the technical problem of high feature dimensionality in modeling based on voice is solved, and meanwhile, the model efficiency is improved.

On the other hand, in the embodiment, because the feature-based unsupervised migration learning method CORAL is used for migration learning, the feature distribution of the training set and the test set can be drawn closer by aligning the second-order covariance matrix on the premise of not revealing the information of the depression label, and the influence of other factors except the depression level on the feature distribution is reduced, so that the technical problem that the feature distribution is influenced by the individual difference outside the tested depression level when modeling is performed based on voice is solved. The combination of the two methods can further improve the accuracy and stability of depression screening.

The above-described embodiments are merely illustrative of specific embodiments of the present invention, and the present invention is not limited to the description of the above-described embodiments.

For example, in the embodiment, the penalty coefficient λ of the Lasso model is set to 0.005, and in the present invention, the penalty coefficient λ may also be adjusted to other suitable values, so that the technical effects of the present invention can also be achieved.

In the embodiment, the classifier model used for classification is XGBoost, and in the present invention, other classifier models may be used for classification, for example, LightGBM may also be used to achieve the technical effect of the present invention.

Claims

1. A speech depression state identification method based on feature selection and transfer learning is used for identifying speech depression states and is characterized by comprising the following steps:

step S1, collecting voice by using a recording device to obtain a voice sample;

step S2, preprocessing the voice sample;

step S3, extracting the voice characteristics in the voice sample, wherein the voice characteristics at least comprise chrominance characteristics;

step S4, calculating statistic of the voice features, and taking the statistic as a feature set;

step S5, using a Lasso model to perform feature selection on the feature set to obtain an effective feature set;

step S6, based on the effective feature set, using a CORAL method to perform transfer learning to obtain the characteristics of the training set after transfer;

and step S7, classifying the voice samples based on the training set characteristics, and outputting a classification result.

2. The speech depression state recognition method based on feature selection and transfer learning according to claim 1, characterized in that:

wherein the speech features further include acoustic features, frequency domain features, pause features, and mel-frequency cepstrum coefficients.

3. The speech depression state recognition method based on feature selection and transfer learning according to claim 1, characterized in that:

wherein the statistics include a maximum value, a minimum value, a range, a mean, a median, an intercept term of the linear regression, an independent variable coefficient of the linear regression, R2 of the linear regression, a standard deviation, a skewness, a kurtosis, and a coefficient of variation.

4. The speech depression state recognition method based on feature selection and transfer learning according to claim 1, characterized in that:

and the classifier model used for classification is XGboost.

5. The speech depression state recognition method based on feature selection and transfer learning according to claim 1, characterized in that:

wherein the preprocessing includes removal of noise segments, removal of silence segments, high-pass filtering, and down-sampling.

6. A speech depression state recognition apparatus based on feature selection and transfer learning, comprising:

a voice collecting part for collecting the voice sample;

a preprocessing section for preprocessing the voice sample;

a feature extraction unit configured to extract the speech feature of the speech sample;

the characteristic processing part is used for processing the voice characteristics to obtain the effective characteristic set;

the transfer learning part is used for carrying out transfer learning on the effective characteristic set to obtain the characteristics of the training set after the transfer;

a classification section for classifying the voice sample.