CN113823267B

CN113823267B - Automatic depression recognition method and device based on voice recognition and machine learning

Info

Publication number: CN113823267B
Application number: CN202110986901.2A
Authority: CN
Inventors: 张莉; 辛逸男; 刘欣阳; 刘志宽; 邓冉琪; 吴鹏飞
Original assignee: South Central University for Nationalities
Current assignee: South Central Minzu University
Priority date: 2021-08-26
Filing date: 2021-08-26
Publication date: 2023-12-29
Anticipated expiration: 2041-08-26
Also published as: CN113823267A

Abstract

The invention discloses a depression automatic identification method and device based on voice identification and machine learning, comprising the following steps: s1, acquiring voice data of a patient; s2, selecting the characteristics of the voice data, and recombining the selected characteristics to generate long-term characteristics; and step S3, identifying the depression degree of the long-term features according to a random forest algorithm. By adopting the technical scheme of the invention, the problem that depression patients are difficult to find in early stage is effectively solved, and the diagnosis threshold of depression patients is reduced.

Description

Automatic depression recognition method and device based on voice recognition and machine learning

Technical Field

The invention belongs to the technical field of machine learning, and particularly relates to an automatic depression recognition method and device based on voice recognition and machine learning.

Background

By 2014, the prevalence rate of depression in China is 2.1%, and by 2017, 581 ten thousand people of serious mental disorder patients registered nationally can cause serious injuries to patients, families and society. The notification and the working scheme for exploring and developing the special service for preventing and treating the depression are published in 9 months 2020, the working scheme indicates that the knowledge rate, the treatment rate and the treatment rate of the public for preventing and treating the depression in China are lower, the treatment rate is only one tenth of that of all patients with the depression, the diagnosis and the treatment of the depression depend on doctors in mental hospitals, and the training of doctors in non-mental hospitals is increasing in China. Therefore, automation is particularly important for early diagnosis of depression.

Disclosure of Invention

The invention aims to solve the technical problem of providing an automatic depression recognition method and device based on voice recognition and machine learning, which are used for recording voice data of common people through designing questions and answers, and then using a machine learning algorithm to recognize and classify the voice data, so that the problem that depression patients are difficult to find in early stage is effectively solved, and the diagnosis threshold of depression patients is reduced.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

an automatic depression recognition method based on voice recognition and machine learning comprises the following steps:

s1, acquiring voice data of a patient;

s2, selecting the characteristics of the voice data, and recombining the selected characteristics to generate long-term characteristics;

and step S3, identifying the depression degree of the long-term features according to a random forest algorithm.

Preferably, step S2 includes:

step 2.1, carrying out feature extraction on the voice data by adopting framing windowing;

step 2.2, selecting the extracted features according to the decision tree;

and 2.3, recombining the selected characteristics to generate long-term characteristics.

Preferably, the extracting features in step 2.1 are time domain features and frequency domain features, and the time domain features include: short-time energy, zero crossing rate and energy entropy, and frequency domain features comprise: spectral entropy, fundamental frequency, and centroid.

Preferably, in step 2.3, the short-time features are subjected to discretization, a threshold value is set according to the upper and lower third of the sites of each feature value, each feature is divided into three discrete features of low value, median value and high value, and then the features after discretization are subjected to feature combination in a co-occurrence mode; and after the characteristics are combined, counting the frequency of the occurrence of the characteristics in a frame of voice signal to generate long-term characteristics.

Preferably, in step S3, each of the feature values in the long-term features represents a frequency of occurrence of a plurality of feature-specific discrete values of the speech data in common in one frame, and when classifying according to the feature values, classifying is performed according to the frequency of occurrence of the discrete features of the speech data in common in one frame.

The invention also provides an automatic depression recognition device based on voice recognition and machine learning, which comprises:

the acquisition module is used for acquiring voice data of a patient;

the combination module is used for selecting the characteristics of the voice data, and recombining the selected characteristics to generate long-term characteristics;

and the identification module is used for identifying the depression degree of the long-term features according to a random forest algorithm.

Preferably, the combination module includes:

the extraction unit is used for carrying out feature extraction on the voice data by adopting framing windowing;

the selection unit is used for selecting the extracted characteristics according to the decision tree;

and the combination unit is used for recombining the selected characteristics to generate long-term characteristics.

Preferably, the extracted features are time domain features and frequency domain features, the time domain features comprising: short-time energy, zero crossing rate and energy entropy, and frequency domain features comprise: spectral entropy, fundamental frequency, and centroid.

Preferably, the combining unit includes:

the discretization component is used for discretizing the short-time features, setting a threshold value according to the upper and lower third positions of each feature value, and dividing each feature into three discrete features of a low value, a median value and a high value;

the combination component is used for combining the features after discretization in a co-occurrence manner;

and the generation component is used for generating long-term features by adopting statistics of frequency of occurrence of the features in a frame of voice signal for the combined features.

Preferably, each of the long-term features in the recognition module represents a frequency of co-occurrence of a plurality of feature-specific discrete values of the speech data in one frame, and when classified according to the feature values, the classification is performed according to the frequency of co-occurrence of the discrete features of the speech data in one frame.

According to the invention, by collecting the voice signals, selecting the characteristics of the voice signals, recombining the characteristics into the new long-term characteristics, and identifying the depression degree by combining the random forest algorithm in machine learning, people can be helped to detect and diagnose the depression in an early stage in a simpler mode.

Drawings

FIG. 1 is a flow chart of an automatic depression recognition method based on voice recognition and machine learning technology of the invention;

FIG. 2 is a schematic diagram of voice data acquisition and recording;

FIG. 3 is a schematic diagram of threshold points;

FIG. 4 is a schematic diagram of a combined long-term feature machine learning classification;

fig. 5 is a schematic diagram of a structure diagram of an automatic depression recognition device based on voice recognition and machine learning technology.

Detailed Description

The invention will be described in further detail below with reference to the drawings by means of specific embodiments.

As shown in fig. 1, the present invention provides an automatic depression recognition method based on voice recognition and machine learning technology, comprising the following steps:

s1, acquiring voice data of a patient;

and S2, identifying the depression degree of the long-term features according to a random forest algorithm.

Further, in step S1, voice data is collected by a smart device, such as a mobile phone or a bracelet, as shown in fig. 2. In the collection of a voice dataset, multiple subjects may be invited to interview, record, interview talk to the subject through design-fixed questions relating to multiple aspects of family, work, mood, self-depression assessment, etc. The interview session was recorded by a smart phone with recording capability on the market, the interview environment was kept quiet, the microphone sampling frequency was 44.1kHz, the voice data was saved in wav format, the average duration was 20 minutes, and the start and stop times of each text in the interview session were recorded as text documents. All subjects will fill in a psychiatric questionnaire (PHQ-8) before interview, with scores from 0 to 30, 15 points and more being considered depressive and 15 points and less being considered non-depressive.

Further, the features of the voice data in step S2 include time domain features and frequency domain features, the time domain features extracted by the present invention include short-time energy, zero crossing rate and energy entropy, and the frequency domain features include spectral entropy, fundamental frequency and centroid. The different features represent different physical meanings of speech: the short-time energy represents the speaking voice, and the energy is greatly reduced in the short-time period of the silent section, and the value is close to 0; the zero crossing rate is the number of times the energy is inverted at zero, and is generally used for distinguishing a noise section from a sounding section; the energy entropy represents the richness of data information contained in voice data, the content contained in different voice fragments is inconsistent, and the energy entropy can measure the richness of the information; spectral entropy represents the unvoiced and voiced sounds of sound, and simultaneously represents the air supply degree when a speaker speaks, and is a measure of the richness of voice data information in the frequency domain; the fundamental frequency is the lowest frequency of the sine wave forming the voice signal, and reflects the fundamental frequency of voice; the centroid represents the center of distribution of the speech waveform. Features of the speech signal may be diagnostic for the degree of depression,

step S2 comprises the steps of:

step 2.1, extracting the characteristics of the voice data

Firstly, according to the conversation starting and ending time in the recorded file, the voice of interviewee in voice data is removed, and the voice fragments of the testee are spliced. The voice signal is a non-stationary signal, but has stationary characteristics in 10-50ms, so that the voice data is subjected to framing and windowing, the voice signal is firstly divided into voice fragments, the characteristics of the voice signal are extracted through windowing, the movement of a Hamming window is smaller than the length of each frame of voice, the information loss of a Hamming window side lobe is prevented, and the Hamming window is calculated as follows:

short time energy:

the short-time energy is an average value of the energy of the voice data, the voice signal of each sampling point is set as A (n), the short-time energy E (energy) is defined as follows, and k is the sampling frequency of the window.

Zero-crossing rate:

the zero crossing rate refers to the number of times the energy of the voice signal passes through the zero crossing point, and is abbreviated as ZCR, and the ZCR rate is calculated as shown in the following formula.

The sgn function in the above equation is a sign function, as follows.

Energy entropy:

the energy entropy represents an information uncertainty measure of the energy of the voice data in the time domain, and is calculated as shown in the following formula.

EE＝-∑p _i log(p _i )

Wherein p is _i Is the ratio of the energy of a certain value to the sum of all energies in a speech segment.

Fundamental frequency:

the speech signal may be considered to consist of sine waves of different frequencies, the lowest frequency sine wave being the fundamental tone, the fundamental frequency representing the frequency of the fundamental tone.

Centroid:

the centroid may reflect the instability of the speech signal, and is calculated as shown in the following equation, assuming a centroid (spectral centroid, SC).

Wherein f (n) is the signal frequency, E (n) is the spectral energy of the corresponding frequency of the voice signal A (n) after short-time Fourier transform

Spectral entropy:

the spectral entropy is calculated in the frequency domain, unlike the energy entropy, and the spectral entropy is calculated after the voice signal is subjected to short-time fourier transform, and is calculated as shown in the following formula.

SE＝-∑p _i log(p _i )

Wherein p is _i Representing the ratio of the value of a certain sample point to the sum of the sample points.

Step 2.2, selecting the extracted features

The invention combines the features to avoid the excessive number of the combined features, so the extracted features are selected according to a decision tree, and the most important 4 features are selected for research, not limited to the 4 features.

Step 2.3, recombining the selected features

Short-term features are extracted through framing and windowing, the short-term features correspond to 10-50ms voice segment features, the short-term features of each window are not rich enough in representing information, the difference is large, feature values of each frame are generated by average values of all windows, the length of one frame is about 2-4s, but long-term features cannot well reflect information of the short-term features, and direct classification effects of the short-term features and the long-term features are not ideal. The invention adopts a feature combination method, firstly discretizing short-time features, setting a threshold value according to the upper and lower third sites of each feature value, dividing each feature into three features of low value, medium value and high value, and then carrying out feature combination on the discretized features. The method comprises the steps of firstly discretizing 4 selected characteristics into three types of low values, medium values and high values according to threshold values, wherein the used threshold values are the upper third position points and the lower third position points of the characteristic values, selecting one third position points as the threshold values, and reducing the influence of abnormal values on data to a certain extent, wherein the threshold values of each characteristic are shown in a table 1, two values of one column of the threshold values in the table 1 are the lower third position points and the upper third position points respectively, and the '0', '1' and '2' in the characteristic values correspond to the low, medium and high positions in the characteristic sizes. A discretized schematic is shown in fig. 3.

TABLE 1

The feature combination method combines certain feature values of any two short-term features into new short-term features in a co-occurrence mode, taking energy entropy and zero crossing rate as examples, the energy entropy EE is divided into three features EE (0), EE (1) and EE (2) after discretization, the three features represent low values, medium values and high values after discretization respectively, and after feature combination, the energy entropy and the zero crossing rate generate a 9-dimensional feature vector, as shown in table 2. The feature combination method is not limited to the pairwise combination, and the new feature vector can be generated by three or four short-term feature combinations.

TABLE 2

After feature combination, the generated feature vector is still a short-term feature, and in order to solve the defects in the short-term feature and the long-term feature, the invention adopts statistics of the frequency of occurrence of the features in a frame of voice signal to generate the long-term feature with more abundant information, and the calculation is shown in the following formula.

Wherein ZCR(s) and EE(s) represent characteristic values of zero crossing rate and energy entropy at t time, a ₁ 、a ₂ 、b ₁ 、b ₂ Upper and lower thresholds of zero crossing rate and energy entropy respectively, W _ZCR*EE And combining the feature vectors for the time t. V (V) _ZCR*EE To calculate the combined long-term characteristics of the zero crossing rate and the energy entropy co-occurrence of the Δt time length. The above features have great difference in order magnitude, in order to compare the difference of different voice fragments caused by depression degree, and create a more stable model, the obtained combined features are normalized, the original distribution of the data is not changed, and only the data is stretched.

Further, in step S3, a machine learning classification algorithm based on combined features is adopted, and a random forest algorithm (Random Forests algorithm) is an integrated learning, is composed of a plurality of decision trees, belongs to a strong classifier, and outputs a prediction result by a majority voting method according to the prediction results of the plurality of decision trees. The single decision tree belongs to a weak classifier, which means that the classification result of the single decision tree is slightly stronger than the random classification, and the majority voting result of a plurality of decision trees is obviously better than the single decision tree according to the law of large numbers. In the random forest algorithm, the total number of samples is N, the number of features is M, N (N < N) samples are selected, M (M < M) features are selected, an optimal feature is selected in the single decision tree to divide left and right subtrees, T times of sampling are carried out, T decision trees are generated to form a random forest, and finally a prediction result of the random forest is generated by majority voting of the T decision trees. In the process of generating a single decision tree, a random sampling method is selected to ensure that samples of each decision tree are different, so that a prediction result of each decision tree is different, and replaced sampling (bagging) is adopted to ensure that samples in each decision tree have an intersection, prevent overlarge difference after training of each decision tree, and randomly select characteristics to avoid the problem of overfitting. The feature importance of a single decision tree is measured by calculating the degree to which features reduce in the tree node's coefficient of base (Gini index), which is calculated as shown in the following equation.

Where p (c|n) is the proportion of the number of samples belonging to the C class in the node N, C is the total number of classes, if all the nodes N are samples of the same class, the value of the base coefficient is 0, if all the samples of the class occupy the same proportion in the node N, the base coefficient takes the minimum value, and the smaller the base coefficient, the clearer the classification of the classes is explained. The plurality of random forests represent feature importance by calculating an average of the reduced values of the coefficient of the feature's kunit in all trees. Therefore, the random forest algorithm not only can process complex classification tasks, but also can sort the feature importance, and can be conveniently trained in parallel, has low hardware requirements, can be realized on a CPU, and is also suitable for various intelligent devices with recording functions. In the combined long-term feature created by the invention, each feature value does not singly represent a certain feature of voice data, but represents the frequency of the co-occurrence of a plurality of specific discrete values of the voice data in a frame, when the decision tree classifies according to the feature values, the decision tree classifies according to the feature values of the combined long-term feature of the voice data in a frame, compared with the long-term feature or the short-term feature, the combined long-term feature has richer information, the contribution of different feature values to the classifier can be judged, and the size of the feature values corresponds to certain acoustic characteristics. According to the invention, a long-term characteristic is generated for voice data of one frame, and short-term voice fragments can be classified, so that people can recognize depression through short dialogue in daily life, and depression can be found early, and help seeking threshold of depressed patients is reduced. A specific combined feature random forest classification framework is shown in fig. 4.

The present invention marks the depression category as 1, positive, and the non-depression category as 0, negative. The system classification performance evaluation comprises accuracy, sensitivity, specificity and F1 score, and the evaluation calculation is based on the following formula:

where TP represents positive samples predicted by the model as positive, TN represents negative samples predicted by the model as negative, FP represents negative samples predicted by the model as positive, FN represents positive samples predicted by the model as negative, R is recall, which is equivalent to sensitivity, and P is precision.

As shown in fig. 5, the present invention further provides an automatic depression recognition device based on voice recognition and machine learning, comprising:

the acquisition module is used for acquiring voice data of a patient;

Further, the combination module includes:

Further, the extracted features are time domain features and frequency domain features, the time domain features comprising: short-time energy, zero crossing rate and energy entropy, and frequency domain features comprise: spectral entropy, fundamental frequency, and centroid.

Further, the combining unit includes:

the discretization component is used for discretizing the short-time features, setting a threshold value according to the upper and lower third positions of each feature value and dividing each feature into three features of a low value, a median value and a high value;

the combination component is used for carrying out feature combination on the discretized features;

Further, each feature value in the long-term features in the recognition module represents the frequency of co-occurrence of a plurality of feature-specific discrete values of the speech data in one frame, and when classifying according to the feature values, the classification is performed according to the frequency of co-occurrence of the discrete features of the speech data in one frame.

The voice data of the depression patients are easy to obtain, the voice data can be obtained through equipment such as a bracelet and a mobile phone, the voice data contains rich emotion information, and compared with the voice of normal people, the voice of the depression patients has larger difference, for example, the voice speed of the depression patients is slower, the tone is single, the hoarseness degree is increased, and the voice of the depression patients is more gas-sound, so that the early diagnosis of the depression can be realized by analyzing the information characteristics of the voice data. Therefore, the invention processes the voice signal, extracts a plurality of characteristics of the voice signal, and combines a machine learning algorithm to provide a novel method for identifying depression. The method can identify depression according to a short dialogue of people, and the extracted short-term voice signal features are required to be subjected to feature combination in the identification process, and then effective classification judgment is carried out. The invention adopts a technology combining the machine learning theory and the characteristic combination to effectively and accurately judge the daily depression degree of the crowd.

The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art, who is within the scope of the present invention, should make equivalent substitutions or modifications according to the technical scheme of the present invention and the inventive concept thereof, and should be covered by the scope of the present invention.

Claims

1. An automatic depression recognition method based on voice recognition and machine learning is characterized by comprising the following steps:

s1, acquiring voice data of a patient;

s3, recognizing the depression degree of the long-term features according to a random forest algorithm;

the step S2 comprises the following steps:

step 2.2, selecting the extracted features according to the decision tree;

step 2.3, recombining the selected characteristics to generate long-term characteristics;

the extracted features in step 2.1 are time domain features and frequency domain features, wherein the time domain features comprise: short-time energy, zero crossing rate and energy entropy, and frequency domain features comprise: spectral entropy, fundamental frequency, and centroid;

in step 2.3, discretizing the short-time features, setting a threshold according to the upper and lower third positions of each feature value, dividing each feature into three discrete features of low value, medium value and high value, and then combining the features after discretization in a co-occurrence mode; counting the frequency of occurrence of the features in a frame of voice signal after feature combination to generate long-term features;

in step S3, each of the feature values in the long-term features represents a frequency of co-occurrence of a plurality of feature-specific discrete values of the speech data in one frame, and when the feature values are classified, the classification is performed according to the frequency of co-occurrence of the discrete features of the speech data in one frame.

2. An automatic depression recognition device based on voice recognition and machine learning, comprising:

the acquisition module is used for acquiring voice data of a patient;

the identification module is used for identifying the depression degree of the long-term features according to a random forest algorithm;

the combination module comprises:

a combination unit for recombining the selected features to generate long-term features;

the extracted features are time domain features and frequency domain features, the time domain features comprising: short-time energy, zero crossing rate and energy entropy, and frequency domain features comprise: spectral entropy, fundamental frequency, and centroid;

the combination unit includes:

the generation component is used for generating long-term features by counting the frequency of occurrence of the features in a frame of voice signal for the combined features;

each characteristic value in the long-term characteristics in the identification module represents the frequency of the co-occurrence of a plurality of characteristic specific discrete values of the voice data in one frame, and when the characteristic values are classified, the characteristic values are classified according to the frequency of the co-occurrence of the discrete characteristics of the voice data in one frame.