CN111951824A - Detection method for distinguishing depression based on sound - Google Patents

Detection method for distinguishing depression based on sound Download PDF

Info

Publication number
CN111951824A
CN111951824A CN202010817892.XA CN202010817892A CN111951824A CN 111951824 A CN111951824 A CN 111951824A CN 202010817892 A CN202010817892 A CN 202010817892A CN 111951824 A CN111951824 A CN 111951824A
Authority
CN
China
Prior art keywords
depression
sound
layer
output
value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010817892.XA
Other languages
Chinese (zh)
Inventor
陆可
李青青
赵双双
王颖捷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Guoling Technology Research Intelligent Technology Co ltd
Original Assignee
Suzhou Guoling Technology Research Intelligent Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Guoling Technology Research Intelligent Technology Co ltd filed Critical Suzhou Guoling Technology Research Intelligent Technology Co ltd
Priority to CN202010817892.XA priority Critical patent/CN111951824A/en
Publication of CN111951824A publication Critical patent/CN111951824A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/66Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for extracting parameters related to health condition

Abstract

The invention discloses a detection method for judging depression based on sound, which is used for judging depression based on voice feature extraction and deep learning processing; through the collection and storage of the sound element datamation, the BSS algorithm analysis is carried out on the sound file data, and the voice is identified; using MFCC as characteristic parameter to analyze the speech signal to be processed, converting it into Mel frequency, making cepstrum analysis; respectively collecting data in the recording by adopting a plurality of groups of training data, and establishing a convolutional neural network model for discrimination; classifying and analyzing the obtained test sample data by using a BP neural network method; and judging the accuracy of the individual depression suffering probability based on sound judgment by adopting an ROC (rock characteristic) and AUC (AUC) model evaluation method based on a confusion matrix. The discrimination rate of the depression is obviously improved, and the cost is low.

Description

Detection method for distinguishing depression based on sound
Technical Field
The invention belongs to the technical field of voice processing, and particularly relates to a detection method for judging depression based on voice.
Background
Depression is a mental disorder accompanied by abnormalities in thought and behavior, and has become a serious public health and social problem worldwide. The report published by the world health organization in 2017 shows that more than 3 hundred million people are afflicted by depression worldwide, and in China, the number of depression patients reaches 5400 million people (accounting for 4.2 percent of the population), and the incidence rate is similar to the global level (4.4 percent); among young people of 15-24 years of age in china, about 120 million people suffer from depression; the incidence rate of the Chinese college student depression is up to 23.8% (which is similar to the data of British university); the 2015 report of the foundation of children of the United nations shows that the incidence rate of the depression of teenagers in rural areas is higher than that of urban peers; in china for example, absenteeism, medical and funeral costs due to depression result in a loss of $ 78 million each year. The depression is characterized in that the appearance of the patients is the same as that of the normal people, but the patients suffer from the pain in the heart, often feel depressed and have a strong mind, and the symptoms of the depression are from stuffiness to happiness to self-mutilation and social difficulties to the later stage and even have suicide thoughts or behaviors. Therefore, one of effective methods for reducing the suicide rate is to make detection in advance and treat the depression in time, namely based on an effective depression detection method. In recent years, the diagnosis of depression has been dependent on traditional depression detection methods such as the SDS depression self-rating scale, and SDS is mainly suitable for adults with depression symptoms, and is available for both psychological counseling and psychiatric outpatients or inpatients. For depressed patients with severe retardation symptoms, assessment is difficult. Scholars at home and abroad have also made a great deal of research, and Ozdas et al explore risk factors causing depression and suicide based on vocal cord tremor and the spectral range of glottal waves. But the number of the experimental samples is small, the verification in the case of large samples is lacked, and the establishment environment of the experimental samples comes from different communication equipment and environments. Therefore, the accuracy of the experimental result is influenced to a certain extent.
In addition, there are some journal literatures at home and abroad that disclose methods for detecting depression based on sound, for example, yanchu jade and others have studied "depression recognition technology research based on speech and facial features", and have analyzed audio data recorded in interviews based on speech feature parts. The audio features provided by the data set are extracted from the audio recording file by the covrep algorithm. Each 0.3334s is a time stamp, and the extracted audio features are recorded under each time stamp. According to the time sequence characteristics of the audio features, a long-short term memory network (LSTM) is established, meanwhile, the data sets are classified according to genders, the features are used as the input of the long-short term memory network (LSTM) according to the sequence of time stamps, and a prediction result based on the audio features is obtained. Wangtianyang et al studied effective feature analysis based on speech data and its application in depression level assessment, and herein used GMM to establish a multi-feature set decision system, trained models on multiple feature sets respectively, and then made decision fusion on the prediction results, and obtained 70% and 75% classification accuracy on male and female data respectively.
In addition, some domestic patent documents disclose methods for detecting depression based on sound, for example, chinese patent CN106725532A discloses an automatic depression assessment system and method based on speech features and machine learning, which are based on speech processing, feature extraction and machine learning technology to find the relation between speech features and depression, and provide objective reference for clinical diagnosis of depression. Chinese patent CN107657964A discloses a depression auxiliary detection method and a classifier based on acoustic features and sparse mathematics, and the depression judgment is based on the common recognition of voice and facial emotion; the estimation of the glottal signal is realized through an inverse filter, global analysis is adopted for the voice signal, characteristic parameters are extracted, the time sequence and distribution characteristics of the characteristic parameters are analyzed, and the rhythm rules of different emotion voices are found to be used as the basis of emotion recognition; and analyzing the voice signal to be processed by using the MFCC as a characteristic parameter, respectively acquiring data in the sound recording by using a plurality of groups of training data, and establishing a neural network model for discrimination. Chinese patent CN109171769A discloses a method and system for extracting voice and facial features applied to depression detection, which performs feature extraction on audio data according to an energy information method to obtain spectral parameters and acoustic parameters; inputting the parameters into a first deep neural network model to obtain voice depth characteristic data; performing static feature extraction on the video image to obtain a frame image; inputting the frame image into a second deep neural network model to obtain facial feature data; extracting dynamic features of the video image to obtain an optical flow image; inputting the optical flow image into a third deep neural network model to obtain facial motion characteristic data; inputting the facial feature data and the motion feature data into a third deep neural network model to obtain facial deep feature data; and inputting the voice depth feature data and the face depth feature data into a fourth neural network model to obtain fusion data. Chinese patent CN111329494A discloses a depression detection method based on voice keyword retrieval and voice emotion recognition, which can automatically recognize depression of a person to be detected by collecting voice information of the person to be detected and using voice features and voice texts extracted from the voice information.
While there have been many attempts to detect audio-based deprences using neural networks, existing methods mark one sample with a single audio 62 file at training, ultimately outputting the total prediction accuracy, and a single file does not have a probability of 63 predictions being correct. The invention is more representative by processing from a single file and estimating and judging aiming at the uniqueness of a single individual.
In summary, the problems of the prior art are as follows: the traditional depression detection method is based on SDS depression self-rating scale and subjective judgment of clinicians, has larger error, does not adopt BP neural network algorithm two-classification and AUC accuracy verification after MFCC voice feature extraction, and is lack of scientificity and effective objective evaluation index.
Disclosure of Invention
1. Problems to be solved
Aiming at the defects in the prior art, the invention provides the detection method for distinguishing the depression based on the sound, which greatly improves the depression recognition rate, and the method system can be easily built on a hospital detector or a computer, so that the software and hardware cost is low.
2. Technical scheme
In order to solve the problems, the technical scheme adopted by the invention is as follows:
the invention relates to a detection method for distinguishing depression based on voice, which is based on the depression distinguishing of voice feature extraction and deep learning processing; through the collection and storage of the sound element datamation, the BSS algorithm analysis is carried out on the sound file data, and the voice is identified; using MFCC as characteristic parameter to analyze the speech signal to be processed, converting it into Mel frequency, making cepstrum analysis; respectively collecting data in the recording by adopting a plurality of groups of training data, and establishing a convolutional neural network model for discrimination; classifying and analyzing the obtained test sample data by using a BP neural network method; and judging the accuracy of the individual depression suffering probability based on sound judgment by adopting an ROC (rock characteristic) and AUC (AUC) model evaluation method based on a confusion matrix.
The invention discloses a detection method for distinguishing depression based on sound, which comprises the following steps:
step S101, BSS algorithm analysis is carried out on the collected voice wav files, and then sound digital processing is carried out;
step S102, coding operation is carried out on the voice physical information, cepstrum (spectrum envelope and details) is carried out, 13-dimensional feature vectors of the MFCC are obtained for machine identification, 13-dimensional static coefficients of the original MFCC are supplemented, and the 13-dimensional static coefficients are converted into 39-dimensional MFCC used in identification, and the method comprises the following steps: inputting the static coefficient +13 first-order difference coefficient +13 second-order difference coefficient into a convolutional neural network model;
s103, establishing a convolutional neural network model for training, and autonomously extracting selection characteristics;
step S104, the BP network end receives the output characteristic vector, carries out error back-propagation training and classifies the input vector II;
s105, obtaining an accumulated value by using a statistical analysis method to obtain the probability of suffering from depression of an individual;
and S106, carrying out measurement evaluation on the binary model by using AUC and ROC to support accuracy.
Further, the step S101 specifically includes:
(1) sampling, quantizing and coding the recording to ensure the precision;
(2) 3 main indexes in the digitization of the sound signal are clearly and mainly extracted: sampling frequency, quantization bit number and channel number.
Further, the step S102 specifically includes:
(1) MFCC feature extraction comprises two key steps: converting to Mel frequency, and performing cepstrum analysis;
(2) the filter bank of the Mel scale has high resolution at the low frequency part, which is consistent with the auditory characteristics of human ears, and the physical meaning of the Mel scale is that the conversion to the Mel frequency step is that firstly, the Fourier transform is carried out on the time domain signal to convert the time domain signal into the frequency domain, then the division is carried out by utilizing the filter bank of the Mel frequency scale to correspond to the frequency domain signal, and finally, each frequency segment corresponds to a numerical value;
(3) the cepstrum analysis is to perform Fourier transform on time domain signals, then take log, perform inverse Fourier transform, and can be divided into complex cepstrum, real cepstrum and power cepstrum, and select power cepstrum in a limited way.
Further, the specific process of MFCC extracting features of step S102 is as follows:
(1) pre-emphasis, namely multiplying a coefficient by a frequency domain, wherein the coefficient is positively correlated with the frequency, so that the amplitude of a high frequency is improved; actually, an H (z) -1-Kz-1 high-pass filter is used to realize S' n-Sn-k Sn-1;
(2) windowing, namely performing windowing processing on the signal by using a Hamming window, wherein S' N is {0.54-0.46cos (2 pi (N-1) N-1) } Sn, and the side lobe size and the frequency spectrum leakage after FFT are weakened compared with a rectangular window function;
(3) converting the frequency domain, namely converting the time domain signal into the frequency domain for subsequent frequency analysis;
(4) filtering by using a Mel scale filter bank, and respectively multiplying and accumulating the frequency of the amplitude spectrum obtained by FFT with each filter to obtain a value, namely the energy value of the frame data in the corresponding frequency band of the filter, wherein if the number of the filters is 22, 22 energy values are obtained at the moment;
(5) the energy value is log, because the perception of human ears to sound is not linear, the nonlinear relation of log is better described, and the cepstrum analysis can be carried out after the log is taken;
(6) discrete cosine transform, performing inverse Fourier transform, and then obtaining a final low-frequency signal through a low-pass filter to obtain a final characteristic parameter; (7) and in order to enable the feature to better reflect the time domain continuity, the dimensionality of the frame information before and after the feature dimensionality can be increased, and the common mode is first-order difference and second-order difference, namely, first-order difference and second-order difference, and 13-dimensional MFCC is converted into 39-dimensional MFCC to be input into a convolutional neural network model.
Further, step S103 specifically includes:
(1) the first stage is a stage of data propagation from a low level to a high level, namely a forward propagation stage;
(2) the other stage is a stage of carrying out propagation training on the error from a high level to a bottom level when the result obtained by the current propagation is inconsistent with the expectation, namely a back propagation stage;
the method comprises the following specific steps:
a. initializing a weight value by the network;
b. the input data is transmitted forwards through a convolution layer, a down-sampling layer and a full-connection layer to obtain an output value;
c. calculating the error between the output value of the network and the target value;
d. when the error is larger than the expected value, the error is transmitted back to the network, and the errors of the full connection layer, the down sampling layer and the convolution layer are sequentially obtained;
e. when the error is equal to or less than our expected value, the training is finished;
f. and (c) updating the weight according to the obtained error, and then entering the step b.
Further, the step S104 specifically includes:
(1) network initialization, namely determining the number n of nodes of a network input layer, the number l of nodes of a hidden layer and the number m of nodes of an output layer according to a system input and output sequence (X, Y), initializing link weights omega ij and omega jk among neurons of the input layer, the hidden layer and the output layer, initializing a threshold value a of the hidden layer and a threshold value b of the output layer, and setting a learning rate and a neuron excitation function;
(2) hidden layer output calculation, namely calculating hidden layer output H, wherein Hj is f (Σ ω ijxi-aj) j is 1,2, …, l, and l is the number of hidden layer nodes according to an input variable X, an input layer and hidden layer interlayer connection weight ω ij and a hidden layer threshold a; f is the hidden layer excitation function;
(3) output layer output calculation, namely calculating BP neural network output O according to hidden layer output H, connecting the weight omega jk and a threshold b, and calculating Ok ∑ Hj ω jk-bk ═ 1,2, …, m;
(4) calculating a network prediction error e, ek-Ok-1, 2, …, m, based on the network prediction output O and the desired output Y;
(5) updating the weight, namely updating the network connection weight ω ij, ω jk, ω ij ═ ω ij + η Hi (1-Hj) x (i) Σ ω ijek j ═ 1,2, …, n according to the network prediction error e; j ═ 1,2, …, l; ω jk + η Hjek j 1,2, …, l; k is 1,2, …, where η is the learning rate;
(6) updating a threshold value, namely updating a network node threshold value a, b, aj + eta Hj (1-Hj) Sigma omega jkek j to 1,2, …, l according to the network prediction error e; bk-bk + ek-1, 2, …, m;
(7) judging whether the algorithm iteration is finished or not, and if not, returning to the step (2);
(8) the supervised learning classification algorithm qualitatively outputs classifications, each frame being directed to depression and not depression.
Further, the step S105 specifically includes:
(1) 1000 ten thousand frames of test data are extracted for training, and the pointing cumulative value is counted;
(2) setting a threshold, and if 800 ten thousand frames of classification points to depression, the person can be said to have depression at a probability of 80%; a 1-frame 20ms, 10 minute recording, which indicates depression if an 8 minute length of sound points to the person.
Further, the step S106 specifically includes:
(1) based on the concepts of Positive, Negative, True and False in the confusion matrix, the prediction category is 1, the prediction category is Positive, the prediction category is 0, the prediction category is Negative, the prediction is correct True, and the prediction error is False, and the four concepts are combined to generate a unique confusion matrix;
(2) calculating True Positive Rate and False Positive Rate, wherein TPRTate is TP/(TP + FN), FPRate is FP/(FP + TN), TPRTate means the proportion of 1 in all samples with real category of 1, and FPRate means the proportion of 1 in all samples with real category of 0;
(3) when the classifier is effective, for a sample with a true class of 1, the probability that the classifier predicts 1 (i.e., TPRate) is greater than the probability that the true class is 0 and the predicted class is 1 (i.e., FPRate), i.e., y > x;
(4) experiments show that 0.8 is used as a threshold value to obtain a series of TPRATE and FPRate, points are drawn, the area is calculated, and an AUC value can be obtained and is high, so that the method for evaluating depression based on sound judgment is reliable in accuracy.
In contrast, chinese patent CN109599129A discloses a speech depression recognition method based on attention mechanism and convolutional neural network, which first preprocesses speech data, and segments longer speech data, based on that the segmented segments can fully contain features related to depression; then extracting a Mel frequency spectrum graph from each segmented segment, and adjusting the size of the frequency spectrum graph input to the neural network model so as to facilitate the training of the model; then, fine tuning of the weight is carried out by using a pre-trained Alexnet deep convolution neural network, and higher-level voice characteristics in the Mel frequency spectrogram are extracted; then, using an attention mechanism algorithm to perform weight adjustment on the segment-level voice features to obtain sentence-level voice features; and finally classifying the sentence-level voice characteristics into depression by using an SVM classification model. The patent also carries out feature extraction on voice data through a convolutional neural network, extracts a Mel frequency spectrogram for optimization and adjustment, extracts feature Mel Frequency Cepstrum Coefficients (MFCCs) of voice signals as matrix vector features to represent features of the voice of a participant, and then continuously updates weights so as to obtain the best prediction effect. However, there are also many differences, firstly, in the preprocessing of the speech data, we delete the long silent part of each audio file and splice the rest into a whole new one. After this, a label indicating whether the participant is healthy or not is added to each file, with 0 label belonging to healthy persons and 1 label belonging to depressed persons, and the probability of prediction of the individual file is finally output through the softmax layer by conducting supervised learning, thereby judging how likely the test person is to suffer from depression.
3. Advantageous effects
Compared with the prior art, the invention has the beneficial effects that:
(1) compared with simple clinical detection or SDS (sodium dodecyl sulfate) depression scale self-test, the method can avoid the trouble of illumination, behavior, age and other problems on detection, extracts voice characteristics and deeply learns and processes the voice characteristics based on MFCC (Mel frequency cepstrum coefficient), cuts frames and analyzes a large amount of recorded data, statistically analyzes the output classification of a BP (Back propagation) neural network to obtain an accumulated value, obtains the probability that an individual suffers from depression, and carries out measurement and evaluation on a binary classification model by utilizing AUC (AUC) and ROC (rock characteristic curve), and the experimental result supports accuracy, so that the method provided by the invention can be used as a low-cost and high-efficiency method for detecting whether the depression exists or not;
(2) the detection method for judging the depression based on the sound has great improvement on the depression recognition rate, and the method system can be easily built on a hospital detector or a computer, so that the cost of software and hardware is low; is an accurate and effective depression detection method.
Drawings
The technical solutions of the present invention will be described in further detail below with reference to the accompanying drawings and examples, but it should be understood that these drawings are designed for illustrative purposes only and thus do not limit the scope of the present invention. Furthermore, unless otherwise indicated, the drawings are intended to be illustrative of the structural configurations described herein and are not necessarily drawn to scale.
FIG. 1 is a schematic flow chart of the detection method for distinguishing depression based on sound according to the present invention;
FIG. 2 is a process of the detection method for determining depression based on voice according to the present invention;
fig. 3 is another processing procedure of the detection method for discriminating depression based on sound according to the present invention.
Detailed Description
Exemplary embodiments of the present invention are described in detail below. Although these exemplary embodiments are described in sufficient detail to enable those skilled in the art to practice the invention, it should be understood that other embodiments may be realized and that various changes to the invention may be made without departing from the spirit and scope of the present invention. The following more detailed description of the embodiments of the invention is not intended to limit the scope of the invention, as claimed, but is presented for purposes of illustration only and not limitation to describe the features and characteristics of the invention, to set forth the best mode of carrying out the invention, and to sufficiently enable one skilled in the art to practice the invention. Accordingly, the scope of the invention is to be limited only by the following claims.
As shown in fig. 1, the detection method for discriminating depression based on sound includes the steps of:
step S101, BSS algorithm analysis is carried out on the collected voice wav files, and then sound digital processing is carried out;
the step S101 specifically includes:
(1) sampling, quantizing and coding the recording to ensure the precision;
(2) 3 main indexes in the digitization of the sound signal are clearly and mainly extracted: sampling frequency, quantization bit number and channel number.
Step S102, coding operation is carried out on the voice physical information, cepstrum (spectrum envelope and details) is carried out, 13-dimensional feature vectors of the MFCC are obtained for machine identification, 13-dimensional static coefficients of the original MFCC are supplemented, and the 13-dimensional static coefficients are converted into 39-dimensional MFCC used in identification, and the method comprises the following steps: inputting the static coefficient +13 first-order difference coefficient +13 second-order difference coefficient into a convolutional neural network model;
the step S012 specifically includes:
(1) MFCC feature extraction comprises two key steps: converting to Mel frequency, and performing cepstrum analysis;
the specific process of feature extraction of the MFCC is as follows:
(1) pre-emphasis, namely multiplying a coefficient by a frequency domain, wherein the coefficient is positively correlated with the frequency, so that the amplitude of a high frequency is improved; actually, an H (z) -1-Kz-1 high-pass filter is used to realize S' n-Sn-k Sn-1;
(2) windowing, namely performing windowing processing on the signal by using a Hamming window, wherein S' N is {0.54-0.46cos (2 pi (N-1) N-1) } Sn, and the side lobe size and the frequency spectrum leakage after FFT are weakened compared with a rectangular window function;
(3) converting the frequency domain, namely converting the time domain signal into the frequency domain for subsequent frequency analysis;
(4) filtering by using a Mel scale filter bank, and respectively multiplying and accumulating the frequency of the amplitude spectrum obtained by FFT with each filter to obtain a value, namely the energy value of the frame data in the corresponding frequency band of the filter, wherein if the number of the filters is 22, 22 energy values are obtained at the moment;
(5) the energy value is log, because the perception of human ears to sound is not linear, the nonlinear relation of log is better described, and the cepstrum analysis can be carried out after the log is taken;
(6) discrete cosine transform, performing inverse Fourier transform, and then obtaining a final low-frequency signal through a low-pass filter to obtain a final characteristic parameter; (7) and in order to enable the feature to better reflect the time domain continuity, the dimensionality of the frame information before and after the feature dimensionality can be increased, and the common mode is first-order difference and second-order difference, namely, first-order difference and second-order difference, and 13-dimensional MFCC is converted into 39-dimensional MFCC to be input into a convolutional neural network model.
(2) The filter bank of the Mel scale has high resolution at the low frequency part, which is consistent with the auditory characteristics of human ears, and the physical meaning of the Mel scale is that the conversion to the Mel frequency step is that firstly, the Fourier transform is carried out on the time domain signal to convert the time domain signal into the frequency domain, then the division is carried out by utilizing the filter bank of the Mel frequency scale to correspond to the frequency domain signal, and finally, each frequency segment corresponds to a numerical value;
(3) the cepstrum analysis is to perform Fourier transform on time domain signals, then take log, perform inverse Fourier transform, and can be divided into complex cepstrum, real cepstrum and power cepstrum, and select power cepstrum in a limited way.
S103, establishing a convolutional neural network model for training, and autonomously extracting selection characteristics;
the method specifically comprises the following steps:
(1) the first stage is a stage of data propagation from a low level to a high level, namely a forward propagation stage;
(2) the other stage is a stage of carrying out propagation training on the error from a high level to a bottom level when the result obtained by the current propagation is inconsistent with the expectation, namely a back propagation stage;
the method comprises the following specific steps:
a. initializing a weight value by the network;
b. the input data is transmitted forwards through a convolution layer, a down-sampling layer and a full-connection layer to obtain an output value;
c. calculating the error between the output value of the network and the target value;
d. when the error is larger than the expected value, the error is transmitted back to the network, and the errors of the full connection layer, the down sampling layer and the convolution layer are sequentially obtained;
e. when the error is equal to or less than our expected value, the training is finished;
f. and (c) updating the weight according to the obtained error, and then entering the step b.
Step S104, the BP network end receives the output characteristic vector, carries out error back-propagation training and classifies the input vector II;
the method specifically comprises the following steps:
(1) network initialization, namely determining the number n of nodes of a network input layer, the number l of nodes of a hidden layer and the number m of nodes of an output layer according to a system input and output sequence (X, Y), initializing link weights omega ij and omega jk among neurons of the input layer, the hidden layer and the output layer, initializing a threshold value a of the hidden layer and a threshold value b of the output layer, and setting a learning rate and a neuron excitation function;
(2) hidden layer output calculation, namely calculating hidden layer output H, wherein Hj is f (Σ ω ijxi-aj) j is 1,2, …, l, and l is the number of hidden layer nodes according to an input variable X, an input layer and hidden layer interlayer connection weight ω ij and a hidden layer threshold a; f is the hidden layer excitation function;
(3) output layer output calculation, namely calculating BP neural network output O according to hidden layer output H, connecting the weight omega jk and a threshold b, and calculating Ok ∑ Hj ω jk-bk ═ 1,2, …, m;
(4) calculating a network prediction error e, ek-Ok-1, 2, …, m, based on the network prediction output O and the desired output Y;
(5) updating the weight, namely updating the network connection weight ω ij, ω jk, ω ij ═ ω ij + η Hi (1-Hj) x (i) Σ ω ijek j ═ 1,2, …, n according to the network prediction error e; j ═ 1,2, …, l; ω jk + η Hjek j 1,2, …, l; k is 1,2, …, where η is the learning rate;
(6) updating a threshold value, namely updating a network node threshold value a, b, aj + eta Hj (1-Hj) Sigma omega jkek j to 1,2, …, l according to the network prediction error e; bk-bk + ek-1, 2, …, m;
(7) judging whether the algorithm iteration is finished or not, and if not, returning to the step (2);
(8) the supervised learning classification algorithm qualitatively outputs classifications, each frame being directed to depression and not depression.
S105, obtaining an accumulated value by using a statistical analysis method to obtain the probability of suffering from depression of an individual;
the method specifically comprises the following steps:
(1) 1000 ten thousand frames of test data are extracted for training, and the pointing cumulative value is counted;
(2) setting a threshold, and if 800 ten thousand frames of classification points to depression, the person can be said to have depression at a probability of 80%; a 1-frame 20ms, 10 minute recording, which indicates depression if an 8 minute length of sound points to the person.
And S106, carrying out measurement evaluation on the binary model by using AUC and ROC to support accuracy.
The method specifically comprises the following steps:
(1) based on the concepts of Positive, Negative, True and False in the confusion matrix, the prediction category is 1, the prediction category is Positive, the prediction category is 0, the prediction category is Negative, the prediction is correct True, and the prediction error is False, and the four concepts are combined to generate a unique confusion matrix;
(2) calculating True Positive Rate and False Positive Rate, wherein TPRTate is TP/(TP + FN), FPRate is FP/(FP + TN), TPRTate means the proportion of 1 in all samples with real category of 1, and FPRate means the proportion of 1 in all samples with real category of 0;
(3) when the classifier is effective, for a sample with a true class of 1, the probability that the classifier predicts 1 (i.e., TPRate) is greater than the probability that the true class is 0 and the predicted class is 1 (i.e., FPRate), i.e., y > x;
(4) experiments show that 0.8 is used as a threshold value to obtain a series of TPRATE and FPRate, points are drawn, the area is calculated, and an AUC value can be obtained and is high, so that the method for evaluating depression based on sound judgment is reliable in accuracy.
Example 1
As shown in FIGS. 2 and 3, the distance Analysis overview kernel-Wizard in the Oz (DAIC-WOZ) dataset was used as experimental data and the above-described method was used.
Firstly, the DAIC-WOZ sample is preprocessed, and noise interference on subsequent feature extraction is reduced. Due to the original speech data, without preprocessing, intermittent periods of blanking may occur. According to the method, the threshold value is flexibly set, whether the current state is in a mute state or not is determined, if the current state exceeds the threshold value, deletion is selected, and blanks of 0.03 are added to the left end and the right end of the audio frequency, so that the stability of the sound is ensured, and meanwhile, each file is marked as 'depression' or 'health', and the subsequent data processing is facilitated;
secondly, extracting Mel frequency cepstrum coefficients of the voice signals, and finally extracting MFCCs for obtaining feature data of unique voice attributes of participants through the steps of pre-emphasis, framing and windowing, FFT conversion, Mel filtering calculation and the like of the voice files, wherein the feature data are data which are vital to normal training of a network model;
and finally, inputting the extracted MFCCS characteristics into a convolutional neural network model, obtaining the error between an actual result and a target value through classification prediction of different convolutional layers, full-link layers and a softmax function, then reversely propagating an error value by using a BP algorithm, updating the network weight and optimizing the structure of the network, and finally obtaining the probability value of the single file which is predicted to be healthy or depressed. And evaluating the test set of the trained model to obtain the proportion of the predicted correct frame number in a single file and obtain the final prediction precision of the single file.
Overall, the overall prediction accuracy was 0.86, and the average prediction accuracy for a single file was 0.84. The accuracy of the predictions using a metric model of AUC and ROC includes the probability that a healthy person is predicted to be healthy (TPR) and the probability that a depressed person is predicted to be depressed (FPR). When the relevant training parameters are adjusted, the model still has higher stability and prediction precision, and the effectiveness of the method is proved.

Claims (9)

1. A detection method for judging depression based on voice is characterized in that BSS algorithm analysis is carried out on voice file data through collection and storage of voice element datamation, and voice is identified; using MFCC as characteristic parameter to analyze the speech signal to be processed, converting it into Mel frequency, making cepstrum analysis; respectively collecting data in the recording by adopting a plurality of groups of training data, and establishing a convolutional neural network model for discrimination; classifying and analyzing the obtained test sample data by using a BP neural network method; and judging the accuracy of the individual depression suffering probability based on sound judgment by adopting an ROC (rock characteristic) and AUC (automatic characteristic) model evaluation method based on a confusion matrix.
2. The detection method for distinguishing the depression based on the voice according to the claim 1 is characterized by comprising the following specific steps:
step S101, BSS algorithm analysis is carried out on the collected voice wav files, and then sound digital processing is carried out;
step S102, coding operation and cepstrum are carried out on the voice physical information to obtain 13-dimensional feature vectors of the MFCC for machine identification, 13-dimensional static coefficients of the original MFCC are supplemented and converted into 39-dimensional MFCC used in identification, and the method comprises the following steps: inputting the static coefficient +13 first-order difference coefficient +13 second-order difference coefficient into a convolutional neural network model;
s103, establishing a convolutional neural network model for training, and autonomously extracting selection characteristics;
step S104, the BP network end receives the output characteristic vector, carries out error back-propagation training and classifies the input vector II;
s105, obtaining an accumulated value by using a statistical analysis method to obtain the probability of suffering from depression of an individual;
and S106, carrying out measurement evaluation on the binary model by using AUC and ROC to support accuracy.
3. The detection method for distinguishing depression based on sound according to claim 2, wherein the step S101 specifically includes:
(1) sampling, quantizing and coding the audio record to ensure the precision;
(2) 3 main indexes in the digitization of the sound signal are clearly and mainly extracted: sampling frequency, quantization bit number and channel number.
4. The detection method for distinguishing depression based on sound according to claim 2, wherein the step S102 specifically includes:
(1) MFCC feature extraction comprises two key steps: converting to Mel frequency, and performing cepstrum analysis;
(2) firstly, Fourier transform is carried out on a time domain signal to convert the time domain signal into a frequency domain, then, a filter bank with a Mel frequency scale is utilized to divide the frequency domain signal, and finally, each frequency segment corresponds to a numerical value;
(3) the cepstrum analysis is to perform Fourier transform on time domain signals, then take log, perform inverse Fourier transform, and can be divided into complex cepstrum, real cepstrum and power cepstrum, and preferentially select the power cepstrum.
5. The sound-based depression detection method according to claim 2, wherein the MFCC extraction features specifically include:
(1) pre-emphasis, multiplying a coefficient by a frequency domain, wherein the coefficient is positively correlated with the frequency, and realizing S' n Sn-k Sn-1 through an H (z) -1-Kz-1 high-pass filter;
(2) windowing, namely performing windowing processing on the signal by using a Hamming window, wherein S' N is {0.54-0.46cos (2 pi (N-1) N-1) } Sn;
(3) converting the frequency domain, namely converting the time domain signal into the frequency domain for subsequent frequency analysis;
(4) filtering by using a Mel scale filter bank, and respectively multiplying and accumulating the amplitude spectrum obtained by FFT with each filter to obtain a value, namely the energy value of the frame data in the corresponding frequency band of the filter;
(5) taking log of the energy value, and performing cepstrum analysis after the log is taken;
(6) discrete cosine transform, performing inverse Fourier transform, and then obtaining a final low-frequency signal through a low-pass filter to obtain a final characteristic parameter;
(7) and (4) converting the 13-dimensional MFCC into a 39-dimensional MFCC input convolutional neural network model by adopting a first-order difference and a second-order difference.
6. The detection method for distinguishing depression based on sound according to claim 2, wherein the step S103 specifically includes:
(1) the first stage is a stage of data propagation from a low level to a high level, namely a forward propagation stage;
(2) the other stage is a stage of carrying out propagation training on the error from a high level to a bottom level when the result obtained by the current propagation is inconsistent with the expectation, namely a back propagation stage; specifically, as follows, the following description will be given,
a. initializing a weight value by the network;
b. the input data is transmitted forwards through a convolution layer, a down-sampling layer and a full-connection layer to obtain an output value;
c. calculating the error between the output value of the network and the target value;
d. when the error is larger than the expected value, the error is transmitted back to the network, and the errors of the full connection layer, the down sampling layer and the convolution layer are sequentially obtained;
e. when the error is equal to or less than our expected value, the training is finished;
f. and (c) updating the weight according to the obtained error, and then entering the step b.
7. The detection method for distinguishing depression based on sound according to claim 2, wherein the step S104 specifically includes:
(1) network initialization, namely determining the number n of nodes of a network input layer, the number l of nodes of a hidden layer and the number m of nodes of an output layer according to a system input and output sequence (X, Y), initializing link weights omega ij and omega jk among neurons of the input layer, the hidden layer and the output layer, initializing a threshold value a of the hidden layer and a threshold value b of the output layer, and setting a learning rate and a neuron excitation function;
(2) hidden layer output calculation, namely calculating hidden layer output H, wherein Hj is f (Σ ω ijxi-aj) j is 1,2, …, l, and l is the number of hidden layer nodes according to an input variable X, an input layer and hidden layer interlayer connection weight ω ij and a hidden layer threshold a; f is the hidden layer excitation function;
(3) output layer output calculation, namely calculating BP neural network output O, Ok ∑ Hj ω jk-bk, k ∑ 1,2, …, m according to hidden layer output H, connecting weight ω jk and threshold b;
(4) calculating a network prediction error e, ek-Ok-1, 2, …, m, based on the network prediction output O and the desired output Y;
(5) updating the weight, namely updating the network connection weight ω ij, ω jk, ω ij ═ ω ij + η Hi (1-Hj) x (i) Σ ω ijek, j ═ 1,2, …, n according to the network prediction error e; j ═ 1,2, …, l; ω jk + η Hjek j 1,2, …, l; k is 1,2, …, where η is the learning rate;
(6) updating a threshold value, namely updating a network node threshold value a, b, aj + eta Hj (1-Hj) Sigma omega jkek j to 1,2, …, l according to the network prediction error e; bk ═ bk + ek, k ═ 1,2, …, m;
(7) judging whether the algorithm iteration is finished or not, and if not, returning to the step (2);
(8) the supervised learning classification algorithm qualitatively outputs classifications, each frame being directed to depression and not depression.
8. The detection method for distinguishing depression based on sound according to claim 2, wherein the step S105 specifically includes:
(1) 1000 ten thousand frames of test data are extracted for training, and the pointing cumulative value is counted;
(2) setting a threshold value, and if 800 ten thousand frames of classification points to depression, the person has 80% of depression; 120 ms, 10 minute recording, and if an 8 minute length of sound points to the person with depression, the person suffers depression.
9. The detection method for distinguishing depression based on sound according to claim 2, wherein the step S106 specifically includes:
(1) based on the concepts of Positive, Negative, True and False in the confusion matrix, the prediction category is 1, Positive (Positive), the prediction category is 0, Negative (Negative), True (True) and False (False);
(2) calculating True Positive Rate and False Positive Rate, wherein TPRTate is TP/(TP + FN), FPRate is FP/(FP + TN), TPRTate means the proportion of 1 in all samples with real category of 1, and FPRate means the proportion of 1 in all samples with real category of 0;
(3) when the classifier is effective, for a sample with a true class of 1, the probability that the classifier predicts 1 (i.e., TPRate) is greater than the probability that the true class is 0 and the predicted class is 1 (i.e., FPRate), i.e., y > x;
(4) through experiments, 0.8 is set as a threshold value, a series of TPRTate, FPRate and tracing points are obtained, the area is calculated, the AUC value is high, and the accuracy of the method for evaluating depression based on sound judgment is reliable.
CN202010817892.XA 2020-08-14 2020-08-14 Detection method for distinguishing depression based on sound Pending CN111951824A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010817892.XA CN111951824A (en) 2020-08-14 2020-08-14 Detection method for distinguishing depression based on sound

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010817892.XA CN111951824A (en) 2020-08-14 2020-08-14 Detection method for distinguishing depression based on sound

Publications (1)

Publication Number Publication Date
CN111951824A true CN111951824A (en) 2020-11-17

Family

ID=73343223

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010817892.XA Pending CN111951824A (en) 2020-08-14 2020-08-14 Detection method for distinguishing depression based on sound

Country Status (1)

Country Link
CN (1) CN111951824A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112472065A (en) * 2020-11-18 2021-03-12 天机医用机器人技术(清远)有限公司 Disease detection method based on cough sound recognition and related equipment thereof
CN112818892A (en) * 2021-02-10 2021-05-18 杭州医典智能科技有限公司 Multi-modal depression detection method and system based on time convolution neural network
CN112908435A (en) * 2021-01-28 2021-06-04 南京脑科医院 Depression cognitive behavior training system and voice data processing method
CN113274023A (en) * 2021-06-30 2021-08-20 中国科学院自动化研究所 Multi-modal mental state assessment method based on multi-angle analysis
CN113509183A (en) * 2021-04-21 2021-10-19 杭州聚视鼎特科技有限公司 Method for analyzing emotional anxiety, depression and tension based on AR artificial intelligence
CN115346561A (en) * 2022-08-15 2022-11-15 南京脑科医院 Method and system for estimating and predicting depression mood based on voice characteristics
CN116978409A (en) * 2023-09-22 2023-10-31 苏州复变医疗科技有限公司 Depression state evaluation method, device, terminal and medium based on voice signal
CN116978408A (en) * 2023-04-26 2023-10-31 新疆大学 Depression detection method and system based on voice pre-training model

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112472065A (en) * 2020-11-18 2021-03-12 天机医用机器人技术(清远)有限公司 Disease detection method based on cough sound recognition and related equipment thereof
CN112908435A (en) * 2021-01-28 2021-06-04 南京脑科医院 Depression cognitive behavior training system and voice data processing method
CN112818892A (en) * 2021-02-10 2021-05-18 杭州医典智能科技有限公司 Multi-modal depression detection method and system based on time convolution neural network
CN113509183A (en) * 2021-04-21 2021-10-19 杭州聚视鼎特科技有限公司 Method for analyzing emotional anxiety, depression and tension based on AR artificial intelligence
CN113274023A (en) * 2021-06-30 2021-08-20 中国科学院自动化研究所 Multi-modal mental state assessment method based on multi-angle analysis
CN113274023B (en) * 2021-06-30 2021-12-14 中国科学院自动化研究所 Multi-modal mental state assessment method based on multi-angle analysis
CN115346561A (en) * 2022-08-15 2022-11-15 南京脑科医院 Method and system for estimating and predicting depression mood based on voice characteristics
CN115346561B (en) * 2022-08-15 2023-11-24 南京医科大学附属脑科医院 Depression emotion assessment and prediction method and system based on voice characteristics
CN116978408A (en) * 2023-04-26 2023-10-31 新疆大学 Depression detection method and system based on voice pre-training model
CN116978409A (en) * 2023-09-22 2023-10-31 苏州复变医疗科技有限公司 Depression state evaluation method, device, terminal and medium based on voice signal

Similar Documents

Publication Publication Date Title
CN111951824A (en) Detection method for distinguishing depression based on sound
Godino-Llorente et al. Automatic detection of voice impairments by means of short-term cepstral parameters and neural network based detectors
Fujimura et al. Classification of voice disorders using a one-dimensional convolutional neural network
CN111798874A (en) Voice emotion recognition method and system
US10548534B2 (en) System and method for anhedonia measurement using acoustic and contextual cues
Vrindavanam et al. Machine learning based COVID-19 cough classification models-a comparative analysis
CN112820279B (en) Parkinson detection model construction method based on voice context dynamic characteristics
Dahmani et al. Vocal folds pathologies classification using Naïve Bayes Networks
CN113012720A (en) Depression detection method by multi-voice characteristic fusion under spectral subtraction noise reduction
CN111329494A (en) Depression detection method based on voice keyword retrieval and voice emotion recognition
CN115346561B (en) Depression emotion assessment and prediction method and system based on voice characteristics
CN113674767A (en) Depression state identification method based on multi-modal fusion
WO2023139559A1 (en) Multi-modal systems and methods for voice-based mental health assessment with emotion stimulation
CN115862684A (en) Audio-based depression state auxiliary detection method for dual-mode fusion type neural network
Jiang et al. A novel infant cry recognition system using auditory model‐based robust feature and GMM‐UBM
Whitehill et al. Whosecough: In-the-wild cougher verification using multitask learning
CN112466284B (en) Mask voice identification method
CN113571095B (en) Speech emotion recognition method and system based on nested deep neural network
Sabet et al. COVID-19 detection in cough audio dataset using deep learning model
Villanueva et al. Respiratory Sound Classification Using Long-Short Term Memory
CN113571050A (en) Voice depression state identification method based on Attention and Bi-LSTM
Akshay et al. Identification of Parkinson disease patients classification using feed forward technique based on speech signals
CN114038562A (en) Psychological development assessment method, device and system and electronic equipment
Baquirin et al. Artificial neural network (ANN) in a small dataset to determine neutrality in the pronunciation of english as a foreign language in filipino call center agents: Neutrality classification of Filipino call center agent's pronunciation
Jothi et al. Speech intelligence using machine learning for aphasia individual

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: Room 706, 7th Floor, Building 1, No. 2 Litai Road, Taiping Street, Xiangcheng District, Suzhou City, Jiangsu Province, 215100

Applicant after: Suzhou Guoling technology research Intelligent Technology Co.,Ltd.

Address before: Room 609, building C, Caohu science and Technology Park, xijiaoda, No.1, Guantang Road, Caohu street, economic and Technological Development Zone, Xiangcheng District, Suzhou City, Jiangsu Province

Applicant before: Suzhou Guoling technology research Intelligent Technology Co.,Ltd.

CB02 Change of applicant information