CN115346561A - Method and system for estimating and predicting depression mood based on voice characteristics - Google Patents

Method and system for estimating and predicting depression mood based on voice characteristics Download PDF

Info

Publication number
CN115346561A
CN115346561A CN202210974876.0A CN202210974876A CN115346561A CN 115346561 A CN115346561 A CN 115346561A CN 202210974876 A CN202210974876 A CN 202210974876A CN 115346561 A CN115346561 A CN 115346561A
Authority
CN
China
Prior art keywords
features
voice
neural network
signal
speech
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210974876.0A
Other languages
Chinese (zh)
Other versions
CN115346561B (en
Inventor
王菲
张锡哲
尹舒络
姚菲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Brain Hospital
Original Assignee
Nanjing Brain Hospital
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Brain Hospital filed Critical Nanjing Brain Hospital
Priority to CN202210974876.0A priority Critical patent/CN115346561B/en
Publication of CN115346561A publication Critical patent/CN115346561A/en
Application granted granted Critical
Publication of CN115346561B publication Critical patent/CN115346561B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/66Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for extracting parameters related to health condition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • General Health & Medical Sciences (AREA)
  • Public Health (AREA)
  • Epidemiology (AREA)
  • Child & Adolescent Psychology (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychiatry (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Measurement Of The Respiration, Hearing Ability, Form, And Blood Characteristics Of Living Organisms (AREA)

Abstract

The invention discloses a method and a system for estimating and predicting a depressed mood based on voice characteristics, and relates to the technical field of estimation of the depressed mood. The method comprises the following specific steps: collecting a speech signal data set; calculating an upper envelope line, a spectrogram, a Mel cepstrum coefficient and LLDs voice characteristics of the voice signal data set as voice signal characteristics; respectively inputting the voice signal characteristics into a pre-trained deep neural network characteristic extraction sub-model to extract neural network characteristics corresponding to the voice signal characteristics; splicing the neural network features output by each sub-model into a one-dimensional vector as a multi-modal voice feature; and inputting the multi-modal voice features into a trained evaluation model for emotion evaluation. The method can effectively and accurately evaluate the depressed mood, and improves the accuracy of evaluation of the depressed mood compared with the traditional scale.

Description

Method and system for estimating and predicting depression mood based on voice characteristics
Technical Field
The invention relates to the technical field of depression emotion assessment, in particular to a depression emotion assessment prediction method and system based on voice characteristics.
Background
In the prior art, the determination of the therapeutic effect is examined primarily by quantitative assessment of the patient's basic condition by means of various depression scales. The depression scale is an important basis for judging whether a patient suffers from depression and the treatment effect at present. The major clinical depression scales include Hamilton depression scale, PHQ-9, etc. Patients are usually examined by specially trained clinicians or psychologists using conversational and observational methods, and finally scored according to a scale to determine efficacy. Such excessive subjective judgment can easily lead to inconsistent evaluation criteria of the doctor, and thus, the evaluation of the patient status is not accurate enough. Therefore, how to objectively and accurately evaluate and predict the depressed mood of a patient is a problem to be solved for those skilled in the art.
Disclosure of Invention
In view of the above, the present invention provides a method and a system for estimating and predicting a depressed mood based on speech features, so as to solve the problems in the background art.
In order to achieve the purpose, the invention adopts the following technical scheme: a depression treatment effect prediction method based on a convolutional neural network comprises the following specific steps:
collecting a speech signal data set;
calculating an upper envelope line, a spectrogram, a Mel cepstrum coefficient and LLDs voice characteristics of the voice signal data set as voice signal characteristics;
respectively inputting the voice signal characteristics into a pre-trained deep neural network characteristic extraction sub-model to extract neural network characteristics corresponding to the voice signal characteristics;
splicing the neural network features output by each submodel into a one-dimensional vector as a multi-modal voice feature;
and inputting the multi-modal voice features into a trained evaluation model to carry out depression emotion evaluation prediction.
Optionally, the method further comprises preprocessing the speech signal data set, wherein the preprocessing comprises performing down-sampling operation on the speech signal data set, performing endpoint detection on the signal by using a double-threshold endpoint detection algorithm to identify a start point and an end point of the audio signal of the subject, and clipping the signal.
Optionally, calculating an upper envelope of the speech signal data set using a sliding window; the spectrogram was calculated by librosa toolkit.
Optionally, the computation process of the mel-frequency cepstrum coefficient is as follows:
s1, pre-emphasis is carried out on the voice signal data set through a high-pass filter to obtain a first signal, and the formula is as follows:
H(Z)=1-μz -1 (ii) a Wherein, the value of mu is 0.97;
s2, performing framing operation on the first signal, and multiplying each frame by a Hamming window in order to increase the continuity of the frame, wherein the formula is as follows:
Figure BDA0003798337920000021
wherein, S (N) is a signal after framing, N =0,1, N-1,N is the size of the frame, and the value of alpha is 0.46;
s3, multiplying each frame by a Hamming window, and then obtaining energy distribution on a frequency spectrum, namely an energy spectrum, through fast Fourier transform of each frame;
s4, enabling the energy spectrum to pass through a triangular filter bank, and calculating logarithmic energy passing through the triangular filter bank, wherein the formula is as follows:
Figure BDA0003798337920000031
where M is the number of filters
m=1,2,...,M;
S5, on the basis of the passing logarithmic energy, obtaining low-frequency information of a frequency spectrum through discrete cosine transform, wherein the formula is as follows:
Figure BDA0003798337920000032
where s (M) is the logarithmic energy passing through the triangular filter bank, M =1,2.
Optionally, the training process of the evaluation model includes:
collecting voice signals;
using the upper envelope curve, spectrogram, mel cepstrum coefficient and LLDs speech features of the speech signal as speech signal features;
pre-training features by using the voice signal features to extract a sub-model, wherein the trained label is the total score of the HAMD-17 depression scale of the corresponding sample;
after the pre-training is finished, splicing the full connection layer outputs of the feature extraction submodels into a one-dimensional vector serving as the deep neural network feature of the sample;
the deep neural network features are used as inputs to train the evaluation model.
Optionally, the upper envelope, the spectrogram and the mel-frequency cepstrum coefficient are all timing features, a CuDNNLSTM layer is adopted as a cycle layer of the neural network in the deep neural network feature extraction submodel of the timing features, an attention layer is added behind the cycle layer to weight output of time steps, and finally, a fully-connected layer is used for calculating weighted vectors to predict the label; wherein the attention layer adopts a self-attention mechanism.
By adopting the technical scheme, the method has the following beneficial technical effects: the time sequence voice feature frames are weighted through a self-attention mechanism, feature frames highly related to depression emotion are learned in a focused mode, the training difficulty of the model is reduced, the CuDNNLSTM layer can combine long-term memory and short-term memory information, and GPU can be used for fast reasoning.
Optionally, the LLDs speech features are non-structural feature vectors, and the deep neural network feature extraction submodel of the non-structural feature vectors is constructed by stacking fully connected layers.
On the other hand, the depression treatment effect prediction system based on the convolutional neural network comprises a data acquisition module, a first feature extraction module, a second feature extraction module, a feature splicing module and a depression emotion assessment prediction module which are sequentially connected; wherein, the first and the second end of the pipe are connected with each other,
the data acquisition module is used for acquiring a voice signal data set;
the first feature extraction module is used for calculating an upper envelope line, a spectrogram, a Mel cepstrum coefficient and LLDs voice features of the voice signal data set as voice signal features;
the second feature extraction module is used for respectively inputting the voice signal features into a pre-trained deep neural network feature extraction sub-model to extract neural network features corresponding to the voice signal features;
the feature splicing module is used for splicing the neural network features output by each sub-model into a one-dimensional vector as multi-modal voice features;
and the depression emotion assessment and prediction module is used for inputting the multi-modal voice features into a trained assessment and prediction model to carry out depression emotion assessment and prediction.
Optionally, the system further comprises a data preprocessing module connected to the data acquisition module, wherein the data preprocessing module is configured to perform down-sampling on the voice signal data set, perform endpoint detection on the signal by using a double-threshold endpoint detection algorithm to identify a start point and an end point of the audio signal of the subject, and perform clipping on the signal.
Compared with the prior art, the invention discloses and provides the depression treatment effect prediction method and system based on the convolutional neural network, and the prediction method and system have practical significance in practical application to depression treatment effect prediction and clinical treatment of depression, not only eliminate the subjective influence of depression patients and clinicians on depression condition diagnosis, but also reduce the workload of repeatedly evaluating the treatment effect of patients by doctors, and can improve the treatment experience of the patients; the method analyzes and processes the audio information of the depression patients, inputs the audio information into the neural network model for feature extraction, feature fusion and autonomous learning, and can provide more information for model decision through the feature fusion, so that the accuracy of the overall decision result is improved, the treatment effect of the depression patients can be effectively evaluated, and the diagnosis and treatment efficiency of clinicians and the treatment experience of the patients are improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the prior art descriptions will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
FIG. 1 is a flow chart of a method for estimating and predicting depressed mood of the present invention;
FIG. 2 is a flow chart of the evaluation model training of the present invention;
FIG. 3 is a diagram of the Attention mechanism structure of the present invention;
FIG. 4 is a diagram of a deep neural network feature extraction submodel structure for timing features according to the present invention;
FIG. 5 is a diagram of a sub-model for unstructured data according to the present invention
FIG. 6 is a diagram of a neural network feature stitching architecture in accordance with the present invention;
FIG. 7 is a plot of the area under the ROC curve of the present invention;
FIG. 8 is a schematic diagram of a confusion matrix according to the present invention;
fig. 9 is a system configuration diagram of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The embodiment of the invention discloses a method for estimating and predicting depressed mood based on voice characteristics, which comprises the following specific steps as shown in figure 1:
step one, collecting a voice signal data set;
step two, calculating an upper envelope line, a spectrogram, a Mel cepstrum coefficient and LLDs voice characteristics of the voice signal data set as voice signal characteristics;
step three, respectively inputting the voice signal characteristics into a pre-trained deep neural network characteristic extraction sub-model to extract neural network characteristics corresponding to the voice signal characteristics;
splicing the neural network characteristics output by each sub-model into a one-dimensional vector as a multi-modal voice characteristic;
and fifthly, inputting the multi-modal voice features into the trained evaluation model to carry out depression emotion evaluation prediction.
Further, as shown in fig. 2, the training process of the evaluation model is as follows:
collecting voice signals;
using the upper envelope curve, spectrogram, mel cepstrum coefficient and LLDs as the features of the speech signal;
pre-training features by using voice signal features to extract a sub-model, wherein the trained label is the total score of the HAMD-17 depression scale of the corresponding sample;
after the pre-training is finished, splicing the full connection layer output of the feature extraction sub-model into a one-dimensional vector as the deep neural network feature of the sample;
the deep neural network features are used as inputs to train the evaluation model.
Further, in step one, the data used in this embodiment is voice data, which includes sound recordings after the health subject and the depression patient are enrolled and evaluated weekly. Participants were asked to read the prose poem "life like summer flower" freely in a quiet environment, using a voice recording pen to record voice signals. The recording pen was placed on a table 20-30cm from the subject and the microphone was pointed at the subject. The audio signal was recorded using a Newman RD07 recording pen at a sampling frequency of 44.1kHz and a sampling depth of 16 bits, and finally stored in a mono WAV format.
In the interviewing process of experimental volunteers, doctors need to turn on the recording equipment in advance and turn off the recording equipment after the audio acquisition is finished, which can cause that a section of content irrelevant to the interviewing content exists before and after the acquired voice signal, and the subsequent data analysis is interfered by background noise or equipment bottom noise. To avoid this interference, a dual threshold endpoint detection algorithm is used to perform endpoint detection on the signal. Endpoint detection may identify the beginning and end of a speech signal in a segment of audio, after which the signal may be clipped.
Furthermore, the method comprises preprocessing the voice signal data set, wherein preprocessing comprises firstly using an ffmpeg tool to transcribe the voice signal collected by the recording device, and down-sampling the voice signal collected by the recording pen with the frequency of 44.1kHz to the sampling frequency of 16 kHz. In order to avoid interference of background noise on subsequent analysis in a period and reduce the calculation amount of speech signal processing, a double-threshold endpoint detection algorithm is adopted to carry out endpoint detection on the signal so as to identify the starting point and the end point of the audio signal of the subject, and the signal is cut.
Further, in the second step, the specific way of calculating the envelope curve, the spectrogram, the mel cepstrum coefficient and the speech features of the LLDs is as follows:
(1) Upper envelope curve
Frequency-related information can be embodied by a frequency spectrum, so it is only desirable for waveform signals that the model can learn some useful information through the envelope of the signal. The present invention calculates the low resolution upper envelope of the waveform signal as one input to the model using a sliding window. The size of the sliding window is set to 800 sample points and the window displacement is set to 400 sample points. The average of the samples with the sampling value greater than 0 in the window is used to represent the value of the current window.
(2) Speech spectrum
Although the time domain waveform is simple and intuitive, for complex signals such as speech, some characteristics are exhibited in the frequency domain.
The speech signal is a short-time stationary signal, a spectrum is obtained by analyzing the speech signal in a short period of time in the vicinity thereof at each time, and a two-dimensional map is obtained by continuously performing such spectrum analysis on the speech signal, the abscissa of which represents time and the ordinate of which represents frequency, and the magnitude of the gray level of each pixel reflects the energy of the response time and frequency. Such a time-frequency diagram is called spectrogram. Wherein the energy power spectrum can be calculated by the formula:
Figure BDA0003798337920000081
wherein, X (n, w) represents the magnitude of the fourier transform of a frame signal centered at n points in the time domain at w, and is calculated by the formula:
Figure BDA0003798337920000082
w [ n ] is a window function of length 2N +1, typically using a Hamming window as the window function.
A long time window (at least two pitch periods) is often used to compute the narrow-band spectrogram. The narrow-band spectrogram has higher frequency resolution and lower time resolution, and the good frequency resolution can make each harmonic component of the voice more easily distinguished and is displayed as a horizontal stripe on the spectrogram. In the invention, a narrowband spectrogram of the voice is calculated by using a librosa toolkit, the window size of a window function is 10ms, and the window displacement is 2.5ms.
(3) Mel frequency cepstrum coefficient
MFCC (Mel-Frequency Cepstral coeffients) is called the Mel Frequency cepstrum coefficient. It was proposed in 1980 by Davis and Mermelstein, which combines the auditory perception properties of the human ear with the speech generation mechanism, a feature that is widely used in automated speech and speaker recognition.
The human ear enables a person to normally distinguish various sounds in a noisy environment, focusing only on certain specific frequency components. The filtering action of the cochlea in the human ear is on a logarithmic frequency scale, and is on a linear scale below 1000Hz, which makes the human ear more sensitive to low-frequency signals. Based on this phenomenon, the phonetics design a set of filter banks similar to the cochlear filtering function, called mel-frequency filter banks.
The MFCC calculation flow is as follows:
s1, pre-emphasis is carried out on a voice signal data set through a high-pass filter to obtain a first signal, and the formula is as follows:
H(Z)=1-μz -1 (ii) a Wherein, the value of mu is 0.97;
s2, performing framing operation on the first signal, and multiplying each frame by a Hamming window in order to increase the continuity of the frame, wherein the formula is as follows:
Figure BDA0003798337920000091
wherein, S (N) is a signal after framing, N =0,1, N-1,N is the size of the frame, and the value of alpha is 0.46;
s3, multiplying each frame by a Hamming window, and then obtaining energy distribution on a frequency spectrum, namely an energy spectrum, through fast Fourier transform of each frame;
s4, passing the energy spectrum through a triangular filter bank, and calculating logarithmic energy passing through the triangular filter bank, wherein the formula is as follows:
Figure BDA0003798337920000092
where M is the number of filters
m=1,2,...,M;
S5, on the basis of the passing logarithmic energy, obtaining low-frequency information of a frequency spectrum through discrete cosine transform, wherein the formula is as follows:
Figure BDA0003798337920000093
where s (M) is the logarithmic energy passing through the triangular filter bank, M =1,2.
(4)LLDs
In the field of speech emotion recognition, rhythm features and the like of speech are generally used as features for analysis. Most of the speech features are calculated by a well-designed short-time analysis algorithm after preprocessing operations such as framing and windowing are carried out on an original speech waveform. Since speech is one-dimensional time series data, a feature sequence obtained after short-time analysis, which tends to obtain temporal changes of speech, is also called Low-Level Descriptors (LLDs). In order to map low-level descriptors of variable length to fixed-size feature vectors, the dynamic features of speech level, which are obtained by counting all frames of a whole speech or LLDs, are called High-level statistical Functions (HSFs). The invention adopts the emoLarg voice emotion characteristic set provided by openSMILE to calculate the HSFs of the audio, and the HSFs totally comprises 6552 characteristics. In addition to voiceProb, the first and second order differences of the rest of the LLDs of the emoLarge feature set are calculated as dynamic features.
To follow the general specification of neural network feature engineering, the extracted 6552 HSFs were further screened. Pearson correlation coefficients among the HSFs are calculated, a common linear feature pair in an original feature set is deleted by taking 0.7 as a threshold value, and the remaining 590 HSFs are used as the HSFs voice features used in the experiment and defined as artificial voice features.
The method models the mode of the depressive mood in different types of voice features based on the self-learning capability of the neural network, and extracts the high-level neural network feature representation of the voice by utilizing the pre-trained neural network model. In order to solve the problem that the time step of partial voice feature is too long, a self-attention mechanism is designed to weight time sequence voice feature frames, feature frames highly related to depressed emotion are studied in a focused mode, and the training difficulty of the model is reduced. And finally, performing feature fusion on the plurality of advanced neural network features by using a feature fusion method, wherein the feature fusion can mutually supplement unique information of different features, and the accuracy of prediction is further improved.
Furthermore, the upper envelope line, the spectrogram and the Mel cepstrum coefficient are all time sequence characteristics, the deep neural network characteristic extraction submodel of the time sequence characteristics adopts a CuDNNLSTM layer as a circulation layer of a neural network, an attention layer is added behind the circulation layer to weight the output of time steps, and finally, the weighted vector is operated through a full connection layer to predict the label; wherein the attention layer adopts a self-attention mechanism.
The calculation method of the self-attention mechanism used in the present embodiment is as follows:
step1, point multiplying the input vector by three learnable matrixes Q, K, V respectively;
step2, calculating the score of attention; score = Q · K T
Step3, normalizing the score and calculating a Softmax activation function to obtain a weighting weight;
Figure BDA0003798337920000111
step4, weighting the v vector by the weighting weight to obtain weighted output;
Context=Weight·V。
in order to capture all the patterns related to the depressive state in the voice signal as much as possible and fully utilize the self-learning capability of the neural network, in the embodiment, the Attention is packaged into a neural network layer, and Q, K, V three matrixes are obtained by calculating the original input through the neural network, wherein the square matrix QK T The projection between two feature frames of the original input sequence represents the correlation of the voice features between different time steps. Square matrix QK T After normalization, the weighted attribute layer output is obtained by multiplying the normalized attribute layer output by the variant V of the original input sequence, the attribute mechanism is shown in figure 3, and the deep neural network feature extraction submodel of the time sequence feature is shown in figure 4.
The LLDs speech features are non-structural feature vectors, and the deep neural network feature extraction submodel of the non-structural feature vectors is constructed by stacking fully connected layers, as shown in FIG. 5.
Further, in the third step, the specific process of extracting and splicing the neural network features is as follows: the first three inputs (upper envelope, spectrum, mel-frequency cepstral coefficients) belong to time series data and are trained in this embodiment using a sub-model consisting of 2 LSTM layers and 4 fully-connected layers. For LLDs voice features, belonging to non-structural features, a submodel consisting of 4 layers of full connection layers is used for training. And after the sub-models are trained, the output of the last full-connection layer of the four sub-models is used as the neural network characteristics extracted corresponding to each type of input.
Furthermore, in the fourth step, the characteristics of the neural network are spliced, and researches show that the prediction performance can be effectively improved by synthesizing the characteristics extracted by a plurality of inputs to predict the curative effect compared with the prediction by using a single characteristic. As shown in fig. 6, the last hidden layer of the pre-trained deep neural network feature extraction submodel is a fully connected layer of 16 units, and a vector of 1 × 16 is output for each submodel. And in order to fuse the information extracted by different submodels, information complementation is carried out on the characteristics calculated by different models. And sequentially splicing the outputs of the four sub-models to finally form a vector of 1 × 64 as the deep neural network characteristic of the current sample.
The neural network features obtained by splicing also belong to non-structural features, so that the network structure of the prediction model uses the non-structural data sub-model (shown in fig. 5) corresponding to the previous LLDs features as the network architecture of the curative effect prediction model. Meanwhile, the model parameters are adjusted according to the input data size and the output target. The number of units of each hidden layer is set to be 32, onehot coding is carried out on the prediction label, meanwhile, the size of an output end is set to be 2, and softmax is used as an activation function.
Furthermore, the voice signal data collected during the treatment of the depression patients is used for effect verification in the embodiment.
The voice of the psychiatric inpatient was collected every week using a recording pen, the voice signal was recorded in a quiet ward using a recording pen, and the doctor asked the patient to read a segment of prose poetry, "life like summer flowers". The collected audio recordings are transcribed as wav formatted audio files with a 16kHz sampling frequency and a 16bit sampling number. And respectively extracting the upper envelope curve, the spectrogram, the Mel cepstrum coefficient and the HSFs voice features of the voice signal by using a sliding window algorithm, a librosa tool and an openSMILE tool. The HSFs voice feature quantity calculates the Pearson correlation coefficient, and eliminates the co-linear feature pair with the correlation coefficient larger than 0.7. And the voice characteristics of the four modes are respectively input into the submodels for pre-training, and the vectors output by the last hidden layer of each submodel are spliced and then input into the prediction model as the multi-mode voice characteristics to predict the depression state of the depression patient.
The data used in the experiment included voice signal data from 90 depression patients and the assessment results from the clinician. Wherein the audio data for evaluation of depression state comprises 12 males and 30 females, and the age range is 12-22 years (12 + -5.61); the audio data for evaluation in the non-depressed state included 11 males and 37 females, with the age range of 12-31 years (12 ± 12.12).
The verification adopts five-fold cross verification, namely, the data set is averagely divided into five parts, one part is used as a verification set every time, and the other four parts are used as training sets. The validation is repeated five times, ensuring that each piece of data is validated as a validation set. And (3) evaluating the effect by combining results of five times of operation and drawing a confusion matrix and an ROC curve, wherein the model prediction accuracy is 63.33%, the confusion matrix is shown in a figure 8, and the ROC curve is shown in a figure 7.
Embodiment 2 of the present invention provides a system for estimating and predicting a depression mood based on voice features, as shown in fig. 9, including a data acquisition module, a first feature extraction module, a second feature extraction module, a feature concatenation module, and a depression mood estimation and prediction module, which are connected in sequence; wherein the content of the first and second substances,
the data acquisition module is used for acquiring a voice signal data set;
the first characteristic extraction module is used for calculating the upper envelope curve, the spectrogram, the Mel cepstrum coefficient and the voice characteristics of LLDs of the voice signal data set as voice signal characteristics;
the second feature extraction module is used for respectively inputting the voice signal features into the pre-trained deep neural network feature extraction submodel to extract neural network features corresponding to the voice signal features;
the feature splicing module is used for splicing the neural network features output by each sub-model into a one-dimensional vector as multi-modal voice features;
and the depression emotion assessment and prediction module is used for inputting the multi-modal voice features into the trained assessment and prediction model to carry out depression emotion assessment and prediction.
The voice signal processing device further comprises a data preprocessing module connected with the data acquisition module, wherein the data preprocessing module is used for carrying out down-sampling operation on the voice signal data set, carrying out endpoint detection on the signal by adopting a double-threshold endpoint detection algorithm to identify a starting point and an end point of the audio signal of the subject, and cutting the signal.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (9)

1. A depression mood evaluation prediction method based on voice characteristics is characterized by comprising the following specific steps:
collecting a speech signal data set;
calculating an upper envelope line, a spectrogram, a Mel cepstrum coefficient and LLDs voice features of the voice signal data set as voice signal features;
respectively inputting the voice signal features into a pre-trained deep neural network feature extraction sub-model to extract neural network features corresponding to the voice signal features;
splicing the neural network features output by each submodel into a one-dimensional vector as a multi-modal voice feature;
and inputting the multi-modal voice features into a trained evaluation model to carry out depression emotion evaluation prediction.
2. The method of claim 1, further comprising preprocessing the speech signal data set, wherein the preprocessing comprises downsampling the speech signal data set, performing endpoint detection on the signal by using a double-threshold endpoint detection algorithm to identify a start point and an end point of the audio signal of the subject, and clipping the signal.
3. A speech feature based depression emotion assessment prediction method according to claim 1, wherein the upper envelope of the speech signal data set is calculated using a sliding window; the spectrogram was calculated by librosa toolkit.
4. The method for predicting the evaluation of depressive mood based on speech features according to claim 1, wherein the Mel cepstral coefficients are calculated by:
s1, pre-emphasis is carried out on the voice signal data set through a high-pass filter to obtain a first signal;
s2, performing framing operation on the first signal, and multiplying each frame by a Hamming window in order to increase the continuity of the frame;
s3, multiplying each frame by a Hamming window, and then obtaining energy distribution on a frequency spectrum, namely an energy spectrum, through fast Fourier transform of each frame;
s4, enabling the energy spectrum to pass through a triangular filter bank, and calculating logarithmic energy passing through the triangular filter bank;
and S5, obtaining low-frequency information of a frequency spectrum through discrete cosine transform on the basis of the passed logarithmic energy.
5. The method for predicting the evaluation of depressed mood based on voice characteristics as claimed in claim 1, wherein the training process of the evaluation model is as follows:
collecting voice signals;
using the upper envelope curve, spectrogram, mel cepstrum coefficient and LLDs speech features of the speech signal as speech signal features;
pre-training features by using the voice signal features to extract a sub-model, wherein the trained label is the total score of the HAMD-17 depression scale of the corresponding sample;
after the pre-training is finished, splicing the full connection layer outputs of the feature extraction submodels into a one-dimensional vector serving as the deep neural network feature of the sample;
the deep neural network features are used as inputs to train the evaluation model.
6. The method for predicting depression emotion assessment based on speech features according to claim 1, wherein the upper envelope line, the speech spectrogram and the mel cepstrum coefficient are all timing features, the deep neural network feature extraction submodel of the timing features adopts a CuDNNLSTM layer as a cyclic layer of a neural network, an attention layer is added behind the cyclic layer to weight output of time steps, and finally, a label is predicted by operating weighted vectors through a full-connection layer; wherein the attention layer adopts a self-attention mechanism.
7. The method according to claim 1, wherein the LLDs speech features are non-structural feature vectors, and the deep neural network feature extraction submodel of the non-structural feature vectors is constructed by stacking fully connected layers.
8. A depression emotion assessment and prediction system based on voice features is characterized by comprising a data acquisition module, a first feature extraction module, a second feature extraction module, a feature splicing module and a depression emotion assessment and prediction module which are sequentially connected; wherein the content of the first and second substances,
the data acquisition module is used for acquiring a voice signal data set;
the first feature extraction module is used for calculating an upper envelope line, a spectrogram, a Mel cepstrum coefficient and LLDs voice features of the voice signal data set as voice signal features;
the second feature extraction module is used for respectively inputting the voice signal features into a pre-trained deep neural network feature extraction sub-model to extract neural network features corresponding to the voice signal features;
the feature splicing module is used for splicing the neural network features output by each sub-model into a one-dimensional vector as multi-modal voice features;
and the depression emotion assessment and prediction module is used for inputting the multi-modal voice features into a trained assessment and prediction model to carry out depression emotion assessment and prediction.
9. The system of claim 8, further comprising a data preprocessing module coupled to the data acquisition module, the data preprocessing module configured to down-sample the speech signal data set and perform endpoint detection on the signal using a dual-threshold endpoint detection algorithm to identify a start point and an end point of the audio signal of the subject, and crop the signal.
CN202210974876.0A 2022-08-15 2022-08-15 Depression emotion assessment and prediction method and system based on voice characteristics Active CN115346561B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210974876.0A CN115346561B (en) 2022-08-15 2022-08-15 Depression emotion assessment and prediction method and system based on voice characteristics

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210974876.0A CN115346561B (en) 2022-08-15 2022-08-15 Depression emotion assessment and prediction method and system based on voice characteristics

Publications (2)

Publication Number Publication Date
CN115346561A true CN115346561A (en) 2022-11-15
CN115346561B CN115346561B (en) 2023-11-24

Family

ID=83951178

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210974876.0A Active CN115346561B (en) 2022-08-15 2022-08-15 Depression emotion assessment and prediction method and system based on voice characteristics

Country Status (1)

Country Link
CN (1) CN115346561B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115482837A (en) * 2022-07-25 2022-12-16 科睿纳(河北)医疗科技有限公司 Emotion classification method based on artificial intelligence
CN116978409A (en) * 2023-09-22 2023-10-31 苏州复变医疗科技有限公司 Depression state evaluation method, device, terminal and medium based on voice signal

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108960269A (en) * 2018-04-02 2018-12-07 阿里巴巴集团控股有限公司 Characteristic-acquisition method, device and the calculating equipment of data set
CN109241669A (en) * 2018-10-08 2019-01-18 成都四方伟业软件股份有限公司 A kind of method for automatic modeling, device and its storage medium
CN111798874A (en) * 2020-06-24 2020-10-20 西北师范大学 Voice emotion recognition method and system
CN111951824A (en) * 2020-08-14 2020-11-17 苏州国岭技研智能科技有限公司 Detection method for distinguishing depression based on sound
CN112002348A (en) * 2020-09-07 2020-11-27 复旦大学 Method and system for recognizing speech anger emotion of patient
CN112351443A (en) * 2019-08-08 2021-02-09 华为技术有限公司 Communication method and device
CN112818892A (en) * 2021-02-10 2021-05-18 杭州医典智能科技有限公司 Multi-modal depression detection method and system based on time convolution neural network
WO2021104099A1 (en) * 2019-11-29 2021-06-03 中国科学院深圳先进技术研究院 Multimodal depression detection method and system employing context awareness
CN113012720A (en) * 2021-02-10 2021-06-22 杭州医典智能科技有限公司 Depression detection method by multi-voice characteristic fusion under spectral subtraction noise reduction

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108960269A (en) * 2018-04-02 2018-12-07 阿里巴巴集团控股有限公司 Characteristic-acquisition method, device and the calculating equipment of data set
CN109241669A (en) * 2018-10-08 2019-01-18 成都四方伟业软件股份有限公司 A kind of method for automatic modeling, device and its storage medium
CN112351443A (en) * 2019-08-08 2021-02-09 华为技术有限公司 Communication method and device
WO2021104099A1 (en) * 2019-11-29 2021-06-03 中国科学院深圳先进技术研究院 Multimodal depression detection method and system employing context awareness
CN111798874A (en) * 2020-06-24 2020-10-20 西北师范大学 Voice emotion recognition method and system
CN111951824A (en) * 2020-08-14 2020-11-17 苏州国岭技研智能科技有限公司 Detection method for distinguishing depression based on sound
CN112002348A (en) * 2020-09-07 2020-11-27 复旦大学 Method and system for recognizing speech anger emotion of patient
CN112818892A (en) * 2021-02-10 2021-05-18 杭州医典智能科技有限公司 Multi-modal depression detection method and system based on time convolution neural network
CN113012720A (en) * 2021-02-10 2021-06-22 杭州医典智能科技有限公司 Depression detection method by multi-voice characteristic fusion under spectral subtraction noise reduction

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
"听视觉抑郁症识别方法研究", 中国博士学位论文电子期刊 *
李伟: "音频音乐与计算机交融-音频音乐技术", 复旦大学出版社, pages: 232 - 233 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115482837A (en) * 2022-07-25 2022-12-16 科睿纳(河北)医疗科技有限公司 Emotion classification method based on artificial intelligence
CN116978409A (en) * 2023-09-22 2023-10-31 苏州复变医疗科技有限公司 Depression state evaluation method, device, terminal and medium based on voice signal

Also Published As

Publication number Publication date
CN115346561B (en) 2023-11-24

Similar Documents

Publication Publication Date Title
CN115346561B (en) Depression emotion assessment and prediction method and system based on voice characteristics
Shama et al. Study of harmonics-to-noise ratio and critical-band energy spectrum of speech as acoustic indicators of laryngeal and voice pathology
CN109044396B (en) Intelligent heart sound identification method based on bidirectional long-time and short-time memory neural network
CN101023469B (en) Digital filtering method, digital filtering equipment
CN111798874A (en) Voice emotion recognition method and system
CN108896878A (en) A kind of detection method for local discharge based on ultrasound
CN108305639B (en) Speech emotion recognition method, computer-readable storage medium and terminal
Mittal et al. Analysis of production characteristics of laughter
WO2019023879A1 (en) Cough sound recognition method and device, and storage medium
Dubey et al. Bigear: Inferring the ambient and emotional correlates from smartphone-based acoustic big data
CN108682432B (en) Speech emotion recognition device
CN113012720A (en) Depression detection method by multi-voice characteristic fusion under spectral subtraction noise reduction
CN112820279A (en) Parkinson disease detection method based on voice context dynamic characteristics
Hsu et al. Robust voice activity detection algorithm based on feature of frequency modulation of harmonics and its DSP implementation
CN110415824B (en) Cerebral apoplexy disease risk assessment device and equipment
CN113974607B (en) Sleep snore detecting system based on pulse neural network
Usman et al. Heart rate detection and classification from speech spectral features using machine learning
Hasan et al. Preprocessing of continuous bengali speech for feature extraction
CN114255783A (en) Method for constructing sound classification model, sound classification method and system
CN114842878A (en) Speech emotion recognition method based on neural network
CN115910097A (en) Audible signal identification method and system for latent fault of high-voltage circuit breaker
Ezzine et al. Towards a computer tool for automatic detection of laryngeal cancer
Ouzounov A robust feature for speech detection
Kaminski et al. Automatic speaker recognition using a unique personal feature vector and Gaussian Mixture Models
Manjutha et al. An optimized cepstral feature selection method for dysfluencies classification using Tamil speech dataset

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: No.264, Guangzhou road, Nanjing, Jiangsu 210029

Applicant after: NANJING MEDICAL UNIVERSITY AFFILIATED BRAIN Hospital

Address before: No.264, Guangzhou road, Nanjing, Jiangsu 210029

Applicant before: NANJING BRAIN Hospital

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant