CN115346561B - Depression emotion assessment and prediction method and system based on voice characteristics - Google Patents

Depression emotion assessment and prediction method and system based on voice characteristics Download PDF

Info

Publication number
CN115346561B
CN115346561B CN202210974876.0A CN202210974876A CN115346561B CN 115346561 B CN115346561 B CN 115346561B CN 202210974876 A CN202210974876 A CN 202210974876A CN 115346561 B CN115346561 B CN 115346561B
Authority
CN
China
Prior art keywords
voice
neural network
voice signal
model
sub
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210974876.0A
Other languages
Chinese (zh)
Other versions
CN115346561A (en
Inventor
王菲
张锡哲
尹舒络
姚菲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
NANJING MEDICAL UNIVERSITY AFFILIATED BRAIN HOSPITAL
Original Assignee
NANJING MEDICAL UNIVERSITY AFFILIATED BRAIN HOSPITAL
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by NANJING MEDICAL UNIVERSITY AFFILIATED BRAIN HOSPITAL filed Critical NANJING MEDICAL UNIVERSITY AFFILIATED BRAIN HOSPITAL
Priority to CN202210974876.0A priority Critical patent/CN115346561B/en
Publication of CN115346561A publication Critical patent/CN115346561A/en
Application granted granted Critical
Publication of CN115346561B publication Critical patent/CN115346561B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/66Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for extracting parameters related to health condition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • General Health & Medical Sciences (AREA)
  • Public Health (AREA)
  • Epidemiology (AREA)
  • Child & Adolescent Psychology (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychiatry (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Measurement Of The Respiration, Hearing Ability, Form, And Blood Characteristics Of Living Organisms (AREA)

Abstract

The invention discloses a depression emotion assessment and prediction method and system based on voice characteristics, and relates to the technical field of depression emotion assessment. The method comprises the following specific steps: collecting a voice signal data set; calculating an upper envelope curve, a spectrogram, a mel-frequency cepstrum coefficient and LLDs (light-induced digital music) voice characteristics of the voice signal data set as voice signal characteristics; respectively inputting the voice signal characteristics into a pre-trained deep neural network characteristic extraction sub-model to extract neural network characteristics corresponding to the voice signal characteristics; splicing the neural network characteristics output by each sub-model into one-dimensional vectors serving as multi-mode voice characteristics; and inputting the multi-modal voice characteristics into a trained evaluation model for emotion evaluation. The invention can effectively and accurately evaluate the depressed emotion, and improves the accuracy rate of evaluating the depressed emotion compared with the traditional scale.

Description

Depression emotion assessment and prediction method and system based on voice characteristics
Technical Field
The invention relates to the technical field of depression emotion assessment, in particular to a depression emotion assessment prediction method and system based on voice characteristics.
Background
In the prior art, the effect of treatment is determined by examining quantitative assessments of the patient's underlying condition, mainly by means of the use of various depression scales. The depression scale is an important basis for judging whether patients suffer from depression and the curative effect of treatment at present. The major depression scales used clinically include hamilton depression scale, PHQ-9, etc. Patients are typically examined by a trained clinician or psychological trainer using conversational and observational methods, and the efficacy is ultimately judged by scale scores. Such too much subjective judgment can easily lead to inconsistent doctor assessment criteria and thus inaccurate assessment of patient status. Therefore, it is a urgent problem for those skilled in the art how to objectively and accurately evaluate and predict the depressed emotion of a patient.
Disclosure of Invention
In view of the above, the present invention provides a method and a system for estimating and predicting depressed emotion based on voice features, so as to solve the problems in the background art.
In order to achieve the above purpose, the present invention adopts the following technical scheme: a depression treatment effect prediction method based on convolutional neural network comprises the following specific steps:
collecting a voice signal data set;
calculating an upper envelope curve, a spectrogram, a mel-frequency cepstrum coefficient and LLDs voice characteristics of the voice signal data set as voice signal characteristics;
respectively inputting the voice signal characteristics into a pre-trained deep neural network characteristic extraction sub-model to extract neural network characteristics corresponding to the voice signal characteristics;
splicing the neural network characteristics output by each sub-model into one-dimensional vectors serving as multi-mode voice characteristics;
and inputting the multi-modal voice characteristics into a trained evaluation model to evaluate and predict the depression emotion.
Optionally, the method further comprises preprocessing the voice signal data set, wherein the preprocessing comprises downsampling the voice signal data set, and adopting a double-threshold endpoint detection algorithm to detect endpoints of the signal to identify a starting point and an endpoint of the audio signal of the subject, and clipping the signal.
Optionally, calculating an upper envelope of the speech signal dataset using a sliding window; the spectrogram was calculated by the librosa kit.
Optionally, the calculating process of the mel-frequency cepstrum coefficient is as follows:
s1, pre-emphasis is carried out on the voice signal data set through a high-pass filter to obtain a first signal, wherein the formula is as follows:
H(Z)=1-μz -1 the method comprises the steps of carrying out a first treatment on the surface of the Wherein, the value of mu is 0.97;
s2, carrying out framing operation on the first signal, and multiplying each frame by a Hamming window for increasing the continuity of the frames, wherein the formula is as follows:
where S (N) is a signal after framing, n=0, 1., N-1, N is the size of the frame, and the value of alpha is 0.46;
s3, multiplying each frame by a Hamming window, and obtaining energy distribution on a frequency spectrum, namely an energy spectrum, by fast Fourier transform of each frame;
s4, the energy spectrum passes through a triangular filter bank, and logarithmic energy passing through the triangular filter bank is calculated, wherein the formula is as follows:
wherein M is the number of filters
m=1,2,...,M;
S5, obtaining low-frequency information of a frequency spectrum through discrete cosine transform on the basis of passing logarithmic energy, wherein the formula is as follows:
where s (M) is the logarithmic energy passing through the triangular filter bank, m=1, 2.
Optionally, the training process of the evaluation model is as follows:
collecting voice signals;
calculating an upper envelope, a spectrogram, a mel-frequency cepstral coefficient and LLDs voice characteristics as voice signal characteristics by using the voice signals;
utilizing the voice signal characteristic pre-training characteristic extraction sub-model, wherein the trained label is the total score of the HAMD-17 depression scale of the corresponding sample;
after the pre-training is completed, the full-connection layer output of the feature extraction sub-model is spliced into a one-dimensional vector serving as a deep neural network feature of a sample;
the deep neural network features are used as inputs for training the assessment model.
Optionally, the upper envelope, the spectrogram and the mel cepstrum coefficient are all time sequence features, a deep neural network feature extraction sub-model of the time sequence features adopts a CuDNNLSTM layer as a circulating layer of a neural network, an attention layer is added after the circulating layer to weight the output of time steps, and finally the weighted vectors are calculated through a full-connection layer to predict the labels; wherein the attention layer employs a self-attention mechanism.
By adopting the technical scheme, the method has the following beneficial technical effects: the time sequence voice feature frames are weighted through a self-attention mechanism, feature frames highly related to depressed emotion are mainly learned, the training difficulty of a model is reduced, the CuDNNLSTM layer can combine long-term memory and short-term memory information, and the GPU can be used for rapid reasoning.
Alternatively, the LLDs speech feature is a non-structural feature vector whose deep neural network feature extraction submodel is constructed by a fully connected layer stack.
On the other hand, the depression treatment effect prediction system based on the convolutional neural network comprises a data acquisition module, a first feature extraction module, a second feature extraction module, a feature stitching module and a depression emotion assessment prediction module which are connected in sequence; wherein,
the data acquisition module is used for acquiring a voice signal data set;
the first feature extraction module is used for calculating an upper envelope curve, a spectrogram, a mel-frequency cepstrum coefficient and LLDs voice features of the voice signal data set as voice signal features;
the second feature extraction module is used for respectively inputting the voice signal features into a pre-trained deep neural network feature extraction sub-model to extract neural network features corresponding to the voice signal features;
the characteristic splicing module is used for splicing the neural network characteristics output by each sub-model into one-dimensional vectors serving as multi-mode voice characteristics;
the depression emotion assessment and prediction module is used for inputting the multi-modal voice characteristics into a trained assessment and prediction model to carry out depression emotion assessment and prediction.
Optionally, the system further comprises a data preprocessing module connected with the data acquisition module, wherein the data preprocessing module is used for performing downsampling operation on the voice signal data set, performing endpoint detection on the signal by adopting a double-threshold endpoint detection algorithm to identify a starting point and an endpoint of the audio signal of the subject, and clipping the signal.
Compared with the prior art, the method and the system for predicting the treatment effect of the depression based on the convolutional neural network have practical significance in actual application to the prediction of the treatment effect of the depression and the clinical treatment of the depression, so that subjective influences of depression patients and clinicians on the diagnosis of depression conditions are eliminated, the workload of doctors for repeatedly evaluating the treatment effect of the patients is reduced, and the treatment experience of the patients can be improved; according to the invention, the audio information of the depression patient is analyzed and processed, then the audio information is input into the neural network model for feature extraction, feature fusion and autonomous learning, and more information can be provided for model decision through feature fusion, so that the accuracy of the overall decision result is improved, the treatment effect of the depression patient can be effectively evaluated, and the diagnosis and treatment efficiency of a clinician and the treatment experience of the patient are improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present invention, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a method for assessing and predicting depressed mood in accordance with the present invention;
FIG. 2 is a flow chart of the evaluation model training of the present invention;
FIG. 3 is a block diagram of the Attention mechanism of the present invention;
FIG. 4 is a diagram of a deep neural network feature extraction sub-model of the timing features of the present invention;
FIG. 5 is a diagram of a sub-model structure of unstructured data according to the present invention
FIG. 6 is a diagram of a neural network feature splice architecture of the present invention;
FIG. 7 is an area under ROC curve for the present invention;
FIG. 8 is a schematic diagram of a confusion matrix according to the present invention;
fig. 9 is a system configuration diagram of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The embodiment of the invention discloses a depression emotion assessment and prediction method based on voice characteristics, which is shown in fig. 1 and comprises the following specific steps:
step one, collecting a voice signal data set;
step two, calculating an upper envelope curve, a spectrogram, a Mel cepstrum coefficient and LLDs voice characteristics of a voice signal data set as voice signal characteristics;
step three, respectively inputting the voice signal characteristics into a pre-trained deep neural network characteristic extraction sub-model to extract the neural network characteristics corresponding to the voice signal characteristics;
splicing the neural network characteristics output by each sub-model into one-dimensional vectors serving as multi-mode voice characteristics;
inputting the multi-modal voice characteristics into a trained evaluation model to evaluate and predict the depression.
Further, as shown in fig. 2, the training process of the evaluation model is as follows:
collecting voice signals;
calculating an upper envelope, a spectrogram, a mel-frequency cepstral coefficient and LLDs voice characteristics as voice signal characteristics by using voice signals;
extracting a sub-model by utilizing the pre-training characteristics of the voice signal characteristics, wherein the trained label is the total score of the HAMD-17 depression scale of the corresponding sample;
after the pre-training is completed, the full-connection layer output of the feature extraction sub-model is spliced into a one-dimensional vector serving as a deep neural network feature of a sample;
the deep neural network features are used as inputs to train an assessment model.
Further, in step one, the data used in this embodiment is voice data, including the healthy subject group and the depressed patient group, and the weekly record. Participants were required to freely recite the prose poems, such as Sheng like Xiahua, in a quiet environment and record voice signals using a recording pen. The recording pen is placed on a desk surface 20-30cm away from the subject, and the microphone is pointed at the subject. The audio signal is recorded using a neoman RD07 recording pen at a 44.1kHz sampling frequency and a 16bit sampling depth and finally saved in a mono WAV format.
During interview of experimental volunteers, a physician needs to turn on the recording device in advance and turn off the recording device after the audio collection is completed, which can result in a piece of content independent of the interview content before and after the collected voice signal, and the part can cause interference to subsequent data analysis due to background noise or device background noise. To avoid this interference, a double threshold endpoint detection algorithm is used to endpoint detect the signal. Endpoint detection may identify the start and end points of a speech signal in a piece of audio, after which the signal may be clipped.
Still further, the method includes preprocessing the voice signal data set, wherein the preprocessing includes using a ffmpeg tool to transfer an audio signal collected by a recording device, and downsampling an audio signal with a frequency of 44.1kHz collected by a recording pen to a sampling frequency of 16 kHz. In order to avoid interference of background noise during the period on subsequent analysis and reduce the calculation amount of voice signal processing, a double-threshold endpoint detection algorithm is adopted to detect the endpoint of the signal so as to identify the starting point and the endpoint of the audio signal of the subject, and the signal is cut.
Further, in the second step, the specific way of calculating the upper envelope, the spectrogram, the mel-frequency cepstral coefficient and the LLDs voice features is as follows:
(1) Upper envelope line
The frequency-dependent information may be embodied by frequency spectrum, so for waveform signals only the model is expected to learn some useful information through the envelope of the signal. The present invention uses a sliding window to calculate the low resolution upper envelope of the waveform signal as one input to the model. The size of the sliding window is set to 800 sampling points, and the window displacement is set to 400 sampling points. The average of the sampling points in the window with sampling values greater than 0 is used to represent the value of the current window.
(2) Language spectrogram
The time domain waveform is simple and intuitive, but some characteristics are to be exhibited in the frequency domain for complex signals such as voice.
The speech signal is a short-time stationary signal, a frequency spectrum can be obtained by analyzing the speech signal in a short period of time in the vicinity of each time, and a two-dimensional map can be obtained by continuously carrying out such spectrum analysis on the speech signal, wherein the abscissa represents time and the ordinate represents frequency, and the gray scale of each pixel reflects the energy of the response time and frequency. Such a time-frequency diagram is called a spectrogram. Wherein the energy power spectrum can be calculated by the formula:
wherein X (n, w) represents the magnitude of fourier transform at w of a frame signal centered at n points in the time domain, and is calculated by the formula:
w [ n ] is a window function of length 2N+1, typically using a Hamming window as the window function.
A long time window (at least two pitch periods) is often used to calculate a narrowband spectrogram. The narrow-band spectrogram has higher frequency resolution and lower time resolution, and the good frequency resolution can enable each harmonic component of the voice to be more easily distinguished, and the narrow-band spectrogram is displayed as horizontal stripes. In the invention, a library tool kit is used for calculating a narrow-band spectrogram of voice, the window size of a window function is 10ms, and the window displacement is 2.5ms.
(3) Mel-frequency coefficient
MFCC (Mel-Frequency Cepstral Coefficients) refers to the Mel frequency cepstral coefficient. It was proposed by Davis and Mermelstein in 1980, which combines the auditory perception characteristics of the human ear with the speech generation mechanism, a feature widely used in automatic speech and speaker recognition.
The human ear allows one to normally distinguish between various sounds in a noisy environment, focusing only on certain specific frequency components. The filtering of the cochlea in the human ear is done on a logarithmic frequency scale and below 1000Hz on a linear scale, which makes the human ear more sensitive to low frequency signals. From this phenomenon, a phonist has designed a set of filter banks that resemble the function of cochlear filtering, called mel-frequency filter banks.
The MFCC calculation flow is as follows:
s1, pre-emphasis is carried out on a voice signal data set through a high-pass filter to obtain a first signal, wherein the formula is as follows:
H(Z)=1-μz -1 the method comprises the steps of carrying out a first treatment on the surface of the Wherein, the value of mu is 0.97;
s2, carrying out framing operation on the first signal, and multiplying each frame by a Hamming window for increasing the continuity of the frames, wherein the formula is as follows:
where S (N) is a signal after framing, n=0, 1., N-1, N is the size of the frame, and the value of alpha is 0.46;
s3, multiplying each frame by a Hamming window, and obtaining energy distribution on a frequency spectrum, namely an energy spectrum, by fast Fourier transform of each frame;
s4, passing the energy spectrum through a triangular filter bank, and calculating logarithmic energy passing through the triangular filter bank, wherein the formula is as follows:
wherein M is the number of filters
m=1,2,...,M;
S5, obtaining low-frequency information of a frequency spectrum through discrete cosine transform on the basis of passing logarithmic energy, wherein the formula is as follows:
where s (M) is the logarithmic energy passing through the triangular filter bank, m=1, 2.
(4)LLDs
In the field of speech emotion recognition, analysis is generally performed using a rhythm feature, a prosody feature, or the like of speech as a feature. Most of the voice characteristics are obtained by carrying out pretreatment operations such as framing, windowing and the like on the original voice waveform and then calculating through a carefully designed short-time analysis algorithm. Since speech is one-dimensional time series data, feature sequences that tend to get time-dependent changes in speech after short-time analysis are also called Low-level descriptors (LLDs, low-Level Descriptors). In order to map low-level descriptors of indefinite length to feature vectors of fixed size, the dynamic feature of speech level, which is obtained by counting all frames of the whole speech or LLDs, is called High-level statistics function (HSFs, high-level Static Functions). The invention calculates the HSFs of the audio by adopting the emoLarg voice emotion feature set provided by openSMILE, and the HSFs comprise 6552 features in total. The first and second order differences of the remaining LLDs in addition to the voicechart were calculated as dynamic features in the LLDs of the emoLarget feature set.
The 6552 HSFs extracted were further screened in order to follow the general specifications of neural network feature engineering. Pearson correlation coefficients between HSFs were calculated, and the co-linear feature pairs in the original feature set were deleted with 0.7 as a threshold, and the final remaining 590 HSFs were used as the HSFs speech features used in the experiment, which were defined as artificial speech features.
The invention models the mode of the depressed emotion in different types of voice features based on the self-learning ability of the neural network, and extracts the advanced neural network feature representation of the voice by utilizing the pre-trained neural network model. In order to solve the problem that part of voice characteristic time steps are long, a self-attention mechanism is designed to weight time sequence voice characteristic frames, characteristic frames highly related to depressed emotion are mainly learned, and training difficulty of a model is reduced. And finally, feature fusion is carried out on the features of the plurality of advanced neural networks by using a feature fusion method, and the feature fusion can mutually supplement the unique information of different features, so that the prediction accuracy is further improved.
Further, the upper envelope, the spectrogram and the mel cepstrum coefficients are all time sequence features, a deep neural network feature extraction sub-model of the time sequence features adopts a CuDNNLSTM layer as a circulating layer of the neural network, an attention layer is added after the circulating layer to weight the output of time steps, and finally the weighted vectors are calculated through a full-connection layer to predict the labels; wherein the attention layer employs a self-attention mechanism.
The self-attention mechanism calculation method used in the present embodiment is as follows:
step1, respectively dot multiplying three vectors of three learnable matrixes Q, K, V by the input vector;
step2, calculating a score of attention; score=q·k T
Step3, normalizing the score and calculating a Softmax activation function to obtain a weighted weight;
step4, weighting the v vector by the weighting weight to obtain weighted output;
Context=Weight·V。
in order to capture all modes related to depression state in a voice signal as much as possible, fully utilizing the self-learning capability of a neural network, in the embodiment, the Attention is packaged into a neural network layer, and the Q, K, V three matrixes are all obtained by calculation of the original input through the neural network, wherein the square matrix QK T As projections between every two feature frames of the original input sequence, the correlation of speech features between different time steps is represented. Square matrix QK T And (3) carrying out weighting by multiplying the normalized sequence with a variant V of the original input sequence, and finally obtaining weighted Attention layer output, wherein an Attention mechanism is shown in a figure 3, and a deep neural network characteristic extraction sub-model of the time sequence characteristic is shown in a figure 4.
The LLDs speech features are unstructured feature vectors whose deep neural network feature extraction submodels are constructed by fully connected layer stacks as shown in FIG. 5.
Further, in the third step, the specific process of extracting and splicing the neural network features is as follows: the first three inputs (upper envelope, spectrum, mel-frequency cepstral coefficients) belong to the time series data, and are trained in this embodiment using a sub-model consisting of 2 LSTM layers and 4 fully connected layers. For LLDs speech features, which are non-structural features, a sub-model consisting of 4 fully connected layers is used for training. And after the sub-model training is finished, taking the output of the last full-connection layer of the four sub-models as the neural network characteristics extracted corresponding to each type of input.
Furthermore, in the fourth step, the neural network features are spliced, and researches show that the prediction performance can be effectively improved by combining the features extracted by a plurality of inputs to predict the curative effect and using a single feature to predict. As shown in fig. 6, the last hidden layer of the pre-trained deep neural network feature extraction submodel is a full connection layer of 16 units, and a vector of 1×16 is output for each submodel. And in order to fuse the information extracted by the different sub-models, information complementation is carried out on the characteristics calculated by the different models. And sequentially splicing the outputs of the four sub-models to finally form a vector of 1 x 64 as the deep neural network characteristic of the current sample.
The neural network features obtained by splicing also belong to a non-structural feature, so that the network structure of the prediction model uses the non-structural data submodel (shown in fig. 5) corresponding to the LLDs features as the network structure of the curative effect prediction model. Meanwhile, the model parameters are adjusted according to the size of the input data and the output target. The number of units per hidden layer is set to 32, the predictive label is onehot coded, while the output size is set to 2, using softmax as the activation function.
Further, in this example, the effect verification was performed using voice signal data collected during the course of treatment for a patient suffering from depression.
The voice of the inpatient in the psychiatric department is collected every week by using a recording pen, the voice signal is recorded in a quiet ward by using the recording pen, and a doctor is required to read a section of prose poem 'give birth to summer flowers'. The collected sound recording is transcribed into an audio file in wav format with a 16kHz sampling frequency and a 16bit sampling bit number. The sliding window algorithm is used for extracting the upper envelope curve, the spectrogram, the mel cepstrum coefficient and the HSFs voice characteristics of the voice signals by using a library tool and an openSMILE tool respectively. The HSFs speech feature quantity calculates pearson correlation coefficients and removes co-linear feature pairs with correlation coefficients greater than 0.7. The voice features of the fourth mode are respectively input into the submodels for pre-training, and the vectors output by the last hidden layer of each submodel are spliced and then are input into the prediction model as multi-mode voice features for predicting the depression state of the depression patient.
The data used in the experiment contained voice signal data from 90 depressed patients and the evaluation results from the clinician. Wherein the evaluation audio data of the depression state comprises 12 men and 30 women, and the age range is 12-22 years (12+/-5.61); the evaluation audio data in the non-depressed state contains 11 men and 37 women, and the age range is 12-31 years (12+ -12.12).
The verification adopts five-fold cross verification, namely the data set is divided into five parts averagely, one part is used as the verification set each time, and the other four parts are used as the training set. The verification is repeated five times, ensuring that each piece of data is verified as a verification set. And (3) by integrating the results of five runs, drawing a confusion matrix and an ROC curve to evaluate the effect, wherein the model prediction accuracy is 63.33%, the confusion matrix is shown in fig. 8, and the ROC curve is shown in fig. 7.
The embodiment 2 of the invention provides a depression emotion assessment and prediction system based on voice characteristics, which is shown in fig. 9 and comprises a data acquisition module, a first characteristic extraction module, a second characteristic extraction module, a characteristic splicing module and a depression emotion assessment and prediction module which are connected in sequence; wherein,
the data acquisition module is used for acquiring a voice signal data set;
the first feature extraction module is used for calculating an upper envelope curve, a spectrogram, a mel-frequency cepstrum coefficient and LLDs voice features of the voice signal data set as voice signal features;
the second feature extraction module is used for respectively inputting the voice signal features into the pre-trained deep neural network feature extraction sub-model to extract the neural network features corresponding to the voice signal features;
the feature splicing module is used for splicing the neural network features output by each sub-model into one-dimensional vectors serving as multi-modal voice features;
the depression emotion assessment and prediction module is used for inputting the multi-modal voice characteristics into the trained assessment and prediction model to carry out depression emotion assessment and prediction.
Further, the system also comprises a data preprocessing module connected with the data acquisition module, wherein the data preprocessing module is used for carrying out downsampling operation on the voice signal data set, adopting a double-threshold endpoint detection algorithm to carry out endpoint detection on the signal so as to identify the starting point and the endpoint of the audio signal of the subject, and cutting the signal.
In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (3)

1. The depression emotion assessment and prediction method based on the voice characteristics is characterized by comprising the following specific steps of:
step one: collecting a voice signal data set;
step two: calculating an upper envelope curve, a spectrogram, a mel-frequency cepstrum coefficient and LLDs voice characteristics of the voice signal data set as voice signal characteristics;
the calculation process of the mel-frequency cepstrum coefficient comprises the following steps:
s1, pre-emphasis is carried out on the voice signal data set through a high-pass filter, so that a first signal is obtained;
s2, carrying out framing operation on the first signal, and multiplying each frame by a Hamming window in order to increase the continuity of the frames;
s3, multiplying each frame by a Hamming window, and obtaining energy distribution on a frequency spectrum, namely an energy spectrum, by fast Fourier transform of each frame;
s4, the energy spectrum passes through a triangular filter bank, and logarithmic energy passing through the triangular filter bank is calculated;
s5, obtaining low-frequency information of a frequency spectrum through discrete cosine transform on the basis of passing logarithmic energy;
step three: respectively inputting the voice signal characteristics into a pre-trained deep neural network characteristic extraction sub-model to extract neural network characteristics corresponding to the voice signal characteristics;
the specific process for extracting and splicing the neural network features is as follows: the three inputs of the upper envelope, the frequency spectrum and the mel-frequency cepstrum coefficient belong to time sequence data, and training is carried out by using a sub-model consisting of a 2-layer LSTM layer and a 4-layer full-connection layer; for LLDs voice features, belonging to non-structural features, training by using a sub-model consisting of 4 full-connection layers; after the sub-model training is completed, taking the output of the last full-connection layer of the four sub-models as the neural network characteristics extracted corresponding to each type of input;
step four: splicing the neural network characteristics output by each sub-model into one-dimensional vectors serving as multi-mode voice characteristics;
step five: inputting the multi-modal voice characteristics into a trained evaluation model to evaluate and predict depression;
the training process of the evaluation model is as follows:
collecting voice signals;
calculating an upper envelope, a spectrogram, a mel-frequency cepstral coefficient and LLDs voice characteristics as voice signal characteristics by using the voice signals;
utilizing the voice signal characteristic pre-training characteristic extraction sub-model, wherein the trained label is the total score of the HAMD-17 depression scale of the corresponding sample;
after the pre-training is completed, the full-connection layer output of the feature extraction sub-model is spliced into a one-dimensional vector serving as a deep neural network feature of a sample;
the method further comprises the steps of preprocessing the voice signal data set, performing downsampling operation on the voice signal data set, performing endpoint detection on signals by adopting a double-threshold endpoint detection algorithm to identify a starting point and an endpoint of a subject audio signal, and cutting the signals;
calculating an upper envelope of the speech signal dataset using a sliding window;
the spectrogram is calculated through a library kit, specifically: carrying out continuous spectrum analysis on the voice signal to obtain a two-dimensional map, wherein the abscissa represents time, the ordinate represents frequency, and the gray scale of each pixel reflects the energy of response time and frequency; the spectrum analysis is to analyze the nearby short-period voice signals at each moment to obtain a spectrum; wherein the energy power spectrum is calculated by the following formula:
wherein X (n, w) represents the magnitude of fourier transform at w of a frame signal centered at n points in the time domain, and is calculated by the formula:
w n is a window function of length 2N+1, typically using a Hamming window as the window function;
calculating the HSFs of the audio through an emoLarg voice emotion feature set provided by openSMILE, calculating the pearson correlation coefficient of the HSFs, and deleting the co-linear feature pairs in the original feature set through a preset threshold;
the deep neural network features are used as inputs for training the assessment model; the upper envelope line, the spectrogram and the mel cepstrum coefficient are all time sequence characteristics, a deep neural network characteristic extraction sub-model of the time sequence characteristics adopts a CuDNNLSTM layer as a circulating layer of a neural network, the output weighting of a time step by an attention layer is added after the circulating layer, and finally the weighted vectors are calculated through a full-connection layer to predict a label; wherein the attention layer employs a self-attention mechanism.
2. A system for implementing the speech feature-based depressed emotion assessment prediction method as defined in claim 1, comprising a data acquisition module, a first feature extraction module, a second feature extraction module, a feature stitching module, and a depressed emotion assessment prediction module connected in sequence;
the data acquisition module is used for acquiring a voice signal data set;
the first feature extraction module is used for calculating an upper envelope curve, a spectrogram, a mel-frequency cepstrum coefficient and LLDs voice features of the voice signal data set as voice signal features;
the second feature extraction module is used for respectively inputting the voice signal features into the pre-trained deep neural network feature extraction;
the characteristic splicing module is used for splicing the neural network characteristics output by each sub-model into one-dimensional vectors serving as multi-mode voice characteristics;
the depression emotion assessment and prediction module is used for inputting the multi-modal voice characteristics into a trained assessment and prediction model to carry out depression emotion assessment and prediction.
3. The speech feature based depression mood assessment prediction system as recited in claim 2, further comprising a data preprocessing module coupled to the data acquisition module, the data preprocessing module configured to downsample the speech signal data set and to employ a double threshold endpoint detection algorithm to endpoint detect the signal to identify a start point and an end point of the subject's audio signal, and to crop the signal.
CN202210974876.0A 2022-08-15 2022-08-15 Depression emotion assessment and prediction method and system based on voice characteristics Active CN115346561B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210974876.0A CN115346561B (en) 2022-08-15 2022-08-15 Depression emotion assessment and prediction method and system based on voice characteristics

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210974876.0A CN115346561B (en) 2022-08-15 2022-08-15 Depression emotion assessment and prediction method and system based on voice characteristics

Publications (2)

Publication Number Publication Date
CN115346561A CN115346561A (en) 2022-11-15
CN115346561B true CN115346561B (en) 2023-11-24

Family

ID=83951178

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210974876.0A Active CN115346561B (en) 2022-08-15 2022-08-15 Depression emotion assessment and prediction method and system based on voice characteristics

Country Status (1)

Country Link
CN (1) CN115346561B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115482837B (en) * 2022-07-25 2023-04-28 科睿纳(河北)医疗科技有限公司 Emotion classification method based on artificial intelligence
CN116978409A (en) * 2023-09-22 2023-10-31 苏州复变医疗科技有限公司 Depression state evaluation method, device, terminal and medium based on voice signal

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108960269A (en) * 2018-04-02 2018-12-07 阿里巴巴集团控股有限公司 Characteristic-acquisition method, device and the calculating equipment of data set
CN109241669A (en) * 2018-10-08 2019-01-18 成都四方伟业软件股份有限公司 A kind of method for automatic modeling, device and its storage medium
CN111798874A (en) * 2020-06-24 2020-10-20 西北师范大学 Voice emotion recognition method and system
CN111951824A (en) * 2020-08-14 2020-11-17 苏州国岭技研智能科技有限公司 Detection method for distinguishing depression based on sound
CN112002348A (en) * 2020-09-07 2020-11-27 复旦大学 Method and system for recognizing speech anger emotion of patient
CN112351443A (en) * 2019-08-08 2021-02-09 华为技术有限公司 Communication method and device
CN112818892A (en) * 2021-02-10 2021-05-18 杭州医典智能科技有限公司 Multi-modal depression detection method and system based on time convolution neural network
WO2021104099A1 (en) * 2019-11-29 2021-06-03 中国科学院深圳先进技术研究院 Multimodal depression detection method and system employing context awareness
CN113012720A (en) * 2021-02-10 2021-06-22 杭州医典智能科技有限公司 Depression detection method by multi-voice characteristic fusion under spectral subtraction noise reduction

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108960269A (en) * 2018-04-02 2018-12-07 阿里巴巴集团控股有限公司 Characteristic-acquisition method, device and the calculating equipment of data set
CN109241669A (en) * 2018-10-08 2019-01-18 成都四方伟业软件股份有限公司 A kind of method for automatic modeling, device and its storage medium
CN112351443A (en) * 2019-08-08 2021-02-09 华为技术有限公司 Communication method and device
WO2021104099A1 (en) * 2019-11-29 2021-06-03 中国科学院深圳先进技术研究院 Multimodal depression detection method and system employing context awareness
CN111798874A (en) * 2020-06-24 2020-10-20 西北师范大学 Voice emotion recognition method and system
CN111951824A (en) * 2020-08-14 2020-11-17 苏州国岭技研智能科技有限公司 Detection method for distinguishing depression based on sound
CN112002348A (en) * 2020-09-07 2020-11-27 复旦大学 Method and system for recognizing speech anger emotion of patient
CN112818892A (en) * 2021-02-10 2021-05-18 杭州医典智能科技有限公司 Multi-modal depression detection method and system based on time convolution neural network
CN113012720A (en) * 2021-02-10 2021-06-22 杭州医典智能科技有限公司 Depression detection method by multi-voice characteristic fusion under spectral subtraction noise reduction

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
听视觉抑郁症识别方法研究.中国博士学位论文电子期刊.2022,全文. *
李伟.音频音乐与计算机交融-音频音乐技术.复旦大学出版社,2020,232-233. *

Also Published As

Publication number Publication date
CN115346561A (en) 2022-11-15

Similar Documents

Publication Publication Date Title
CN115346561B (en) Depression emotion assessment and prediction method and system based on voice characteristics
CN101023469B (en) Digital filtering method, digital filtering equipment
CN109044396B (en) Intelligent heart sound identification method based on bidirectional long-time and short-time memory neural network
CN111798874A (en) Voice emotion recognition method and system
CN112006697B (en) Voice signal-based gradient lifting decision tree depression degree recognition system
CN108896878A (en) A kind of detection method for local discharge based on ultrasound
Mittal et al. Analysis of production characteristics of laughter
CN108305639B (en) Speech emotion recognition method, computer-readable storage medium and terminal
CN113012720B (en) Depression detection method by multi-voice feature fusion under spectral subtraction noise reduction
CN108682432B (en) Speech emotion recognition device
CN105448291A (en) Parkinsonism detection method and detection system based on voice
CN111951824A (en) Detection method for distinguishing depression based on sound
CN112820279A (en) Parkinson disease detection method based on voice context dynamic characteristics
CN113539294A (en) Method for collecting and identifying sound of abnormal state of live pig
CN109272986A (en) A kind of dog sound sensibility classification method based on artificial neural network
CN110415824B (en) Cerebral apoplexy disease risk assessment device and equipment
WO2023139559A1 (en) Multi-modal systems and methods for voice-based mental health assessment with emotion stimulation
CN113974607B (en) Sleep snore detecting system based on pulse neural network
CN114842878A (en) Speech emotion recognition method based on neural network
CN114255783A (en) Method for constructing sound classification model, sound classification method and system
CN115910097A (en) Audible signal identification method and system for latent fault of high-voltage circuit breaker
Kaminski et al. Automatic speaker recognition using a unique personal feature vector and Gaussian Mixture Models
CN114403878A (en) Voice fatigue detection method based on deep learning
CN114299925A (en) Method and system for obtaining importance measurement index of dysphagia symptom of Parkinson disease patient based on voice
Manjutha et al. An optimized cepstral feature selection method for dysfluencies classification using Tamil speech dataset

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: No.264, Guangzhou road, Nanjing, Jiangsu 210029

Applicant after: NANJING MEDICAL UNIVERSITY AFFILIATED BRAIN Hospital

Address before: No.264, Guangzhou road, Nanjing, Jiangsu 210029

Applicant before: NANJING BRAIN Hospital

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant