CN113274023A - Multi-modal mental state assessment method based on multi-angle analysis - Google Patents
Multi-modal mental state assessment method based on multi-angle analysis Download PDFInfo
- Publication number
- CN113274023A CN113274023A CN202110732115.XA CN202110732115A CN113274023A CN 113274023 A CN113274023 A CN 113274023A CN 202110732115 A CN202110732115 A CN 202110732115A CN 113274023 A CN113274023 A CN 113274023A
- Authority
- CN
- China
- Prior art keywords
- analysis module
- anxiety
- depression
- video
- features
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 44
- 230000006996 mental state Effects 0.000 title claims abstract description 42
- 208000019901 Anxiety disease Diseases 0.000 claims abstract description 83
- 230000036506 anxiety Effects 0.000 claims abstract description 82
- 230000004927 fusion Effects 0.000 claims abstract description 46
- 238000012549 training Methods 0.000 claims abstract description 17
- 238000007781 pre-processing Methods 0.000 claims abstract description 4
- 230000007246 mechanism Effects 0.000 claims description 45
- 230000006870 function Effects 0.000 claims description 37
- 238000013527 convolutional neural network Methods 0.000 claims description 34
- 230000004913 activation Effects 0.000 claims description 31
- 230000008569 process Effects 0.000 claims description 12
- 239000000126 substance Substances 0.000 claims description 12
- 238000011156 evaluation Methods 0.000 claims description 9
- 230000001771 impaired effect Effects 0.000 claims description 9
- 238000013528 artificial neural network Methods 0.000 claims description 7
- 238000001228 spectrum Methods 0.000 claims description 7
- 238000005070 sampling Methods 0.000 claims description 6
- 230000009466 transformation Effects 0.000 claims description 4
- 230000003001 depressive effect Effects 0.000 claims description 3
- 238000009432 framing Methods 0.000 claims description 2
- 238000012545 processing Methods 0.000 description 10
- 238000004590 computer program Methods 0.000 description 5
- 238000012544 monitoring process Methods 0.000 description 4
- 230000000694 effects Effects 0.000 description 3
- 230000015654 memory Effects 0.000 description 3
- 238000012706 support-vector machine Methods 0.000 description 3
- 230000008901 benefit Effects 0.000 description 2
- 230000005540 biological transmission Effects 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000010606 normalization Methods 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 230000004044 response Effects 0.000 description 2
- 238000000926 separation method Methods 0.000 description 2
- 208000020401 Depressive disease Diseases 0.000 description 1
- 208000003443 Unconsciousness Diseases 0.000 description 1
- 230000009471 action Effects 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 210000004027 cell Anatomy 0.000 description 1
- 238000007635 classification algorithm Methods 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 125000004122 cyclic group Chemical group 0.000 description 1
- 230000008034 disappearance Effects 0.000 description 1
- 210000000750 endocrine system Anatomy 0.000 description 1
- 230000005284 excitation Effects 0.000 description 1
- 230000001815 facial effect Effects 0.000 description 1
- 230000014509 gene expression Effects 0.000 description 1
- 210000003128 head Anatomy 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000007787 long-term memory Effects 0.000 description 1
- 208000024714 major depressive disease Diseases 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 230000003340 mental effect Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000008450 motivation Effects 0.000 description 1
- 210000002569 neuron Anatomy 0.000 description 1
- 230000000644 propagated effect Effects 0.000 description 1
- 230000004800 psychological effect Effects 0.000 description 1
- 230000008255 psychological mechanism Effects 0.000 description 1
- 210000001747 pupil Anatomy 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 230000011514 reflex Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
- 239000000758 substrate Substances 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
Images
Classifications
-
- A—HUMAN NECESSITIES
- A61—MEDICAL OR VETERINARY SCIENCE; HYGIENE
- A61B—DIAGNOSIS; SURGERY; IDENTIFICATION
- A61B5/00—Measuring for diagnostic purposes; Identification of persons
- A61B5/16—Devices for psychotechnics; Testing reaction times ; Devices for evaluating the psychological state
- A61B5/165—Evaluating the state of mind, e.g. depression, anxiety
-
- A—HUMAN NECESSITIES
- A61—MEDICAL OR VETERINARY SCIENCE; HYGIENE
- A61B—DIAGNOSIS; SURGERY; IDENTIFICATION
- A61B5/00—Measuring for diagnostic purposes; Identification of persons
- A61B5/72—Signal processing specially adapted for physiological signals or for diagnostic purposes
- A61B5/7235—Details of waveform analysis
- A61B5/7253—Details of waveform analysis characterised by using transforms
- A61B5/7257—Details of waveform analysis characterised by using transforms using Fourier transforms
-
- A—HUMAN NECESSITIES
- A61—MEDICAL OR VETERINARY SCIENCE; HYGIENE
- A61B—DIAGNOSIS; SURGERY; IDENTIFICATION
- A61B5/00—Measuring for diagnostic purposes; Identification of persons
- A61B5/72—Signal processing specially adapted for physiological signals or for diagnostic purposes
- A61B5/7235—Details of waveform analysis
- A61B5/7264—Classification of physiological signals or data, e.g. using neural networks, statistical classifiers, expert systems or fuzzy systems
Abstract
The invention provides a multi-modal mental state assessment method based on multi-angle analysis, which comprises the following steps: collecting audio files and video files from an original video, and carrying out data preprocessing on the audio files and the video files: extracting time domain waveform points and Mel frequency cepstrum coefficients from the audio file as audio features; inputting the picture sequence into a pre-training network to obtain a video coding vector; extracting a face motion unit of the picture sequence; taking the video coding vector and the face motion unit as video features; respectively inputting the audio features and the video features into a depression analysis module, an anxiety analysis module and a stress analysis module for multi-angle analysis to obtain depression features, anxiety features and stress features; inputting the depression characteristic, the anxiety characteristic and the pressure characteristic into a fusion analysis module for attention characteristic fusion to obtain a fusion characteristic; and inputting the fusion features into a support vector regression, and evaluating the mental states of individuals in the audio file and the video file.
Description
Technical Field
The invention relates to the field of voice processing and image processing, in particular to a multi-modal mental state assessment method based on multi-angle analysis.
Background
The mental state analysis not only describes psychological phenomena, but also aims at exploring the psychological motivation of people, not only reveals surface psychological rules, but also aims at exploring deep unconscious psychological mechanisms of people, and has important significance for exploring the self-consciousness of people. For example, the analysis of mental states of patients can be used to perform different treatment schemes for different mental states, and the patients have severe psychological or physiological reactions that affect endocrine systems and the like, thereby affecting the treatment effect.
Application publication number CN108888281A provides a mental state assessment method, device and system, relating to the technical field of mental state assessment. The mental state assessment method comprises the following steps: collecting audio data and video data of a person to be evaluated within a preset time; the method comprises the following steps of extracting multi-modal physiological characteristics of a person to be evaluated in audio data and video data, wherein the multi-modal physiological characteristics comprise: facial pupil data features, voice data features, and heart rate variability data features; and outputting a mental state evaluation result of the person to be evaluated according to the multi-modal physiological characteristics and a preset correlation model, wherein the correlation model is a training model for classifying individual data under different mental states based on a neural network or an SVM (support vector machine).
Application publication No. CN109547695A provides a holographic video monitoring system and method for directionally capturing pictures based on a sound classification algorithm, which comprises a front-end acquisition system, transmission equipment, a central control platform and display recording equipment; the front-end acquisition system is configured to acquire on-site audio data and video data and transmit the audio data and the video data to the central control platform through the transmission equipment; the central control platform is configured to perform noise reduction processing and sound classification on the audio data through a support vector machine recognition algorithm of a Mel frequency cepstrum coefficient, extract audio data required by a user in a segmented manner, and send the audio data required by the user and corresponding video data to the display recording equipment; and directionally capturing and amplifying the corresponding video frame through selection of the specific sound; and the display recording equipment is configured to synchronously play the monitoring data of the monitoring system in real time, can call the monitoring data of any time interval in real time, and plays the corresponding video pictures captured and amplified for the specific sound orientation.
The problem with the prior art is that most are evaluated using a single mental state, without taking into account various mental aspects of the subject, such as depression, anxiety, etc. In addition, most of the conventional methods use multiple steps and multiple models for prediction, so that the target function of each template has deviation from the final prediction target, errors are easy to accumulate, and the prediction result is inaccurate.
Disclosure of Invention
In view of this, the present invention provides a multi-modal mental state assessment method based on multi-angle analysis, and specifically, the present invention is implemented by the following technical solutions:
s1: collecting audio files and video files from an original video, and carrying out data preprocessing on the audio files and the video files:
extracting time domain waveform points and Mel frequency cepstrum coefficients from the audio file, and taking the time domain waveform points and the Mel frequency cepstrum coefficients as audio features;
sampling a video file according to a certain frequency to obtain a picture sequence, and inputting the picture sequence into a pre-training network to obtain a video coding vector;
extracting a face motion unit of the picture sequence by using an openface tool;
taking the video coding vector and the face motion unit as video features;
s2: respectively inputting the audio features and the video features into a depression analysis module, an anxiety analysis module and a stress analysis module for multi-angle analysis to obtain depression features, anxiety features and stress features;
s3: inputting the depression characteristic, the anxiety characteristic and the pressure characteristic into a fusion analysis module for attention characteristic fusion to obtain a fusion characteristic;
s4: and inputting the fusion features into a support vector regression, and evaluating the mental state of the individual in the audio file and the video file.
Preferably, the specific method for extracting the time domain waveform points from the audio file comprises the following steps:
extracting an audio file from an original MP4 long video file, and saving the audio file in a wav file format; extracting original waveform points of the audio file in the wav file format, and storing the original waveform points in the mat format;
the specific method for extracting the time domain waveform points and the Mel frequency cepstrum coefficients from the audio file comprises the following steps:
pre-emphasis, framing and windowing are carried out on the audio file in the wav file format, and then fast Fourier transform is carried out to obtain a Fourier spectrum;
the Fourier spectrum passes through a Mel filter bank, then logarithm operation is carried out, and finally discrete cosine transformation is carried out to obtain a Mel frequency cepstrum coefficient;
and storing the Mel frequency cepstrum coefficients in a mat format.
Preferably, the network of depression analysis modules comprises:
the system comprises a gate control circulation unit, a multi-head attention mechanism, an activation function, data standardization, a convolutional neural network and multi-modal feature fusion; the video characteristics and the audio characteristics are respectively input into a gating cycle unit of a depression analysis module, a multi-head attention mechanism of the depression analysis module and a convolutional neural network of the depression analysis module; and performing primary activation function activation and data standardization on the gated circulation unit of the depression analysis module, the multi-head attention mechanism of the depression analysis module and the output of the convolutional neural network of the depression analysis module, and inputting the output of the gated circulation unit of the depression analysis module, the multi-head attention mechanism of the depression analysis module and the output of the convolutional neural network of the depression analysis module after data standardization into the multi-mode feature fusion of the depression analysis module to obtain the depression feature.
Preferably, the loss function applied by the depression analysis module training process is: root mean square error between predicted and true values of depression degree, the formula is as follows:
wherein the content of the first and second substances,
RMSED: root mean square error between predicted and true values of depressive extent;
n: the number of samples;
the criteria for the degree of depression were: normal for a score of 0-9, mild depression for a score of 10-13, moderate depression for a score of 14-20, severe depression for a score of 21-27, and very severe for a score of greater than 27.
Preferably, the network of anxiety analysis modules comprises:
the system comprises a gate control circulation unit, a multi-head attention mechanism, an activation function, data standardization, a convolutional neural network and multi-modal feature fusion; the video characteristic and the audio characteristic are respectively input into a gating cycle unit of an anxiety analysis module, a multi-head attention mechanism of the anxiety analysis module and a convolution neural network of the anxiety analysis module; and performing one-time activation function activation and data standardization on the gated circulation unit of the anxiety analysis module, the multi-head attention mechanism of the anxiety analysis module and the output of the convolutional neural network of the anxiety analysis module, and inputting the output of the gated circulation unit of the anxiety analysis module, the multi-head attention mechanism of the anxiety analysis module and the output of the convolutional neural network of the anxiety analysis module after data standardization into the multi-mode feature fusion of the anxiety analysis module to obtain the anxiety feature.
Preferably, the loss function applied by the anxiety analysis module training process is: root mean square error between predicted and true values of the degree of anxiety, the formula is as follows:
wherein the content of the first and second substances,
RMSEA: root mean square error between predicted value and true value of anxiety degree;
n is the number of samples;
the criteria for the degree of anxiety were: normal for scores of 0-7, mild anxiety for scores of 8-9, moderate anxiety for scores of 10-14, severe anxiety for scores of 15-19, and very severe for scores of more than 19.
Preferably, the network of pressure analysis modules comprises:
the system comprises a gate control circulation unit, a multi-head attention mechanism, an activation function, data standardization, a convolutional neural network and multi-modal feature fusion; the video characteristics and the audio characteristics are respectively input into a gating circulation unit of the pressure analysis module, a multi-head attention mechanism of the pressure analysis module and a convolution neural network of the pressure analysis module; and performing one-time activation function activation and data standardization on the gated circulation unit of the pressure analysis module, the multi-head attention mechanism of the pressure analysis module and the output of the convolutional neural network of the pressure analysis module, and inputting the output of the gated circulation unit of the pressure analysis module, the multi-head attention mechanism of the pressure analysis module and the output of the convolutional neural network of the pressure analysis module after data standardization into the multi-mode feature fusion of the pressure analysis module to obtain pressure features.
Preferably, the loss function applied by the pressure analysis module training process is: root mean square error between predicted value and true value of pressure degree, the formula is as follows:
wherein the content of the first and second substances,
RMSESroot mean square error between predicted value and true value of pressure degree;
n: the number of samples;
the evaluation criteria of the degree of stress were: normal for 0-14 points, mild for 15-18 points, moderate for 19-25 points, severe for 26-33 points, and very severe for more than 33 points.
Preferably, the fusion analysis module performs feature fusion using an attention mechanism.
Preferably, the support vector regression formula is as follows:
wherein the content of the first and second substances,
wandbare the parameters of the model to be learned,Cis a constant for the regularization of the phase,mis the number of samples that are to be taken,l ɛ is an insensitive loss function;f(x i ) Is a support vector regression predictor for the value of the feature,y i is the true value of the mental state of the individual sample in the audio file and the video file;
the specific evaluation criteria for evaluating the mental state of the individual in the audio file and the video file are as follows: normal in score 0-10, mild impaired in score 11-20, severe impaired in score 21-30, severe impaired in score 31-40, and very severe in score 41-50.
Compared with the prior art, the technical scheme provided by the embodiment of the application has the following advantages:
(1) through a multi-head attention mechanism and multi-mode fusion, not only the information of the mental state in each mode is considered, but also the dependency relationship between the modes is included, the information of two modes, namely audio and video, is fused, and the accuracy of the mental state analysis is improved;
(2) depression characteristics, anxiety characteristics and stress characteristics of the individual are comprehensively considered, multi-task modeling is carried out from multiple angles, compared with the traditional mental state analysis, information of all aspects of the individual is more comprehensively considered, and comprehensiveness of the mental state analysis is enhanced;
(3) compared with the traditional direct splicing fusion, the attention fusion considers different importance degrees of each characteristic, gives different weights, better utilizes the advantages of a neural network and has better performance.
Drawings
FIG. 1 is a block diagram of a multi-modal mental state assessment method based on multi-angle analysis according to an embodiment of the present invention;
fig. 2 is a data flow diagram of an anxiety/depression/stress analysis module of a multi-modal mental state assessment method based on multi-angle analysis according to an embodiment of the present invention.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the invention, as detailed in the appended claims.
The method for multi-modal mental state assessment based on multi-angle analysis provided by the embodiment of the application as shown in FIG. 1 comprises the following steps:
s1: collecting audio files and video files from an original video, and carrying out data preprocessing on the audio files and the video files:
extracting time domain waveform points and Mel frequency cepstrum coefficients from an audio file, and taking the time domain waveform points and the Mel frequency cepstrum coefficients as audio features;
extracting an audio file from an original MP4 long video file of a subject by using an FFMPEG tool, and saving the audio file in a wav file format; sampling the audio file at a sampling rate of 16K HZ, extracting original waveform points of the audio file in the wav file format, and storing the original waveform points in a mat format;
the specific method for extracting the time domain waveform points and the Mel frequency cepstrum coefficients from the audio file comprises the following steps:
pre-emphasizing the audio file in the wav file format in a manner that the frequency spectrum of the signal becomes flat by passing through a first-order finite excitation response high-pass filter;
dividing frames in a mode that 512 sampling point sets are used as an observation unit, namely each frame is 32ms, and the overlapping area between two adjacent frames is 50%;
windowing, namely windowing a frame of voice by adopting a Hamming window so as to reduce the influence of the Gibbs effect;
then, fast Fourier transform is carried out to obtain a Fourier spectrum;
the Fourier spectrum passes through a Mel filter bank, then logarithm operation is carried out, finally discrete pre-transformation is carried out, the first order difference and the second order difference are obtained, and then one bit of energy is added, so that a Mel frequency cepstrum coefficient is obtained;
storing the mel frequency cepstrum coefficient in a mat format;
sampling the video file according to the frequency of 6 times/s to obtain a picture sequence, inputting the picture sequence into a ResNet-50 pre-training network to obtain a video coding vector, and storing the video coding vector in a mat format;
extracting a human face motion unit of the picture sequence by using an openface tool, and storing the human face motion unit in a csv format;
taking the video coding vector and a human face motion unit as video features;
s2: respectively inputting the audio features and the video features into a depression analysis module, an anxiety analysis module and a stress analysis module for multi-angle analysis to obtain depression features, anxiety features and stress features;
as shown in fig. 2, the network of depression analysis modules comprises:
the system comprises a gate control circulation unit, a multi-head attention mechanism, an activation function, data standardization, a convolutional neural network and multi-modal feature fusion; the video characteristics and the audio characteristics are respectively input into a gating cycle unit of a depression analysis module, a multi-head attention mechanism of the depression analysis module and a convolutional neural network of the depression analysis module; performing primary activation function activation and data standardization on the gated circulation unit of the depression analysis module, the multi-head attention mechanism of the depression analysis module and the output of the convolutional neural network of the depression analysis module, and inputting the output of the gated circulation unit of the depression analysis module, the multi-head attention mechanism of the depression analysis module and the output of the convolutional neural network of the depression analysis module after data standardization into the multi-modal feature fusion of the depression analysis module to obtain the depression features;
the loss function applied by the depression analysis module training process is: root mean square error between predicted and true values of depression degree, the formula is as follows:
wherein the content of the first and second substances,
RMSED: root mean square error between predicted and true values of depressive extent;
n: the number of samples;
the criteria for the degree of depression were: normal for a score of 0-9, mild depression for a score of 10-13, moderate depression for a score of 14-20, major depression for a score of 21-27, and very severe for a score greater than 27;
as shown in fig. 2, the network of anxiety analysis modules includes:
the system comprises a gate control circulation unit, a multi-head attention mechanism, an activation function, data standardization, a convolutional neural network and multi-modal feature fusion; the video characteristic and the audio characteristic are respectively input into a gating circulation unit of an anxiety analysis module, a multi-head attention mechanism of the anxiety analysis module and a convolutional neural network of the anxiety analysis module; performing one-time activation function activation and data standardization on the gated circulation unit of the anxiety analysis module, the multi-head attention mechanism of the anxiety analysis module and the output of the convolutional neural network of the anxiety analysis module, and inputting the output of the gated circulation unit of the anxiety analysis module, the multi-head attention mechanism of the anxiety analysis module and the output of the convolutional neural network of the anxiety analysis module after data standardization into the multi-mode feature fusion of the anxiety analysis module to obtain the anxiety feature;
the loss function applied by the anxiety analysis module training process is: root mean square error between predicted and true values of the degree of anxiety, the formula is as follows:
wherein the content of the first and second substances,
RMSEA: root mean square error between predicted value and true value of anxiety degree;
n is the number of samples;
the evaluation criteria of the anxiety degree are: normal for 0-7, mild anxiety for 8-9, moderate anxiety for 10-14, severe anxiety for 15-19, and very severe for more than 19;
as shown in fig. 2, the network of pressure analysis modules comprises:
the system comprises a gate control circulation unit, a multi-head attention mechanism, an activation function, data standardization, a convolutional neural network and multi-modal feature fusion; the video characteristic and the audio characteristic are respectively input into a gating circulation unit of the pressure analysis module, a multi-head attention mechanism of the pressure analysis module and a convolution neural network of the pressure analysis module; performing one-time activation function activation and data standardization on the gated circulation unit of the pressure analysis module, the multi-head attention mechanism of the pressure analysis module and the output of the convolutional neural network of the pressure analysis module, and inputting the gated circulation unit of the pressure analysis module, the multi-head attention mechanism of the pressure analysis module and the output of the convolutional neural network of the pressure analysis module after data standardization into the multi-mode feature fusion of the pressure analysis module to obtain the pressure feature;
the loss function applied by the pressure analysis module training process is: root mean square error between predicted value and true value of pressure degree, the formula is as follows:
wherein the content of the first and second substances,
RMSESroot mean square error between predicted value and true value of pressure degree;
n: the number of samples;
the evaluation standard of the pressure degree is as follows: normal for 0-14 points, mild pressure for 15-18 points, moderate pressure for 19-25 points, severe pressure for 26-33 points, and very severe for more than 33 points;
the specific parameter settings for each module are as follows:
the analysis module firstly inputs the audio features and the video features into a gating circulation unit, the gating circulation unit is a variant of a long-term and short-term memory network, the context dependence relationship can be captured, the problems of long dependence and gradient disappearance are solved, the structure is simple, and the effect is better; then, a multi-head attention mechanism is adopted, wherein the number of attention heads is set to be 8, and characteristic expressions are calculated from 8 different angles; then extracting features through a convolutional neural network, wherein the number of convolutional kernels is 512, the size of the convolutional kernels is 3 multiplied by 3, and the convolutional neural network has excellent performance in the aspect of extracting local features; after the three operations, activation function activation and data standardization are carried out once, the activation function is a parameter rectification linear unit PReLU, nonlinearity can be increased, and the data standardization is batch standardization, so that the influence of data deviation is solved, and the training speed can be accelerated; and finally, splicing the audio features and the video features together and fusing the audio features and the video features through a fully connected neural network to form multi-angle features of depression features, anxiety features and pressure features, wherein the number of the neurons is 1024. The loss function is the root mean square error between the predicted value and the actual value, and after a plurality of times of iterative training, the depression analysis module, the anxiety analysis module and the stress analysis module can respectively analyze the depression degree, the anxiety degree and the stress degree of the testee; putting the trained three modules into a model for final training;
the concrete model structure is as follows:
the gated cycle cell formula is as follows:
whereinIs a feature of the input of the character,is the hidden layer output at the last moment,is the hidden layer output at this moment, W and U are both weight matrices, b is an offset, and the gated cyclic unit has two gate functions, whereThe reset gate is used for controlling the extent of updating the hidden layer state at the previous moment to the current candidate hidden layer state;the updating gate is used for controlling the degree of updating the hidden layer state at the previous moment to the current hidden layer state;
the multi-head attention mechanism formula is as follows:
wherein Q, K, V represents the set of queries, keys, and values entered, respectively, the formula is as follows:
the self-attention calculation is carried out on the input by using a multi-head attention mechanism, so that the characteristics can be analyzed from multiple angles, and useful characteristics can be enhanced and useless characteristics can be suppressed;
the activation function uses a parametric rectifying linear unit, the formula is as follows:
whereinxIs an input to the computer system that is,ais a trainable parameter;
data normalization was done using batch normalization, with the following formula:
whereinIs the sample data that is input and is,is the average number of samples that are taken,is the variance of the sample(s),the method is sample data after standardization, and batch standardization can effectively solve the problem of internal covariate deviation;
s3: inputting the depression characteristic, the anxiety characteristic and the pressure characteristic into a fusion analysis module for attention characteristic fusion to obtain a fusion characteristic; the fusion analysis module performs feature fusion by adopting an attention mechanism;
the fusion analysis module performs feature fusion by adopting an attention mechanism, and the formula is as follows:
whereinIs a fused intermediate feature obtained by linear transformation and splicing and is used for subsequently calculating the attention weight of each feature,、、are respectively the firstAnxiety, depression and stress characteristics of the sequences,、、、are a matrix of trainable parameters that are,is a vector of parameters that can be trained,attention weights for anxiety features, depression features and stress features,is a fusion feature obtained through attention calculation;
because the anxiety state, the depression state and the stress state of the testee are different in contribution to mental state assessment, the anxiety feature, the depression feature and the stress feature are fused by an attention mechanism, so that the model can automatically learn the weight of the features, emphasize the greatly-contributed features and inhibit useless features;
s4: inputting the fusion features into a support vector regression to evaluate the mental state of the individual in the audio file and the video file;
the support vector regression formula is as follows:
wherein the content of the first and second substances,
wandbare the parameters of the model to be learned,Cis a constant for the regularization of the phase,mis the number of samples that are to be taken,l ɛ is an insensitive loss function;f(x i ) Is a support vector regression predictor for the value of the feature,y i is the actual value of the mental state of the individual sample in the audio file and the video file;
the specific evaluation criteria for evaluating the mental state of the individual in the audio file and the video file are as follows: normal in score 0-10, mild impaired in score 11-20, severe impaired in score 21-30, severe impaired in score 31-40, and very severe in score 41-50.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this specification and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.
It is to be understood that although the terms first, second, third, etc. may be used herein to describe various information, these information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present invention. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.
For the device embodiments, since they substantially correspond to the method embodiments, reference may be made to the partial description of the method embodiments for relevant points. The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the scheme of the invention. One of ordinary skill in the art can understand and implement it without inventive effort.
Embodiments of the subject matter and the functional operations described in this specification can be implemented in: digital electronic circuitry, tangibly embodied computer software or firmware, computer hardware including the structures disclosed in this specification and their structural equivalents, or a combination of one or more of them. Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions, encoded on a tangible, non-transitory program carrier for execution by, or to control the operation of, data processing apparatus. Alternatively or additionally, the program instructions may be encoded on an artificially generated propagated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal, that is generated to encode and transmit information to suitable receiver apparatus for execution by the data processing apparatus. The computer storage medium may be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them.
The processes and logic flows described in this specification can be performed by one or more programmable computers executing one or more computer programs to perform corresponding functions by operating on input data and generating output. The processes and logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit).
Computers suitable for executing computer programs include, for example, general and/or special purpose microprocessors, or any other type of central processing unit. Generally, a central processing unit will receive instructions and data from a read-only memory and/or a random access memory. The basic components of a computer include a central processing unit for implementing or executing instructions and one or more memory devices for storing instructions and data. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks. However, a computer does not necessarily have such a device. Moreover, a computer may be embedded in another device, e.g., a mobile telephone, a Personal Digital Assistant (PDA), a mobile audio or video player, a game console, a Global Positioning System (GPS) receiver, or a portable storage device such as a Universal Serial Bus (USB) flash drive, to name a few.
Computer-readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices, including by way of example semiconductor memory devices (e.g., EPROM, EEPROM, and flash memory devices), magnetic disks (e.g., an internal hard disk or a removable disk), magneto-optical disks, and CD ROM and DVD-ROM disks. The processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.
While this specification contains many specific implementation details, these should not be construed as limitations on the scope of any invention or of what may be claimed, but rather as descriptions of features specific to particular embodiments of particular inventions. Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. In other instances, features described in connection with one embodiment may be implemented as discrete components or in any suitable subcombination. Moreover, although features may be described above as acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.
Similarly, while operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In some cases, multitasking and parallel processing may be advantageous. Moreover, the separation of various system modules and components in the embodiments described above should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.
Thus, particular embodiments of the subject matter have been described. Other embodiments are within the scope of the following claims. In some cases, the actions recited in the claims can be performed in a different order and still achieve desirable results. Further, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some implementations, multitasking and parallel processing may be advantageous.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.
Claims (10)
1. A method for multi-modal mental state assessment based on multi-angle analysis, the method comprising:
s1: collecting audio files and video files from an original video, and carrying out data preprocessing on the audio files and the video files:
extracting time domain waveform points and Mel frequency cepstrum coefficients from an audio file, and taking the time domain waveform points and the Mel frequency cepstrum coefficients as audio features;
sampling the video file according to a certain frequency to obtain a picture sequence, and inputting the picture sequence into a pre-training network to obtain a video coding vector;
extracting a human face motion unit of the picture sequence by using an openface tool;
taking the video coding vector and a human face motion unit as video features;
s2: respectively inputting the audio features and the video features into a depression analysis module, an anxiety analysis module and a stress analysis module for multi-angle analysis to obtain depression features, anxiety features and stress features;
s3: inputting the depression characteristic, the anxiety characteristic and the pressure characteristic into a fusion analysis module for attention characteristic fusion to obtain a fusion characteristic;
s4: inputting the fusion features into a support vector regression, and evaluating the mental state of the individual in the audio file and the video file.
2. The method for multi-modal mental state assessment based on multi-angle analysis according to claim 1, wherein the specific method for extracting time domain waveform points from audio files is as follows:
extracting an audio file from an original MP4 long video file, and saving the audio file in a wav file format; extracting original waveform points of the audio file in the wav file format, and storing the original waveform points in the mat format;
the specific method for extracting the time domain waveform points and the Mel frequency cepstrum coefficients from the audio file comprises the following steps:
pre-emphasis, framing and windowing are carried out on the audio file in the wav file format, and then fast Fourier transform is carried out to obtain a Fourier spectrum;
the Fourier spectrum passes through a Mel filter bank, then logarithm operation is carried out, and finally discrete cosine transformation is carried out to obtain a Mel frequency cepstrum coefficient;
and storing the Mel frequency cepstrum coefficient in a mat format.
3. The method for multi-modal mental state assessment based on multi-angle analysis according to claim 1, wherein said network of depression analysis modules comprises:
the system comprises a gate control circulation unit, a multi-head attention mechanism, an activation function, data standardization, a convolutional neural network and multi-modal feature fusion; the video characteristics and the audio characteristics are respectively input into a gating cycle unit of a depression analysis module, a multi-head attention mechanism of the depression analysis module and a convolutional neural network of the depression analysis module; and performing primary activation function activation and data standardization on the gated circulation unit of the depression analysis module, the multi-head attention mechanism of the depression analysis module and the output of the convolutional neural network of the depression analysis module, and inputting the output of the gated circulation unit of the depression analysis module, the multi-head attention mechanism of the depression analysis module and the output of the convolutional neural network of the depression analysis module after data standardization into the multi-modal feature fusion of the depression analysis module to obtain the depression features.
4. The method for multi-modal mental state assessment based on multi-angle analysis according to claim 3, wherein the loss function applied by the depression analysis module training process is: root mean square error between predicted and true values of depression degree, the formula is as follows:
wherein the content of the first and second substances,
RMSED: root mean square error between predicted and true values of depressive extent;
n: the number of samples;
the criteria for the degree of depression were: normal for a score of 0-9, mild depression for a score of 10-13, moderate depression for a score of 14-20, severe depression for a score of 21-27, and very severe for a score of greater than 27.
5. The method for multi-modal mental state assessment based on multi-angle analysis according to claim 1, wherein the network of anxiety analysis modules comprises:
the system comprises a gate control circulation unit, a multi-head attention mechanism, an activation function, data standardization, a convolutional neural network and multi-modal feature fusion; the video characteristic and the audio characteristic are respectively input into a gating circulation unit of an anxiety analysis module, a multi-head attention mechanism of the anxiety analysis module and a convolutional neural network of the anxiety analysis module; and performing one-time activation function activation and data standardization on the gated circulation unit of the anxiety analysis module, the multi-head attention mechanism of the anxiety analysis module and the output of the convolutional neural network of the anxiety analysis module, and inputting the output of the gated circulation unit of the anxiety analysis module, the multi-head attention mechanism of the anxiety analysis module and the output of the convolutional neural network of the anxiety analysis module after data standardization into the multi-mode feature fusion of the anxiety analysis module to obtain the anxiety feature.
6. The method for multi-modal mental state assessment based on multi-angle analysis according to claim 5, wherein the loss function applied by the anxiety analysis module training process is: root mean square error between predicted and true values of the degree of anxiety, the formula is as follows:
wherein the content of the first and second substances,
RMSEA: root mean square error between predicted value and true value of anxiety degree;
n is the number of samples;
the evaluation criteria of the anxiety degree are: normal for scores of 0-7, mild anxiety for scores of 8-9, moderate anxiety for scores of 10-14, severe anxiety for scores of 15-19, and very severe for scores of more than 19.
7. The method for multi-modal mental state assessment based on multi-angle analysis according to claim 1, wherein said network of stress analysis modules comprises:
the system comprises a gate control circulation unit, a multi-head attention mechanism, an activation function, data standardization, a convolutional neural network and multi-modal feature fusion; the video characteristic and the audio characteristic are respectively input into a gating circulation unit of the pressure analysis module, a multi-head attention mechanism of the pressure analysis module and a convolution neural network of the pressure analysis module; and performing one-time activation function activation and data standardization on the gated circulation unit of the pressure analysis module, the multi-head attention mechanism of the pressure analysis module and the output of the convolutional neural network of the pressure analysis module, and inputting the gated circulation unit of the pressure analysis module, the multi-head attention mechanism of the pressure analysis module and the output of the convolutional neural network of the pressure analysis module after data standardization into the multi-mode feature fusion of the pressure analysis module to obtain the pressure feature.
8. The method for multi-modal mental state assessment based on multi-angle analysis according to claim 7, wherein the loss function applied by the stress analysis module training process is: root mean square error between predicted value and true value of pressure degree, the formula is as follows:
wherein the content of the first and second substances,
RMSESroot mean square error between predicted value and true value of pressure degree;
n: the number of samples;
the evaluation standard of the pressure degree is as follows: normal for 0-14 points, mild for 15-18 points, moderate for 19-25 points, severe for 26-33 points, and very severe for more than 33 points.
9. The method for multi-modal mental state assessment based on multi-angle analysis according to claim 1, wherein said fusion analysis module employs an attention mechanism for feature fusion.
10. The method for multi-modal mental state assessment based on multi-angle analysis according to claim 1, wherein the support vector regression formula is as follows:
wherein the content of the first and second substances,
wandbare the parameters of the model to be learned,Cis a constant for the regularization of the phase,mis the number of samples that are to be taken,l ɛ is an insensitive loss function;f(x i ) Is a support vector regression predictor for the value of the feature,y i is the actual value of the mental state of the individual sample in the audio file and the video file;
the specific evaluation criteria for evaluating the mental state of the individual in the audio file and the video file are as follows: normal in score 0-10, mild impaired in score 11-20, severe impaired in score 21-30, severe impaired in score 31-40, and very severe in score 41-50.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110732115.XA CN113274023B (en) | 2021-06-30 | 2021-06-30 | Multi-modal mental state assessment method based on multi-angle analysis |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110732115.XA CN113274023B (en) | 2021-06-30 | 2021-06-30 | Multi-modal mental state assessment method based on multi-angle analysis |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113274023A true CN113274023A (en) | 2021-08-20 |
CN113274023B CN113274023B (en) | 2021-12-14 |
Family
ID=77286269
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110732115.XA Active CN113274023B (en) | 2021-06-30 | 2021-06-30 | Multi-modal mental state assessment method based on multi-angle analysis |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113274023B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115064246A (en) * | 2022-08-18 | 2022-09-16 | 山东第一医科大学附属省立医院(山东省立医院) | Depression evaluation system and equipment based on multi-mode information fusion |
CN115910329A (en) * | 2023-01-06 | 2023-04-04 | 江苏瑞康成医疗科技有限公司 | Intelligent depression identification method and device |
CN116661607A (en) * | 2023-07-24 | 2023-08-29 | 北京智精灵科技有限公司 | Emotion adjustment method and system based on multi-modal emotion interaction |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170245759A1 (en) * | 2016-02-25 | 2017-08-31 | Samsung Electronics Co., Ltd. | Image-analysis for assessing heart failure |
US20190074028A1 (en) * | 2017-09-01 | 2019-03-07 | Newton Howard | Real-time vocal features extraction for automated emotional or mental state assessment |
US20200118458A1 (en) * | 2018-06-19 | 2020-04-16 | Ellipsis Health, Inc. | Systems and methods for mental health assessment |
CN111225612A (en) * | 2017-10-17 | 2020-06-02 | 萨蒂什·拉奥 | Neural obstacle identification and monitoring system based on machine learning |
CN111951824A (en) * | 2020-08-14 | 2020-11-17 | 苏州国岭技研智能科技有限公司 | Detection method for distinguishing depression based on sound |
CN112818892A (en) * | 2021-02-10 | 2021-05-18 | 杭州医典智能科技有限公司 | Multi-modal depression detection method and system based on time convolution neural network |
-
2021
- 2021-06-30 CN CN202110732115.XA patent/CN113274023B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170245759A1 (en) * | 2016-02-25 | 2017-08-31 | Samsung Electronics Co., Ltd. | Image-analysis for assessing heart failure |
US20190074028A1 (en) * | 2017-09-01 | 2019-03-07 | Newton Howard | Real-time vocal features extraction for automated emotional or mental state assessment |
CN111225612A (en) * | 2017-10-17 | 2020-06-02 | 萨蒂什·拉奥 | Neural obstacle identification and monitoring system based on machine learning |
US20200118458A1 (en) * | 2018-06-19 | 2020-04-16 | Ellipsis Health, Inc. | Systems and methods for mental health assessment |
CN111951824A (en) * | 2020-08-14 | 2020-11-17 | 苏州国岭技研智能科技有限公司 | Detection method for distinguishing depression based on sound |
CN112818892A (en) * | 2021-02-10 | 2021-05-18 | 杭州医典智能科技有限公司 | Multi-modal depression detection method and system based on time convolution neural network |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115064246A (en) * | 2022-08-18 | 2022-09-16 | 山东第一医科大学附属省立医院(山东省立医院) | Depression evaluation system and equipment based on multi-mode information fusion |
CN115064246B (en) * | 2022-08-18 | 2022-12-20 | 山东第一医科大学附属省立医院(山东省立医院) | Depression evaluation system and equipment based on multi-mode information fusion |
CN115910329A (en) * | 2023-01-06 | 2023-04-04 | 江苏瑞康成医疗科技有限公司 | Intelligent depression identification method and device |
CN116661607A (en) * | 2023-07-24 | 2023-08-29 | 北京智精灵科技有限公司 | Emotion adjustment method and system based on multi-modal emotion interaction |
Also Published As
Publication number | Publication date |
---|---|
CN113274023B (en) | 2021-12-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113274023B (en) | Multi-modal mental state assessment method based on multi-angle analysis | |
CN110556129B (en) | Bimodal emotion recognition model training method and bimodal emotion recognition method | |
Liu et al. | GMM and CNN hybrid method for short utterance speaker recognition | |
Brady et al. | Multi-modal audio, video and physiological sensor learning for continuous emotion prediction | |
CN110457432B (en) | Interview scoring method, interview scoring device, interview scoring equipment and interview scoring storage medium | |
CN109859772B (en) | Emotion recognition method, emotion recognition device and computer-readable storage medium | |
CN112006697B (en) | Voice signal-based gradient lifting decision tree depression degree recognition system | |
CN107221320A (en) | Train method, device, equipment and the computer-readable storage medium of acoustic feature extraction model | |
Kelly et al. | Deep neural network based forensic automatic speaker recognition in VOCALISE using x-vectors | |
Zhao et al. | Hybrid network feature extraction for depression assessment from speech | |
Lai | Contrastive predictive coding based feature for automatic speaker verification | |
CN107430678A (en) | Use the inexpensive face recognition of Gauss received field feature | |
CN112669820B (en) | Examination cheating recognition method and device based on voice recognition and computer equipment | |
Sefara | The effects of normalisation methods on speech emotion recognition | |
Wang et al. | Speaker recognition using convolutional neural network with minimal training data for smart home solutions | |
Sabatier et al. | Measurement of the impact of identical twin voices on automatic speaker recognition | |
Chen et al. | Cough detection using selected informative features from audio signals | |
Deepa et al. | Speech technology in healthcare | |
Islam et al. | A Novel Approach for Text-Independent Speaker Identification Using Artificial Neural Network | |
JP7014761B2 (en) | Cognitive function estimation method, computer program and cognitive function estimation device | |
Baldwin et al. | Beyond speech: Generalizing d-vectors for biometric verification | |
Yadav et al. | Portable neurological disease assessment using temporal analysis of speech | |
CN112863486A (en) | Voice-based spoken language evaluation method and device and electronic equipment | |
Dua et al. | Speaker recognition using noise robust features and LSTM-RNN | |
Siagian et al. | Footstep Recognition Using Mel Frequency Cepstral Coefficients and Artificial Neural Network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |