CN113274023A - Multi-modal mental state assessment method based on multi-angle analysis - Google Patents

Multi-modal mental state assessment method based on multi-angle analysis Download PDF

Info

Publication number
CN113274023A
CN113274023A CN202110732115.XA CN202110732115A CN113274023A CN 113274023 A CN113274023 A CN 113274023A CN 202110732115 A CN202110732115 A CN 202110732115A CN 113274023 A CN113274023 A CN 113274023A
Authority
CN
China
Prior art keywords
analysis module
anxiety
depression
video
features
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110732115.XA
Other languages
Chinese (zh)
Other versions
CN113274023B (en
Inventor
陶建华
蔡聪
刘斌
柳雪飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Automation of Chinese Academy of Science
Original Assignee
Institute of Automation of Chinese Academy of Science
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Automation of Chinese Academy of Science filed Critical Institute of Automation of Chinese Academy of Science
Priority to CN202110732115.XA priority Critical patent/CN113274023B/en
Publication of CN113274023A publication Critical patent/CN113274023A/en
Application granted granted Critical
Publication of CN113274023B publication Critical patent/CN113274023B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • AHUMAN NECESSITIES
    • A61MEDICAL OR VETERINARY SCIENCE; HYGIENE
    • A61BDIAGNOSIS; SURGERY; IDENTIFICATION
    • A61B5/00Measuring for diagnostic purposes; Identification of persons
    • A61B5/16Devices for psychotechnics; Testing reaction times ; Devices for evaluating the psychological state
    • A61B5/165Evaluating the state of mind, e.g. depression, anxiety
    • AHUMAN NECESSITIES
    • A61MEDICAL OR VETERINARY SCIENCE; HYGIENE
    • A61BDIAGNOSIS; SURGERY; IDENTIFICATION
    • A61B5/00Measuring for diagnostic purposes; Identification of persons
    • A61B5/72Signal processing specially adapted for physiological signals or for diagnostic purposes
    • A61B5/7235Details of waveform analysis
    • A61B5/7253Details of waveform analysis characterised by using transforms
    • A61B5/7257Details of waveform analysis characterised by using transforms using Fourier transforms
    • AHUMAN NECESSITIES
    • A61MEDICAL OR VETERINARY SCIENCE; HYGIENE
    • A61BDIAGNOSIS; SURGERY; IDENTIFICATION
    • A61B5/00Measuring for diagnostic purposes; Identification of persons
    • A61B5/72Signal processing specially adapted for physiological signals or for diagnostic purposes
    • A61B5/7235Details of waveform analysis
    • A61B5/7264Classification of physiological signals or data, e.g. using neural networks, statistical classifiers, expert systems or fuzzy systems

Abstract

The invention provides a multi-modal mental state assessment method based on multi-angle analysis, which comprises the following steps: collecting audio files and video files from an original video, and carrying out data preprocessing on the audio files and the video files: extracting time domain waveform points and Mel frequency cepstrum coefficients from the audio file as audio features; inputting the picture sequence into a pre-training network to obtain a video coding vector; extracting a face motion unit of the picture sequence; taking the video coding vector and the face motion unit as video features; respectively inputting the audio features and the video features into a depression analysis module, an anxiety analysis module and a stress analysis module for multi-angle analysis to obtain depression features, anxiety features and stress features; inputting the depression characteristic, the anxiety characteristic and the pressure characteristic into a fusion analysis module for attention characteristic fusion to obtain a fusion characteristic; and inputting the fusion features into a support vector regression, and evaluating the mental states of individuals in the audio file and the video file.

Description

Multi-modal mental state assessment method based on multi-angle analysis
Technical Field
The invention relates to the field of voice processing and image processing, in particular to a multi-modal mental state assessment method based on multi-angle analysis.
Background
The mental state analysis not only describes psychological phenomena, but also aims at exploring the psychological motivation of people, not only reveals surface psychological rules, but also aims at exploring deep unconscious psychological mechanisms of people, and has important significance for exploring the self-consciousness of people. For example, the analysis of mental states of patients can be used to perform different treatment schemes for different mental states, and the patients have severe psychological or physiological reactions that affect endocrine systems and the like, thereby affecting the treatment effect.
Application publication number CN108888281A provides a mental state assessment method, device and system, relating to the technical field of mental state assessment. The mental state assessment method comprises the following steps: collecting audio data and video data of a person to be evaluated within a preset time; the method comprises the following steps of extracting multi-modal physiological characteristics of a person to be evaluated in audio data and video data, wherein the multi-modal physiological characteristics comprise: facial pupil data features, voice data features, and heart rate variability data features; and outputting a mental state evaluation result of the person to be evaluated according to the multi-modal physiological characteristics and a preset correlation model, wherein the correlation model is a training model for classifying individual data under different mental states based on a neural network or an SVM (support vector machine).
Application publication No. CN109547695A provides a holographic video monitoring system and method for directionally capturing pictures based on a sound classification algorithm, which comprises a front-end acquisition system, transmission equipment, a central control platform and display recording equipment; the front-end acquisition system is configured to acquire on-site audio data and video data and transmit the audio data and the video data to the central control platform through the transmission equipment; the central control platform is configured to perform noise reduction processing and sound classification on the audio data through a support vector machine recognition algorithm of a Mel frequency cepstrum coefficient, extract audio data required by a user in a segmented manner, and send the audio data required by the user and corresponding video data to the display recording equipment; and directionally capturing and amplifying the corresponding video frame through selection of the specific sound; and the display recording equipment is configured to synchronously play the monitoring data of the monitoring system in real time, can call the monitoring data of any time interval in real time, and plays the corresponding video pictures captured and amplified for the specific sound orientation.
The problem with the prior art is that most are evaluated using a single mental state, without taking into account various mental aspects of the subject, such as depression, anxiety, etc. In addition, most of the conventional methods use multiple steps and multiple models for prediction, so that the target function of each template has deviation from the final prediction target, errors are easy to accumulate, and the prediction result is inaccurate.
Disclosure of Invention
In view of this, the present invention provides a multi-modal mental state assessment method based on multi-angle analysis, and specifically, the present invention is implemented by the following technical solutions:
s1: collecting audio files and video files from an original video, and carrying out data preprocessing on the audio files and the video files:
extracting time domain waveform points and Mel frequency cepstrum coefficients from the audio file, and taking the time domain waveform points and the Mel frequency cepstrum coefficients as audio features;
sampling a video file according to a certain frequency to obtain a picture sequence, and inputting the picture sequence into a pre-training network to obtain a video coding vector;
extracting a face motion unit of the picture sequence by using an openface tool;
taking the video coding vector and the face motion unit as video features;
s2: respectively inputting the audio features and the video features into a depression analysis module, an anxiety analysis module and a stress analysis module for multi-angle analysis to obtain depression features, anxiety features and stress features;
s3: inputting the depression characteristic, the anxiety characteristic and the pressure characteristic into a fusion analysis module for attention characteristic fusion to obtain a fusion characteristic;
s4: and inputting the fusion features into a support vector regression, and evaluating the mental state of the individual in the audio file and the video file.
Preferably, the specific method for extracting the time domain waveform points from the audio file comprises the following steps:
extracting an audio file from an original MP4 long video file, and saving the audio file in a wav file format; extracting original waveform points of the audio file in the wav file format, and storing the original waveform points in the mat format;
the specific method for extracting the time domain waveform points and the Mel frequency cepstrum coefficients from the audio file comprises the following steps:
pre-emphasis, framing and windowing are carried out on the audio file in the wav file format, and then fast Fourier transform is carried out to obtain a Fourier spectrum;
the Fourier spectrum passes through a Mel filter bank, then logarithm operation is carried out, and finally discrete cosine transformation is carried out to obtain a Mel frequency cepstrum coefficient;
and storing the Mel frequency cepstrum coefficients in a mat format.
Preferably, the network of depression analysis modules comprises:
the system comprises a gate control circulation unit, a multi-head attention mechanism, an activation function, data standardization, a convolutional neural network and multi-modal feature fusion; the video characteristics and the audio characteristics are respectively input into a gating cycle unit of a depression analysis module, a multi-head attention mechanism of the depression analysis module and a convolutional neural network of the depression analysis module; and performing primary activation function activation and data standardization on the gated circulation unit of the depression analysis module, the multi-head attention mechanism of the depression analysis module and the output of the convolutional neural network of the depression analysis module, and inputting the output of the gated circulation unit of the depression analysis module, the multi-head attention mechanism of the depression analysis module and the output of the convolutional neural network of the depression analysis module after data standardization into the multi-mode feature fusion of the depression analysis module to obtain the depression feature.
Preferably, the loss function applied by the depression analysis module training process is: root mean square error between predicted and true values of depression degree, the formula is as follows:
Figure 558100DEST_PATH_IMAGE001
wherein the content of the first and second substances,
RMSED: root mean square error between predicted and true values of depressive extent;
Figure 388653DEST_PATH_IMAGE002
: predictive value of extent of depression;
Figure DEST_PATH_IMAGE003
: true value of extent of depression;
n: the number of samples;
the criteria for the degree of depression were: normal for a score of 0-9, mild depression for a score of 10-13, moderate depression for a score of 14-20, severe depression for a score of 21-27, and very severe for a score of greater than 27.
Preferably, the network of anxiety analysis modules comprises:
the system comprises a gate control circulation unit, a multi-head attention mechanism, an activation function, data standardization, a convolutional neural network and multi-modal feature fusion; the video characteristic and the audio characteristic are respectively input into a gating cycle unit of an anxiety analysis module, a multi-head attention mechanism of the anxiety analysis module and a convolution neural network of the anxiety analysis module; and performing one-time activation function activation and data standardization on the gated circulation unit of the anxiety analysis module, the multi-head attention mechanism of the anxiety analysis module and the output of the convolutional neural network of the anxiety analysis module, and inputting the output of the gated circulation unit of the anxiety analysis module, the multi-head attention mechanism of the anxiety analysis module and the output of the convolutional neural network of the anxiety analysis module after data standardization into the multi-mode feature fusion of the anxiety analysis module to obtain the anxiety feature.
Preferably, the loss function applied by the anxiety analysis module training process is: root mean square error between predicted and true values of the degree of anxiety, the formula is as follows:
Figure 569099DEST_PATH_IMAGE004
wherein the content of the first and second substances,
RMSEA: root mean square error between predicted value and true value of anxiety degree;
Figure DEST_PATH_IMAGE005
: predictive value of anxiety;
Figure 152396DEST_PATH_IMAGE006
: true value of the degree of anxiety;
n is the number of samples;
the criteria for the degree of anxiety were: normal for scores of 0-7, mild anxiety for scores of 8-9, moderate anxiety for scores of 10-14, severe anxiety for scores of 15-19, and very severe for scores of more than 19.
Preferably, the network of pressure analysis modules comprises:
the system comprises a gate control circulation unit, a multi-head attention mechanism, an activation function, data standardization, a convolutional neural network and multi-modal feature fusion; the video characteristics and the audio characteristics are respectively input into a gating circulation unit of the pressure analysis module, a multi-head attention mechanism of the pressure analysis module and a convolution neural network of the pressure analysis module; and performing one-time activation function activation and data standardization on the gated circulation unit of the pressure analysis module, the multi-head attention mechanism of the pressure analysis module and the output of the convolutional neural network of the pressure analysis module, and inputting the output of the gated circulation unit of the pressure analysis module, the multi-head attention mechanism of the pressure analysis module and the output of the convolutional neural network of the pressure analysis module after data standardization into the multi-mode feature fusion of the pressure analysis module to obtain pressure features.
Preferably, the loss function applied by the pressure analysis module training process is: root mean square error between predicted value and true value of pressure degree, the formula is as follows:
Figure DEST_PATH_IMAGE007
wherein the content of the first and second substances,
RMSESroot mean square error between predicted value and true value of pressure degree;
Figure 25674DEST_PATH_IMAGE008
: a predicted value of the degree of pressure;
Figure DEST_PATH_IMAGE009
: the actual value of the degree of pressure;
n: the number of samples;
the evaluation criteria of the degree of stress were: normal for 0-14 points, mild for 15-18 points, moderate for 19-25 points, severe for 26-33 points, and very severe for more than 33 points.
Preferably, the fusion analysis module performs feature fusion using an attention mechanism.
Preferably, the support vector regression formula is as follows:
Figure 77943DEST_PATH_IMAGE010
wherein the content of the first and second substances,
wandbare the parameters of the model to be learned,Cis a constant for the regularization of the phase,mis the number of samples that are to be taken,l ɛ is an insensitive loss function;f(x i ) Is a support vector regression predictor for the value of the feature,y i is the true value of the mental state of the individual sample in the audio file and the video file;
the specific evaluation criteria for evaluating the mental state of the individual in the audio file and the video file are as follows: normal in score 0-10, mild impaired in score 11-20, severe impaired in score 21-30, severe impaired in score 31-40, and very severe in score 41-50.
Compared with the prior art, the technical scheme provided by the embodiment of the application has the following advantages:
(1) through a multi-head attention mechanism and multi-mode fusion, not only the information of the mental state in each mode is considered, but also the dependency relationship between the modes is included, the information of two modes, namely audio and video, is fused, and the accuracy of the mental state analysis is improved;
(2) depression characteristics, anxiety characteristics and stress characteristics of the individual are comprehensively considered, multi-task modeling is carried out from multiple angles, compared with the traditional mental state analysis, information of all aspects of the individual is more comprehensively considered, and comprehensiveness of the mental state analysis is enhanced;
(3) compared with the traditional direct splicing fusion, the attention fusion considers different importance degrees of each characteristic, gives different weights, better utilizes the advantages of a neural network and has better performance.
Drawings
FIG. 1 is a block diagram of a multi-modal mental state assessment method based on multi-angle analysis according to an embodiment of the present invention;
fig. 2 is a data flow diagram of an anxiety/depression/stress analysis module of a multi-modal mental state assessment method based on multi-angle analysis according to an embodiment of the present invention.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the invention, as detailed in the appended claims.
The method for multi-modal mental state assessment based on multi-angle analysis provided by the embodiment of the application as shown in FIG. 1 comprises the following steps:
s1: collecting audio files and video files from an original video, and carrying out data preprocessing on the audio files and the video files:
extracting time domain waveform points and Mel frequency cepstrum coefficients from an audio file, and taking the time domain waveform points and the Mel frequency cepstrum coefficients as audio features;
extracting an audio file from an original MP4 long video file of a subject by using an FFMPEG tool, and saving the audio file in a wav file format; sampling the audio file at a sampling rate of 16K HZ, extracting original waveform points of the audio file in the wav file format, and storing the original waveform points in a mat format;
the specific method for extracting the time domain waveform points and the Mel frequency cepstrum coefficients from the audio file comprises the following steps:
pre-emphasizing the audio file in the wav file format in a manner that the frequency spectrum of the signal becomes flat by passing through a first-order finite excitation response high-pass filter;
dividing frames in a mode that 512 sampling point sets are used as an observation unit, namely each frame is 32ms, and the overlapping area between two adjacent frames is 50%;
windowing, namely windowing a frame of voice by adopting a Hamming window so as to reduce the influence of the Gibbs effect;
then, fast Fourier transform is carried out to obtain a Fourier spectrum;
the Fourier spectrum passes through a Mel filter bank, then logarithm operation is carried out, finally discrete pre-transformation is carried out, the first order difference and the second order difference are obtained, and then one bit of energy is added, so that a Mel frequency cepstrum coefficient is obtained;
storing the mel frequency cepstrum coefficient in a mat format;
sampling the video file according to the frequency of 6 times/s to obtain a picture sequence, inputting the picture sequence into a ResNet-50 pre-training network to obtain a video coding vector, and storing the video coding vector in a mat format;
extracting a human face motion unit of the picture sequence by using an openface tool, and storing the human face motion unit in a csv format;
taking the video coding vector and a human face motion unit as video features;
s2: respectively inputting the audio features and the video features into a depression analysis module, an anxiety analysis module and a stress analysis module for multi-angle analysis to obtain depression features, anxiety features and stress features;
as shown in fig. 2, the network of depression analysis modules comprises:
the system comprises a gate control circulation unit, a multi-head attention mechanism, an activation function, data standardization, a convolutional neural network and multi-modal feature fusion; the video characteristics and the audio characteristics are respectively input into a gating cycle unit of a depression analysis module, a multi-head attention mechanism of the depression analysis module and a convolutional neural network of the depression analysis module; performing primary activation function activation and data standardization on the gated circulation unit of the depression analysis module, the multi-head attention mechanism of the depression analysis module and the output of the convolutional neural network of the depression analysis module, and inputting the output of the gated circulation unit of the depression analysis module, the multi-head attention mechanism of the depression analysis module and the output of the convolutional neural network of the depression analysis module after data standardization into the multi-modal feature fusion of the depression analysis module to obtain the depression features;
the loss function applied by the depression analysis module training process is: root mean square error between predicted and true values of depression degree, the formula is as follows:
Figure DEST_PATH_IMAGE011
wherein the content of the first and second substances,
RMSED: root mean square error between predicted and true values of depressive extent;
Figure 796501DEST_PATH_IMAGE002
: predictive value of extent of depression;
Figure 516195DEST_PATH_IMAGE003
: true value of extent of depression;
n: the number of samples;
the criteria for the degree of depression were: normal for a score of 0-9, mild depression for a score of 10-13, moderate depression for a score of 14-20, major depression for a score of 21-27, and very severe for a score greater than 27;
as shown in fig. 2, the network of anxiety analysis modules includes:
the system comprises a gate control circulation unit, a multi-head attention mechanism, an activation function, data standardization, a convolutional neural network and multi-modal feature fusion; the video characteristic and the audio characteristic are respectively input into a gating circulation unit of an anxiety analysis module, a multi-head attention mechanism of the anxiety analysis module and a convolutional neural network of the anxiety analysis module; performing one-time activation function activation and data standardization on the gated circulation unit of the anxiety analysis module, the multi-head attention mechanism of the anxiety analysis module and the output of the convolutional neural network of the anxiety analysis module, and inputting the output of the gated circulation unit of the anxiety analysis module, the multi-head attention mechanism of the anxiety analysis module and the output of the convolutional neural network of the anxiety analysis module after data standardization into the multi-mode feature fusion of the anxiety analysis module to obtain the anxiety feature;
the loss function applied by the anxiety analysis module training process is: root mean square error between predicted and true values of the degree of anxiety, the formula is as follows:
Figure 825954DEST_PATH_IMAGE004
wherein the content of the first and second substances,
RMSEA: root mean square error between predicted value and true value of anxiety degree;
Figure 399409DEST_PATH_IMAGE005
: predictive value of anxiety;
Figure 187237DEST_PATH_IMAGE006
: true value of the degree of anxiety;
n is the number of samples;
the evaluation criteria of the anxiety degree are: normal for 0-7, mild anxiety for 8-9, moderate anxiety for 10-14, severe anxiety for 15-19, and very severe for more than 19;
as shown in fig. 2, the network of pressure analysis modules comprises:
the system comprises a gate control circulation unit, a multi-head attention mechanism, an activation function, data standardization, a convolutional neural network and multi-modal feature fusion; the video characteristic and the audio characteristic are respectively input into a gating circulation unit of the pressure analysis module, a multi-head attention mechanism of the pressure analysis module and a convolution neural network of the pressure analysis module; performing one-time activation function activation and data standardization on the gated circulation unit of the pressure analysis module, the multi-head attention mechanism of the pressure analysis module and the output of the convolutional neural network of the pressure analysis module, and inputting the gated circulation unit of the pressure analysis module, the multi-head attention mechanism of the pressure analysis module and the output of the convolutional neural network of the pressure analysis module after data standardization into the multi-mode feature fusion of the pressure analysis module to obtain the pressure feature;
the loss function applied by the pressure analysis module training process is: root mean square error between predicted value and true value of pressure degree, the formula is as follows:
Figure 761437DEST_PATH_IMAGE007
wherein the content of the first and second substances,
RMSESroot mean square error between predicted value and true value of pressure degree;
Figure 976518DEST_PATH_IMAGE008
: a predicted value of the degree of pressure;
Figure 268959DEST_PATH_IMAGE009
: the actual value of the degree of pressure;
n: the number of samples;
the evaluation standard of the pressure degree is as follows: normal for 0-14 points, mild pressure for 15-18 points, moderate pressure for 19-25 points, severe pressure for 26-33 points, and very severe for more than 33 points;
the specific parameter settings for each module are as follows:
the analysis module firstly inputs the audio features and the video features into a gating circulation unit, the gating circulation unit is a variant of a long-term and short-term memory network, the context dependence relationship can be captured, the problems of long dependence and gradient disappearance are solved, the structure is simple, and the effect is better; then, a multi-head attention mechanism is adopted, wherein the number of attention heads is set to be 8, and characteristic expressions are calculated from 8 different angles; then extracting features through a convolutional neural network, wherein the number of convolutional kernels is 512, the size of the convolutional kernels is 3 multiplied by 3, and the convolutional neural network has excellent performance in the aspect of extracting local features; after the three operations, activation function activation and data standardization are carried out once, the activation function is a parameter rectification linear unit PReLU, nonlinearity can be increased, and the data standardization is batch standardization, so that the influence of data deviation is solved, and the training speed can be accelerated; and finally, splicing the audio features and the video features together and fusing the audio features and the video features through a fully connected neural network to form multi-angle features of depression features, anxiety features and pressure features, wherein the number of the neurons is 1024. The loss function is the root mean square error between the predicted value and the actual value, and after a plurality of times of iterative training, the depression analysis module, the anxiety analysis module and the stress analysis module can respectively analyze the depression degree, the anxiety degree and the stress degree of the testee; putting the trained three modules into a model for final training;
the concrete model structure is as follows:
the gated cycle cell formula is as follows:
Figure 860477DEST_PATH_IMAGE012
Figure 289185DEST_PATH_IMAGE013
Figure 675167DEST_PATH_IMAGE014
Figure 454904DEST_PATH_IMAGE015
wherein
Figure 833802DEST_PATH_IMAGE016
Is a feature of the input of the character,
Figure 382595DEST_PATH_IMAGE017
is the hidden layer output at the last moment,
Figure 205057DEST_PATH_IMAGE018
is the hidden layer output at this moment, W and U are both weight matrices, b is an offset, and the gated cyclic unit has two gate functions, where
Figure 206511DEST_PATH_IMAGE019
The reset gate is used for controlling the extent of updating the hidden layer state at the previous moment to the current candidate hidden layer state;
Figure 139832DEST_PATH_IMAGE020
the updating gate is used for controlling the degree of updating the hidden layer state at the previous moment to the current hidden layer state;
the multi-head attention mechanism formula is as follows:
Figure 277552DEST_PATH_IMAGE021
wherein Q, K, V represents the set of queries, keys, and values entered, respectively, the formula is as follows:
Figure 536495DEST_PATH_IMAGE022
the self-attention calculation is carried out on the input by using a multi-head attention mechanism, so that the characteristics can be analyzed from multiple angles, and useful characteristics can be enhanced and useless characteristics can be suppressed;
the activation function uses a parametric rectifying linear unit, the formula is as follows:
Figure 25245DEST_PATH_IMAGE023
whereinxIs an input to the computer system that is,ais a trainable parameter;
data normalization was done using batch normalization, with the following formula:
Figure 496678DEST_PATH_IMAGE024
Figure 20063DEST_PATH_IMAGE025
Figure 184328DEST_PATH_IMAGE026
wherein
Figure 645528DEST_PATH_IMAGE027
Is the sample data that is input and is,
Figure 920651DEST_PATH_IMAGE028
is the average number of samples that are taken,
Figure 298543DEST_PATH_IMAGE029
is the variance of the sample(s),
Figure 633710DEST_PATH_IMAGE030
the method is sample data after standardization, and batch standardization can effectively solve the problem of internal covariate deviation;
s3: inputting the depression characteristic, the anxiety characteristic and the pressure characteristic into a fusion analysis module for attention characteristic fusion to obtain a fusion characteristic; the fusion analysis module performs feature fusion by adopting an attention mechanism;
the fusion analysis module performs feature fusion by adopting an attention mechanism, and the formula is as follows:
Figure 97052DEST_PATH_IMAGE031
Figure 910287DEST_PATH_IMAGE032
Figure 142685DEST_PATH_IMAGE033
Figure 383174DEST_PATH_IMAGE034
wherein
Figure 333812DEST_PATH_IMAGE035
Is a fused intermediate feature obtained by linear transformation and splicing and is used for subsequently calculating the attention weight of each feature,
Figure 200006DEST_PATH_IMAGE036
Figure 552490DEST_PATH_IMAGE037
Figure 229459DEST_PATH_IMAGE038
are respectively the first
Figure 401814DEST_PATH_IMAGE040
Anxiety, depression and stress characteristics of the sequences,
Figure 556852DEST_PATH_IMAGE041
Figure 763843DEST_PATH_IMAGE042
Figure 611713DEST_PATH_IMAGE043
Figure 271364DEST_PATH_IMAGE044
are a matrix of trainable parameters that are,
Figure 230093DEST_PATH_IMAGE045
is a vector of parameters that can be trained,
Figure 291590DEST_PATH_IMAGE046
attention weights for anxiety features, depression features and stress features,
Figure 575941DEST_PATH_IMAGE047
is a fusion feature obtained through attention calculation;
because the anxiety state, the depression state and the stress state of the testee are different in contribution to mental state assessment, the anxiety feature, the depression feature and the stress feature are fused by an attention mechanism, so that the model can automatically learn the weight of the features, emphasize the greatly-contributed features and inhibit useless features;
s4: inputting the fusion features into a support vector regression to evaluate the mental state of the individual in the audio file and the video file;
the support vector regression formula is as follows:
Figure 473621DEST_PATH_IMAGE048
wherein the content of the first and second substances,
wandbare the parameters of the model to be learned,Cis a constant for the regularization of the phase,mis the number of samples that are to be taken,l ɛ is an insensitive loss function;f(x i ) Is a support vector regression predictor for the value of the feature,y i is the actual value of the mental state of the individual sample in the audio file and the video file;
the specific evaluation criteria for evaluating the mental state of the individual in the audio file and the video file are as follows: normal in score 0-10, mild impaired in score 11-20, severe impaired in score 21-30, severe impaired in score 31-40, and very severe in score 41-50.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this specification and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.
It is to be understood that although the terms first, second, third, etc. may be used herein to describe various information, these information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present invention. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.
For the device embodiments, since they substantially correspond to the method embodiments, reference may be made to the partial description of the method embodiments for relevant points. The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the scheme of the invention. One of ordinary skill in the art can understand and implement it without inventive effort.
Embodiments of the subject matter and the functional operations described in this specification can be implemented in: digital electronic circuitry, tangibly embodied computer software or firmware, computer hardware including the structures disclosed in this specification and their structural equivalents, or a combination of one or more of them. Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions, encoded on a tangible, non-transitory program carrier for execution by, or to control the operation of, data processing apparatus. Alternatively or additionally, the program instructions may be encoded on an artificially generated propagated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal, that is generated to encode and transmit information to suitable receiver apparatus for execution by the data processing apparatus. The computer storage medium may be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them.
The processes and logic flows described in this specification can be performed by one or more programmable computers executing one or more computer programs to perform corresponding functions by operating on input data and generating output. The processes and logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit).
Computers suitable for executing computer programs include, for example, general and/or special purpose microprocessors, or any other type of central processing unit. Generally, a central processing unit will receive instructions and data from a read-only memory and/or a random access memory. The basic components of a computer include a central processing unit for implementing or executing instructions and one or more memory devices for storing instructions and data. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks. However, a computer does not necessarily have such a device. Moreover, a computer may be embedded in another device, e.g., a mobile telephone, a Personal Digital Assistant (PDA), a mobile audio or video player, a game console, a Global Positioning System (GPS) receiver, or a portable storage device such as a Universal Serial Bus (USB) flash drive, to name a few.
Computer-readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices, including by way of example semiconductor memory devices (e.g., EPROM, EEPROM, and flash memory devices), magnetic disks (e.g., an internal hard disk or a removable disk), magneto-optical disks, and CD ROM and DVD-ROM disks. The processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.
While this specification contains many specific implementation details, these should not be construed as limitations on the scope of any invention or of what may be claimed, but rather as descriptions of features specific to particular embodiments of particular inventions. Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. In other instances, features described in connection with one embodiment may be implemented as discrete components or in any suitable subcombination. Moreover, although features may be described above as acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.
Similarly, while operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In some cases, multitasking and parallel processing may be advantageous. Moreover, the separation of various system modules and components in the embodiments described above should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.
Thus, particular embodiments of the subject matter have been described. Other embodiments are within the scope of the following claims. In some cases, the actions recited in the claims can be performed in a different order and still achieve desirable results. Further, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some implementations, multitasking and parallel processing may be advantageous.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (10)

1. A method for multi-modal mental state assessment based on multi-angle analysis, the method comprising:
s1: collecting audio files and video files from an original video, and carrying out data preprocessing on the audio files and the video files:
extracting time domain waveform points and Mel frequency cepstrum coefficients from an audio file, and taking the time domain waveform points and the Mel frequency cepstrum coefficients as audio features;
sampling the video file according to a certain frequency to obtain a picture sequence, and inputting the picture sequence into a pre-training network to obtain a video coding vector;
extracting a human face motion unit of the picture sequence by using an openface tool;
taking the video coding vector and a human face motion unit as video features;
s2: respectively inputting the audio features and the video features into a depression analysis module, an anxiety analysis module and a stress analysis module for multi-angle analysis to obtain depression features, anxiety features and stress features;
s3: inputting the depression characteristic, the anxiety characteristic and the pressure characteristic into a fusion analysis module for attention characteristic fusion to obtain a fusion characteristic;
s4: inputting the fusion features into a support vector regression, and evaluating the mental state of the individual in the audio file and the video file.
2. The method for multi-modal mental state assessment based on multi-angle analysis according to claim 1, wherein the specific method for extracting time domain waveform points from audio files is as follows:
extracting an audio file from an original MP4 long video file, and saving the audio file in a wav file format; extracting original waveform points of the audio file in the wav file format, and storing the original waveform points in the mat format;
the specific method for extracting the time domain waveform points and the Mel frequency cepstrum coefficients from the audio file comprises the following steps:
pre-emphasis, framing and windowing are carried out on the audio file in the wav file format, and then fast Fourier transform is carried out to obtain a Fourier spectrum;
the Fourier spectrum passes through a Mel filter bank, then logarithm operation is carried out, and finally discrete cosine transformation is carried out to obtain a Mel frequency cepstrum coefficient;
and storing the Mel frequency cepstrum coefficient in a mat format.
3. The method for multi-modal mental state assessment based on multi-angle analysis according to claim 1, wherein said network of depression analysis modules comprises:
the system comprises a gate control circulation unit, a multi-head attention mechanism, an activation function, data standardization, a convolutional neural network and multi-modal feature fusion; the video characteristics and the audio characteristics are respectively input into a gating cycle unit of a depression analysis module, a multi-head attention mechanism of the depression analysis module and a convolutional neural network of the depression analysis module; and performing primary activation function activation and data standardization on the gated circulation unit of the depression analysis module, the multi-head attention mechanism of the depression analysis module and the output of the convolutional neural network of the depression analysis module, and inputting the output of the gated circulation unit of the depression analysis module, the multi-head attention mechanism of the depression analysis module and the output of the convolutional neural network of the depression analysis module after data standardization into the multi-modal feature fusion of the depression analysis module to obtain the depression features.
4. The method for multi-modal mental state assessment based on multi-angle analysis according to claim 3, wherein the loss function applied by the depression analysis module training process is: root mean square error between predicted and true values of depression degree, the formula is as follows:
Figure 800337DEST_PATH_IMAGE001
wherein the content of the first and second substances,
RMSED: root mean square error between predicted and true values of depressive extent;
Figure 836426DEST_PATH_IMAGE002
: predictive value of extent of depression;
Figure 931421DEST_PATH_IMAGE003
: true value of extent of depression;
n: the number of samples;
the criteria for the degree of depression were: normal for a score of 0-9, mild depression for a score of 10-13, moderate depression for a score of 14-20, severe depression for a score of 21-27, and very severe for a score of greater than 27.
5. The method for multi-modal mental state assessment based on multi-angle analysis according to claim 1, wherein the network of anxiety analysis modules comprises:
the system comprises a gate control circulation unit, a multi-head attention mechanism, an activation function, data standardization, a convolutional neural network and multi-modal feature fusion; the video characteristic and the audio characteristic are respectively input into a gating circulation unit of an anxiety analysis module, a multi-head attention mechanism of the anxiety analysis module and a convolutional neural network of the anxiety analysis module; and performing one-time activation function activation and data standardization on the gated circulation unit of the anxiety analysis module, the multi-head attention mechanism of the anxiety analysis module and the output of the convolutional neural network of the anxiety analysis module, and inputting the output of the gated circulation unit of the anxiety analysis module, the multi-head attention mechanism of the anxiety analysis module and the output of the convolutional neural network of the anxiety analysis module after data standardization into the multi-mode feature fusion of the anxiety analysis module to obtain the anxiety feature.
6. The method for multi-modal mental state assessment based on multi-angle analysis according to claim 5, wherein the loss function applied by the anxiety analysis module training process is: root mean square error between predicted and true values of the degree of anxiety, the formula is as follows:
Figure 52961DEST_PATH_IMAGE004
wherein the content of the first and second substances,
RMSEA: root mean square error between predicted value and true value of anxiety degree;
Figure 157183DEST_PATH_IMAGE005
: predictive value of anxiety;
Figure 782200DEST_PATH_IMAGE006
: true value of the degree of anxiety;
n is the number of samples;
the evaluation criteria of the anxiety degree are: normal for scores of 0-7, mild anxiety for scores of 8-9, moderate anxiety for scores of 10-14, severe anxiety for scores of 15-19, and very severe for scores of more than 19.
7. The method for multi-modal mental state assessment based on multi-angle analysis according to claim 1, wherein said network of stress analysis modules comprises:
the system comprises a gate control circulation unit, a multi-head attention mechanism, an activation function, data standardization, a convolutional neural network and multi-modal feature fusion; the video characteristic and the audio characteristic are respectively input into a gating circulation unit of the pressure analysis module, a multi-head attention mechanism of the pressure analysis module and a convolution neural network of the pressure analysis module; and performing one-time activation function activation and data standardization on the gated circulation unit of the pressure analysis module, the multi-head attention mechanism of the pressure analysis module and the output of the convolutional neural network of the pressure analysis module, and inputting the gated circulation unit of the pressure analysis module, the multi-head attention mechanism of the pressure analysis module and the output of the convolutional neural network of the pressure analysis module after data standardization into the multi-mode feature fusion of the pressure analysis module to obtain the pressure feature.
8. The method for multi-modal mental state assessment based on multi-angle analysis according to claim 7, wherein the loss function applied by the stress analysis module training process is: root mean square error between predicted value and true value of pressure degree, the formula is as follows:
Figure 579255DEST_PATH_IMAGE007
wherein the content of the first and second substances,
RMSESroot mean square error between predicted value and true value of pressure degree;
Figure 171779DEST_PATH_IMAGE008
: a predicted value of the degree of pressure;
Figure 79692DEST_PATH_IMAGE009
: the actual value of the degree of pressure;
n: the number of samples;
the evaluation standard of the pressure degree is as follows: normal for 0-14 points, mild for 15-18 points, moderate for 19-25 points, severe for 26-33 points, and very severe for more than 33 points.
9. The method for multi-modal mental state assessment based on multi-angle analysis according to claim 1, wherein said fusion analysis module employs an attention mechanism for feature fusion.
10. The method for multi-modal mental state assessment based on multi-angle analysis according to claim 1, wherein the support vector regression formula is as follows:
Figure 824794DEST_PATH_IMAGE010
wherein the content of the first and second substances,
wandbare the parameters of the model to be learned,Cis a constant for the regularization of the phase,mis the number of samples that are to be taken,l ɛ is an insensitive loss function;f(x i ) Is a support vector regression predictor for the value of the feature,y i is the actual value of the mental state of the individual sample in the audio file and the video file;
the specific evaluation criteria for evaluating the mental state of the individual in the audio file and the video file are as follows: normal in score 0-10, mild impaired in score 11-20, severe impaired in score 21-30, severe impaired in score 31-40, and very severe in score 41-50.
CN202110732115.XA 2021-06-30 2021-06-30 Multi-modal mental state assessment method based on multi-angle analysis Active CN113274023B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110732115.XA CN113274023B (en) 2021-06-30 2021-06-30 Multi-modal mental state assessment method based on multi-angle analysis

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110732115.XA CN113274023B (en) 2021-06-30 2021-06-30 Multi-modal mental state assessment method based on multi-angle analysis

Publications (2)

Publication Number Publication Date
CN113274023A true CN113274023A (en) 2021-08-20
CN113274023B CN113274023B (en) 2021-12-14

Family

ID=77286269

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110732115.XA Active CN113274023B (en) 2021-06-30 2021-06-30 Multi-modal mental state assessment method based on multi-angle analysis

Country Status (1)

Country Link
CN (1) CN113274023B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115064246A (en) * 2022-08-18 2022-09-16 山东第一医科大学附属省立医院(山东省立医院) Depression evaluation system and equipment based on multi-mode information fusion
CN115910329A (en) * 2023-01-06 2023-04-04 江苏瑞康成医疗科技有限公司 Intelligent depression identification method and device
CN116661607A (en) * 2023-07-24 2023-08-29 北京智精灵科技有限公司 Emotion adjustment method and system based on multi-modal emotion interaction

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170245759A1 (en) * 2016-02-25 2017-08-31 Samsung Electronics Co., Ltd. Image-analysis for assessing heart failure
US20190074028A1 (en) * 2017-09-01 2019-03-07 Newton Howard Real-time vocal features extraction for automated emotional or mental state assessment
US20200118458A1 (en) * 2018-06-19 2020-04-16 Ellipsis Health, Inc. Systems and methods for mental health assessment
CN111225612A (en) * 2017-10-17 2020-06-02 萨蒂什·拉奥 Neural obstacle identification and monitoring system based on machine learning
CN111951824A (en) * 2020-08-14 2020-11-17 苏州国岭技研智能科技有限公司 Detection method for distinguishing depression based on sound
CN112818892A (en) * 2021-02-10 2021-05-18 杭州医典智能科技有限公司 Multi-modal depression detection method and system based on time convolution neural network

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170245759A1 (en) * 2016-02-25 2017-08-31 Samsung Electronics Co., Ltd. Image-analysis for assessing heart failure
US20190074028A1 (en) * 2017-09-01 2019-03-07 Newton Howard Real-time vocal features extraction for automated emotional or mental state assessment
CN111225612A (en) * 2017-10-17 2020-06-02 萨蒂什·拉奥 Neural obstacle identification and monitoring system based on machine learning
US20200118458A1 (en) * 2018-06-19 2020-04-16 Ellipsis Health, Inc. Systems and methods for mental health assessment
CN111951824A (en) * 2020-08-14 2020-11-17 苏州国岭技研智能科技有限公司 Detection method for distinguishing depression based on sound
CN112818892A (en) * 2021-02-10 2021-05-18 杭州医典智能科技有限公司 Multi-modal depression detection method and system based on time convolution neural network

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115064246A (en) * 2022-08-18 2022-09-16 山东第一医科大学附属省立医院(山东省立医院) Depression evaluation system and equipment based on multi-mode information fusion
CN115064246B (en) * 2022-08-18 2022-12-20 山东第一医科大学附属省立医院(山东省立医院) Depression evaluation system and equipment based on multi-mode information fusion
CN115910329A (en) * 2023-01-06 2023-04-04 江苏瑞康成医疗科技有限公司 Intelligent depression identification method and device
CN116661607A (en) * 2023-07-24 2023-08-29 北京智精灵科技有限公司 Emotion adjustment method and system based on multi-modal emotion interaction

Also Published As

Publication number Publication date
CN113274023B (en) 2021-12-14

Similar Documents

Publication Publication Date Title
CN113274023B (en) Multi-modal mental state assessment method based on multi-angle analysis
CN110556129B (en) Bimodal emotion recognition model training method and bimodal emotion recognition method
Liu et al. GMM and CNN hybrid method for short utterance speaker recognition
Brady et al. Multi-modal audio, video and physiological sensor learning for continuous emotion prediction
CN110457432B (en) Interview scoring method, interview scoring device, interview scoring equipment and interview scoring storage medium
CN109859772B (en) Emotion recognition method, emotion recognition device and computer-readable storage medium
CN112006697B (en) Voice signal-based gradient lifting decision tree depression degree recognition system
CN107221320A (en) Train method, device, equipment and the computer-readable storage medium of acoustic feature extraction model
Kelly et al. Deep neural network based forensic automatic speaker recognition in VOCALISE using x-vectors
Zhao et al. Hybrid network feature extraction for depression assessment from speech
Lai Contrastive predictive coding based feature for automatic speaker verification
CN107430678A (en) Use the inexpensive face recognition of Gauss received field feature
CN112669820B (en) Examination cheating recognition method and device based on voice recognition and computer equipment
Sefara The effects of normalisation methods on speech emotion recognition
Wang et al. Speaker recognition using convolutional neural network with minimal training data for smart home solutions
Sabatier et al. Measurement of the impact of identical twin voices on automatic speaker recognition
Chen et al. Cough detection using selected informative features from audio signals
Deepa et al. Speech technology in healthcare
Islam et al. A Novel Approach for Text-Independent Speaker Identification Using Artificial Neural Network
JP7014761B2 (en) Cognitive function estimation method, computer program and cognitive function estimation device
Baldwin et al. Beyond speech: Generalizing d-vectors for biometric verification
Yadav et al. Portable neurological disease assessment using temporal analysis of speech
CN112863486A (en) Voice-based spoken language evaluation method and device and electronic equipment
Dua et al. Speaker recognition using noise robust features and LSTM-RNN
Siagian et al. Footstep Recognition Using Mel Frequency Cepstral Coefficients and Artificial Neural Network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant