CN113274023A

CN113274023A - Multi-modal mental state assessment method based on multi-angle analysis

Info

Publication number: CN113274023A
Application number: CN202110732115.XA
Authority: CN
Inventors: 陶建华; 蔡聪; 刘斌; 柳雪飞
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2021-06-30
Filing date: 2021-06-30
Publication date: 2021-08-20
Anticipated expiration: 2041-06-30
Also published as: CN113274023B

Abstract

The invention provides a multi-modal mental state assessment method based on multi-angle analysis, which comprises the following steps: collecting audio files and video files from an original video, and carrying out data preprocessing on the audio files and the video files: extracting time domain waveform points and Mel frequency cepstrum coefficients from the audio file as audio features; inputting the picture sequence into a pre-training network to obtain a video coding vector; extracting a face motion unit of the picture sequence; taking the video coding vector and the face motion unit as video features; respectively inputting the audio features and the video features into a depression analysis module, an anxiety analysis module and a stress analysis module for multi-angle analysis to obtain depression features, anxiety features and stress features; inputting the depression characteristic, the anxiety characteristic and the pressure characteristic into a fusion analysis module for attention characteristic fusion to obtain a fusion characteristic; and inputting the fusion features into a support vector regression, and evaluating the mental states of individuals in the audio file and the video file.

Description

Multi-modal mental state assessment method based on multi-angle analysis

Technical Field

The invention relates to the field of voice processing and image processing, in particular to a multi-modal mental state assessment method based on multi-angle analysis.

Background

The mental state analysis not only describes psychological phenomena, but also aims at exploring the psychological motivation of people, not only reveals surface psychological rules, but also aims at exploring deep unconscious psychological mechanisms of people, and has important significance for exploring the self-consciousness of people. For example, the analysis of mental states of patients can be used to perform different treatment schemes for different mental states, and the patients have severe psychological or physiological reactions that affect endocrine systems and the like, thereby affecting the treatment effect.

Application publication number CN108888281A provides a mental state assessment method, device and system, relating to the technical field of mental state assessment. The mental state assessment method comprises the following steps: collecting audio data and video data of a person to be evaluated within a preset time; the method comprises the following steps of extracting multi-modal physiological characteristics of a person to be evaluated in audio data and video data, wherein the multi-modal physiological characteristics comprise: facial pupil data features, voice data features, and heart rate variability data features; and outputting a mental state evaluation result of the person to be evaluated according to the multi-modal physiological characteristics and a preset correlation model, wherein the correlation model is a training model for classifying individual data under different mental states based on a neural network or an SVM (support vector machine).

Application publication No. CN109547695A provides a holographic video monitoring system and method for directionally capturing pictures based on a sound classification algorithm, which comprises a front-end acquisition system, transmission equipment, a central control platform and display recording equipment; the front-end acquisition system is configured to acquire on-site audio data and video data and transmit the audio data and the video data to the central control platform through the transmission equipment; the central control platform is configured to perform noise reduction processing and sound classification on the audio data through a support vector machine recognition algorithm of a Mel frequency cepstrum coefficient, extract audio data required by a user in a segmented manner, and send the audio data required by the user and corresponding video data to the display recording equipment; and directionally capturing and amplifying the corresponding video frame through selection of the specific sound; and the display recording equipment is configured to synchronously play the monitoring data of the monitoring system in real time, can call the monitoring data of any time interval in real time, and plays the corresponding video pictures captured and amplified for the specific sound orientation.

The problem with the prior art is that most are evaluated using a single mental state, without taking into account various mental aspects of the subject, such as depression, anxiety, etc. In addition, most of the conventional methods use multiple steps and multiple models for prediction, so that the target function of each template has deviation from the final prediction target, errors are easy to accumulate, and the prediction result is inaccurate.

Disclosure of Invention

In view of this, the present invention provides a multi-modal mental state assessment method based on multi-angle analysis, and specifically, the present invention is implemented by the following technical solutions:

s1: collecting audio files and video files from an original video, and carrying out data preprocessing on the audio files and the video files:

extracting time domain waveform points and Mel frequency cepstrum coefficients from the audio file, and taking the time domain waveform points and the Mel frequency cepstrum coefficients as audio features;

sampling a video file according to a certain frequency to obtain a picture sequence, and inputting the picture sequence into a pre-training network to obtain a video coding vector;

extracting a face motion unit of the picture sequence by using an openface tool;

taking the video coding vector and the face motion unit as video features;

s2: respectively inputting the audio features and the video features into a depression analysis module, an anxiety analysis module and a stress analysis module for multi-angle analysis to obtain depression features, anxiety features and stress features;

s3: inputting the depression characteristic, the anxiety characteristic and the pressure characteristic into a fusion analysis module for attention characteristic fusion to obtain a fusion characteristic;

s4: and inputting the fusion features into a support vector regression, and evaluating the mental state of the individual in the audio file and the video file.

Preferably, the specific method for extracting the time domain waveform points from the audio file comprises the following steps:

extracting an audio file from an original MP4 long video file, and saving the audio file in a wav file format; extracting original waveform points of the audio file in the wav file format, and storing the original waveform points in the mat format;

the specific method for extracting the time domain waveform points and the Mel frequency cepstrum coefficients from the audio file comprises the following steps:

pre-emphasis, framing and windowing are carried out on the audio file in the wav file format, and then fast Fourier transform is carried out to obtain a Fourier spectrum;

the Fourier spectrum passes through a Mel filter bank, then logarithm operation is carried out, and finally discrete cosine transformation is carried out to obtain a Mel frequency cepstrum coefficient;

and storing the Mel frequency cepstrum coefficients in a mat format.

Preferably, the network of depression analysis modules comprises:

the system comprises a gate control circulation unit, a multi-head attention mechanism, an activation function, data standardization, a convolutional neural network and multi-modal feature fusion; the video characteristics and the audio characteristics are respectively input into a gating cycle unit of a depression analysis module, a multi-head attention mechanism of the depression analysis module and a convolutional neural network of the depression analysis module; and performing primary activation function activation and data standardization on the gated circulation unit of the depression analysis module, the multi-head attention mechanism of the depression analysis module and the output of the convolutional neural network of the depression analysis module, and inputting the output of the gated circulation unit of the depression analysis module, the multi-head attention mechanism of the depression analysis module and the output of the convolutional neural network of the depression analysis module after data standardization into the multi-mode feature fusion of the depression analysis module to obtain the depression feature.

Preferably, the loss function applied by the depression analysis module training process is: root mean square error between predicted and true values of depression degree, the formula is as follows:

wherein the content of the first and second substances,

RMSE_D: root mean square error between predicted and true values of depressive extent;

: predictive value of extent of depression;

: true value of extent of depression;

n: the number of samples;

the criteria for the degree of depression were: normal for a score of 0-9, mild depression for a score of 10-13, moderate depression for a score of 14-20, severe depression for a score of 21-27, and very severe for a score of greater than 27.

Preferably, the network of anxiety analysis modules comprises:

the system comprises a gate control circulation unit, a multi-head attention mechanism, an activation function, data standardization, a convolutional neural network and multi-modal feature fusion; the video characteristic and the audio characteristic are respectively input into a gating cycle unit of an anxiety analysis module, a multi-head attention mechanism of the anxiety analysis module and a convolution neural network of the anxiety analysis module; and performing one-time activation function activation and data standardization on the gated circulation unit of the anxiety analysis module, the multi-head attention mechanism of the anxiety analysis module and the output of the convolutional neural network of the anxiety analysis module, and inputting the output of the gated circulation unit of the anxiety analysis module, the multi-head attention mechanism of the anxiety analysis module and the output of the convolutional neural network of the anxiety analysis module after data standardization into the multi-mode feature fusion of the anxiety analysis module to obtain the anxiety feature.

Preferably, the loss function applied by the anxiety analysis module training process is: root mean square error between predicted and true values of the degree of anxiety, the formula is as follows:

wherein the content of the first and second substances,

RMSE_A: root mean square error between predicted value and true value of anxiety degree;

: predictive value of anxiety;

: true value of the degree of anxiety;

n is the number of samples;

the criteria for the degree of anxiety were: normal for scores of 0-7, mild anxiety for scores of 8-9, moderate anxiety for scores of 10-14, severe anxiety for scores of 15-19, and very severe for scores of more than 19.

Preferably, the network of pressure analysis modules comprises:

the system comprises a gate control circulation unit, a multi-head attention mechanism, an activation function, data standardization, a convolutional neural network and multi-modal feature fusion; the video characteristics and the audio characteristics are respectively input into a gating circulation unit of the pressure analysis module, a multi-head attention mechanism of the pressure analysis module and a convolution neural network of the pressure analysis module; and performing one-time activation function activation and data standardization on the gated circulation unit of the pressure analysis module, the multi-head attention mechanism of the pressure analysis module and the output of the convolutional neural network of the pressure analysis module, and inputting the output of the gated circulation unit of the pressure analysis module, the multi-head attention mechanism of the pressure analysis module and the output of the convolutional neural network of the pressure analysis module after data standardization into the multi-mode feature fusion of the pressure analysis module to obtain pressure features.

Preferably, the loss function applied by the pressure analysis module training process is: root mean square error between predicted value and true value of pressure degree, the formula is as follows:

wherein the content of the first and second substances,

RMSE_Sroot mean square error between predicted value and true value of pressure degree;

: a predicted value of the degree of pressure;

: the actual value of the degree of pressure;

n: the number of samples;

the evaluation criteria of the degree of stress were: normal for 0-14 points, mild for 15-18 points, moderate for 19-25 points, severe for 26-33 points, and very severe for more than 33 points.

Preferably, the fusion analysis module performs feature fusion using an attention mechanism.

Preferably, the support vector regression formula is as follows:

wherein the content of the first and second substances,

wandbare the parameters of the model to be learned,Cis a constant for the regularization of the phase,mis the number of samples that are to be taken,l _ɛis an insensitive loss function;f(x _i) Is a support vector regression predictor for the value of the feature,y _iis the true value of the mental state of the individual sample in the audio file and the video file;

the specific evaluation criteria for evaluating the mental state of the individual in the audio file and the video file are as follows: normal in score 0-10, mild impaired in score 11-20, severe impaired in score 21-30, severe impaired in score 31-40, and very severe in score 41-50.

Compared with the prior art, the technical scheme provided by the embodiment of the application has the following advantages:

(1) through a multi-head attention mechanism and multi-mode fusion, not only the information of the mental state in each mode is considered, but also the dependency relationship between the modes is included, the information of two modes, namely audio and video, is fused, and the accuracy of the mental state analysis is improved;

(2) depression characteristics, anxiety characteristics and stress characteristics of the individual are comprehensively considered, multi-task modeling is carried out from multiple angles, compared with the traditional mental state analysis, information of all aspects of the individual is more comprehensively considered, and comprehensiveness of the mental state analysis is enhanced;

(3) compared with the traditional direct splicing fusion, the attention fusion considers different importance degrees of each characteristic, gives different weights, better utilizes the advantages of a neural network and has better performance.

Drawings

FIG. 1 is a block diagram of a multi-modal mental state assessment method based on multi-angle analysis according to an embodiment of the present invention;

fig. 2 is a data flow diagram of an anxiety/depression/stress analysis module of a multi-modal mental state assessment method based on multi-angle analysis according to an embodiment of the present invention.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the invention, as detailed in the appended claims.

The method for multi-modal mental state assessment based on multi-angle analysis provided by the embodiment of the application as shown in FIG. 1 comprises the following steps:

extracting time domain waveform points and Mel frequency cepstrum coefficients from an audio file, and taking the time domain waveform points and the Mel frequency cepstrum coefficients as audio features;

extracting an audio file from an original MP4 long video file of a subject by using an FFMPEG tool, and saving the audio file in a wav file format; sampling the audio file at a sampling rate of 16K HZ, extracting original waveform points of the audio file in the wav file format, and storing the original waveform points in a mat format;

pre-emphasizing the audio file in the wav file format in a manner that the frequency spectrum of the signal becomes flat by passing through a first-order finite excitation response high-pass filter;

dividing frames in a mode that 512 sampling point sets are used as an observation unit, namely each frame is 32ms, and the overlapping area between two adjacent frames is 50%;

windowing, namely windowing a frame of voice by adopting a Hamming window so as to reduce the influence of the Gibbs effect;

then, fast Fourier transform is carried out to obtain a Fourier spectrum;

the Fourier spectrum passes through a Mel filter bank, then logarithm operation is carried out, finally discrete pre-transformation is carried out, the first order difference and the second order difference are obtained, and then one bit of energy is added, so that a Mel frequency cepstrum coefficient is obtained;

storing the mel frequency cepstrum coefficient in a mat format;

sampling the video file according to the frequency of 6 times/s to obtain a picture sequence, inputting the picture sequence into a ResNet-50 pre-training network to obtain a video coding vector, and storing the video coding vector in a mat format;

extracting a human face motion unit of the picture sequence by using an openface tool, and storing the human face motion unit in a csv format;

taking the video coding vector and a human face motion unit as video features;

as shown in fig. 2, the network of depression analysis modules comprises:

the system comprises a gate control circulation unit, a multi-head attention mechanism, an activation function, data standardization, a convolutional neural network and multi-modal feature fusion; the video characteristics and the audio characteristics are respectively input into a gating cycle unit of a depression analysis module, a multi-head attention mechanism of the depression analysis module and a convolutional neural network of the depression analysis module; performing primary activation function activation and data standardization on the gated circulation unit of the depression analysis module, the multi-head attention mechanism of the depression analysis module and the output of the convolutional neural network of the depression analysis module, and inputting the output of the gated circulation unit of the depression analysis module, the multi-head attention mechanism of the depression analysis module and the output of the convolutional neural network of the depression analysis module after data standardization into the multi-modal feature fusion of the depression analysis module to obtain the depression features;

the loss function applied by the depression analysis module training process is: root mean square error between predicted and true values of depression degree, the formula is as follows:

wherein the content of the first and second substances,

: predictive value of extent of depression;

: true value of extent of depression;

n: the number of samples;

the criteria for the degree of depression were: normal for a score of 0-9, mild depression for a score of 10-13, moderate depression for a score of 14-20, major depression for a score of 21-27, and very severe for a score greater than 27;

as shown in fig. 2, the network of anxiety analysis modules includes:

the system comprises a gate control circulation unit, a multi-head attention mechanism, an activation function, data standardization, a convolutional neural network and multi-modal feature fusion; the video characteristic and the audio characteristic are respectively input into a gating circulation unit of an anxiety analysis module, a multi-head attention mechanism of the anxiety analysis module and a convolutional neural network of the anxiety analysis module; performing one-time activation function activation and data standardization on the gated circulation unit of the anxiety analysis module, the multi-head attention mechanism of the anxiety analysis module and the output of the convolutional neural network of the anxiety analysis module, and inputting the output of the gated circulation unit of the anxiety analysis module, the multi-head attention mechanism of the anxiety analysis module and the output of the convolutional neural network of the anxiety analysis module after data standardization into the multi-mode feature fusion of the anxiety analysis module to obtain the anxiety feature;

the loss function applied by the anxiety analysis module training process is: root mean square error between predicted and true values of the degree of anxiety, the formula is as follows:

wherein the content of the first and second substances,

: predictive value of anxiety;

: true value of the degree of anxiety;

n is the number of samples;

the evaluation criteria of the anxiety degree are: normal for 0-7, mild anxiety for 8-9, moderate anxiety for 10-14, severe anxiety for 15-19, and very severe for more than 19;

as shown in fig. 2, the network of pressure analysis modules comprises:

the system comprises a gate control circulation unit, a multi-head attention mechanism, an activation function, data standardization, a convolutional neural network and multi-modal feature fusion; the video characteristic and the audio characteristic are respectively input into a gating circulation unit of the pressure analysis module, a multi-head attention mechanism of the pressure analysis module and a convolution neural network of the pressure analysis module; performing one-time activation function activation and data standardization on the gated circulation unit of the pressure analysis module, the multi-head attention mechanism of the pressure analysis module and the output of the convolutional neural network of the pressure analysis module, and inputting the gated circulation unit of the pressure analysis module, the multi-head attention mechanism of the pressure analysis module and the output of the convolutional neural network of the pressure analysis module after data standardization into the multi-mode feature fusion of the pressure analysis module to obtain the pressure feature;

the loss function applied by the pressure analysis module training process is: root mean square error between predicted value and true value of pressure degree, the formula is as follows:

wherein the content of the first and second substances,

: a predicted value of the degree of pressure;

: the actual value of the degree of pressure;

n: the number of samples;

the evaluation standard of the pressure degree is as follows: normal for 0-14 points, mild pressure for 15-18 points, moderate pressure for 19-25 points, severe pressure for 26-33 points, and very severe for more than 33 points;

the specific parameter settings for each module are as follows:

the analysis module firstly inputs the audio features and the video features into a gating circulation unit, the gating circulation unit is a variant of a long-term and short-term memory network, the context dependence relationship can be captured, the problems of long dependence and gradient disappearance are solved, the structure is simple, and the effect is better; then, a multi-head attention mechanism is adopted, wherein the number of attention heads is set to be 8, and characteristic expressions are calculated from 8 different angles; then extracting features through a convolutional neural network, wherein the number of convolutional kernels is 512, the size of the convolutional kernels is 3 multiplied by 3, and the convolutional neural network has excellent performance in the aspect of extracting local features; after the three operations, activation function activation and data standardization are carried out once, the activation function is a parameter rectification linear unit PReLU, nonlinearity can be increased, and the data standardization is batch standardization, so that the influence of data deviation is solved, and the training speed can be accelerated; and finally, splicing the audio features and the video features together and fusing the audio features and the video features through a fully connected neural network to form multi-angle features of depression features, anxiety features and pressure features, wherein the number of the neurons is 1024. The loss function is the root mean square error between the predicted value and the actual value, and after a plurality of times of iterative training, the depression analysis module, the anxiety analysis module and the stress analysis module can respectively analyze the depression degree, the anxiety degree and the stress degree of the testee; putting the trained three modules into a model for final training;

the concrete model structure is as follows:

the gated cycle cell formula is as follows:

wherein

Is a feature of the input of the character,

is the hidden layer output at the last moment,

is the hidden layer output at this moment, W and U are both weight matrices, b is an offset, and the gated cyclic unit has two gate functions, where

The reset gate is used for controlling the extent of updating the hidden layer state at the previous moment to the current candidate hidden layer state;

the updating gate is used for controlling the degree of updating the hidden layer state at the previous moment to the current hidden layer state;

the multi-head attention mechanism formula is as follows:

wherein Q, K, V represents the set of queries, keys, and values entered, respectively, the formula is as follows:

the self-attention calculation is carried out on the input by using a multi-head attention mechanism, so that the characteristics can be analyzed from multiple angles, and useful characteristics can be enhanced and useless characteristics can be suppressed;

the activation function uses a parametric rectifying linear unit, the formula is as follows:

whereinxIs an input to the computer system that is,ais a trainable parameter;

data normalization was done using batch normalization, with the following formula:

wherein

Is the sample data that is input and is,

is the average number of samples that are taken,

is the variance of the sample(s),

the method is sample data after standardization, and batch standardization can effectively solve the problem of internal covariate deviation;

s3: inputting the depression characteristic, the anxiety characteristic and the pressure characteristic into a fusion analysis module for attention characteristic fusion to obtain a fusion characteristic; the fusion analysis module performs feature fusion by adopting an attention mechanism;

the fusion analysis module performs feature fusion by adopting an attention mechanism, and the formula is as follows:

wherein

Is a fused intermediate feature obtained by linear transformation and splicing and is used for subsequently calculating the attention weight of each feature,

、

、

are respectively the first

Anxiety, depression and stress characteristics of the sequences,

、

、

、

are a matrix of trainable parameters that are,

is a vector of parameters that can be trained,

attention weights for anxiety features, depression features and stress features,

is a fusion feature obtained through attention calculation;

because the anxiety state, the depression state and the stress state of the testee are different in contribution to mental state assessment, the anxiety feature, the depression feature and the stress feature are fused by an attention mechanism, so that the model can automatically learn the weight of the features, emphasize the greatly-contributed features and inhibit useless features;

s4: inputting the fusion features into a support vector regression to evaluate the mental state of the individual in the audio file and the video file;

the support vector regression formula is as follows:

wherein the content of the first and second substances,

wandbare the parameters of the model to be learned,Cis a constant for the regularization of the phase,mis the number of samples that are to be taken,l _ɛis an insensitive loss function;f(x _i) Is a support vector regression predictor for the value of the feature,y _iis the actual value of the mental state of the individual sample in the audio file and the video file;

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this specification and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.

It is to be understood that although the terms first, second, third, etc. may be used herein to describe various information, these information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present invention. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.

For the device embodiments, since they substantially correspond to the method embodiments, reference may be made to the partial description of the method embodiments for relevant points. The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the scheme of the invention. One of ordinary skill in the art can understand and implement it without inventive effort.

Embodiments of the subject matter and the functional operations described in this specification can be implemented in: digital electronic circuitry, tangibly embodied computer software or firmware, computer hardware including the structures disclosed in this specification and their structural equivalents, or a combination of one or more of them. Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions, encoded on a tangible, non-transitory program carrier for execution by, or to control the operation of, data processing apparatus. Alternatively or additionally, the program instructions may be encoded on an artificially generated propagated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal, that is generated to encode and transmit information to suitable receiver apparatus for execution by the data processing apparatus. The computer storage medium may be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them.

The processes and logic flows described in this specification can be performed by one or more programmable computers executing one or more computer programs to perform corresponding functions by operating on input data and generating output. The processes and logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit).

Computers suitable for executing computer programs include, for example, general and/or special purpose microprocessors, or any other type of central processing unit. Generally, a central processing unit will receive instructions and data from a read-only memory and/or a random access memory. The basic components of a computer include a central processing unit for implementing or executing instructions and one or more memory devices for storing instructions and data. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks. However, a computer does not necessarily have such a device. Moreover, a computer may be embedded in another device, e.g., a mobile telephone, a Personal Digital Assistant (PDA), a mobile audio or video player, a game console, a Global Positioning System (GPS) receiver, or a portable storage device such as a Universal Serial Bus (USB) flash drive, to name a few.

Computer-readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices, including by way of example semiconductor memory devices (e.g., EPROM, EEPROM, and flash memory devices), magnetic disks (e.g., an internal hard disk or a removable disk), magneto-optical disks, and CD ROM and DVD-ROM disks. The processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.

While this specification contains many specific implementation details, these should not be construed as limitations on the scope of any invention or of what may be claimed, but rather as descriptions of features specific to particular embodiments of particular inventions. Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. In other instances, features described in connection with one embodiment may be implemented as discrete components or in any suitable subcombination. Moreover, although features may be described above as acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.

Similarly, while operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In some cases, multitasking and parallel processing may be advantageous. Moreover, the separation of various system modules and components in the embodiments described above should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.

Thus, particular embodiments of the subject matter have been described. Other embodiments are within the scope of the following claims. In some cases, the actions recited in the claims can be performed in a different order and still achieve desirable results. Further, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some implementations, multitasking and parallel processing may be advantageous.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A method for multi-modal mental state assessment based on multi-angle analysis, the method comprising:

sampling the video file according to a certain frequency to obtain a picture sequence, and inputting the picture sequence into a pre-training network to obtain a video coding vector;

extracting a human face motion unit of the picture sequence by using an openface tool;

taking the video coding vector and a human face motion unit as video features;

s4: inputting the fusion features into a support vector regression, and evaluating the mental state of the individual in the audio file and the video file.

2. The method for multi-modal mental state assessment based on multi-angle analysis according to claim 1, wherein the specific method for extracting time domain waveform points from audio files is as follows:

and storing the Mel frequency cepstrum coefficient in a mat format.

3. The method for multi-modal mental state assessment based on multi-angle analysis according to claim 1, wherein said network of depression analysis modules comprises:

the system comprises a gate control circulation unit, a multi-head attention mechanism, an activation function, data standardization, a convolutional neural network and multi-modal feature fusion; the video characteristics and the audio characteristics are respectively input into a gating cycle unit of a depression analysis module, a multi-head attention mechanism of the depression analysis module and a convolutional neural network of the depression analysis module; and performing primary activation function activation and data standardization on the gated circulation unit of the depression analysis module, the multi-head attention mechanism of the depression analysis module and the output of the convolutional neural network of the depression analysis module, and inputting the output of the gated circulation unit of the depression analysis module, the multi-head attention mechanism of the depression analysis module and the output of the convolutional neural network of the depression analysis module after data standardization into the multi-modal feature fusion of the depression analysis module to obtain the depression features.

4. The method for multi-modal mental state assessment based on multi-angle analysis according to claim 3, wherein the loss function applied by the depression analysis module training process is: root mean square error between predicted and true values of depression degree, the formula is as follows:

wherein the content of the first and second substances,

: predictive value of extent of depression;

: true value of extent of depression;

n: the number of samples;

5. The method for multi-modal mental state assessment based on multi-angle analysis according to claim 1, wherein the network of anxiety analysis modules comprises:

the system comprises a gate control circulation unit, a multi-head attention mechanism, an activation function, data standardization, a convolutional neural network and multi-modal feature fusion; the video characteristic and the audio characteristic are respectively input into a gating circulation unit of an anxiety analysis module, a multi-head attention mechanism of the anxiety analysis module and a convolutional neural network of the anxiety analysis module; and performing one-time activation function activation and data standardization on the gated circulation unit of the anxiety analysis module, the multi-head attention mechanism of the anxiety analysis module and the output of the convolutional neural network of the anxiety analysis module, and inputting the output of the gated circulation unit of the anxiety analysis module, the multi-head attention mechanism of the anxiety analysis module and the output of the convolutional neural network of the anxiety analysis module after data standardization into the multi-mode feature fusion of the anxiety analysis module to obtain the anxiety feature.

6. The method for multi-modal mental state assessment based on multi-angle analysis according to claim 5, wherein the loss function applied by the anxiety analysis module training process is: root mean square error between predicted and true values of the degree of anxiety, the formula is as follows:

wherein the content of the first and second substances,

: predictive value of anxiety;

: true value of the degree of anxiety;

n is the number of samples;

the evaluation criteria of the anxiety degree are: normal for scores of 0-7, mild anxiety for scores of 8-9, moderate anxiety for scores of 10-14, severe anxiety for scores of 15-19, and very severe for scores of more than 19.

7. The method for multi-modal mental state assessment based on multi-angle analysis according to claim 1, wherein said network of stress analysis modules comprises:

the system comprises a gate control circulation unit, a multi-head attention mechanism, an activation function, data standardization, a convolutional neural network and multi-modal feature fusion; the video characteristic and the audio characteristic are respectively input into a gating circulation unit of the pressure analysis module, a multi-head attention mechanism of the pressure analysis module and a convolution neural network of the pressure analysis module; and performing one-time activation function activation and data standardization on the gated circulation unit of the pressure analysis module, the multi-head attention mechanism of the pressure analysis module and the output of the convolutional neural network of the pressure analysis module, and inputting the gated circulation unit of the pressure analysis module, the multi-head attention mechanism of the pressure analysis module and the output of the convolutional neural network of the pressure analysis module after data standardization into the multi-mode feature fusion of the pressure analysis module to obtain the pressure feature.

8. The method for multi-modal mental state assessment based on multi-angle analysis according to claim 7, wherein the loss function applied by the stress analysis module training process is: root mean square error between predicted value and true value of pressure degree, the formula is as follows:

wherein the content of the first and second substances,

: a predicted value of the degree of pressure;

: the actual value of the degree of pressure;

n: the number of samples;

the evaluation standard of the pressure degree is as follows: normal for 0-14 points, mild for 15-18 points, moderate for 19-25 points, severe for 26-33 points, and very severe for more than 33 points.

9. The method for multi-modal mental state assessment based on multi-angle analysis according to claim 1, wherein said fusion analysis module employs an attention mechanism for feature fusion.

10. The method for multi-modal mental state assessment based on multi-angle analysis according to claim 1, wherein the support vector regression formula is as follows:

wherein the content of the first and second substances,