CN116230234A - Multi-mode feature consistency psychological health abnormality identification method and system - Google Patents

Multi-mode feature consistency psychological health abnormality identification method and system Download PDF

Info

Publication number
CN116230234A
CN116230234A CN202310265823.6A CN202310265823A CN116230234A CN 116230234 A CN116230234 A CN 116230234A CN 202310265823 A CN202310265823 A CN 202310265823A CN 116230234 A CN116230234 A CN 116230234A
Authority
CN
China
Prior art keywords
video
audio
micro
feature
expression
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310265823.6A
Other languages
Chinese (zh)
Inventor
李泽
付志刚
许铮铧
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hebei University of Technology
Original Assignee
Hebei University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hebei University of Technology filed Critical Hebei University of Technology
Priority to CN202310265823.6A priority Critical patent/CN116230234A/en
Publication of CN116230234A publication Critical patent/CN116230234A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • AHUMAN NECESSITIES
    • A61MEDICAL OR VETERINARY SCIENCE; HYGIENE
    • A61BDIAGNOSIS; SURGERY; IDENTIFICATION
    • A61B5/00Measuring for diagnostic purposes; Identification of persons
    • A61B5/16Devices for psychotechnics; Testing reaction times ; Devices for evaluating the psychological state
    • A61B5/165Evaluating the state of mind, e.g. depression, anxiety
    • AHUMAN NECESSITIES
    • A61MEDICAL OR VETERINARY SCIENCE; HYGIENE
    • A61BDIAGNOSIS; SURGERY; IDENTIFICATION
    • A61B5/00Measuring for diagnostic purposes; Identification of persons
    • A61B5/48Other medical applications
    • A61B5/4803Speech analysis specially adapted for diagnostic purposes
    • AHUMAN NECESSITIES
    • A61MEDICAL OR VETERINARY SCIENCE; HYGIENE
    • A61BDIAGNOSIS; SURGERY; IDENTIFICATION
    • A61B5/00Measuring for diagnostic purposes; Identification of persons
    • A61B5/72Signal processing specially adapted for physiological signals or for diagnostic purposes
    • A61B5/7235Details of waveform analysis
    • A61B5/7264Classification of physiological signals or data, e.g. using neural networks, statistical classifiers, expert systems or fuzzy systems
    • A61B5/7267Classification of physiological signals or data, e.g. using neural networks, statistical classifiers, expert systems or fuzzy systems involving training the classification device
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/174Facial expression recognition
    • G06V40/176Dynamic expression
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/30ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for calculating health indices; for individual health risk assessment
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Medical Informatics (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Public Health (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Pathology (AREA)
  • Biomedical Technology (AREA)
  • Psychiatry (AREA)
  • Human Computer Interaction (AREA)
  • Surgery (AREA)
  • Animal Behavior & Ethology (AREA)
  • Molecular Biology (AREA)
  • Heart & Thoracic Surgery (AREA)
  • Veterinary Medicine (AREA)
  • Biophysics (AREA)
  • Databases & Information Systems (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Hospice & Palliative Care (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Signal Processing (AREA)
  • Child & Adolescent Psychology (AREA)
  • Data Mining & Analysis (AREA)
  • Epidemiology (AREA)
  • Primary Health Care (AREA)
  • Psychology (AREA)
  • Social Psychology (AREA)
  • Educational Technology (AREA)
  • Developmental Disabilities (AREA)
  • Mathematical Physics (AREA)
  • Physiology (AREA)

Abstract

The invention relates to a method and a system for identifying multi-modal feature consistency psychological health exception, which comprise the following contents: acquiring original data files containing two modes of an audio file and a video file in the same scene; preprocessing video data and audio data in the obtained original data file; normalizing the micro-expression degree score of each frame in the continuous frame sequence to obtain a micro-expression key frame vector; feeding frequency features extracted from audio data into an audio stream depth feature extraction network to obtain depth speech features F A Audio feature prediction result y A The method comprises the steps of carrying out a first treatment on the surface of the The continuous frame sequence is simultaneously sent into a video stream depth feature extraction network to obtain depth video features F V And video feature prediction result y V The method comprises the steps of carrying out a first treatment on the surface of the Thereafter F is carried out A And F V And (5) carrying out feature fusion, and then sending the fusion into a mental health classification network for final prediction. The invention effectively utilizes the facial micro-expression characteristics and the voice characteristics to carry out the mental health abnormality recognition.

Description

Multi-mode feature consistency psychological health abnormality identification method and system
Technical Field
The invention relates to the technical field of mental health abnormality recognition, in particular to a multimode feature consistency mental health abnormality recognition method and system based on micro-expression amplification and voice features.
Background
With the continuous development of economy and society, the psychological health of people is gradually and widely paid attention to the social world. Therefore, it is necessary to study a method for identifying abnormal mental health, so as to improve the diagnosis efficiency of doctors and provide relatively objective reference basis for the diagnosis of doctors.
Microexpressions act as involuntary movements of facial muscles and are produced when one tries to mask the inherent emotion, and are neither falsified nor inhibited. Compared with the conscious expression of people, the micro-expression can reflect the true feeling and motivation of people, so the micro-expression characteristic is often used as one of important objective indexes for automatic depression detection. Similar to the facial features of the micro-expressions, the voice features can reflect the emotion change and the psychological abnormal state of the individual and objectively and reliably reflect the real psychological state of the speaker, so that the two indexes have important reference values for analyzing psychological diseases. But microexpressions are very fine and difficult to capture for a short duration, and thus are often very difficult to study with video for mental health.
In recent years, some researches on identifying abnormal psychological states based on audio-video multi-modal information appear in the field of artificial intelligence machine learning, wherein automatic depression detection is the most widely used. In the existing research of identifying mental health abnormality by utilizing audio and video multi-mode information, the research is mostly carried out without considering the characteristic of micro expression, which is important to reflect the psychology and emotion; most of the existing research based on micro-expressions is single-mode, and psychological health analysis is not carried out by combining with voice characteristics. Meanwhile, in the existing research, the processing of the multi-modal characteristics is simpler, and the consistency problem among different modal characteristics of the same sample is not considered. For example, chinese patent CN 112560811B proposes an end-to-end automatic depression detection method based on audio and video. Firstly, cutting and preprocessing the collected original data containing two modes of a long-time audio file and a long-time video file based on audio and video data; secondly, respectively inputting the audio segment and the video segment into an audio feature extraction network and a video feature extraction network to obtain an audio depth feature and a video depth feature; calculating the depth voice features and the depth video features by using a multi-head attention mechanism to obtain attention audio features and attention video features; and aggregating the attention audio features and the attention video features into audio and video features through a feature aggregation module, and finally sending the audio and video features into a decision network to predict the depression level of the individual. The method takes the audio and video data as input, but does not consider the micro-expression characteristics of the video data, so that the micro-expression information cannot be effectively captured, and the consideration of the characteristic consistency problem of two modes, namely the audio characteristics and the video characteristics is also lacking.
Based on the reasons, the invention provides a multimodal feature consistency psychological health abnormality identification method and system based on micro-expressions and voice features.
Disclosure of Invention
Aiming at the defects of the prior art, the technical problem to be solved by the invention is to provide the method and the system for identifying the multi-modal feature consistency mental health abnormality based on the micro-expression and the voice feature, which can better cope with the mental health abnormality identification task.
The technical scheme adopted for solving the technical problems is as follows:
in a first aspect, the present invention provides a method for identifying a multimodal feature consistent mental health exception, the method comprising:
acquiring original data files containing two modes, namely an audio file and a video file, under the same scene, wherein the original data files contain normal samples and abnormal samples, and disease type labels are arranged on the abnormal samples;
performing micro-expression amplification processing on video data containing complete facial images in an obtained original data file, scoring video frames according to the micro-expression action amplitude of each frame of facial images, recording micro-expression degree scores of each frame of facial images, and registering and aligning the facial images to finish the micro-expression amplification processing on the video data; then, sampling the amplified video data in a continuous frame sequence according to a set sampling interval to obtain a plurality of continuous frame sequences with the same dimension, and thus finishing preprocessing the video data;
Normalizing the micro-expression degree score of each frame in the continuous frame sequence to obtain a micro-expression key frame vector;
carrying out noise reduction and impurity removal treatment on the audio data in the obtained original data file, and extracting frequency characteristics of the audio data;
sending the frequency characteristic extracted from the audio data into an audio stream depth characteristic extraction network to extract the audio depth characteristic, thereby obtaining a depth voice characteristic F A Simultaneously, voice characteristic result prediction is carried out on voice information in an audio stream depth characteristic extraction network to obtain an audio characteristic prediction result y A
Building video streamsThe video stream depth feature extraction network comprises a C3D ResNet50 of a pooling layer, a space attention module, a channel attention module, a time attention module and a classification network which are connected in sequence; wherein the input of the time attention module is the multiplication result of the output of the channel attention module and the micro expression key frame vector, and the output of the time attention module obtains the depth video characteristic F after averaging in the frame sequence dimension V
Multiple continuous frame sequences with the same dimension and obtained from video data are simultaneously sent into a video stream depth feature extraction network to extract video depth features, so as to obtain depth video features F V The method comprises the steps of carrying out a first treatment on the surface of the Video characteristic result prediction is carried out on video information of continuous frame sequences in a video stream depth characteristic extraction network, and a video characteristic prediction result y is obtained through a classification network V
Carrying out feature fusion on the depth video features and the depth voice features by using a tensor fusion network to obtain audio-video multi-mode fusion features; sending the audio and video multi-mode fusion characteristics into a mental health classification network for final prediction;
calculating the multi-mode consistency loss of the audio and the video according to a formula (2)
Figure BDA0004132947300000021
Figure BDA0004132947300000022
Where n is the number of samples, M is the number of tag categories, y i A label representing the ith sample, p i,c Representing model prediction probability of the ith sample belonging to the class c, wherein lambda is a penalty coefficient; function h (y) V ,y A ) Increasing penalty of penalty function when audio feature prediction result and video feature prediction result are inconsistent, wherein penalty function is defined as
Figure BDA0004132947300000023
Further, the lambda set value is 1.0-1.6; preferably 1.55.
Further, the audio stream depth feature extraction network comprises a 2D ResNet18 network, a time attention mechanism and a full connection layer, wherein the output of the 2D ResNet18 network is connected with the time attention mechanism through pre-training, and the output of the time attention mechanism is connected with the full connection layer;
The time attention mechanism comprises 1 layer one-dimensional convolution, 1 layer full connection and 1 layer softmax function, two dimensional characteristics of space and channels are compressed, time dimensional characteristics are extracted, and depth voice characteristics are obtained.
Further, the backbone network of the video stream depth feature extraction network is a pretrained C3D res net50; the method comprises the steps of extracting features through a backbone network, sequentially sending the features into a space attention module and a channel attention module, and integrating information of space dimension and channel dimension and distributing weight; then multiplying the feature vector Fsc obtained by the channel attention module processing by the micro-expression key frame vector Fem; then sending the information to a time attention module for integrating time dimension information of a frame sequence layer;
each residual block of the C3D res net50 is made up of three-dimensional convolutions in series, the first three-dimensional convolution having a convolution kernel size of 1 x 1, a step size of 1, the convolution kernel of the second three-dimensional convolution has the size of 3 multiplied by 3, the step length is 2, the convolution kernel of the third three-dimensional convolution has the size of 1 multiplied by 1, and the step length is 1; each three-dimensional convolution is followed by a three-dimensional batch regularization and a ReLU activation function.
Further, the micro-expression amplifying processing is to amplify the micro-expression by using an Euler motion amplifying algorithm, and the scoring process is as follows: finding out a frame with the largest micro-expression change of the video after micro-expression amplification by utilizing a key frame detection algorithm, and scoring all frames of the video according to the change degree of the micro-expression;
The continuous frame sequence obtaining process is that all frames of the whole video are equally divided into 10 sections, and then continuous 30 frames are extracted from each section, so that 10 continuous frame sequences with 30 frames in all dimensions are obtained.
Furthermore, the multi-mode feature consistency mental health anomaly identification method is integrated into a mobile phone applet or an application program.
In a second aspect, the invention provides a computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of the method.
In a third aspect, the present invention provides a multimodal feature consistent mental health exception recognition system comprising:
the man-machine interaction module is used for recording the micro-expression change and the micro-expression voice characteristic information of the sample and obtaining audio and video data;
the video data preprocessing module is used for extracting micro expression data in the video;
the microexpressive key frame vector acquisition module is used for carrying out normalization processing on the microexpressive degree score of each frame to obtain a weighted audio data preprocessing module and is used for acquiring the frequency characteristics of the audio data;
the mental health abnormality recognition model comprises a video stream depth feature extraction network, an audio stream depth feature extraction network, a tensor fusion network and a mental health classification network,
Video stream depth feature extraction network for obtaining depth video features F V And video feature prediction result y V
Audio stream depth feature extraction network for obtaining depth speech features F A Audio feature prediction result y A
The tensor fusion network is used for fusing the depth video features and the depth voice features to obtain audio and video multi-mode fusion features;
the psychological health classification network is used for classifying the audio and video multi-mode fusion characteristics;
mental health anomaly identification model using multimodal loss of consistency
Figure BDA0004132947300000031
Video feature prediction result y V Audio feature prediction result y A Training constraints are performed.
Further, the process of the man-machine interaction module is as follows: the next link automatically appears through computer playing, the information of the tested person is input through an identity card reader in advance, the desensitization treatment is carried out on the identity information of the patient in the program, and then a start button is clicked to enter an interface interacted with the tested person; setting human-computer interaction links, including three links of watching video, reading text and answering questions; the step of watching the video is to play a section of emotion induced video to the tested person for watching by a computer, and the step is to enable the tested person to react by receiving external stimulus, so that the micro-expression characteristics of the tested face can be captured more clearly and intuitively; the step of reading text is to pop up a section of characters by a computer, so that a tested person reads the characters aloud, and the step is to enable the tested person to release sound information, so that the frequency and texture characteristics of the tested sound can be clearly captured; the step of answering the questions is to pop up a question by a computer so that a user thinks and answers, and the step captures visual information and voice information; the whole man-machine interaction process is recorded by a head-mounted microphone and a camera with 1080 x 640 resolution and 60fps frame rate; wherein, the time of each of the links of watching video and reading text is 30 seconds, the time of the link of answering questions is 40 seconds, and the total time length of the video and the audio is 100 seconds.
Compared with the prior art, the invention has the beneficial effects that:
1. the invention effectively utilizes the facial micro-expression characteristics and the voice characteristics to carry out the psychological health abnormality recognition, the facial micro-expression and the voice characteristics are important indexes capable of reflecting the emotion and the psychology of the principal, and the research shows that the accuracy and the detection performance of the psychological health abnormality recognition based on the multiple modes are generally better than those of the single mode, so that the invention comprehensively considers the multiple modes of the micro-expression and the voice characteristics, ensures that the model comprehensively considers the multi-mode characteristic information, and enhances the performance of the model; the micro-expression feature is an important reference index, the visual feature of the invention emphasizes and focuses on the micro-expression feature, the micro-expression feature contained in the video data is amplified and enhanced, and the result shows that the model performance is obviously better than that of the video data which is not amplified by the common micro-expression after the video data amplified by the micro-expression is used.
2. The method has the advantages that the end-to-end deep neural network is used for automatically learning the deep features helpful for identifying the mental health anomalies, time-consuming and labor-consuming manual feature marking and manual feature extraction are avoided, the deep neural network is used for automatically extracting features to conduct the mental health anomalies identification, the result display model is good in performance, the extracted features can well conduct mental health anomalies identification tasks, can serve as a primary screening basis, provide relatively objective reference indexes for clinical diagnosis of psychologists, and can be integrated into a mobile phone applet or an application program to be applied to self-checking and self-testing of patients.
3. The invention considers the consistency among the multi-modal feature information, namely, for the same sample, the prediction result of the voice feature and the prediction result of the micro-expression video feature of the sample are consistent, the loss function is constructed by utilizing the consistency of the audio and video multi-modal, the learning and the training of the audio and video feature extraction network, the fusion network and the mental health classification network are carried out, the respective feature information of the two modal information is more fully utilized, and the communication of the two modal feature information is enlarged. Compared with the traditional cross entropy classification loss function, the consistency problem of multi-mode data prediction results is considered, and feature fusion is not simply performed. That is, the ideal final prediction result of the same sample should be consistent with the label of the sample whether the sample is according to video or audio, and if the prediction sample of the video stream network is normal and the prediction sample of the audio stream network is abnormal, the prediction results of the two modes are inconsistent, which means that the network needs further training, the performance of the model needs further improvement, and the penalty of the loss function needs to be increased. In this way, the multi-mode information is better integrated, and the classification performance and the expression capability of the model are effectively improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the prior art, the drawings that are needed in the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is an overall flowchart of a multi-modal mental health anomaly recognition method based on micro-expressions and voice features according to an embodiment of the present application.
Fig. 2 is a schematic structural diagram of a video stream depth feature extraction network according to the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the present invention will be described in detail with reference to the embodiments and the accompanying drawings, but the scope of the present application is not limited thereto.
The method and the device have the advantages that the micro-expression characteristics are applied, the micro-expression amplification and characteristic enhancement work is carried out, the multi-mode mental health abnormality identification is carried out by combining the audio, and the problem of consistency of the multi-mode characteristics is considered in the training process.
The invention provides a multi-mode feature consistency psychological health abnormality recognition method (short for method), which takes micro-expression and voice features as judgment basis and comprises the following steps:
Step one: original audio and video data acquisition:
designing a man-machine interaction program, and automatically acquiring data to obtain original data files of two modes, namely an audio file and a video file in the same scene;
step two: audio and video data preprocessing:
for video data: firstly, carrying out region clipping on the acquired video data containing the complete facial image, and reconstructing the video data into a video containing only a head region so as to facilitate subsequent micro-expression amplification processing; then, aiming at the problems that the micro-expression is fine and difficult to capture, a micro-expression action amplification method is adopted to amplify the micro-expression of the head region video, and the micro-expression amplification method specifically uses an Euler motion amplification algorithm to amplify the micro-expression; then, a frame with the largest micro-expression change is found out for the video after micro-expression amplification by utilizing a key frame detection algorithm, then all frames of the video are scored according to the change degree of the micro-expression, and then the face images are registered and aligned, so that the micro-expression amplification processing of the video data is completed; then, sampling each video data amplified by the micro expression according to a certain sampling interval to obtain a plurality of continuous frame sequences with the same dimension; preprocessing the video data is completed up to this point;
For audio data: noise reduction and impurity removal processing are carried out on the collected audio data, firstly, voice separation is carried out on the audio data containing noise by adopting a voice separation model for solving the problem of cocktail, then, the frequency spectrum characteristics of the audio are observed by utilizing fast Fourier transform, noise reduction processing is carried out by using filters with different pass bands and stop bands, clear audio data are obtained after amplification processing, and thus, the noise reduction and impurity removal of the audio data are completed; then calculating the mel-frequency cepstrum coefficient of the audio data subjected to noise reduction and impurity removal, and extracting the mel-frequency cepstrum coefficient frequency characteristic of the audio data;
step three: audio depth feature extraction:
sending the Mel cepstrum coefficient frequency characteristic obtained by the audio data in the second step into an audio stream depth characteristic extraction network to extract audio depth characteristics, so as to obtain depth voice characteristics; specifically:
audio stream depth feature extraction network: the trunk of the audio stream depth feature extraction network is 2D ResNet18 which is pre-trained, and a time attention mechanism is integrated; deep feature extraction is carried out on the mel-frequency cepstrum coefficient frequency features of the audio data through 2D ResNet18, a time attention mechanism is entered after feature vectors are obtained, two dimensional features of space and channels are compressed, and features of time dimension are extracted; finally, the audio stream depth feature extraction network is a full-connection layer, and audio single-mode mental health abnormality identification is carried out to obtain an audio feature prediction result y A The method comprises the steps of carrying out a first treatment on the surface of the After the audio data is processed by the 2D ResNet18 and is processed by a time attention mechanism, a time feature vector F is obtained A This is the deep speech feature, F A Audio characteristic prediction result y obtained through full connection A The calculation formula may be expressed as,
y A =W 4 F A (4)
wherein W is 4 Parameters representing the fully connected layer are learnable parameters;
step four: video depth feature extraction:
simultaneously sending the video data into a video stream depth feature extraction network for video depth feature extraction through a plurality of continuous frame sequences with the same dimension obtained in the second step to obtain depth video features;
video stream depth feature extraction network: the main network of the video stream depth feature extraction network is a C3DResNet50 after pre-training; after the feature extraction is carried out through a backbone network, the obtained feature vectors are sequentially sent to a space attention module and a channel attention module, and information integration and weight distribution of space dimension and channel dimension are carried out; and multiplying the obtained feature vector with a micro-expression key frame vector, wherein the micro-expression key frame vector is obtained by a key frame detection algorithm in the second step, and specifically comprises the following steps: the method comprises the steps of finding a frame with the largest micro-expression change by using a key frame detection algorithm, scoring all frames of a video according to the change degree of the micro-expression, normalizing the micro-expression degree score of each frame, and constructing a micro-expression key frame vector according to a normalized result, wherein the dimension of the micro-expression key frame vector is identical to that of a continuous frame sequence; then, the weight is multiplied by the output of the channel attention module, so that the method has the advantages of solving the problems that the micro expression duration is short, and the model is difficult to recognize because the micro expression is difficult to capture and find in a long video; and then, the features containing the micro-expression identification information are sent to a time attention module for integrating time dimension information of a frame sequence layer. The calculation formulas of the three attention mechanisms of time, space and channel can be expressed as follows:
Attention k =Softmax(W 2 (W 1 F k )) (5)
Wherein k.epsilon.is (S, C, T) representing the kind of attention, wherein S is as followsShowing spatial attention, C representing channel attention, T representing temporal attention; f (F) k Representing the feature vector fed into the attention module, W 1 Representing a one-dimensional convolution matrix, W 2 Representing a full connection matrix, W 1 And W is 2 Is a parameter that can be learned;
specifically, the continuous frame sequence firstly carries out depth feature extraction through the C3D ResNet50 to obtain a video feature vector F processed by the C3D ResNet50, then sequentially enters a space attention module, a channel attention module and a time attention module (the three attention modules are sequentially connected in series to form a whole network through full connection), compresses the features of other two dimensions, and respectively extracts and integrates the features of space, channel and time dimensions; firstly, a spatial attention module is used to obtain a spatial feature vector F S Then the space channel characteristic vector F is obtained through the channel attention module SC The calculation formula is
Figure BDA0004132947300000061
Figure BDA0004132947300000062
Wherein the symbols are
Figure BDA0004132947300000063
Representing a vector product; then the processed space channel characteristic vector F SC Combining it with a micro-expression key frame vector F mme Multiplying to obtain a space channel feature vector F fused with the microexpressive identification information SCm The calculation formula is
Figure BDA0004132947300000064
Then send it into time attention module to obtain space channel time characteristic vector F SCmT The calculation formula is
Figure BDA0004132947300000065
The obtained space channel time characteristic vector F SCmT Averaging the last dimension of (2) to obtain depth video feature F V . Finally, a layer of full connection is arranged on the video stream depth feature extraction network, and the video single-mode mental health abnormality identification is carried out to obtain a video feature prediction result y V The calculation formula can be expressed as
y V =W 3 F V (10)
Wherein W is 3 Parameters representing a fully connected network are learnable parameters;
step five: and (3) audio and video feature fusion:
feature fusion is carried out on the obtained depth video features and depth voice features by using a tensor fusion network, so as to obtain audio-video multi-mode fusion features F VA
Step six: category prediction:
incorporating audio and video multimodal fusion features F VA Sending the information to a mental health classification network for final prediction and identifying mental health abnormality; the mental health classification network is composed of three layers of full-connection networks, and a final predicted audio and video feature prediction result y is obtained after the mental health classification network, so as to predict whether mental health is abnormal; the calculation formula may be expressed as,
y=W 5 F VA (11)
wherein W is 5 Parameters representing the mental health classification network are learnable parameters;
Step seven: and (3) constructing a loss function:
considering the problem of multi-modal feature consistency, on the basis of the traditional cross entropy function, the multi-modal consistency loss function of the audio and video feature extraction and classification model is designed. According to the video characteristic prediction result y obtained in the third step V Audio feature prediction result y A And finally predicting an audio and video characteristic prediction result y.
The traditional cross entropy loss calculation formula is as follows
Figure BDA0004132947300000071
Where n is the number of samples, M is the number of categories (in this embodiment, the number of categories is 2, i.e., both normal and abnormal), y i A label representing the ith sample, p i,c Representing the predicted probability that the i-th sample belongs to category c.
Taking into account the audio feature prediction result y on the basis of cross entropy loss A And video feature prediction result y V When the audio characteristic predicted result is inconsistent with the video characteristic predicted result, a larger punishment weight is applied, so that the multi-mode consistency loss of the calculated audio and video is obtained
Figure BDA0004132947300000072
The formula is as follows,
Figure BDA0004132947300000073
wherein, lambda is penalty coefficient, is coefficient super parameter, and is a function h (y V ,y A ) As penalty function, increasing penalty of penalty function when audio feature prediction result and video feature prediction result are inconsistent is defined as
Figure BDA0004132947300000074
It can be seen that when the categories are inconsistent, the value of the penalty function will become greater due to the penalty function and penalty coefficients.
Example 1
The method for identifying the multi-modal feature consistency mental health exception in the embodiment identifies the mental health exception of a specific mental health data set, and inputs audio and video files which are acquired by self, and comprises the following contents:
step one: original audio and video data acquisition:
and designing a man-machine interaction program for data acquisition, wherein the man-machine interaction program is used for recording sample micro-expression change and micro-expression voice characteristic information, automatically playing the sample micro-expression change and the micro-expression voice characteristic information through a computer, entering the information of a tested person in advance through an identity card reader, performing desensitization processing on the information such as the identity of the patient in the program, and then clicking a start button to enter an interface for interaction with the tested person. In order to better capture the facial features and the voice features of the tested micro-expressions, the human-computer interaction links for data acquisition are set, and the human-computer interaction links comprise three links of watching video (acquiring micro-expressions in finer), reading text (capturing texture feature information of sound) and answering questions (capturing thinking process); the step of watching the video is to play a section of emotion induced video to the tested person for watching by a computer, and the step is to enable the tested person to react by receiving external stimulus, so that the micro-expression characteristics of the tested face can be captured more clearly and intuitively; the step of reading text is to pop up a section of characters by a computer, so that a tested person reads the characters aloud, and the step is to enable the tested person to release sound information, so that the frequency and texture characteristics of the tested sound can be clearly captured; the "answer questions" link is to pop up a question by computer, let the user think and answer, in this link, both visual information and speech information are captured. The whole man-machine interaction process is recorded by a head-mounted microphone and a camera with 1080 x 640 resolution and 60fps frame rate; wherein, the time of each of the links of watching video and reading text is 30 seconds, the time of the link of answering questions is 40 seconds, and the total time length of the video and the audio is 100 seconds. The original data files of two modes including an audio file and a video file (the length is 100 seconds, the audio and video correspond in real time, and the length is equal) can be obtained; the audio file format is wav file, and the video file format is avi file; the data contained a total of 1469 samples including 1278 normal samples and 371 abnormal samples, which in turn included 15 history of mental disorders, 13 adaptive disorders, 69 separating disorders, 117 schizophrenia, 14 depression, 7 mental retardation, 57 mental disorders, 1 dream disorder, 75 compulsive disorders and 4 suicidal tendencies. Each sample is marked with a label, the 1469 samples form a data set, the data in the data set can be classified according to normal and abnormal types, and the labels of each sample can be marked through professional assessment of a professional psychologist according to specific abnormal disease types and normal setting multi-classification labels.
Taking two labels as examples, dividing a normal sample and an abnormal sample according to a ratio of 3:1, mixing a 3/4 normal sample and a 3/4 abnormal sample as a training set, and storing the training set in a train.csv file under a catalog; mixing the left normal sample and the abnormal sample of 1/4 as a test set, and storing the test set and the abnormal sample into a test.csv file under the catalog. And generating json files for loading data reading by using the train.csv and test.csv files.
Step two: audio and video data preprocessing:
for video data: firstly, converting the collected avi videos into frames, storing all frames of each video in a folder named by a sample name, and placing samples in folders of a training set and a testing set; then, the acquired video image data containing the complete facial image is subjected to region clipping, and the video containing only the head region is reproduced, so that the subsequent micro-expression amplification processing is facilitated; the method comprises the steps of performing microexpressive amplification processing on a head region video, firstly amplifying microexpressions by using an Euler motion amplification algorithm, and selecting an amplification factor to be 5; then, a key frame detection algorithm is utilized, top frame detection positioning is carried out, micro expression degree detection scoring is carried out, a calculation result is stored in a mat format, and regularization is carried out; then registering and aligning the face images to finally generate a mean standard image, finishing the micro expression amplification processing of the video data so far, cutting the face of each frame of image, and only keeping facial features, wherein the size of the cut image is 300 x 300; then, dividing all frames of each video sample amplified by the micro expression into 10 segments, and extracting continuous 30 frames from each segment to obtain 10 continuous frame sequences with 30 frames in each dimension; preprocessing the video data is completed up to this point;
For audio data: noise reduction and impurity removal processing are carried out on the collected audio data, firstly, voice separation is carried out on the audio data containing noise by adopting a voice separation model for solving the problem of cocktail, fast Fourier transform is carried out on the audio data, filters with different pass bands and stop bands are used for filtering, and clear audio data are obtained after amplification processing, so that the noise reduction and impurity removal of the audio data are completed; then calculating the mel cepstrum coefficient of the audio data subjected to noise reduction and impurity removal, and extracting the frequency characteristics of the audio data;
step three: audio depth feature extraction:
sending the Mel cepstrum coefficient frequency characteristic obtained by the audio data in the second step into an audio stream characteristic extraction network for audio depth characteristic extraction to obtain a depth voice characteristic; specifically:
audio stream depth feature extraction network: the main network of the audio stream depth feature extraction network is 2DResNet18 which is pre-trained, and a time attention mechanism is integrated, wherein the time attention mechanism also comprises 1 layer of one-dimensional convolution, 1 layer of full connection and 1 layer of softmax; deep feature extraction is carried out on the mel cepstrum coefficient of the audio data through 2D ResNet18, a feature vector is obtained through extraction, then the audio data enters a time attention mechanism, two dimensional features of a space and a channel are compressed, and features of a time dimension are extracted; finally, the audio stream depth feature extraction network is a full-connection layer, and audio single-mode mental health abnormality identification is carried out to obtain an audio feature prediction result y A The method comprises the steps of carrying out a first treatment on the surface of the Similar to video streaming, audio data is processed by 2D ResNet18 and then processed by a time attention mechanism to obtain depth voice characteristics F A Audio feature prediction result y obtained through full connection A The calculation formula of (c) can be expressed as,
y A =W 4 F A (4)
wherein W is 4 Parameters representing the fully connected layer are learnable parameters;
step four: video depth feature extraction:
simultaneously sending the video data into a video stream depth feature extraction network for video depth feature extraction through a plurality of continuous frame sequences with the same dimension obtained in the second step to obtain depth video features;
video stream depth feature extraction network: the video frame sequences are processed into vectors of dimensions 10 x 3 x 30 x 300, each dimension being the number of consecutive frame sequences, the number of batches, the number of RGB channels, the length of consecutive frame sequences and the picture sample size, respectively, before being fed into a video stream depth feature extraction network, the backbone network of which is a C3D ResNet50 residual network pre-trained on ImageNet and Kinetic datasets, each residual block of the C3D res net50 is composed of three-dimensional convolutions, the first three-dimensional convolution having a convolution kernel size of 1 x 1, a step size of 1, the convolution kernel of the second three-dimensional convolution has the size of 3 multiplied by 3, the step length is 2, the convolution kernel of the third three-dimensional convolution has the size of 1 multiplied by 1, and the step length is 1; each convolution layer is followed by a three-dimensional batch regularization and a ReLU activation function; in the setting of the network, as the characteristics of subtle and easy loss of the micro expression are considered, all pooling layers of the micro expression are removed in the process of C3D ResNet50, the step length of the second convolution kernel of each residual block is set to be 2, the micro expression characteristics are not processed by the pooling layers, and meanwhile, the operation amount is reduced and the details are not lost through the setting; the dimension of the video feature vector F processed by the C3D ResNet50 is 30 multiplied by 512 multiplied by 30, wherein the first dimension is the product of the continuous frame sequence and the batch processing number, the second dimension is the processed space dimension, and the third dimension is the channel dimension and also the continuous frame sequence length;
Then, the video feature vector F processed by the C3D ResNet50 is sequentially sent to a space attention module and a channel attention module, information integration and weight distribution of space dimension and channel dimension are carried out, and the space feature vector F is sequentially obtained S And spatial channel feature vector F SC The method comprises the steps of carrying out a first treatment on the surface of the The resulting spatial channel feature vector F is then used SC And micro-expression key frame vector F mme Multiplying, wherein the microexpressive key frame vector is obtained by a key frame detection algorithm in the second step, and after the microexpressive degree score of each frame is obtained, the microexpressive degree score is obtainedNormalized here, the dimension is 3×10×30, and then taken as the weight and F SC Multiplying to obtain a space channel feature vector F containing micro expression identification information SCm The method comprises the steps of carrying out a first treatment on the surface of the Next, F will SCm Sending the depth video feature F to a time attention module for integrating time dimension information of a frame sequence layer to obtain the depth video feature F V . The structures of the space, the channel and the time attention module are all composed of 1 layer of one-dimensional convolution, 1 layer of full connection and 1 layer of softmax, and the calculation formula of the attention mechanism can be expressed as follows:
Attention k =Softmax(W 2 (W 1 F k )) (5)
where k ε (S, C, T) represents the category of attention, where S represents spatial attention, C represents channel attention, and T represents temporal attention; f (F) k Representing the feature vector fed into the attention module, W 1 Representing a one-dimensional convolution matrix, W 2 Representing a full connection matrix, W 1 And W is 2 Is a parameter that can be learned;
specifically, the video feature vector F processed by the C3D res net50 is first passed through a spatial attention module to obtain a spatial feature vector F S Then the space channel characteristic vector F is obtained through the channel attention module SC The calculation formula is
Figure BDA0004132947300000101
Figure BDA0004132947300000102
Wherein the symbols are
Figure BDA0004132947300000103
Representing a vector product; the spatial attention module and the channel attention module do not change the vector size and dimensions, so the spatial feature vector F S And spatial channel feature vector F SC Still 30 x 512 x 30; then the spatial channel feature vector F SC Is divided into a first dimension representing the product of the number of continuous frame sequences and the number of batch processing to obtain F SC The size is 3×10×512×30; and then it is combined with the micro-expression keyframe vector F here me Multiplication by F me The dimension of the vector is 3 multiplied by 10 multiplied by 30, and a space channel feature vector F fused with the microexpressive identification information is obtained SCm The calculation formula is
Figure BDA0004132947300000104
/>
Obtaining F SCm The dimension is still 3×10×512×30, and the F obtained is then SCm Changing the size and exchanging dimensions, the specific operations are: averaging in the last dimension and exchanging the second and third dimensions of the feature vector to obtain a feature vector F with dimensions of 3×512×10 SCm . Finally, the time attention module is sent into the time attention module, and the calculation formula is as follows
Figure BDA0004132947300000105
Obtaining a space channel time feature vector F SCmT The size is 3 multiplied by 512 multiplied by 10, the simple integration operation is carried out on the depth video characteristic, and the third dimension, namely the frame sequence dimension, is averaged again to obtain the depth video characteristic F with the final size of 3 multiplied by 512 V
At the end of the video stream depth feature extraction network, a layer of full connection is adopted to perform video single-mode mental health abnormality identification so as to obtain a video feature prediction result y V The calculation formula can be expressed as
y V =W 3 F V (10)
Wherein W is 3 Parameters representing the fully connected layer are learnable parameters;
when training is carried out, the C3D ResNet50 network freezes parameters after pretraining and fine tuning, and does not participate in gradient feedback, wherein the parameters of the spatial attention, the channel attention, the time attention and the classification network participate in the gradient feedback.
Step five: and (3) audio and video feature fusion:
feature fusion is carried out on the obtained depth video features and depth voice features by using a tensor fusion network, so as to obtain audio-video multi-mode fusion features F VA The method comprises the steps of carrying out a first treatment on the surface of the The simple splicing and fusion method is used for fusing the multi-modal characteristics, so that loss of modal dynamic related information is likely to be caused, and compared with a splicing-based method, the tensor fusion network focuses on the correlation of the multi-modal characteristics in a high-dimensional space, and the fusion effect is better as the characteristics are fused with better correlation and tighter coupling although the calculation is more complex.
Step six: category prediction:
incorporating audio and video multimodal fusion features F VA Sending the information to a mental health classification network for final prediction and identifying mental health abnormality; the mental health classification network is composed of three layers of fully connected networks, the final predicted audio and video characteristic prediction result y is obtained after the mental health classification network, whether the mental health is abnormal or not is predicted, if the abnormal is the abnormal, the calculation formula can be expressed as,
y=W 5 F VA (11)
wherein W is 5 Parameters representing the mental health classification network are learnable parameters;
step seven: and (3) constructing a loss function:
considering the problem of multi-mode feature consistency, based on the traditional cross entropy function, the method provides the loss of audio and video multi-mode consistency and limits the results among different modes. According to the video characteristic prediction result y obtained in the third step V Audio feature prediction result y A And finally predicting an audio and video characteristic prediction result y. The traditional cross entropy loss calculation formula is as follows
Figure BDA0004132947300000106
Where n is the number of samples, M is the number of categories, y i A label representing the ith sample, p i,c Representing the ith sampleModel predictive probability belonging to category c. Taking into account the audio feature prediction result y on the basis of cross entropy loss A And video feature prediction result y V When the audio feature prediction result is inconsistent with the video feature prediction result, a larger punishment weight is applied, so that the multi-mode consistency loss of the calculated audio and video is obtained
Figure BDA0004132947300000111
The formula is as follows,
Figure BDA0004132947300000112
wherein, the introduced lambda is penalty coefficient, is coefficient super parameter, in this embodiment, lambda set value is 1.55, function h (y V ,y A ) As penalty function, increasing penalty of penalty function when audio prediction result and video prediction result are inconsistent is defined as
Figure BDA0004132947300000113
It can be seen that when the categories are inconsistent, the value of the penalty function will become greater due to the penalty function and penalty coefficients.
The invention effectively utilizes the facial micro-expression characteristics and the voice characteristics to carry out psychological health abnormality recognition, amplifies and characteristic-enhances the micro-expression characteristics contained in the video data, comprehensively considers two important physiological indexes in psychological aspects, namely micro-expression and voice characteristics, enables the model to comprehensively consider multi-modal characteristic information, and enhances the performance of the model. From the aspect of classification evaluation index accuracy and recall rate, using the original video data and the audio data to conduct mental health abnormality identification, wherein the accuracy is 0.85, and the recall rate is 0.61; the video data and the audio data which are subjected to micro-expression amplification and key frame identification are used for identifying mental health abnormality, the accuracy is 0.91, the recall rate is 0.67, the accuracy is improved by 5 percentage points, and the recall rate is improved by 6 percentage points. The invention also provides a method for automatically learning the depth features helpful for identifying the mental health anomalies by using the end-to-end depth neural network, avoids time-consuming and labor-consuming manual feature marking and manual feature extraction, and the result shows that the method is good in performance, can be used as a primary screening basis, provides relatively objective reference indexes for clinical diagnosis of a psychological doctor, and can be integrated into a mobile phone applet or an application program to be applied to self-checking and self-testing of patients.
Meanwhile, the invention utilizes the audio and video multi-mode feature consistency loss function to learn and train the audio stream and video stream model network, more fully utilizes the respective feature information of the two mode information and increases the communication of the two mode feature information. Compared with the traditional cross entropy classification loss function, the method considers the consistency problem of the multi-modal data prediction result, does not simply perform feature fusion, integrates multi-modal information better, and effectively improves the classification performance and the expression capacity of the model. From experimental results, when the input data are the video data and the audio data with the amplified micro-expressions, the accuracy of using the traditional cross entropy loss as a loss function is 0.82, and the recall rate is 0.58; the accuracy of the consistency loss function of the multi-mode features is 0.91, the recall rate is 0.67, and the accuracy and the recall rate are improved by 9 percentage points. Therefore, the multi-modal feature consistency method based on the micro-expression and the voice features is effective.
The invention is applicable to the prior art where it is not described.

Claims (9)

1. A method for identifying a multimodal feature consistent mental health anomaly, the method comprising:
Acquiring original data files containing two modes, namely an audio file and a video file, under the same scene, wherein the original data files contain normal samples and abnormal samples, and disease type labels are arranged on the abnormal samples;
performing micro-expression amplification processing on video data containing complete facial images in an obtained original data file, scoring video frames according to the micro-expression action amplitude of each frame of facial images, recording micro-expression degree scores of each frame of facial images, and registering and aligning the facial images to finish the micro-expression amplification processing on the video data; then, sampling the amplified video data in a continuous frame sequence according to a set sampling interval to obtain a plurality of continuous frame sequences with the same dimension, and thus finishing preprocessing the video data;
normalizing the micro-expression degree score of each frame in the continuous frame sequence to obtain a micro-expression key frame vector;
carrying out noise reduction and impurity removal treatment on the audio data in the obtained original data file, and extracting frequency characteristics of the audio data;
sending the frequency characteristic extracted from the audio data into an audio stream depth characteristic extraction network to extract the audio depth characteristic, thereby obtaining a depth voice characteristic F A Simultaneously, voice characteristic result prediction is carried out on voice information in an audio stream depth characteristic extraction network to obtain an audio characteristic prediction result y A
Constructing a video stream depth feature extraction network, wherein the video stream depth feature extraction network comprises a C3D ResNet50 of a pooling layer, a spatial attention module, a channel attention module, a time attention module and a classification network which are connected in sequence; wherein the input of the time attention module is the multiplication result of the output of the channel attention module and the micro expression key frame vector, and the output of the time attention module obtains the depth video characteristic F after averaging in the frame sequence dimension V
Multiple continuous frame sequences with the same dimension and obtained from video data are simultaneously sent into a video stream depth feature extraction network to extract video depth features, so as to obtain depth video features F V The method comprises the steps of carrying out a first treatment on the surface of the Video characteristic result prediction is carried out on video information of continuous frame sequences in a video stream depth characteristic extraction network, and a video characteristic prediction result y is obtained through a classification network V
Carrying out feature fusion on the depth video features and the depth voice features by using a tensor fusion network to obtain audio-video multi-mode fusion features; sending the audio and video multi-mode fusion characteristics into a mental health classification network for final prediction;
According to the formula(2) Calculating multi-mode consistency loss of audio and video
Figure FDA0004132947290000011
Figure FDA0004132947290000012
Where n is the number of samples, M is the number of tag categories, y i A label representing the ith sample, p i,c Representing model prediction probability of the ith sample belonging to the class c, wherein lambda is a penalty coefficient; function h (y) V ,y A ) Increasing penalty of penalty function when audio feature prediction result and video feature prediction result are inconsistent, wherein penalty function is defined as
Figure FDA0004132947290000013
2. The method for identifying a multimodal feature consistent mental health exception as claimed in claim 1, wherein the lambda set-point is 1.0-1.6; preferably 1.55.
3. The method for identifying the multi-modal feature consistent mental health anomalies according to claim 1, wherein the audio stream depth feature extraction network comprises a 2d res net18 network, a time attention mechanism and a full connection layer, the output of the pre-trained 2d res net18 network is connected with the time attention mechanism, and the output of the time attention mechanism is connected with the full connection layer;
the time attention mechanism comprises 1 layer one-dimensional convolution, 1 layer full connection and 1 layer softmax function, two dimensional characteristics of space and channels are compressed, time dimensional characteristics are extracted, and depth voice characteristics are obtained.
4. The method for identifying multimodal feature consistent mental health anomalies according to claim 1, wherein a backbone network of the video stream depth feature extraction network is a pre-trained C3D res net50; the method comprises the steps of extracting features through a backbone network, sequentially sending the features into a space attention module and a channel attention module, and integrating information of space dimension and channel dimension and distributing weight; then multiplying the feature vector Fsc obtained by the channel attention module processing by the micro-expression key frame vector Fem; then sending the information to a time attention module for integrating time dimension information of a frame sequence layer;
each residual block of the C3D res net50 is made up of three-dimensional convolutions in series, the first three-dimensional convolution having a convolution kernel size of 1 x 1, a step size of 1, the convolution kernel of the second three-dimensional convolution has the size of 3 multiplied by 3, the step length is 2, the convolution kernel of the third three-dimensional convolution has the size of 1 multiplied by 1, and the step length is 1; each three-dimensional convolution is followed by a three-dimensional batch regularization and a ReLU activation function.
5. The method for identifying the multi-modal feature consistent mental health anomalies according to claim 1, wherein the micro-expression amplification process is to amplify micro-expressions by using an euler motion amplification algorithm, and the scoring process is as follows: finding out a frame with the largest micro-expression change of the video after micro-expression amplification by utilizing a key frame detection algorithm, and scoring all frames of the video according to the change degree of the micro-expression;
The continuous frame sequence obtaining process is that all frames of the whole video are equally divided into 10 sections, and then continuous 30 frames are extracted from each section, so that 10 continuous frame sequences with 30 frames in all dimensions are obtained.
6. Integrating the multimodal feature consistent mental health exception identification method of any of claims 1-5 into a cellphone applet or application.
7. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the steps of the method according to any of claims 1-5.
8. A multimodal feature consistent mental health anomaly recognition system, comprising:
the man-machine interaction module is used for recording the micro-expression change and the micro-expression voice characteristic information of the sample and obtaining audio and video data;
the video data preprocessing module is used for extracting micro expression data in the video;
the microexpressive key frame vector acquisition module is used for carrying out normalization processing on the microexpressive degree score of each frame to obtain a weighted audio data preprocessing module and is used for acquiring the frequency characteristics of the audio data;
the mental health abnormality recognition model comprises a video stream depth feature extraction network, an audio stream depth feature extraction network, a tensor fusion network and a mental health classification network,
Video stream depth feature extraction network for obtaining depth video features F V And video feature prediction result y V
Audio stream depth feature extraction network for obtaining depth speech features F A Audio feature prediction result y A
The tensor fusion network is used for fusing the depth video features and the depth voice features to obtain audio and video multi-mode fusion features;
the psychological health classification network is used for classifying the audio and video multi-mode fusion characteristics;
mental health anomaly identification model using multimodal loss of consistency
Figure FDA0004132947290000021
Video feature prediction result y V Audio feature prediction result y A Training constraints are performed.
9. The multi-modal feature consistent mental health anomaly recognition system of claim 8, wherein the process of the human-computer interaction module is: the next link automatically appears through computer playing, the information of the tested person is input through an identity card reader in advance, the desensitization treatment is carried out on the identity information of the patient in the program, and then a start button is clicked to enter an interface interacted with the tested person; setting human-computer interaction links, including three links of watching video, reading text and answering questions; the step of watching the video is to play a section of emotion induced video to the tested person for watching by a computer, and the step is to enable the tested person to react by receiving external stimulus, so that the micro-expression characteristics of the tested face can be captured more clearly and intuitively; the step of reading text is to pop up a section of characters by a computer, so that a tested person reads the characters aloud, and the step is to enable the tested person to release sound information, so that the frequency and texture characteristics of the tested sound can be clearly captured; the step of answering the questions is to pop up a question by a computer so that a user thinks and answers, and the step captures visual information and voice information; the whole man-machine interaction process is recorded by a head-mounted microphone and a camera with 1080 x 640 resolution and 60fps frame rate; wherein, the time of each of the links of watching video and reading text is 30 seconds, the time of the link of answering questions is 40 seconds, and the total time length of the video and the audio is 100 seconds.
CN202310265823.6A 2023-03-20 2023-03-20 Multi-mode feature consistency psychological health abnormality identification method and system Pending CN116230234A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310265823.6A CN116230234A (en) 2023-03-20 2023-03-20 Multi-mode feature consistency psychological health abnormality identification method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310265823.6A CN116230234A (en) 2023-03-20 2023-03-20 Multi-mode feature consistency psychological health abnormality identification method and system

Publications (1)

Publication Number Publication Date
CN116230234A true CN116230234A (en) 2023-06-06

Family

ID=86582392

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310265823.6A Pending CN116230234A (en) 2023-03-20 2023-03-20 Multi-mode feature consistency psychological health abnormality identification method and system

Country Status (1)

Country Link
CN (1) CN116230234A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116933046A (en) * 2023-09-19 2023-10-24 山东大学 Deep learning-based multi-mode health management scheme generation method and system
CN117116489A (en) * 2023-10-25 2023-11-24 光大宏远(天津)技术有限公司 Psychological assessment data management method and system
CN117649933A (en) * 2023-11-28 2024-03-05 广州方舟信息科技有限公司 Online consultation assistance method and device, electronic equipment and storage medium

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116933046A (en) * 2023-09-19 2023-10-24 山东大学 Deep learning-based multi-mode health management scheme generation method and system
CN116933046B (en) * 2023-09-19 2023-11-24 山东大学 Deep learning-based multi-mode health management scheme generation method and system
CN117116489A (en) * 2023-10-25 2023-11-24 光大宏远(天津)技术有限公司 Psychological assessment data management method and system
CN117649933A (en) * 2023-11-28 2024-03-05 广州方舟信息科技有限公司 Online consultation assistance method and device, electronic equipment and storage medium
CN117649933B (en) * 2023-11-28 2024-05-28 广州方舟信息科技有限公司 Online consultation assistance method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN110507335B (en) Multi-mode information based criminal psychological health state assessment method and system
CN111523462B (en) Video sequence expression recognition system and method based on self-attention enhanced CNN
CN116230234A (en) Multi-mode feature consistency psychological health abnormality identification method and system
CN110464366A (en) A kind of Emotion identification method, system and storage medium
CN112800998A (en) Multi-mode emotion recognition method and system integrating attention mechanism and DMCCA
CN110427881B (en) Cross-library micro-expression recognition method and device based on face local area feature learning
CN111920420B (en) Patient behavior multi-modal analysis and prediction system based on statistical learning
Abtahi et al. Emotion analysis using audio/video, emg and eeg: A dataset and comparison study
CN112101096A (en) Suicide emotion perception method based on multi-mode fusion of voice and micro-expression
Jayanthi et al. An integrated framework for emotion recognition using speech and static images with deep classifier fusion approach
CN110781751A (en) Emotional electroencephalogram signal classification method based on cross-connection convolutional neural network
CN109805944A (en) A kind of children's empathy ability analysis system
Le et al. Dynamic image for micro-expression recognition on region-based framework
CN113243924A (en) Identity recognition method based on electroencephalogram signal channel attention convolution neural network
CN116050892A (en) Intelligent education evaluation supervision method based on artificial intelligence
CN113974627B (en) Emotion recognition method based on brain-computer generated confrontation
Javaid et al. EEG guided multimodal lie detection with audio-visual cues
CN111259759A (en) Cross-database micro-expression recognition method and device based on domain selection migration regression
Liu et al. Facial expression recognition for in-the-wild videos
CN111523461A (en) Expression recognition system and method based on enhanced CNN and cross-layer LSTM
Hou Deep learning-based human emotion detection framework using facial expressions
CN113974625A (en) Emotion recognition method based on brain-computer cross-modal migration
CN112507959A (en) Method for establishing emotion perception model based on individual face analysis in video
CN113974628B (en) Emotion recognition method based on brain-computer modal co-space
CN117198468B (en) Intervention scheme intelligent management system based on behavior recognition and data analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination