CN115732076A - Fusion analysis method for multi-modal depression data - Google Patents

Fusion analysis method for multi-modal depression data Download PDF

Info

Publication number
CN115732076A
CN115732076A CN202211433256.2A CN202211433256A CN115732076A CN 115732076 A CN115732076 A CN 115732076A CN 202211433256 A CN202211433256 A CN 202211433256A CN 115732076 A CN115732076 A CN 115732076A
Authority
CN
China
Prior art keywords
data
depression
value
attention
image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211433256.2A
Other languages
Chinese (zh)
Inventor
张健
龚昊然
瞿星
蒋明丰
赵墨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
West China Hospital of Sichuan University
Original Assignee
West China Hospital of Sichuan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by West China Hospital of Sichuan University filed Critical West China Hospital of Sichuan University
Priority to CN202211433256.2A priority Critical patent/CN115732076A/en
Publication of CN115732076A publication Critical patent/CN115732076A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Measuring And Recording Apparatus For Diagnosis (AREA)

Abstract

The invention discloses a fusion analysis method of multi-modal depression data in the field of depression data fusion, and aims to extract and fuse emotional characteristics of different types of data recorded in multiple stages. And aiming at different types of data, data characteristics are provided according to the characteristics of the data. And then, the data features of different modes respectively obtain K value, Q value and V value expressions through three linear layers, then the attention A of the data of each mode is calculated by utilizing K and Q according to the attention mechanism of fused depression data, and A.V is taken as the feature after fusion to serve downstream tasks. Compared with the existing technology for researching the depression through the single mode, the technical scheme is not easily influenced by factors such as individual difference and the like, can better utilize the screening characteristics of a case set on the individual difference of a patient, and then fuses images, actions and sounds of the patient according to the screening characteristics to realize comprehensive diagnosis.

Description

Fusion analysis method for multi-modal depression data
Technical Field
The invention belongs to the field of multi-modal data fusion, and particularly relates to a multi-modal fusion analysis method applied to emotion recognition.
Background
Depression has become an internationally recognized public health problem seriously threatening the physical and mental health of human beings due to high morbidity and great harmfulness, and early recognition and early intervention are crucial to reducing the risk of depression. Traditional depression diagnosis is performed by doctors according to clinical experience and scales, and the method mainly depends on single-modality data and has the defects of subjective deviation, hysteresis, passivity, limitation and the like. Jeffery et al found that the effects of identifying depression using multimodal techniques were superior to those of the single modality.
Multimodal techniques refer to a method of processing or fitting multimodal data simultaneously to enhance the performance of a model. Data of different modalities, which are different in expression form, are different in meaning, and thus are difficult to align and fuse. For example, in the task of image and audio recognition, image data is usually represented as pictures, while language data is usually represented as characters, and the two are difficult to fuse due to different representation forms; in gene sequencing analysis, data between different sequencing methods are difficult to fuse due to different meanings.
Existing work also has much exploration for multimodal techniques. Dupont, S et al use hidden markov combined finite automata to align speech data with picture data and use bimodal data to identify speech and picture. The method fuses data of different expression forms to a certain extent, but still has the defects of low efficiency and poor popularization. Another approach is to use neural networks for multidata fusion. Zeng, X et al fused 10 drug descriptions (e.g., side effects, action pathways, etc.) together using a multi-modal self-encoder, entered the disease type to classify the disease type to match the drug type, and segmented the onset symptoms of the disease, and increased or decreased the corresponding drug dosage for the difference of each symptom expressed on different individuals. This multimodal fusion method does not take into account the relationship between modalities sufficiently, and there is no way to fuse data of different expression forms. In summary, while there have been many attempts at multimodal techniques, there is still no method that fuses multimodal data well.
Disclosure of Invention
In order to solve the above problems, it is an object of the present invention to provide a fusion analysis method of multimodal depression data.
In order to achieve the purpose, the technical scheme of the invention is as follows: a multi-modal fusion analysis method for depression data comprises the steps of carrying out multi-stage data entry on data of different data types, carrying out emotion feature extraction on the entered data, then respectively obtaining K value, Q value and V value expressions through three linear layers according to data features of different modes, then calculating attention A of the data of each mode by using K and Q according to a fusion depression data attention mechanism, and serving downstream tasks by taking A & V as fused features. Due to the fusion of the attention mechanism of the depression data, the fused data features will contain multi-modal information and can assist downstream classification tasks.
Further, the method comprises the following steps of,
s1, preprocessing data, namely dividing data components into text data, image data and audio data;
s2, integrating a depression data attention mechanism, and calculating preprocessed data to obtain features containing multi-modal information;
and S3, identifying the depression, namely splicing the characteristics containing the multi-modal information, outputting a fused data characteristic through a linear layer, and outputting a classification prediction result by using a softmax function as an activation function for the neuron of the last layer.
Further, the text data in the S1 comprises a scale and an electronic case, and the scale and the electronic case data are subjected to characteristic primary screening, missing value processing, characteristic coding and normalization.
Further, in S1, the video data is subjected to image extraction at a frequency of 20 frames per second, after denoising and artifact removal are performed on the obtained image data, face position detection is performed on each frame of image, the images are aligned according to eye positions, and then the video image is cut into face images of 256 × 256 pixels.
Further, after the audio data in the S1 are aligned with the image set obtained by frame extraction, mel frequency cepstrum coefficients are extracted for each aligned voice segment.
Further, in S2, a K value, a Q value, a V value, and a K value corresponding to the text data, the image data, and the audio data are calculated, the Q value calculates the auxiliary attention of the video, the audio, and the text, respectively, and the three auxiliary attentions are spliced and passed through a Softmax function to form the attention of the video, the audio, and the text, and multiplied by the V value calculated in the previous step.
Further, the prediction result in the S3 is obtained by fitting the difference between the predicted value and the true value by using a cross entropy loss function.
After the scheme is adopted, the following beneficial effects are realized: 1. compared with the prior art of researching the depression through the single mode, the single mode can be influenced by factors such as individual difference, so that the technical scheme utilizes the screening characteristics of the case set on the individual difference of the patient, and then fuses the image, the action and the sound of the patient according to the screening characteristics to realize comprehensive diagnosis.
2. Compared with the traditional splicing type data fusion mode, the technical scheme has the advantages that the following effects are generated and combined with information of different modes, and the information of different modes on the medium is represented and combined. The second is the alignment problem, aligning different modality information and handling possible dependencies. Finally, the problem of conversion is solved, and the information of a plurality of modes is in a unified form.
Drawings
FIG. 1 is a multi-modal fusion perinatal depression assessment model framework;
FIG. 2 is a method of attention mechanism for fusing depression data.
Detailed Description
The following is further detailed by way of specific embodiments:
the embodiment is substantially as shown in figures 1 and 2: a multi-modal fusion analysis method for depression data comprises the steps of carrying out multi-stage data entry on data of different data types, carrying out emotion feature extraction on the entered data, then respectively obtaining K value, Q value and V value expressions through three linear layers according to data features of different modes, then calculating attention A of data of each mode by using K and Q according to a fusion depression data attention mechanism, and serving downstream tasks by taking A & V as fused features. Due to the fusion of the attention mechanism of the depression data, the fused data features will contain multi-modal information and can assist downstream classification tasks.
The specific implementation process is as follows: the inputs to the present invention are video, audio and text data. The method is divided into three main stages of data preprocessing, and integrates an integrated depression data attention mechanism (IDDA) and depression identification. Comprises the following steps of (a) carrying out,
s1, preprocessing data, dividing data into text data, image data and audio data, wherein the text data comprises a scale and an electronic case, the scale and the electronic case history data are subjected to feature primary screening, missing value processing, feature coding and normalization, video data are subjected to image extraction according to the frequency of 20 frames per second, after the obtained image data are subjected to noise removal and artifact removal, face position detection is carried out on each frame of image, the image is aligned according to the eye position, then the video image is cut into face images with 256 multiplied by 256 pixels, and after the audio data are aligned with an image set obtained by frame extraction, mel frequency cepstrum coefficients are extracted for each aligned voice segment;
s2, a depression data attention mechanism is fused, preprocessed data are calculated, characteristics containing multi-modal information are obtained, and in order to better fuse the data, a new multi-modal data fusion mechanism (IDDA) is proposed. Firstly, 1, calculating the corresponding K value, Q value and V value of the three data respectively. Then, respectively calculating the auxiliary attention of the video, the audio and the text by using the K value and the Q value;
and splicing the three auxiliary attentions, forming attentions of video, audio and text through a Softmax function, and multiplying the attentions by the V value calculated in the previous step to obtain the characteristics containing multi-modal information.
And splicing the features containing the multi-modal information, and outputting a fused data feature through a linear layer, wherein the data feature is used as the input of a downstream task.
S3, identifying the depression, splicing the features containing the multi-modal information, outputting a fused data feature through a linear layer, and selecting the LSTM as a classifier of a downstream task. And optimizing the model by using an Adam optimizer, using a softmax function as an activation function for the neurons of the last layer, and outputting a classification prediction result. And fitting the difference between the predicted value and the true value by adopting a cross entropy loss function, wherein the learning rate of the model is 0.001.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.
The foregoing is merely an example of the present invention, and common general knowledge in the field of known specific structures and characteristics is not described herein in any greater extent than that known in the art at the filing date or prior to the priority date of the application, so that those skilled in the art can now appreciate that all of the above-described techniques in this field and have the ability to apply routine experimentation before this date can be combined with one or more of the present teachings to complete and implement the present invention, and that certain typical known structures or known methods do not pose any impediments to the implementation of the present invention by those skilled in the art. It should be noted that, for those skilled in the art, without departing from the structure of the present invention, several variations and modifications can be made, which should also be considered as the protection scope of the present invention, and these will not affect the effect of the implementation of the present invention and the utility of the patent. The scope of the claims of the present application shall be defined by the claims, and the description of the embodiments and the like in the specification shall be used to explain the contents of the claims.

Claims (7)

1. A fusion analysis method of multi-modal depression data is characterized by comprising the following steps: the method comprises the steps of carrying out multi-stage data entry on data of different data types, extracting emotion characteristics of the entered data, then respectively obtaining K values, Q values and V value expressions through three linear layers according to the data characteristics of different modes, calculating attention A of data of each mode by using K and Q according to a fused depression data attention mechanism, serving a downstream task by taking A and V as fused characteristics, and assisting the downstream classification task due to the fact that the fused depression data attention mechanism is fused, wherein the fused data characteristics contain multi-mode information.
2. The fusion analysis method of multimodal depression data according to claim 1, characterized by: comprises the following steps of (a) carrying out,
s1, preprocessing data, namely dividing data components into text data, image data and audio data;
s2, integrating a depression data attention mechanism, and calculating preprocessed data to obtain features containing multi-modal information;
and S3, identifying the depression, namely splicing the characteristics containing the multi-modal information, outputting a fused data characteristic through a linear layer, and outputting a classification prediction result by using a softmax function as an activation function for the neuron of the last layer.
3. The fusion analysis method of multimodal depression data according to claim 2, characterized by: and the text data in the S1 comprises a scale and an electronic case, and the scale and the electronic case history data are subjected to characteristic primary screening, missing value processing, characteristic coding and normalization.
4. The method for fusion analysis of multimodal depression data according to claim 2, wherein: and in the S1, the video data is subjected to image extraction according to the frequency of 20 frames per second, after the noise and artifact of the obtained image data are removed, the face position of each frame of image is detected, the image is aligned according to the eye position, and then the video image is cut into face images of 256 multiplied by 256 pixels.
5. The fusion analysis method of multimodal depression data according to claim 2, characterized by: and after the audio data in the S1 are aligned with the image set obtained by frame extraction, extracting a Mel frequency cepstrum coefficient for each aligned voice segment.
6. The fusion analysis method of multimodal depression data according to claim 2, characterized by: and in the S2, the K value, the Q value, the V value and the K value which correspond to the text data, the image data and the audio data respectively are calculated, the Q value calculates the auxiliary attention of the video, the audio and the text respectively, the three auxiliary attention are spliced and form the attention of the video, the audio and the text through a Softmax function, and the attention is multiplied by the V value calculated in the previous step.
7. The fusion analysis method of multimodal depression data according to claim 2, characterized by: and in the S3, the difference between the predicted value and the true value is fitted by adopting a cross entropy loss function as a prediction result.
CN202211433256.2A 2022-11-16 2022-11-16 Fusion analysis method for multi-modal depression data Pending CN115732076A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211433256.2A CN115732076A (en) 2022-11-16 2022-11-16 Fusion analysis method for multi-modal depression data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211433256.2A CN115732076A (en) 2022-11-16 2022-11-16 Fusion analysis method for multi-modal depression data

Publications (1)

Publication Number Publication Date
CN115732076A true CN115732076A (en) 2023-03-03

Family

ID=85296043

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211433256.2A Pending CN115732076A (en) 2022-11-16 2022-11-16 Fusion analysis method for multi-modal depression data

Country Status (1)

Country Link
CN (1) CN115732076A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116259407A (en) * 2023-05-16 2023-06-13 季华实验室 Disease diagnosis method, device, equipment and medium based on multi-mode data
CN116563920A (en) * 2023-05-06 2023-08-08 北京中科睿途科技有限公司 Method and device for identifying age in cabin environment based on multi-mode information
CN118507036A (en) * 2024-07-17 2024-08-16 长春理工大学中山研究院 Emotion semantic multi-mode depression tendency recognition system

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116563920A (en) * 2023-05-06 2023-08-08 北京中科睿途科技有限公司 Method and device for identifying age in cabin environment based on multi-mode information
CN116563920B (en) * 2023-05-06 2023-10-13 北京中科睿途科技有限公司 Method and device for identifying age in cabin environment based on multi-mode information
CN116259407A (en) * 2023-05-16 2023-06-13 季华实验室 Disease diagnosis method, device, equipment and medium based on multi-mode data
CN118507036A (en) * 2024-07-17 2024-08-16 长春理工大学中山研究院 Emotion semantic multi-mode depression tendency recognition system

Similar Documents

Publication Publication Date Title
CN115732076A (en) Fusion analysis method for multi-modal depression data
Zhang et al. Deep learning-based multimodal emotion recognition from audio, visual, and text modalities: A systematic review of recent advancements and future prospects
Muzammel et al. End-to-end multimodal clinical depression recognition using deep neural networks: A comparative analysis
CN111292765B (en) Bimodal emotion recognition method integrating multiple deep learning models
CN103996155A (en) Intelligent interaction and psychological comfort robot service system
CN112101096A (en) Suicide emotion perception method based on multi-mode fusion of voice and micro-expression
WO2022227280A1 (en) Smart glasses-based disaster rescue triage and auxiliary diagnosis method
Ilias et al. Detecting dementia from speech and transcripts using transformers
CN111091044B (en) Network appointment-oriented in-vehicle dangerous scene identification method
CN112418166A (en) Emotion distribution learning method based on multi-mode information
CN114549946A (en) Cross-modal attention mechanism-based multi-modal personality identification method and system
Tuncer et al. Automatic voice based disease detection method using one dimensional local binary pattern feature extraction network
CN115169507A (en) Brain-like multi-mode emotion recognition network, recognition method and emotion robot
Xia et al. Audiovisual speech recognition: A review and forecast
CN112597841A (en) Emotion analysis method based on door mechanism multi-mode fusion
Renjith et al. Speech based emotion recognition in Tamil and Telugu using LPCC and hurst parameters—A comparitive study using KNN and ANN classifiers
Noroozi et al. Speech-based emotion recognition and next reaction prediction
Sasou Automatic identification of pathological voice quality based on the GRBAS categorization
Mocanu et al. Speech emotion recognition using GhostVLAD and sentiment metric learning
Nemani et al. Speaker independent VSR: A systematic review and futuristic applications
Memari et al. Speech analysis with deep learning to determine speech therapy for learning difficulties
Seddik et al. A computer-aided speech disorders correction system for Arabic language
Kumar et al. Can you hear me now? Clinical applications of audio recordings
CN111312215B (en) Natural voice emotion recognition method based on convolutional neural network and binaural characterization
Poorna et al. Bimodal emotion recognition using audio and facial features

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination