CN115732076A - Fusion analysis method for multi-modal depression data - Google Patents
Fusion analysis method for multi-modal depression data Download PDFInfo
- Publication number
- CN115732076A CN115732076A CN202211433256.2A CN202211433256A CN115732076A CN 115732076 A CN115732076 A CN 115732076A CN 202211433256 A CN202211433256 A CN 202211433256A CN 115732076 A CN115732076 A CN 115732076A
- Authority
- CN
- China
- Prior art keywords
- data
- depression
- value
- attention
- image
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000004927 fusion Effects 0.000 title claims abstract description 25
- 238000004458 analytical method Methods 0.000 title claims abstract description 15
- 230000007246 mechanism Effects 0.000 claims abstract description 13
- 230000014509 gene expression Effects 0.000 claims abstract description 7
- 238000012216 screening Methods 0.000 claims abstract description 7
- 238000000034 method Methods 0.000 claims description 20
- 238000000605 extraction Methods 0.000 claims description 8
- 238000007781 pre-processing Methods 0.000 claims description 4
- 238000012545 processing Methods 0.000 claims description 4
- 230000004913 activation Effects 0.000 claims description 3
- 238000013479 data entry Methods 0.000 claims description 3
- 230000008451 emotion Effects 0.000 claims description 3
- 210000002569 neuron Anatomy 0.000 claims description 3
- 238000010606 normalization Methods 0.000 claims description 3
- 230000009471 action Effects 0.000 abstract description 6
- 238000003745 diagnosis Methods 0.000 abstract description 3
- 230000002996 emotional effect Effects 0.000 abstract 1
- 230000006870 function Effects 0.000 description 8
- 230000000694 effects Effects 0.000 description 4
- 201000010099 disease Diseases 0.000 description 3
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 3
- 239000003814 drug Substances 0.000 description 3
- 229940079593 drug Drugs 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 230000007547 defect Effects 0.000 description 2
- 238000001514 detection method Methods 0.000 description 2
- 238000012163 sequencing technique Methods 0.000 description 2
- 208000024891 symptom Diseases 0.000 description 2
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 1
- 208000020401 Depressive disease Diseases 0.000 description 1
- 241000282414 Homo sapiens Species 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000002902 bimodal effect Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 230000003247 decreasing effect Effects 0.000 description 1
- 230000008909 emotion recognition Effects 0.000 description 1
- 230000004630 mental health Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000007500 overflow downdraw method Methods 0.000 description 1
- 230000037361 pathway Effects 0.000 description 1
- 208000033300 perinatal asphyxia Diseases 0.000 description 1
- 108090000623 proteins and genes Proteins 0.000 description 1
- 230000005180 public health Effects 0.000 description 1
Images
Landscapes
- Measuring And Recording Apparatus For Diagnosis (AREA)
Abstract
The invention discloses a fusion analysis method of multi-modal depression data in the field of depression data fusion, and aims to extract and fuse emotional characteristics of different types of data recorded in multiple stages. And aiming at different types of data, data characteristics are provided according to the characteristics of the data. And then, the data features of different modes respectively obtain K value, Q value and V value expressions through three linear layers, then the attention A of the data of each mode is calculated by utilizing K and Q according to the attention mechanism of fused depression data, and A.V is taken as the feature after fusion to serve downstream tasks. Compared with the existing technology for researching the depression through the single mode, the technical scheme is not easily influenced by factors such as individual difference and the like, can better utilize the screening characteristics of a case set on the individual difference of a patient, and then fuses images, actions and sounds of the patient according to the screening characteristics to realize comprehensive diagnosis.
Description
Technical Field
The invention belongs to the field of multi-modal data fusion, and particularly relates to a multi-modal fusion analysis method applied to emotion recognition.
Background
Depression has become an internationally recognized public health problem seriously threatening the physical and mental health of human beings due to high morbidity and great harmfulness, and early recognition and early intervention are crucial to reducing the risk of depression. Traditional depression diagnosis is performed by doctors according to clinical experience and scales, and the method mainly depends on single-modality data and has the defects of subjective deviation, hysteresis, passivity, limitation and the like. Jeffery et al found that the effects of identifying depression using multimodal techniques were superior to those of the single modality.
Multimodal techniques refer to a method of processing or fitting multimodal data simultaneously to enhance the performance of a model. Data of different modalities, which are different in expression form, are different in meaning, and thus are difficult to align and fuse. For example, in the task of image and audio recognition, image data is usually represented as pictures, while language data is usually represented as characters, and the two are difficult to fuse due to different representation forms; in gene sequencing analysis, data between different sequencing methods are difficult to fuse due to different meanings.
Existing work also has much exploration for multimodal techniques. Dupont, S et al use hidden markov combined finite automata to align speech data with picture data and use bimodal data to identify speech and picture. The method fuses data of different expression forms to a certain extent, but still has the defects of low efficiency and poor popularization. Another approach is to use neural networks for multidata fusion. Zeng, X et al fused 10 drug descriptions (e.g., side effects, action pathways, etc.) together using a multi-modal self-encoder, entered the disease type to classify the disease type to match the drug type, and segmented the onset symptoms of the disease, and increased or decreased the corresponding drug dosage for the difference of each symptom expressed on different individuals. This multimodal fusion method does not take into account the relationship between modalities sufficiently, and there is no way to fuse data of different expression forms. In summary, while there have been many attempts at multimodal techniques, there is still no method that fuses multimodal data well.
Disclosure of Invention
In order to solve the above problems, it is an object of the present invention to provide a fusion analysis method of multimodal depression data.
In order to achieve the purpose, the technical scheme of the invention is as follows: a multi-modal fusion analysis method for depression data comprises the steps of carrying out multi-stage data entry on data of different data types, carrying out emotion feature extraction on the entered data, then respectively obtaining K value, Q value and V value expressions through three linear layers according to data features of different modes, then calculating attention A of the data of each mode by using K and Q according to a fusion depression data attention mechanism, and serving downstream tasks by taking A & V as fused features. Due to the fusion of the attention mechanism of the depression data, the fused data features will contain multi-modal information and can assist downstream classification tasks.
Further, the method comprises the following steps of,
s1, preprocessing data, namely dividing data components into text data, image data and audio data;
s2, integrating a depression data attention mechanism, and calculating preprocessed data to obtain features containing multi-modal information;
and S3, identifying the depression, namely splicing the characteristics containing the multi-modal information, outputting a fused data characteristic through a linear layer, and outputting a classification prediction result by using a softmax function as an activation function for the neuron of the last layer.
Further, the text data in the S1 comprises a scale and an electronic case, and the scale and the electronic case data are subjected to characteristic primary screening, missing value processing, characteristic coding and normalization.
Further, in S1, the video data is subjected to image extraction at a frequency of 20 frames per second, after denoising and artifact removal are performed on the obtained image data, face position detection is performed on each frame of image, the images are aligned according to eye positions, and then the video image is cut into face images of 256 × 256 pixels.
Further, after the audio data in the S1 are aligned with the image set obtained by frame extraction, mel frequency cepstrum coefficients are extracted for each aligned voice segment.
Further, in S2, a K value, a Q value, a V value, and a K value corresponding to the text data, the image data, and the audio data are calculated, the Q value calculates the auxiliary attention of the video, the audio, and the text, respectively, and the three auxiliary attentions are spliced and passed through a Softmax function to form the attention of the video, the audio, and the text, and multiplied by the V value calculated in the previous step.
Further, the prediction result in the S3 is obtained by fitting the difference between the predicted value and the true value by using a cross entropy loss function.
After the scheme is adopted, the following beneficial effects are realized: 1. compared with the prior art of researching the depression through the single mode, the single mode can be influenced by factors such as individual difference, so that the technical scheme utilizes the screening characteristics of the case set on the individual difference of the patient, and then fuses the image, the action and the sound of the patient according to the screening characteristics to realize comprehensive diagnosis.
2. Compared with the traditional splicing type data fusion mode, the technical scheme has the advantages that the following effects are generated and combined with information of different modes, and the information of different modes on the medium is represented and combined. The second is the alignment problem, aligning different modality information and handling possible dependencies. Finally, the problem of conversion is solved, and the information of a plurality of modes is in a unified form.
Drawings
FIG. 1 is a multi-modal fusion perinatal depression assessment model framework;
FIG. 2 is a method of attention mechanism for fusing depression data.
Detailed Description
The following is further detailed by way of specific embodiments:
the embodiment is substantially as shown in figures 1 and 2: a multi-modal fusion analysis method for depression data comprises the steps of carrying out multi-stage data entry on data of different data types, carrying out emotion feature extraction on the entered data, then respectively obtaining K value, Q value and V value expressions through three linear layers according to data features of different modes, then calculating attention A of data of each mode by using K and Q according to a fusion depression data attention mechanism, and serving downstream tasks by taking A & V as fused features. Due to the fusion of the attention mechanism of the depression data, the fused data features will contain multi-modal information and can assist downstream classification tasks.
The specific implementation process is as follows: the inputs to the present invention are video, audio and text data. The method is divided into three main stages of data preprocessing, and integrates an integrated depression data attention mechanism (IDDA) and depression identification. Comprises the following steps of (a) carrying out,
s1, preprocessing data, dividing data into text data, image data and audio data, wherein the text data comprises a scale and an electronic case, the scale and the electronic case history data are subjected to feature primary screening, missing value processing, feature coding and normalization, video data are subjected to image extraction according to the frequency of 20 frames per second, after the obtained image data are subjected to noise removal and artifact removal, face position detection is carried out on each frame of image, the image is aligned according to the eye position, then the video image is cut into face images with 256 multiplied by 256 pixels, and after the audio data are aligned with an image set obtained by frame extraction, mel frequency cepstrum coefficients are extracted for each aligned voice segment;
s2, a depression data attention mechanism is fused, preprocessed data are calculated, characteristics containing multi-modal information are obtained, and in order to better fuse the data, a new multi-modal data fusion mechanism (IDDA) is proposed. Firstly, 1, calculating the corresponding K value, Q value and V value of the three data respectively. Then, respectively calculating the auxiliary attention of the video, the audio and the text by using the K value and the Q value;
and splicing the three auxiliary attentions, forming attentions of video, audio and text through a Softmax function, and multiplying the attentions by the V value calculated in the previous step to obtain the characteristics containing multi-modal information.
And splicing the features containing the multi-modal information, and outputting a fused data feature through a linear layer, wherein the data feature is used as the input of a downstream task.
S3, identifying the depression, splicing the features containing the multi-modal information, outputting a fused data feature through a linear layer, and selecting the LSTM as a classifier of a downstream task. And optimizing the model by using an Adam optimizer, using a softmax function as an activation function for the neurons of the last layer, and outputting a classification prediction result. And fitting the difference between the predicted value and the true value by adopting a cross entropy loss function, wherein the learning rate of the model is 0.001.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.
The foregoing is merely an example of the present invention, and common general knowledge in the field of known specific structures and characteristics is not described herein in any greater extent than that known in the art at the filing date or prior to the priority date of the application, so that those skilled in the art can now appreciate that all of the above-described techniques in this field and have the ability to apply routine experimentation before this date can be combined with one or more of the present teachings to complete and implement the present invention, and that certain typical known structures or known methods do not pose any impediments to the implementation of the present invention by those skilled in the art. It should be noted that, for those skilled in the art, without departing from the structure of the present invention, several variations and modifications can be made, which should also be considered as the protection scope of the present invention, and these will not affect the effect of the implementation of the present invention and the utility of the patent. The scope of the claims of the present application shall be defined by the claims, and the description of the embodiments and the like in the specification shall be used to explain the contents of the claims.
Claims (7)
1. A fusion analysis method of multi-modal depression data is characterized by comprising the following steps: the method comprises the steps of carrying out multi-stage data entry on data of different data types, extracting emotion characteristics of the entered data, then respectively obtaining K values, Q values and V value expressions through three linear layers according to the data characteristics of different modes, calculating attention A of data of each mode by using K and Q according to a fused depression data attention mechanism, serving a downstream task by taking A and V as fused characteristics, and assisting the downstream classification task due to the fact that the fused depression data attention mechanism is fused, wherein the fused data characteristics contain multi-mode information.
2. The fusion analysis method of multimodal depression data according to claim 1, characterized by: comprises the following steps of (a) carrying out,
s1, preprocessing data, namely dividing data components into text data, image data and audio data;
s2, integrating a depression data attention mechanism, and calculating preprocessed data to obtain features containing multi-modal information;
and S3, identifying the depression, namely splicing the characteristics containing the multi-modal information, outputting a fused data characteristic through a linear layer, and outputting a classification prediction result by using a softmax function as an activation function for the neuron of the last layer.
3. The fusion analysis method of multimodal depression data according to claim 2, characterized by: and the text data in the S1 comprises a scale and an electronic case, and the scale and the electronic case history data are subjected to characteristic primary screening, missing value processing, characteristic coding and normalization.
4. The method for fusion analysis of multimodal depression data according to claim 2, wherein: and in the S1, the video data is subjected to image extraction according to the frequency of 20 frames per second, after the noise and artifact of the obtained image data are removed, the face position of each frame of image is detected, the image is aligned according to the eye position, and then the video image is cut into face images of 256 multiplied by 256 pixels.
5. The fusion analysis method of multimodal depression data according to claim 2, characterized by: and after the audio data in the S1 are aligned with the image set obtained by frame extraction, extracting a Mel frequency cepstrum coefficient for each aligned voice segment.
6. The fusion analysis method of multimodal depression data according to claim 2, characterized by: and in the S2, the K value, the Q value, the V value and the K value which correspond to the text data, the image data and the audio data respectively are calculated, the Q value calculates the auxiliary attention of the video, the audio and the text respectively, the three auxiliary attention are spliced and form the attention of the video, the audio and the text through a Softmax function, and the attention is multiplied by the V value calculated in the previous step.
7. The fusion analysis method of multimodal depression data according to claim 2, characterized by: and in the S3, the difference between the predicted value and the true value is fitted by adopting a cross entropy loss function as a prediction result.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211433256.2A CN115732076A (en) | 2022-11-16 | 2022-11-16 | Fusion analysis method for multi-modal depression data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211433256.2A CN115732076A (en) | 2022-11-16 | 2022-11-16 | Fusion analysis method for multi-modal depression data |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115732076A true CN115732076A (en) | 2023-03-03 |
Family
ID=85296043
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211433256.2A Pending CN115732076A (en) | 2022-11-16 | 2022-11-16 | Fusion analysis method for multi-modal depression data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115732076A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116259407A (en) * | 2023-05-16 | 2023-06-13 | 季华实验室 | Disease diagnosis method, device, equipment and medium based on multi-mode data |
CN116563920A (en) * | 2023-05-06 | 2023-08-08 | 北京中科睿途科技有限公司 | Method and device for identifying age in cabin environment based on multi-mode information |
CN118507036A (en) * | 2024-07-17 | 2024-08-16 | 长春理工大学中山研究院 | Emotion semantic multi-mode depression tendency recognition system |
-
2022
- 2022-11-16 CN CN202211433256.2A patent/CN115732076A/en active Pending
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116563920A (en) * | 2023-05-06 | 2023-08-08 | 北京中科睿途科技有限公司 | Method and device for identifying age in cabin environment based on multi-mode information |
CN116563920B (en) * | 2023-05-06 | 2023-10-13 | 北京中科睿途科技有限公司 | Method and device for identifying age in cabin environment based on multi-mode information |
CN116259407A (en) * | 2023-05-16 | 2023-06-13 | 季华实验室 | Disease diagnosis method, device, equipment and medium based on multi-mode data |
CN118507036A (en) * | 2024-07-17 | 2024-08-16 | 长春理工大学中山研究院 | Emotion semantic multi-mode depression tendency recognition system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN115732076A (en) | Fusion analysis method for multi-modal depression data | |
Zhang et al. | Deep learning-based multimodal emotion recognition from audio, visual, and text modalities: A systematic review of recent advancements and future prospects | |
Muzammel et al. | End-to-end multimodal clinical depression recognition using deep neural networks: A comparative analysis | |
CN111292765B (en) | Bimodal emotion recognition method integrating multiple deep learning models | |
CN103996155A (en) | Intelligent interaction and psychological comfort robot service system | |
CN112101096A (en) | Suicide emotion perception method based on multi-mode fusion of voice and micro-expression | |
WO2022227280A1 (en) | Smart glasses-based disaster rescue triage and auxiliary diagnosis method | |
Ilias et al. | Detecting dementia from speech and transcripts using transformers | |
CN111091044B (en) | Network appointment-oriented in-vehicle dangerous scene identification method | |
CN112418166A (en) | Emotion distribution learning method based on multi-mode information | |
CN114549946A (en) | Cross-modal attention mechanism-based multi-modal personality identification method and system | |
Tuncer et al. | Automatic voice based disease detection method using one dimensional local binary pattern feature extraction network | |
CN115169507A (en) | Brain-like multi-mode emotion recognition network, recognition method and emotion robot | |
Xia et al. | Audiovisual speech recognition: A review and forecast | |
CN112597841A (en) | Emotion analysis method based on door mechanism multi-mode fusion | |
Renjith et al. | Speech based emotion recognition in Tamil and Telugu using LPCC and hurst parameters—A comparitive study using KNN and ANN classifiers | |
Noroozi et al. | Speech-based emotion recognition and next reaction prediction | |
Sasou | Automatic identification of pathological voice quality based on the GRBAS categorization | |
Mocanu et al. | Speech emotion recognition using GhostVLAD and sentiment metric learning | |
Nemani et al. | Speaker independent VSR: A systematic review and futuristic applications | |
Memari et al. | Speech analysis with deep learning to determine speech therapy for learning difficulties | |
Seddik et al. | A computer-aided speech disorders correction system for Arabic language | |
Kumar et al. | Can you hear me now? Clinical applications of audio recordings | |
CN111312215B (en) | Natural voice emotion recognition method based on convolutional neural network and binaural characterization | |
Poorna et al. | Bimodal emotion recognition using audio and facial features |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |