CN115732076A

CN115732076A - Fusion analysis method for multi-modal depression data

Info

Publication number: CN115732076A
Application number: CN202211433256.2A
Authority: CN
Inventors: 张健; 龚昊然; 瞿星; 蒋明丰; 赵墨
Original assignee: West China Hospital of Sichuan University
Current assignee: West China Hospital of Sichuan University
Priority date: 2022-11-16
Filing date: 2022-11-16
Publication date: 2023-03-03

Abstract

The invention discloses a fusion analysis method of multi-modal depression data in the field of depression data fusion, and aims to extract and fuse emotional characteristics of different types of data recorded in multiple stages. And aiming at different types of data, data characteristics are provided according to the characteristics of the data. And then, the data features of different modes respectively obtain K value, Q value and V value expressions through three linear layers, then the attention A of the data of each mode is calculated by utilizing K and Q according to the attention mechanism of fused depression data, and A.V is taken as the feature after fusion to serve downstream tasks. Compared with the existing technology for researching the depression through the single mode, the technical scheme is not easily influenced by factors such as individual difference and the like, can better utilize the screening characteristics of a case set on the individual difference of a patient, and then fuses images, actions and sounds of the patient according to the screening characteristics to realize comprehensive diagnosis.

Description

A Fusion Analysis Method for Multimodal Depression Data

技术领域technical field

本发明属于多模态数据融合领域，具体是一种应用于情绪识别的多模态融合分析方法。The invention belongs to the field of multimodal data fusion, in particular to a multimodal fusion analysis method applied to emotion recognition.

背景技术Background technique

抑郁症因发病率高、危害性大，已成为国际公认的严重威胁人类身心健康的公共卫生问题，早期识别、早期干预对于降低抑郁症的风险至关重要。传统抑郁症的诊断是医生根据临床经验和量表进行，这一方法主要依赖于单一模态数据，存在主观偏差，有滞后性、被动性和受限性等缺点。Jeffery等人研究发现运用多模态技术识别抑郁症的效果要优于单模态。Due to its high incidence and great harm, depression has become an internationally recognized public health problem that seriously threatens human physical and mental health. Early identification and early intervention are crucial to reducing the risk of depression. The traditional diagnosis of depression is made by doctors based on clinical experience and scales. This method mainly relies on single-modal data, which has subjective bias, lag, passivity, and limitations. Jeffery et al. found that using multimodal technology to identify depression is better than unimodal.

多模态技术指的是同时处理或拟合多种模态数据来增强模型性能的一种方法。不同模态的数据，因表现形式不同，表示含义不同而难以被对齐并融合。如在图像音频识别任务中，图像数据通常表现为图片，而语言数据通常表现为文字，两者因表现形式不同而难以融合；在基因测序分析中，不同测序方法之间的数据又因为表示含义不同而难以融合。Multimodal techniques refer to a method of simultaneously processing or fitting multiple modal data to enhance model performance. Data of different modalities is difficult to be aligned and fused due to different forms of expression and different meanings. For example, in image and audio recognition tasks, image data is usually presented as pictures, while language data is usually presented as text, and the two are difficult to integrate due to different forms of expression; Different and difficult to integrate.

现存的工作也对多模态技术有很多探索。Dupont,S等人用隐马尔可夫联合有限自动机的方法将语音数据与图片数据对齐，并用双模态数据识别语音与图片。该方法从一定程度上融合了不同表现形式的数据，但仍存在效率不高，可推广性较差的缺点。另一种思路是用神经网络进行多数据融合。Zeng,X等人利用多模态的自编码器，将10种药物描述信息(如副作用、作用通路等)融合在一起，同时输入疾病类型对疾病的类型进行划分实现药物种类的匹配，第二步对疾病的发病症状进行分割，对每个症状在不同个体上表现的差异性，增减对应的用药量。这种多模态融合方法，未能充分考虑到模态间的关系，也没有办法融合不同表现形式的数据。综上，虽然多模态技术已经有了许多尝试，但依旧没有一个方法能很好融合多模态数据。Existing works also explore a lot of multimodal techniques. Dupont, S et al. used hidden Markov joint finite automata to align speech data with image data, and used bimodal data to recognize speech and images. This method fuses data in different forms to a certain extent, but it still has the disadvantages of low efficiency and poor generalizability. Another way of thinking is to use neural networks for multi-data fusion. Zeng, X et al. used multi-modal self-encoders to fuse 10 kinds of drug description information (such as side effects, action pathways, etc.), and input disease types to classify the types of diseases to achieve the matching of drug types. The second The first step is to segment the symptoms of the disease, and to increase or decrease the corresponding drug dosage according to the differences in the performance of each symptom on different individuals. This multimodal fusion method fails to fully consider the relationship between modalities, and there is no way to fuse data of different representations. In summary, although there have been many attempts in multimodal technology, there is still no method that can well integrate multimodal data.

发明内容Contents of the invention

为了解决上述问题，本发明的目的是提供一种多模态抑郁数据的融合分析方法。In order to solve the above problems, the object of the present invention is to provide a fusion analysis method of multimodal depression data.

为了实现上述目的，本发明的技术方案如下：一种多模态抑郁数据的融合分析方法，将不同数据类别的数据进行多阶段数据录入，此时将录入的数据进行情绪特征提取，之后，不同模态的数据特征分别通过三个线形层求出K值、Q值和V值表达，再根据融合抑郁数据注意力机制，利用K，Q计算各模态数据的注意力A，将A·V作为融合后的特征，服务下游任务。由于融合抑郁数据注意力机制，融合后的数据特征将包含多模态信息，并能辅助下游分类任务。In order to achieve the above object, the technical solution of the present invention is as follows: a fusion analysis method of multi-modal depression data, data of different data categories is entered into multi-stage data, and at this time, the entered data is subjected to emotional feature extraction, after that, different The data characteristics of the modality are expressed through the three linear layers to obtain the K value, Q value and V value, and then according to the attention mechanism of the fusion depression data, use K and Q to calculate the attention A of each modality data, and A·V As a fused feature, it serves downstream tasks. Due to the attention mechanism of fused depression data, the fused data features will contain multimodal information and can assist downstream classification tasks.

进一步，包括以下步骤，Further, including the following steps,

S1数据预处理，将数据组分为文本数据、图像数据和音频数据；S1 data preprocessing, the data group is divided into text data, image data and audio data;

S2融合抑郁数据注意力机制，计算预处理后的数据，从而获得包含多模态信息的特征；S2 integrates the attention mechanism of depression data and calculates the preprocessed data to obtain features containing multimodal information;

S3抑郁症识别，将包含多模态信息的特征拼接，并通过一个线性层，输出一个融合后的数据特征，最后一层的神经元使用softmax函数作为激活函数并输出分类预测结果。S3 depression recognition, splicing features containing multi-modal information, and output a fused data feature through a linear layer, the neurons in the last layer use the softmax function as the activation function and output classification prediction results.

进一步，S1中文本数据包括量表和电子病例，所述量表和电子病历数据进行特征初筛，缺失值处理，特征编码和归一化。Further, the text data in S1 includes scales and electronic medical records, and the scales and electronic medical record data are subjected to feature preliminary screening, missing value processing, feature encoding and normalization.

进一步，S1中视频数据按照每秒20帧的频率进行图像抽取，将所得图像数据去噪声和去伪影后，对每帧图像进行面部位置检测，根据眼睛位置对齐图像，随后将视频图像裁剪为256×256像素的面部图像。Further, the video data in S1 is extracted at a frequency of 20 frames per second. After denoising and de-artifacting the obtained image data, the face position detection is performed on each frame of the image, and the images are aligned according to the position of the eyes, and then the video image is cropped as 256×256 pixel face image.

进一步，S1中音频数据与抽帧获取的图像集合对齐后针对各个对齐的语音片段提取梅尔频率倒谱系数。Further, after the audio data in S1 is aligned with the image set acquired by frame extraction, Mel-frequency cepstral coefficients are extracted for each aligned speech segment.

进一步，所述S2中计算文本数据、图像数据和音频数据分别对应的K值，Q值，V值和K值，Q值分别计算出视频、音频和文本的辅助注意力，将三种辅助注意力拼接并通过Softmax函数后形成视频、音频、文本的注意力并乘以前一步算出的V值。Further, in the said S2, the K value corresponding to the text data, the image data and the audio data are calculated respectively, the Q value, the V value and the K value, and the Q value calculates the auxiliary attention of the video, audio and text respectively, and the three auxiliary attention After force splicing and passing through the Softmax function, the attention of video, audio, and text is formed and multiplied by the V value calculated in the previous step.

进一步，所述S3中预测结果采用交叉熵损失函数拟合预测值与真实值的差异。Further, the predicted result in S3 uses a cross-entropy loss function to fit the difference between the predicted value and the real value.

采用上述方案后实现了以下有益效果：1、相对于通过单一模态研究抑郁症的现有技术，单一模态会受到个体差异等因素的影响，因此本技术方案利用病例集合对患者个体差异的甄别特征，随后根据甄别特征融合患者的图像、动作和声音，实现综合式诊断。After adopting the above scheme, the following beneficial effects have been achieved: 1. Compared with the prior art of studying depression through a single modality, the single modality will be affected by factors such as individual differences. The discriminative features are then used to fuse the patient's image, motion and voice for a comprehensive diagnosis.

2、相对于传统的拼接式数据融合方式，本技术方案中产生了以下效果结合不同模态的信息，将不同的模态在媒介上信息的表示结合。其次是对齐问题，对齐不同的模态信息并处理可能存在的依赖。最后是转换问题，使多个模态信息统一形式。2. Compared with the traditional splicing data fusion method, this technical solution produces the following effects: Combining information of different modalities, and combining information representations of different modalities on the media. Then there is the issue of alignment, aligning different modal messages and dealing with possible dependencies. Finally, there is the transformation problem, so that the information of multiple modalities can be unified.

附图说明Description of drawings

图1为多模态融合围产期抑郁症评估模型框架；Figure 1 is the framework of the multimodal fusion assessment model for perinatal depression;

图2为融合抑郁数据注意力机制方法。Figure 2 shows the attention mechanism method for fusing depression data.

具体实施方式Detailed ways

下面通过具体实施方式进一步详细说明：The following is further described in detail through specific implementation methods:

实施例基本如附图1和附图2所示：一种多模态抑郁数据的融合分析方法将不同数据类别的数据进行多阶段数据录入，此时将录入的数据进行情绪特征提取，之后，不同模态的数据特征分别通过三个线形层求出K值、Q值和V值表达，再根据融合抑郁数据注意力机制，利用K，Q计算各模态数据的注意力A，将A·V作为融合后的特征，服务下游任务。由于融合抑郁数据注意力机制，融合后的数据特征将包含多模态信息，并能辅助下游分类任务。The embodiment is basically as shown in accompanying drawing 1 and accompanying drawing 2: a kind of fusion analysis method of multimodal depression data carries out multi-stage data entry of data of different data categories, at this moment, the emotional feature extraction is performed on the entered data, after that, The data characteristics of different modalities are expressed through three linear layers to obtain the K value, Q value and V value, and then according to the attention mechanism of fusion depression data, K and Q are used to calculate the attention A of each modal data, and A· V is used as a fused feature to serve downstream tasks. Due to the attention mechanism of fused depression data, the fused data features will contain multimodal information and can assist downstream classification tasks.

具体实施过程如下：本发明的输入为视频、音频与文本数据。分为三个主要阶段数据预处理，融合抑郁数据注意力机制(IntegratedDepressionDataAttention,IDDA)，抑郁症识别。包括以下步骤，The specific implementation process is as follows: the input of the present invention is video, audio and text data. It is divided into three main stages: data preprocessing, integrated depression data attention mechanism (Integrated Depression Data Attention, IDDA), and depression recognition. Include the following steps,

S1数据预处理，将数据组分为文本数据、图像数据和音频数据，文本数据包括量表和电子病例，所述量表和电子病历数据进行特征初筛，缺失值处理，特征编码和归一化，视频数据按照每秒20帧的频率进行图像抽取，将所得图像数据去噪声和去伪影后，对每帧图像进行面部位置检测，根据眼睛位置对齐图像，随后将视频图像裁剪为256×256像素的面部图像，音频数据与抽帧获取的图像集合对齐后针对各个对齐的语音片段提取梅尔频率倒谱系数；S1 data preprocessing, the data group is divided into text data, image data and audio data, the text data includes scales and electronic medical records, and the scales and electronic medical record data are subjected to feature preliminary screening, missing value processing, feature encoding and normalization The video data is extracted at a frequency of 20 frames per second. After denoising and de-artifacting the obtained image data, the face position detection is performed on each frame of the image, and the image is aligned according to the eye position, and then the video image is cropped to 256× 256-pixel facial image, audio data is aligned with the image set obtained by frame extraction, and Mel frequency cepstral coefficients are extracted for each aligned speech segment;

S2融合抑郁数据注意力机制，计算预处理后的数据，从而获得包含多模态信息的特征，为了更好融合数据，我们提出了新的多模态数据融合机制(IDDA)。首先，1、对三种数据分别计算其对应的K值，Q值，V值。之后，用K值，Q值分别计算出视频、音频、文本的辅助注意力；S2 fuses the attention mechanism of depression data and calculates the preprocessed data to obtain features containing multimodal information. In order to better fuse data, we propose a new multimodal data fusion mechanism (IDDA). First, 1. Calculate the corresponding K value, Q value and V value for the three kinds of data respectively. After that, use the K value and Q value to calculate the auxiliary attention of video, audio, and text respectively;

将三种辅助注意力拼接并通过Softmax函数后形成视频、音频、文本的注意力并乘以前一步算出的V值，获得包含多模态信息的特征。The three kinds of auxiliary attention are spliced and passed through the Softmax function to form the attention of video, audio, and text, and multiplied by the V value calculated in the previous step to obtain features containing multimodal information.

将包含多模态信息的特征拼接，并通过一个线性层，输出一个融合后的数据特征，该数据特征作为下游任务的输入。The features containing multimodal information are concatenated and passed through a linear layer to output a fused data feature, which is used as the input of the downstream task.

S3抑郁症识别，将包含多模态信息的特征拼接，并通过一个线性层，输出一个融合后的数据特征，我们选用了LSTM作为下游任务的分类器。使用Adam优化器对模型进行优化，最后一层的神经元使用softmax函数作为激活函数并输出分类预测结果。采用交叉熵损失函数拟合预测值与真实值的差异，模型的学习率为0.001。S3 depression recognition, splicing features containing multimodal information, and output a fused data feature through a linear layer, we chose LSTM as the classifier for downstream tasks. The model is optimized using the Adam optimizer, and the neurons in the last layer use the softmax function as the activation function and output classification prediction results. The cross-entropy loss function was used to fit the difference between the predicted value and the real value, and the learning rate of the model was 0.001.

需要说明的是，在本文中，诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来，而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且，术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含，从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素，而且还包括没有明确列出的其他要素，或者是还包括为这种过程、方法、物品或者设备所固有的要素。It should be noted that in this article, relational terms such as first and second are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply that there is a relationship between these entities or operations. There is no such actual relationship or order between them. Furthermore, the term "comprises", "comprises" or any other variation thereof is intended to cover a non-exclusive inclusion such that a process, method, article, or apparatus comprising a set of elements includes not only those elements, but also includes elements not expressly listed. other elements of or also include elements inherent in such a process, method, article, or device.

以上所述的仅是本发明的实施例，方案中公知的具体结构及特性等常识在此未作过多描述，所属领域普通技术人员知晓申请日或者优先权日之前发明所属技术领域所有的普通技术知识，能够获知该领域中所有的现有技术，并且具有应用该日期之前常规实验手段的能力，所属领域普通技术人员可以在本申请给出的启示下，结合自身能力完善并实施本方案，一些典型的公知结构或者公知方法不应当成为所属领域普通技术人员实施本申请的障碍。应当指出，对于本领域的技术人员来说，在不脱离本发明结构的前提下，还可以作出若干变形和改进，这些也应该视为本发明的保护范围，这些都不会影响本发明实施的效果和专利的实用性。本申请要求的保护范围应当以其权利要求的内容为准，说明书中的具体实施方式等记载可以用于解释权利要求的内容。What is described above is only an embodiment of the present invention, and the common knowledge such as the specific structure and characteristics known in the scheme is not described too much here, and those of ordinary skill in the art know all the common knowledge in the technical field to which the invention belongs before the filing date or the priority date Technical knowledge, being able to know all the existing technologies in this field, and having the ability to apply conventional experimental methods before this date, those of ordinary skill in the art can improve and implement this plan based on their own abilities under the inspiration given by this application, Some typical known structures or known methods should not be obstacles for those of ordinary skill in the art to implement the present application. It should be pointed out that for those skilled in the art, under the premise of not departing from the structure of the present invention, several modifications and improvements can also be made, and these should also be regarded as the protection scope of the present invention, and these will not affect the implementation of the present invention. Effects and utility of patents. The scope of protection required by this application shall be based on the content of the claims, and the specific implementation methods and other records in the specification may be used to interpret the content of the claims.

Claims

1. A fusion analysis method of multi-modal depression data is characterized by comprising the following steps: the method comprises the steps of carrying out multi-stage data entry on data of different data types, extracting emotion characteristics of the entered data, then respectively obtaining K values, Q values and V value expressions through three linear layers according to the data characteristics of different modes, calculating attention A of data of each mode by using K and Q according to a fused depression data attention mechanism, serving a downstream task by taking A and V as fused characteristics, and assisting the downstream classification task due to the fact that the fused depression data attention mechanism is fused, wherein the fused data characteristics contain multi-mode information.

2. The fusion analysis method of multimodal depression data according to claim 1, characterized by: comprises the following steps of (a) carrying out,

s1, preprocessing data, namely dividing data components into text data, image data and audio data;

s2, integrating a depression data attention mechanism, and calculating preprocessed data to obtain features containing multi-modal information;

and S3, identifying the depression, namely splicing the characteristics containing the multi-modal information, outputting a fused data characteristic through a linear layer, and outputting a classification prediction result by using a softmax function as an activation function for the neuron of the last layer.

3. The fusion analysis method of multimodal depression data according to claim 2, characterized by: and the text data in the S1 comprises a scale and an electronic case, and the scale and the electronic case history data are subjected to characteristic primary screening, missing value processing, characteristic coding and normalization.

4. The method for fusion analysis of multimodal depression data according to claim 2, wherein: and in the S1, the video data is subjected to image extraction according to the frequency of 20 frames per second, after the noise and artifact of the obtained image data are removed, the face position of each frame of image is detected, the image is aligned according to the eye position, and then the video image is cut into face images of 256 multiplied by 256 pixels.

5. The fusion analysis method of multimodal depression data according to claim 2, characterized by: and after the audio data in the S1 are aligned with the image set obtained by frame extraction, extracting a Mel frequency cepstrum coefficient for each aligned voice segment.

6. The fusion analysis method of multimodal depression data according to claim 2, characterized by: and in the S2, the K value, the Q value, the V value and the K value which correspond to the text data, the image data and the audio data respectively are calculated, the Q value calculates the auxiliary attention of the video, the audio and the text respectively, the three auxiliary attention are spliced and form the attention of the video, the audio and the text through a Softmax function, and the attention is multiplied by the V value calculated in the previous step.

7. The fusion analysis method of multimodal depression data according to claim 2, characterized by: and in the S3, the difference between the predicted value and the true value is fitted by adopting a cross entropy loss function as a prediction result.