WO2021104099A1 - Multimodal depression detection method and system employing context awareness - Google Patents

Multimodal depression detection method and system employing context awareness Download PDF

Info

Publication number
WO2021104099A1
WO2021104099A1 PCT/CN2020/129214 CN2020129214W WO2021104099A1 WO 2021104099 A1 WO2021104099 A1 WO 2021104099A1 CN 2020129214 W CN2020129214 W CN 2020129214W WO 2021104099 A1 WO2021104099 A1 WO 2021104099A1
Authority
WO
WIPO (PCT)
Prior art keywords
depression
text
acoustic
channel subsystem
context
Prior art date
Application number
PCT/CN2020/129214
Other languages
French (fr)
Chinese (zh)
Inventor
苏荣锋
王岚
燕楠
Original Assignee
中国科学院深圳先进技术研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中国科学院深圳先进技术研究院 filed Critical 中国科学院深圳先进技术研究院
Publication of WO2021104099A1 publication Critical patent/WO2021104099A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/66Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for extracting parameters related to health condition

Definitions

  • the present invention relates to the technical field of depression detection, in particular to a multi-modal depression detection method and system based on context perception.
  • Deep learning is a new field of machine learning, which combines high-level abstract modeling of data by using multiple layers of non-linear transformations. Using deep learning algorithms can make the original data easier to adapt to learning and training in various directions.
  • CNN and LSTM use CNN and LSTM to combine to form a new deep network, and then extract the acoustic features of the speech signal and use it for the detection of depression.
  • Another example is the semantic analysis of the conversation between the doctor and the depression patient, such as filled pause extraction, Principal Components Analysis (PCA), whitening transform (whitening transform) and other techniques to get some text
  • PCA Principal Components Analysis
  • SVR Linear Support Vector Regressor
  • the acoustic features used in the prior art are some artificially defined 279-dimensional features, and the text features are 100-dimensional word embedding vectors extracted using the Doc2Vec tool.
  • the existing technology mainly has the following problems: in terms of the amount of training data, most of the existing multi-modal depression detection systems based on speech, text, or images are trained on limited depression data, so the performance is low.
  • existing feature extraction methods lack verbal information related to topic and context, and are insufficient in the field of depression detection, which limits the performance of the final depression detection system; in terms of depression classification modeling, the existing technology does not consider speech , The long-term dependence of text features and depression diagnosis; in terms of multi-modal fusion, the prior art simply connects the subsystem outputs obtained under different modalities or channels in series, and finally makes a decision, ignoring each modal Or the weight relationship between channels, so performance is limited.
  • the purpose of the present invention is to overcome the above-mentioned shortcomings of the prior art and provide a multi-modal depression detection method and system based on context perception.
  • a multi-modal depression detection method based on context perception includes the following steps:
  • Step S1 Construct a training sample set, the training sample set includes topic information, a spectrogram and corresponding text information;
  • Step S2 Using a convolutional neural network, combined with multi-task learning, perform acoustic feature extraction on the spectrogram of the training sample set to obtain acoustic features with contextual awareness;
  • Step S3 Use the training sample set to process the word embedding using the Transformer model, and extract context-aware text features
  • Step S4 establishing an acoustic channel subsystem for depression detection for the context-aware acoustic features, and establishing a text channel subsystem for depression detection for the context-aware text features;
  • Step S5 fusing the outputs of the acoustic channel subsystem and the text channel subsystem to obtain depression classification information.
  • the acoustic characteristics of the contextual perception are obtained according to the following steps:
  • the convolutional neural network includes an input layer, multiple convolutional layers, multiple fully connected layers, an output layer, and a bottleneck layer located between the last fully connected layer and the output layer.
  • the bottleneck layer Compared with the convolutional layer and the fully connected layer, it has fewer nodes;
  • the output layer contains the depression classification task and the topic labeling task
  • the acoustic features of the context perception are extracted from the bottleneck layer of the convolutional neural network.
  • the context-aware text features are extracted according to the following steps:
  • the Transformer model includes multiple encoders and decoders with self-attention and a softmax layer at the last layer;
  • the softmax layer is removed, and the output of the Transformer model is used as the context-aware text feature.
  • step S5 includes:
  • the outputs of the acoustic channel subsystem and the text channel subsystem are merged to obtain a classification score for depression.
  • the classification score of the depression is expressed as:
  • the weight w i [ ⁇ 1 , ⁇ 2 ,..., ⁇ c ], and c is the number of classifications of depression.
  • the acoustic channel subsystem and the text channel subsystem are established based on a BLSTM network, and the network input of the acoustic channel subsystem is the perceptual linear prediction coefficients of consecutive multiple frames and the acoustic characteristics of the context perception ,
  • the output is a depression classification label
  • the network input of the text channel subsystem is text information
  • the output is a depression classification label.
  • the topic information in the training sample set includes multiple types of identifiers classified based on the content of the conversation between the doctor and the depression patient.
  • a multi-modal depression detection system based on contextual perception includes:
  • Training sample construction unit used to construct a training sample set, the training sample set including topic information, spectrogram and corresponding text information;
  • Acoustic feature extraction unit used to extract acoustic features from the spectrogram of the training sample set by using a convolutional neural network, combined with multi-task learning, to obtain acoustic features with contextual awareness;
  • Text feature extraction unit used to use the training sample set to process word embeddings using a Transformer model to extract context-aware text features
  • Classification subsystem establishment unit used to establish an acoustic channel subsystem for depression detection for the context-aware acoustic features, and establish a text channel subsystem for depression detection for the context-aware text features;
  • Classification and fusion unit used to fuse the output of the acoustic channel subsystem and the text channel subsystem to obtain depression classification information.
  • the present invention has the advantage of using the method of data enhancement to expand the voice and text training data of depression according to the topic information in the content of the free conversation between the doctor and the depression patient, and use the data for model training;
  • Verbal information related to depression detection including acquiring acoustic features that are not related to the speaker, highly related to depression, and context-aware, and text features that are highly related to depression and context-aware;
  • a depression detection subsystem is established in the acoustic channel and the text channel;
  • the reinforcement learning method is used to obtain a multi-system fusion framework to achieve robust multi-modal depression automatic detection.
  • Fig. 1 is a general framework diagram of a multi-modal depression detection method based on context perception according to an embodiment of the present invention
  • Fig. 2 is a flowchart of a multi-modal depression detection method based on context perception according to an embodiment of the present invention
  • Figure 3 is a schematic diagram of topic-based data enhancement
  • Figure 4 is a schematic diagram of the acoustic feature extraction process based on CNN and multi-task learning
  • Figure 5 is a schematic diagram of a text feature extraction process based on a multi-head self-attention mechanism
  • Figure 6 is a schematic diagram of reinforcement learning.
  • the overall technical solution includes: firstly adopt topic-based data enhancement method to obtain more topic-related depression speech and text data; then use CNN network combined with multi-task learning
  • the method is to extract context-aware acoustic features from the spectrogram, and use Transformer to process word embeddings to obtain context-aware text features; then, use context-aware acoustic features and context-aware text features, respectively, using BLSTM (two-way length and short Temporal memory network) model is used to establish the depression detection subsystem; finally, the reinforcement learning method is used to make a fusion decision on the output of each subsystem to obtain the final depression classification.
  • BLSTM two-way length and short Temporal memory network
  • the multi-modal depression detection method based on context perception includes the following steps:
  • Step S210 Obtain a training sample set with context awareness.
  • the training sample set can be expanded based on the original training set to include context perception information.
  • the original data set usually only includes the correspondence between speech and text.
  • topic labeling is performed on each pair of speech and text data in the existing training set. For example, divide the content of conversations between doctors and patients with depression into 7 topics: whether they are interested, whether they sleep well, whether they feel depressed, whether they feel defeated, self-evaluation, whether they have ever been diagnosed with depression, and whether their parents have ever suffered from depression.
  • Some new training samples can be obtained through the above method, and the original training samples can be spliced together to expand the original data set and construct a new training sample set.
  • this step by defining the content of multiple topics that the doctor talks with the depression patient, and expanding the original training data set by random combination, a richer set of context-aware training samples can be obtained, including topic information, Spectrogram, text information, and corresponding classification labels, etc., thereby improving the accuracy of subsequent training.
  • Step S220 extracting acoustic features with context awareness based on CNN and multi-task learning.
  • CNN Convolutional Neural Network
  • the present invention combines multi-task learning and CNN network for classification network training.
  • the input of the CNN network is the spectrogram of each training sample, and the CNN network includes several convolutional layers and several fully connected layers.
  • the convolutional layer downsampling is performed using, for example, a maximum pooling technique.
  • the embodiment of the present invention inserts a bottleneck layer, which contains only a few nodes, for example, the value is 39.
  • the output layer of the CNN network contains two tasks.
  • the first task is the classification of depression, for example, classification into multiple categories such as mild, severe, moderate, and normal.
  • the second task is the labeling of different topics (or topic identification). ).
  • the context-aware acoustic features are extracted from the bottleneck layer of the CNN network, and are spliced with traditional acoustic features for subsequent classification network training.
  • CNN neural network and multi-task learning methods are used.
  • the first task is the classification of depression, and the second task is the label of different topics.
  • the output obtained by the network bottleneck layer is used as topical context awareness Characteristic acoustic characteristics.
  • Step S230 extracting context-aware text features based on the multi-head self-attention mechanism.
  • a Transformer model based on a multi-head self-attention mechanism is used to analyze the semantics of sentences, so as to extract context-aware text features.
  • the input of the Transformer model is traditional word embedding plus topic ID (identification), and its main structure is composed of multiple encoders and decoders containing self-attention, which is the so-called multi-head mechanism.
  • the Transformer model allows direct connections between data units, it allows the model to take into account the attention information of different locations and better capture long-term dependencies.
  • the Transformer model in the embodiment of the present invention, first use large-scale text corpus (such as Weibo, Wikipedia, etc.) to pre-train the Transformer model parameters using an unsupervised training method; and then use transfer learning.
  • large-scale text corpus such as Weibo, Wikipedia, etc.
  • Method self-adaptive training is performed on the collected textual data of depression.
  • the last softmax layer in Figure 5 is removed, and then the output is used as a text feature, that is, the extracted context-aware text feature, which will be used for subsequent depression detection model training.
  • the Transformer model can be used to extract robust text features.
  • step S240 a subsystem for detecting depression is established for the acoustic features of context perception and the distribution of text features of context perception.
  • the embodiment of the present invention adopts a BLSTM-based method to establish a depression classification sub-network (or a sub-system).
  • BLSTM can cache the current input, and use the current input to participate in the previous and next calculations to implicitly include time information into the model, thereby realizing the modeling of long-term dependencies.
  • the BLSTM network adopted in the embodiment of the present invention has a total of 3 BLSTM layers, and each layer contains 128 nodes.
  • the corresponding network input is continuous 11 frames of PLP (Perceptual Linear Prediction Coefficient) and the acoustic features of context perception, and the output is the depression classification label;
  • the corresponding network input is the context perception of a training sample
  • the text feature of the output is the depression classification label.
  • the BLSTM network is used to establish a depression classification model to capture the long-term dependence of acoustic features or text features with the diagnosis of depression.
  • Step S250 Use reinforcement learning to fuse the outputs of the various depression detection subsystems to obtain the final depression classification.
  • the embodiment of the present invention adopts a reinforcement learning mechanism to minimize the difference between the final depression prediction result and feedback information of the combined system by adjusting the weight of each subsystem.
  • the final score for depression is expressed as:
  • the decision score function L t of reinforcement learning at time t is defined as:
  • a t-1 represents the feedback at time t-1
  • D represents the difference between the actual and predicted results of the development set
  • W represents the weight of all subsystems ⁇ w i ⁇
  • C represents the global accuracy rate on the development set. Therefore, it is necessary to sum L t at all times and maximize it, and the obtained W * is the weight of the final subsystem, which is expressed as:
  • a hidden Markov model or other models can be used for reinforcement learning.
  • the reinforcement learning method is used to automatically adjust the weights of the subsystem score of the acoustic channel and the subsystem score of the text channel, so that they can be organically integrated to perform the final depression classification.
  • the trained network model can be used for new data (including topics, speech, text, etc.) using a process similar to training to treat depression Classification prediction.
  • BLSTM other models containing time information can also be used.
  • the present invention also provides a multi-modal depression detection system based on context perception.
  • the system includes: a training sample construction unit, used to construct a training sample set, the training sample set includes topic information, a spectrogram and corresponding text information; an acoustic feature extraction unit, used to use a convolutional neural network, combined with multiple Task learning: extracting acoustic features from the spectrogram of the training sample set to obtain acoustic features with context awareness; text feature extraction unit: used to process the word embedding using the training sample set and using the Transformer model to extract With context-aware text features; classification subsystem establishment unit: used to establish an acoustic channel subsystem for depression detection for the context-aware acoustic features, and establish a text channel for depression detection for the context-aware text features Subsystem; classification fusion unit: used to fuse the output of the acoustic channel subsystem and the text channel subsystem to obtain depression classification information
  • the present invention combines the information obtained by the acoustic channel and the text channel to achieve high-precision multi-modal depression detection.
  • the main technical content includes: using topic-related data enhancement technology: based on limited depression speech and text data, using The topic information in the content of the free conversation between doctors and depression patients, expands the speech and text training data of depression; Robust analysis and extraction of depression-related features: Combining transfer learning and multi-head self-attention mechanism, extracting topical and context-aware features , And the acoustic feature description and text feature description showing the characteristics of depression patients to improve the accuracy of the detection system; BLSTM-based depression classification model: use the powerful time series modeling capabilities of the BLSTM network to capture acoustic information and text information and depression The long-term dependence of diagnosis; multi-modal fusion framework: the use of reinforcement learning methods to achieve the fusion of the depression detection subsystem under the acoustic channel and the text channel.
  • the present invention has the following advantages:
  • the existing depression detection method only uses limited speech and text data of depression. Compared with this, the present invention uses a topic-based data enhancement method to expand the original training data set;
  • the present invention uses CNN neural network and multi-task learning methods to extract acoustic features with topic context perception characteristics, and uses Transformer model to extract topics with topic context awareness.
  • the textual features of context-aware features are in-depth feature descriptions, which can improve the robustness of depression detection;
  • the existing depression detection modeling technology does not consider the long-term dependence of speech, text features and depression diagnosis.
  • the present invention uses the BLSTM network to capture acoustic features or text features and the long-term diagnosis of depression. Dependency, better performance;
  • the existing multi-modal depression detection technology simply connects the outputs of different subsystems in series for decision-making.
  • the present invention adopts a reinforcement learning method to automatically adjust the sub-system score weights under different channels, and Make the final classification decision, the performance is better.
  • the present invention may be a system, a method and/or a computer program product.
  • the computer program product may include a computer-readable storage medium loaded with computer-readable program instructions for enabling a processor to implement various aspects of the present invention.
  • the computer-readable storage medium may be a tangible device that holds and stores instructions used by the instruction execution device.
  • the computer-readable storage medium may include, but is not limited to, an electrical storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing, for example.
  • Computer-readable storage media include: portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM) Or flash memory), static random access memory (SRAM), portable compact disk read-only memory (CD-ROM), digital versatile disk (DVD), memory stick, floppy disk, mechanical encoding device, such as a printer with instructions stored thereon
  • RAM random access memory
  • ROM read-only memory
  • EPROM erasable programmable read-only memory
  • flash memory flash memory
  • SRAM static random access memory
  • CD-ROM compact disk read-only memory
  • DVD digital versatile disk
  • memory stick floppy disk
  • mechanical encoding device such as a printer with instructions stored thereon

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Signal Processing (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Psychiatry (AREA)
  • Hospice & Palliative Care (AREA)
  • Public Health (AREA)
  • Epidemiology (AREA)
  • Child & Adolescent Psychology (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A multimodal depression detection method and system employing context awareness. The method comprises: constructing a training sample set comprising topic information, a spectrogram and corresponding text information; using a convolutional neural network in combination with multi-task learning to perform acoustic feature extraction on the spectrogram of the training sample set, and obtaining an acoustic feature having context awareness; using a transformer model to process a word embedding on the basis the training sample set, and extracting a textual feature having context awareness; establishing, with respect to the acoustic feature having context awareness, an acoustic channel subsystem for depression detection; establishing, with respect to the textual feature having context awareness, a textual channel subsystem for depression detection; and fusing outputs of the acoustic channel subsystem and the textual channel subsystem to obtain depression classification information. The present invention can improve the accuracy of depression detection.

Description

一种基于情景感知的多模态抑郁症检测方法和系统A multi-modal depression detection method and system based on situational perception 技术领域Technical field
本发明涉及抑郁症检测技术领域,尤其涉及一种基于情景感知的多模态抑郁症检测方法和系统。The present invention relates to the technical field of depression detection, in particular to a multi-modal depression detection method and system based on context perception.
背景技术Background technique
在与抑郁症相关的特征提取方面,早期的基于语音的抑郁症相关研究主要集中于时域特征,例如停顿时间、录音时间、对问题的反馈时间、语速等。后来,人们发现单一的特征无法涵盖具有足够辨识度的信息去辅助临床诊断。随着对语音信号的深入研究,大量其余语音信号特征被构造出来。研究者尝试了各种语音特征组合,希望可以构建出检测抑郁症患者的分类模型。这些特征有音高(pitch)、能量(energy)、语速(speaking rate)、共振峰(formant)、梅尔倒谱系数(MFCC)等特征。文本是另外一种“隐藏”在语音信号中的与抑郁症相关的信息,它较容易从语音信号中获得。研究表明,抑郁患者使用消极情感词和愤怒词明显较正常人多。而人们常常使用词频统计作为文本特征表示。这种特征属于底层(low-level)的文本特征,最近人们更偏向于使用高层次(high-level)的文本特征来描述抑郁状态,也就是所谓的词嵌入(word embedding)特征,获取词嵌入特征的常用网络结构有skip-gram或者CBOW(continuous bag-of-words)等。In terms of feature extraction related to depression, early speech-based depression-related research mainly focused on time domain features, such as pause time, recording time, feedback time to questions, speaking speed, etc. Later, it was discovered that a single feature could not cover enough recognizable information to assist clinical diagnosis. With the in-depth study of speech signals, a large number of other speech signal features have been constructed. Researchers have tried various combinations of voice features, hoping to build a classification model for detecting patients with depression. These features include pitch, energy, speaking rate, formant, and Mel cepstrum coefficient (MFCC). Text is another kind of information related to depression that is "hidden" in the speech signal, and it is easier to obtain from the speech signal. Studies have shown that depression patients use negative emotion words and angry words significantly more than normal people. People often use word frequency statistics as text feature representation. This feature belongs to low-level text features. Recently, people prefer to use high-level text features to describe depression, which is the so-called word embedding feature to obtain word embedding. Commonly used network structures for features include skip-gram or CBOW (continuous bag-of-words) and so on.
在有限抑郁症语音文本数据条件下进行抑郁症检测方面,鉴于抑郁症患者的语音文本数据很难进行大规模采集,因此可用于研究抑郁症的语音数据库一般规模较小。目前研究者一般只能采用较为简单的分类模型进行抑郁症检测。传统的基于语音的抑郁症检测方法有:支撑向量机(Support Vector Machine,SVM)、决策树、混合高斯模型(Gaussian Mixture Model,GMM)等。深度学习是机器学习的一个新的领域,它通过使用多层的非线性转换进行组合,对数据进行高层次抽象建模。利用深度学习算法,能够使得原始数据更加容易的适应各种方向的学习训练。例如,利用CNN和LSTM组合成一个新的深层网络,然后对语音信号提取声学特征,并用 于抑郁症的检测。又如,通过对医生与抑郁症患者的对话进行语义分析,如停留词提取(filled pause extraction)、主成分分析(Principal Components Analysis,PCA)、白化变换(whitening transform)等技术,从中得到一些文本特征并结合一个线性支撑向量回归器(Support Vector Regressor,SVR)分类器进行抑郁症分类。再如,首先使用独立的LSTM层分别对声学通道和文本通道进行处理,然后再把其中的输入特征输入到全连接层中,最后进行抑郁症类别输出。现有技术所使用的声学特征是一些人工定义的279维特征,而文本特征是使用Doc2Vec工具提取得到的100维词嵌入向量。In terms of depression detection under the condition of limited speech and text data for depression, given that the speech and text data of depression patients is difficult to collect on a large scale, the speech database that can be used to study depression is generally small. At present, researchers generally can only use simpler classification models for depression detection. Traditional voice-based depression detection methods include: Support Vector Machine (SVM), decision tree, Gaussian Mixture Model (GMM), etc. Deep learning is a new field of machine learning, which combines high-level abstract modeling of data by using multiple layers of non-linear transformations. Using deep learning algorithms can make the original data easier to adapt to learning and training in various directions. For example, use CNN and LSTM to combine to form a new deep network, and then extract the acoustic features of the speech signal and use it for the detection of depression. Another example is the semantic analysis of the conversation between the doctor and the depression patient, such as filled pause extraction, Principal Components Analysis (PCA), whitening transform (whitening transform) and other techniques to get some text The features are combined with a Linear Support Vector Regressor (SVR) classifier to classify depression. For another example, first use a separate LSTM layer to process the acoustic channel and the text channel separately, and then input the input features into the fully connected layer, and finally output the depression category. The acoustic features used in the prior art are some artificially defined 279-dimensional features, and the text features are 100-dimensional word embedding vectors extracted using the Doc2Vec tool.
在现有技术中,通常采取基于生化试剂和基于脑电的检测手段,而在基于语音、文本或图像的技术方案中,多以语音数据为依托,在特征提取及分类的基础上进行抑郁症检测。简言之,现有技术主要存在以下几方面的问题:训练数据量方面,现有的基于语音、文本或图像的多模态抑郁症检测系统大部分由有限抑郁症数据训练得到,因此性能低下;特征提取方面,现有特征提取方法缺少话题情景相关的言语信息,在抑郁症检测领域表现力不足,限制了最终抑郁症检测系统的性能;抑郁症分类建模方面,现有技术没有考虑语音、文本特征与抑郁症诊断的长时间依赖关系;多模态融合方面,现有技术简单地把不同模态或通道下所得到的子系统输出串联在一起,最终进行决策,忽略了各个模态或通道之间的轻重关系,因此性能受到限制。In the prior art, detection methods based on biochemical reagents and EEG are usually adopted. However, in technical solutions based on voice, text, or images, most of them rely on voice data to perform depression based on feature extraction and classification. Detection. In short, the existing technology mainly has the following problems: in terms of the amount of training data, most of the existing multi-modal depression detection systems based on speech, text, or images are trained on limited depression data, so the performance is low. In terms of feature extraction, existing feature extraction methods lack verbal information related to topic and context, and are insufficient in the field of depression detection, which limits the performance of the final depression detection system; in terms of depression classification modeling, the existing technology does not consider speech , The long-term dependence of text features and depression diagnosis; in terms of multi-modal fusion, the prior art simply connects the subsystem outputs obtained under different modalities or channels in series, and finally makes a decision, ignoring each modal Or the weight relationship between channels, so performance is limited.
发明内容Summary of the invention
本发明的目的在于克服上述现有技术的缺陷,提供一种基于情景感知的多模态抑郁症检测方法和系统。The purpose of the present invention is to overcome the above-mentioned shortcomings of the prior art and provide a multi-modal depression detection method and system based on context perception.
根据本发明的第一方面,提供一种基于情景感知的多模态抑郁症检测方法。该方法包括以下步骤:According to the first aspect of the present invention, a multi-modal depression detection method based on context perception is provided. The method includes the following steps:
步骤S1:构建训练样本集,所述训练样本集包括话题信息、语谱图和对应的文本信息;Step S1: Construct a training sample set, the training sample set includes topic information, a spectrogram and corresponding text information;
步骤S2:使用卷积神经网络,结合多任务学习,对所述训练样本集的语谱图进行声学特征提取,获得具备情景感知的声学特征;Step S2: Using a convolutional neural network, combined with multi-task learning, perform acoustic feature extraction on the spectrogram of the training sample set to obtain acoustic features with contextual awareness;
步骤S3:利用所述训练样本集,使用Transformer模型对词嵌入进行处理,提取具备情景感知的文本特征;Step S3: Use the training sample set to process the word embedding using the Transformer model, and extract context-aware text features;
步骤S4:对于所述情景感知的声学特征建立进行抑郁症检测的声学通 道子系统,对于所述情景感知的文本特征建立进行抑郁症检测的文本通道子系统;Step S4: establishing an acoustic channel subsystem for depression detection for the context-aware acoustic features, and establishing a text channel subsystem for depression detection for the context-aware text features;
步骤S5:对所述声学通道子系统和所述文本通道子系统的输出进行融合,获得抑郁症分类信息。Step S5: fusing the outputs of the acoustic channel subsystem and the text channel subsystem to obtain depression classification information.
在一个实施例中,根据以下步骤获得所述情景感知的声学特征:In an embodiment, the acoustic characteristics of the contextual perception are obtained according to the following steps:
构建卷积神经网络,该卷积神经网络包括输入层、多个卷积层、多个全连接层、输出层、以及位于最后一层全连接层和输出层之间的瓶颈层,该瓶颈层相对于卷积层和全连接层具有较少的节点;Construct a convolutional neural network. The convolutional neural network includes an input layer, multiple convolutional layers, multiple fully connected layers, an output layer, and a bottleneck layer located between the last fully connected layer and the output layer. The bottleneck layer Compared with the convolutional layer and the fully connected layer, it has fewer nodes;
将所述训练样本集中的语谱图输入到卷积神经网络,输出层包含抑郁症分类任务和话题的标签任务;Inputting the spectrogram in the training sample set to the convolutional neural network, and the output layer contains the depression classification task and the topic labeling task;
从卷积神经网络的瓶颈层提取得到所述情景感知的声学特征。The acoustic features of the context perception are extracted from the bottleneck layer of the convolutional neural network.
在一个实施例中,根据以下步骤提取所述情景感知的文本特征:In an embodiment, the context-aware text features are extracted according to the following steps:
构建Transformer模型,以词嵌入加上话题标识作为Transformer模型的输入,该Transformer模型包括多个含有自注意力的编码器和解码器以及位于最后一层的softmax层;Construct a Transformer model, and use word embedding and topic identification as the input of the Transformer model. The Transformer model includes multiple encoders and decoders with self-attention and a softmax layer at the last layer;
利用已有的文本语料,使用无监督训练方法预训练Transformer模型参数,然后采用迁移学习,在采集得到的抑郁症文本数据进行自适应训练;Use the existing text corpus to pre-train the Transformer model parameters using an unsupervised training method, and then use transfer learning to perform adaptive training on the collected depression text data;
在训练完成之后,将softmax层去除,以Transformer模型的输出作为所述情景感知的文本特征。After the training is completed, the softmax layer is removed, and the output of the Transformer model is used as the context-aware text feature.
在一个实施例中,步骤S5包括:In one embodiment, step S5 includes:
采用强化学习机制,调整所述声学通道子系统的权重和所述文本通道子系统的权重,使得最终抑郁症分类预测结果和反馈信息之间的差异最小化;Using a reinforcement learning mechanism to adjust the weight of the acoustic channel subsystem and the weight of the text channel subsystem to minimize the difference between the final depression classification prediction result and the feedback information;
融合所述声学通道子系统和所述文本通道子系统的输出,获得抑郁症的分类打分。The outputs of the acoustic channel subsystem and the text channel subsystem are merged to obtain a classification score for depression.
在一个实施例中,所述抑郁症的分类打分表示为:In one embodiment, the classification score of the depression is expressed as:
Figure PCTCN2020129214-appb-000001
Figure PCTCN2020129214-appb-000001
其中,权重w i=[λ 12,…,λ c],c为抑郁症的分类个数。 Among them, the weight w i =[λ 12 ,...,λ c ], and c is the number of classifications of depression.
在一个实施例中,所述声学通道子系统和所述文本通道子系统基于BLSTM网络建立,所述声学通道子系统的网络输入为连续多帧的感知线 性预测系数和所述情景感知的声学特征,输出为抑郁症分类标签,所述文本通道子系统的网络输入是文本信息,输出为抑郁症分类标签。In one embodiment, the acoustic channel subsystem and the text channel subsystem are established based on a BLSTM network, and the network input of the acoustic channel subsystem is the perceptual linear prediction coefficients of consecutive multiple frames and the acoustic characteristics of the context perception , The output is a depression classification label, the network input of the text channel subsystem is text information, and the output is a depression classification label.
在一个实施例中,所述训练样本集中的话题信息包括基于医生与抑郁症患者交谈的内容所划分的多种类型标识。In one embodiment, the topic information in the training sample set includes multiple types of identifiers classified based on the content of the conversation between the doctor and the depression patient.
根据本发明的第二方面,提供一种基于情景感知的多模态抑郁症检测系统。该系统包括:According to a second aspect of the present invention, a multi-modal depression detection system based on contextual perception is provided. The system includes:
训练样本构建单元:用于构建训练样本集,所述训练样本集包括话题信息、语谱图和对应的文本信息;Training sample construction unit: used to construct a training sample set, the training sample set including topic information, spectrogram and corresponding text information;
声学特征提取单元:用于使用卷积神经网络,结合多任务学习,对所述训练样本集的语谱图进行声学特征提取,获得具备情景感知的声学特征;Acoustic feature extraction unit: used to extract acoustic features from the spectrogram of the training sample set by using a convolutional neural network, combined with multi-task learning, to obtain acoustic features with contextual awareness;
文本特征提取单元:用于利用所述训练样本集,使用Transformer模型对词嵌入进行处理,提取具备情景感知的文本特征;Text feature extraction unit: used to use the training sample set to process word embeddings using a Transformer model to extract context-aware text features;
分类子系统建立单元:用于对于所述情景感知的声学特征建立进行抑郁症检测的声学通道子系统,对于所述情景感知的文本特征建立进行抑郁症检测的文本通道子系统;Classification subsystem establishment unit: used to establish an acoustic channel subsystem for depression detection for the context-aware acoustic features, and establish a text channel subsystem for depression detection for the context-aware text features;
分类融合单元:用于对所述声学通道子系统和所述文本通道子系统的输出进行融合,获得抑郁症分类信息。Classification and fusion unit: used to fuse the output of the acoustic channel subsystem and the text channel subsystem to obtain depression classification information.
与现有技术相比,本发明的优点在于:利用数据增强的方法,根据医生与抑郁症患者自由交谈内容中的话题信息,扩展抑郁症语音文本训练数据,并利用该数据进行模型训练;获取与抑郁症检测相关的言语信息,包括获取与说话人无关、与抑郁症高度相关、具备情景感知的声学特征,以及获取与抑郁症高度相关、具备情景感知的文本特征;考虑医生与抑郁症患者自由交谈内容中的话题情景信息,在声学通道和文本通道建立抑郁症检测子系统;使用强化学习方法,得到多系统融合框架,以实现鲁棒的多模态抑郁症自动检测。Compared with the prior art, the present invention has the advantage of using the method of data enhancement to expand the voice and text training data of depression according to the topic information in the content of the free conversation between the doctor and the depression patient, and use the data for model training; Verbal information related to depression detection, including acquiring acoustic features that are not related to the speaker, highly related to depression, and context-aware, and text features that are highly related to depression and context-aware; consider doctors and depression patients For topical and contextual information in the content of free conversations, a depression detection subsystem is established in the acoustic channel and the text channel; the reinforcement learning method is used to obtain a multi-system fusion framework to achieve robust multi-modal depression automatic detection.
附图说明Description of the drawings
以下附图仅对本发明作示意性的说明和解释,并不用于限定本发明的范围,其中:The following drawings only schematically illustrate and explain the present invention, and are not used to limit the scope of the present invention, in which:
图1是根据本发明一个实施例的基于情景感知的多模态抑郁症检测方法的总体框架图;Fig. 1 is a general framework diagram of a multi-modal depression detection method based on context perception according to an embodiment of the present invention;
图2是根据本发明一个实施例的基于情景感知的多模态抑郁症检测方 法的流程图;Fig. 2 is a flowchart of a multi-modal depression detection method based on context perception according to an embodiment of the present invention;
图3是基于话题的数据增强示意;Figure 3 is a schematic diagram of topic-based data enhancement;
图4是基于CNN和多任务学习的声学特征提取过程的示意图;Figure 4 is a schematic diagram of the acoustic feature extraction process based on CNN and multi-task learning;
图5是基于多头自注意力机制的文本特征提取过程的示意图;Figure 5 is a schematic diagram of a text feature extraction process based on a multi-head self-attention mechanism;
图6是强化学习示意图。Figure 6 is a schematic diagram of reinforcement learning.
具体实施方式Detailed ways
为了使本发明的目的、技术方案、设计方法及优点更加清楚明了,以下结合附图通过具体实施例对本发明进一步详细说明。应当理解,此处所描述的具体实施例仅用于解释本发明,并不用于限定本发明。In order to make the objectives, technical solutions, design methods, and advantages of the present invention clearer, the following further describes the present invention in detail through specific embodiments with reference to the accompanying drawings. It should be understood that the specific embodiments described here are only used to explain the present invention, but not used to limit the present invention.
在本文示出和讨论的所有例子中,任何具体值应被解释为仅仅是示例性的,而不是作为限制。因此,示例性实施例的其它例子可以具有不同的值。In all examples shown and discussed herein, any specific value should be construed as merely exemplary, rather than as a limitation. Therefore, other examples of the exemplary embodiment may have different values.
对于相关领域普通技术人员已知的技术、方法和设备可能不作详细讨论,但在适当情况下,所述技术、方法和设备应当被视为说明书的一部分。The technologies, methods, and equipment known to those of ordinary skill in the relevant fields may not be discussed in detail, but where appropriate, the technologies, methods, and equipment should be regarded as part of the specification.
为进一步理解本发明,首先参见图1所示,总体技术方案包括:首先采用基于话题的数据增强方法,得到更多的与话题相关的抑郁症语音文本数据;然后使用CNN网络,结合多任务学习方法,对语谱图进行情景感知的声学特征提取,使用Transformer对词嵌入进行处理,得到情景感知的文本特征;接着,分别使用情景感知的声学特征和情景感知的文本特征,利用BLSTM(双向长短时记忆网络)模型进行抑郁症检测子系统建立;最后使用强化学习的方法,对每个子系统的输出进行融合决策,得到最终的抑郁症分类。To further understand the present invention, firstly refer to Figure 1. The overall technical solution includes: firstly adopt topic-based data enhancement method to obtain more topic-related depression speech and text data; then use CNN network combined with multi-task learning The method is to extract context-aware acoustic features from the spectrogram, and use Transformer to process word embeddings to obtain context-aware text features; then, use context-aware acoustic features and context-aware text features, respectively, using BLSTM (two-way length and short Temporal memory network) model is used to establish the depression detection subsystem; finally, the reinforcement learning method is used to make a fusion decision on the output of each subsystem to obtain the final depression classification.
具体地,参见图2所示,本发明实施例的基于情景感知的多模态抑郁症检测方法包括以下步骤:Specifically, referring to FIG. 2, the multi-modal depression detection method based on context perception according to the embodiment of the present invention includes the following steps:
步骤S210,获得具有情景感知的训练样本集。Step S210: Obtain a training sample set with context awareness.
训练样本集可基于原有的训练集进行扩充,使其包含情景感知信息,原有数据集通常仅包括语音和文本的对应关系。The training sample set can be expanded based on the original training set to include context perception information. The original data set usually only includes the correspondence between speech and text.
具体地,首先,对已有的训练集中每一对语音文本数据进行话题标注。例如,将医生与抑郁症患者交谈的内容分成7个话题:是否有兴趣、睡觉是否安稳、是否感到沮丧、是否感到失败、自我评价、是否曾经诊断为抑郁症、父母是否曾经患有抑郁症。Specifically, first, topic labeling is performed on each pair of speech and text data in the existing training set. For example, divide the content of conversations between doctors and patients with depression into 7 topics: whether they are interested, whether they sleep well, whether they feel depressed, whether they feel defeated, self-evaluation, whether they have ever been diagnosed with depression, and whether their parents have ever suffered from depression.
接下来,将原有训练集进行扩充:Next, expand the original training set:
对于训练集中属于每一个被试的语音和文本,计算其中唯一的话题数目;如果该数字大于等于m,则把其作为数据增强的备选被试,其中m为限定的最小话题数目;For the speech and text belonging to each participant in the training set, calculate the number of unique topics; if the number is greater than or equal to m, use it as a candidate for data enhancement, where m is the limited number of topics;
对于每一个备选被试,随机选取n个属于该被试的语音文本数据对,作为一个新的组合;For each candidate subject, randomly select n pairs of speech and text data belonging to the subject as a new combination;
对于每一个新的组合,把其中的语音文本数据对的顺序进行随机打乱,然后作为新的训练样本,参见图3所示。For each new combination, randomly scramble the sequence of the voice and text data pairs, and then use them as a new training sample, as shown in Figure 3.
通过上述方式可以得到一些新的训练样本,将其与原来的训练样本拼接在一起即可扩展原有数据集,构建为新的训练样本集。Some new training samples can be obtained through the above method, and the original training samples can be spliced together to expand the original data set and construct a new training sample set.
在此步骤中,通过定义医生与抑郁症患者交谈的多个话题内容,并通过随机组合的方法扩展原有训练数据集,能够获得更丰富的具有情景感知的训练样本集,其中包括话题信息、语谱图、文本信息以及对应的分类标签等,从而提高了后续训练的精度。In this step, by defining the content of multiple topics that the doctor talks with the depression patient, and expanding the original training data set by random combination, a richer set of context-aware training samples can be obtained, including topic information, Spectrogram, text information, and corresponding classification labels, etc., thereby improving the accuracy of subsequent training.
步骤S220,基于CNN和多任务学习提取具有情景感知的声学特征。Step S220, extracting acoustic features with context awareness based on CNN and multi-task learning.
传统方法中,使用的声学特征(如语速、音高、停顿时长等)均是基于特定领域的人类知识的所设计。由于这些传统特征在抑郁症领域表现力不足,而影响了最终检测的结果的准确性。从生物学上分析,人类的视觉感知是从低层局面感知到高层全局感知,而卷积神经网络(Convolutional Neural Network,CNN)恰恰模拟了这个过程。在CNN网络中,经过局部权重共享和一系列的非线性变换后,去掉原有的视觉信息中一些冗余和混淆的信息,仅保留每个局部区域最具区分度的信息。也就是说,经CNN得到的特征只包含不同说话人的“共性”描述,个体信息均被抛弃。In traditional methods, the acoustic features used (such as speech speed, pitch, pause duration, etc.) are all designed based on human knowledge in a specific field. Because these traditional features are insufficient in the field of depression, they affect the accuracy of the final test results. From a biological analysis, human visual perception is from low-level situational perception to high-level global perception, and Convolutional Neural Network (CNN) precisely simulates this process. In the CNN network, after partial weight sharing and a series of nonlinear transformations, some redundant and confusing information in the original visual information is removed, and only the most discriminative information of each local area is retained. In other words, the features obtained by CNN only contain the "common" description of different speakers, and individual information is discarded.
为了使得最终获得的特征包含不同层面的信息,本发明结合多任务学习与CNN网络进行分类网络训练。参见图4所示,CNN网络的输入为每一个训练样本的语谱图,而该CNN网络包含有若干卷积层以及若干全连接层。在卷积层中,使用例如最大池化技术进行降采样。在最后一层全连接层与输出层之间,本发明实施例插入了一个瓶颈层,它只含有较少的节点,例如取值为39。CNN网络的输出层含有两个任务,第一个任务是抑郁症的分类,例如,分类为轻微、严重、中等、正常等多个类别,第二个任务是不同话题的标签(或称话题标识)。In order to make the finally obtained features contain different levels of information, the present invention combines multi-task learning and CNN network for classification network training. As shown in Figure 4, the input of the CNN network is the spectrogram of each training sample, and the CNN network includes several convolutional layers and several fully connected layers. In the convolutional layer, downsampling is performed using, for example, a maximum pooling technique. Between the last fully connected layer and the output layer, the embodiment of the present invention inserts a bottleneck layer, which contains only a few nodes, for example, the value is 39. The output layer of the CNN network contains two tasks. The first task is the classification of depression, for example, classification into multiple categories such as mild, severe, moderate, and normal. The second task is the labeling of different topics (or topic identification). ).
需要注意的是,在本发明实施例中,将从CNN网络的瓶颈层提取得 到情景感知的声学特征,并且将其与传统声学特征拼接在一起进行后续分类网络训练。It should be noted that in this embodiment of the present invention, the context-aware acoustic features are extracted from the bottleneck layer of the CNN network, and are spliced with traditional acoustic features for subsequent classification network training.
在此步骤中,利用CNN神经网络以及多任务学习的方法,其中第一个任务是抑郁症的分类,而第二个任务是不同话题的标签,由网络瓶颈层得到的输出作为具有话题情景感知特性的声学特征。In this step, CNN neural network and multi-task learning methods are used. The first task is the classification of depression, and the second task is the label of different topics. The output obtained by the network bottleneck layer is used as topical context awareness Characteristic acoustic characteristics.
步骤S230,基于多头自注意力机制提取情景感知的文本特征。Step S230, extracting context-aware text features based on the multi-head self-attention mechanism.
传统方法使用词嵌入来描述一段文本,然而该特征难以从语义角度理解句子意义,尤其在某些与抑郁症相关的话题上,严重缺乏与之相关的语义情感表征。自注意力机制模仿了生物观察行为的内部过程,擅长捕捉数据或特征的内部相关性。Traditional methods use word embedding to describe a piece of text. However, this feature is difficult to understand the meaning of a sentence from a semantic perspective, especially on some topics related to depression, which seriously lacks related semantic emotional representation. The self-attention mechanism mimics the internal process of biological observation behavior, and is good at capturing the internal correlation of data or features.
在本发明实施例中,采用基于多头自注意力机制的Transformer模型,来对句子中的语义进行分析,从而提取情景感知的文本特征。参见图5所示,Transformer模型的输入是传统的词嵌入加上话题的ID(标识),其主体结构由多个含有自注意力的编码器和解码器组成,也就是所谓的多头机制。由于Transformer模型允许各个数据单元之间直接连接,因此能让模型考虑到不同位置的注意力信息,更好地捕获长期依赖关系。另外,为了使得Transformer模型得到充分训练,在本发明实施例中,首先利用大规模文本语料(如微博、维基百科等),使用无监督训练方法预训练Transformer模型参数;然后再采用迁移学习的方法,在采集得到的抑郁症文本数据进行自适应训练。在训练完毕后,将图5中最后一层softmax层去除,然后将该输出作为文本特征,即提取的情景感知的文本特征,该特征将用于后续的抑郁症检测模型训练。In the embodiment of the present invention, a Transformer model based on a multi-head self-attention mechanism is used to analyze the semantics of sentences, so as to extract context-aware text features. As shown in Figure 5, the input of the Transformer model is traditional word embedding plus topic ID (identification), and its main structure is composed of multiple encoders and decoders containing self-attention, which is the so-called multi-head mechanism. Because the Transformer model allows direct connections between data units, it allows the model to take into account the attention information of different locations and better capture long-term dependencies. In addition, in order to make the Transformer model fully trained, in the embodiment of the present invention, first use large-scale text corpus (such as Weibo, Wikipedia, etc.) to pre-train the Transformer model parameters using an unsupervised training method; and then use transfer learning. Method, self-adaptive training is performed on the collected textual data of depression. After the training is completed, the last softmax layer in Figure 5 is removed, and then the output is used as a text feature, that is, the extracted context-aware text feature, which will be used for subsequent depression detection model training.
在此步骤中,结合词嵌入和话题情景信息作为输入,利用Transformer模型能够提取得到鲁棒的文本特征。In this step, combining word embedding and topic context information as input, the Transformer model can be used to extract robust text features.
步骤S240,对于情景感知的声学特征和情景感知的文本特征分布建立进行抑郁症检测的子系统。In step S240, a subsystem for detecting depression is established for the acoustic features of context perception and the distribution of text features of context perception.
由于抑郁症的诊断往往不是由某一时刻的一帧或者一句话决定的,而是由长时间的多句话的信息综合决定,即所谓的长时依赖关系。为了对这种长时依赖关系进行捕捉,本发明实施例采用基于BLSTM的方法进行抑郁症分类子网络(或称子系统)的建立。BLSTM可以缓存当前的输入,并用该当前输入参与上一次和下一次的计算,以隐式地将时间信息包含到模型,从而实现对长时间的依赖关系进行建模。本发明实施例采用的 BLSTM网络共有3层BLSTM层,其中每层含有128个节点。对于声学通道,其对应的网络输入为连续11帧PLP(感知线性预测系数)以及情景感知的声学特征,输出为抑郁症分类标签;对于文本通道,其对应的网络输入为一个训练样本的情景感知的文本特征,输出为抑郁症分类标签。Because the diagnosis of depression is often not determined by a frame or sentence at a certain moment, but by a comprehensive decision of long-term multi-sentence information, the so-called long-term dependence. In order to capture this long-term dependency relationship, the embodiment of the present invention adopts a BLSTM-based method to establish a depression classification sub-network (or a sub-system). BLSTM can cache the current input, and use the current input to participate in the previous and next calculations to implicitly include time information into the model, thereby realizing the modeling of long-term dependencies. The BLSTM network adopted in the embodiment of the present invention has a total of 3 BLSTM layers, and each layer contains 128 nodes. For the acoustic channel, the corresponding network input is continuous 11 frames of PLP (Perceptual Linear Prediction Coefficient) and the acoustic features of context perception, and the output is the depression classification label; for the text channel, the corresponding network input is the context perception of a training sample The text feature of the output is the depression classification label.
在此步骤中,利用BLSTM网络进行抑郁症分类模型的建立,以捕捉声学特征或文本特征与抑郁症诊断的长时依赖关系。In this step, the BLSTM network is used to establish a depression classification model to capture the long-term dependence of acoustic features or text features with the diagnosis of depression.
步骤S250,利用强化学习,对各抑郁症检测的子系统的输出进行融合,得到最终的抑郁症分类。Step S250: Use reinforcement learning to fuse the outputs of the various depression detection subsystems to obtain the final depression classification.
针对多模态系统信息融合的策略,本发明实施例采用强化学习机制,通过调整各个子系统的权重,使得组合系统的最终抑郁症预测结果以及反馈信息之间的差异最小化。抑郁症的最终打分表示为:Aiming at the strategy of multi-modal system information fusion, the embodiment of the present invention adopts a reinforcement learning mechanism to minimize the difference between the final depression prediction result and feedback information of the combined system by adjusting the weight of each subsystem. The final score for depression is expressed as:
Figure PCTCN2020129214-appb-000002
Figure PCTCN2020129214-appb-000002
其中,权重w i=[λ 12,…,λ c],c为抑郁症的分类个数,S i对应子系统。而强化学习在t时刻的决策得分函数L t定义为: Wherein the weights w i = [λ 1, λ 2, ..., λ c], c is the number of classification depression, S i the corresponding subsystem. The decision score function L t of reinforcement learning at time t is defined as:
L t=W(A t-1)D-C(2) L t =W(A t-1 )DC(2)
其中A t-1表示在t-1时刻的反馈,D表示开发集中真实和预测结果的差异,W表示所有子系统的权重{w i},C表示在开发集上的全局准确率。因此,需要对所有时刻的L t求和并令其最大化,所得到的W *就是最终的子系统的权重,将其表示为: Where A t-1 represents the feedback at time t-1, D represents the difference between the actual and predicted results of the development set, W represents the weight of all subsystems {w i }, and C represents the global accuracy rate on the development set. Therefore, it is necessary to sum L t at all times and maximize it, and the obtained W * is the weight of the final subsystem, which is expressed as:
W *=arg max WtL t(3) W * =arg max Wt L t (3)
在本发明实时例中,强化学习可采用隐马尔可夫模型或其它模型。In the real-time example of the present invention, a hidden Markov model or other models can be used for reinforcement learning.
在此步骤中,采用强化学习的方法,自动调整声学通道的子系统评分与文本通道的子系统评分的权重,使其有机融合在一起进行最终抑郁症分类。In this step, the reinforcement learning method is used to automatically adjust the weights of the subsystem score of the acoustic channel and the subsystem score of the text channel, so that they can be organically integrated to perform the final depression classification.
应理解的是,尽管本文以训练过程进行介绍,但在实际应用中,利用训练好的网络模型,可以针对新的数据(包括话题、语音、文本等)采用与训练类似的过程来进行抑郁症的分类预测。此外,除了BLSTM之外,也可采用其他包含时间信息的模型。It should be understood that although this article introduces the training process, in practical applications, the trained network model can be used for new data (including topics, speech, text, etc.) using a process similar to training to treat depression Classification prediction. In addition, in addition to BLSTM, other models containing time information can also be used.
相应地,本发明还提供一种基于情景感知的多模态抑郁症检测系统。用于实现上述方法的一个方面或多个方面。例如该系统包括:训练样本构建单元,用于构建训练样本集,所述训练样本集包括话题信息、语谱图和对应的文本信息;声学特征提取单元,用于使用卷积神经网络,结合多任 务学习,对所述训练样本集的语谱图进行声学特征提取,获得具备情景感知的声学特征;文本特征提取单元:用于利用所述训练样本集,使用Transformer模型对词嵌入进行处理,提取具备情景感知的文本特征;分类子系统建立单元:用于对于所述情景感知的声学特征建立进行抑郁症检测的声学通道子系统,对于所述情景感知的文本特征建立进行抑郁症检测的文本通道子系统;分类融合单元:用于对所述声学通道子系统和所述文本通道子系统的输出进行融合,获得抑郁症分类信息。Correspondingly, the present invention also provides a multi-modal depression detection system based on context perception. Used to implement one or more aspects of the above method. For example, the system includes: a training sample construction unit, used to construct a training sample set, the training sample set includes topic information, a spectrogram and corresponding text information; an acoustic feature extraction unit, used to use a convolutional neural network, combined with multiple Task learning: extracting acoustic features from the spectrogram of the training sample set to obtain acoustic features with context awareness; text feature extraction unit: used to process the word embedding using the training sample set and using the Transformer model to extract With context-aware text features; classification subsystem establishment unit: used to establish an acoustic channel subsystem for depression detection for the context-aware acoustic features, and establish a text channel for depression detection for the context-aware text features Subsystem; classification fusion unit: used to fuse the output of the acoustic channel subsystem and the text channel subsystem to obtain depression classification information.
综上,本发明结合声学通道和文本通道得到的信息,实现高精度的多模态抑郁症检测,主要技术内容包括:利用话题相关的数据增强技术:在有限抑郁症语音文本数据基础上,利用医生与抑郁症患者自由交谈内容中的话题信息,扩展抑郁症语音文本训练数据;鲁棒的抑郁症相关特征的分析与提取:结合迁移学习和基于多头自注意力机制,提取具备话题情景感知特性,以及显示抑郁症患者特性的声学特征描述和文本特征描述,以提高检测系统的精度;基于BLSTM的抑郁症分类模型:利用BLSTM网络的强大时序建模能力,捕捉声学信息和文本信息与抑郁症诊断的长时依赖关系;多模态融合框架:利用强化学习的方法,实现在声学通道和文本通道下的抑郁症检测子系统的融合。In summary, the present invention combines the information obtained by the acoustic channel and the text channel to achieve high-precision multi-modal depression detection. The main technical content includes: using topic-related data enhancement technology: based on limited depression speech and text data, using The topic information in the content of the free conversation between doctors and depression patients, expands the speech and text training data of depression; Robust analysis and extraction of depression-related features: Combining transfer learning and multi-head self-attention mechanism, extracting topical and context-aware features , And the acoustic feature description and text feature description showing the characteristics of depression patients to improve the accuracy of the detection system; BLSTM-based depression classification model: use the powerful time series modeling capabilities of the BLSTM network to capture acoustic information and text information and depression The long-term dependence of diagnosis; multi-modal fusion framework: the use of reinforcement learning methods to achieve the fusion of the depression detection subsystem under the acoustic channel and the text channel.
与现有技术相比,本发明具有以下优势:Compared with the prior art, the present invention has the following advantages:
1)、现有的抑郁症检测方法只使用有限的抑郁症语音文本数据,与其相比,本发明使用基于话题的数据增强方法扩展原有训练数据集;1) The existing depression detection method only uses limited speech and text data of depression. Compared with this, the present invention uses a topic-based data enhancement method to expand the original training data set;
2)、现有技术大部分使用缺少话题情景感知的特征,与其相比,本发明使用CNN神经网络以及多任务学习的方法提取得到具备话题情景感知特性的声学特征,以及使用Transformer模型提取具备话题情景感知特性的文本特征,是深层的特征描述,能提升抑郁症检测的鲁棒性;2). Most of the prior art uses features that lack topic context perception. Compared with this, the present invention uses CNN neural network and multi-task learning methods to extract acoustic features with topic context perception characteristics, and uses Transformer model to extract topics with topic context awareness. The textual features of context-aware features are in-depth feature descriptions, which can improve the robustness of depression detection;
3)、现有的抑郁症检测建模技术没有考虑语音、文本特征与抑郁症诊断的长时间依赖关系,与其相比,本发明利用BLSTM网络捕捉声学特征或文本特征与抑郁症诊断的长时依赖关系,性能更好;3) The existing depression detection modeling technology does not consider the long-term dependence of speech, text features and depression diagnosis. In contrast, the present invention uses the BLSTM network to capture acoustic features or text features and the long-term diagnosis of depression. Dependency, better performance;
4)、现有的多模态抑郁症检测技术简单地把不同子系统输出串联在一起进行决策,与其相比,本发明采用强化学习的方法,自动调整不同通道下的子系统评分权重,并进行最终分类决策,性能更好。4) The existing multi-modal depression detection technology simply connects the outputs of different subsystems in series for decision-making. Compared with this, the present invention adopts a reinforcement learning method to automatically adjust the sub-system score weights under different channels, and Make the final classification decision, the performance is better.
需要说明的是,虽然上文按照特定顺序描述了各个步骤,但是并不意味着必须按照上述特定顺序来执行各个步骤,实际上,这些步骤中的一些 可以并发执行,甚至改变顺序,只要能够实现所需要的功能即可。It should be noted that although the steps are described in a specific order above, it does not mean that the steps must be executed in the above specific order. In fact, some of these steps can be executed concurrently, or even change the order, as long as it can be implemented. The required functions are sufficient.
本发明可以是系统、方法和/或计算机程序产品。计算机程序产品可以包括计算机可读存储介质,其上载有用于使处理器实现本发明的各个方面的计算机可读程序指令。The present invention may be a system, a method and/or a computer program product. The computer program product may include a computer-readable storage medium loaded with computer-readable program instructions for enabling a processor to implement various aspects of the present invention.
计算机可读存储介质可以是保持和存储由指令执行设备使用的指令的有形设备。计算机可读存储介质例如可以包括但不限于电存储设备、磁存储设备、光存储设备、电磁存储设备、半导体存储设备或者上述的任意合适的组合。计算机可读存储介质的更具体的例子(非穷举的列表)包括:便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、静态随机存取存储器(SRAM)、便携式压缩盘只读存储器(CD-ROM)、数字多功能盘(DVD)、记忆棒、软盘、机械编码设备、例如其上存储有指令的打孔卡或凹槽内凸起结构、以及上述的任意合适的组合。The computer-readable storage medium may be a tangible device that holds and stores instructions used by the instruction execution device. The computer-readable storage medium may include, but is not limited to, an electrical storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing, for example. More specific examples of computer-readable storage media (non-exhaustive list) include: portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM) Or flash memory), static random access memory (SRAM), portable compact disk read-only memory (CD-ROM), digital versatile disk (DVD), memory stick, floppy disk, mechanical encoding device, such as a printer with instructions stored thereon The protruding structure in the hole card or the groove, and any suitable combination of the above.
以上已经描述了本发明的各实施例,上述说明是示例性的,并非穷尽性的,并且也不限于所披露的各实施例。在不偏离所说明的各实施例的范围和精神的情况下,对于本技术领域的普通技术人员来说许多修改和变更都是显而易见的。本文中所用术语的选择,旨在最好地解释各实施例的原理、实际应用或对市场中的技术改进,或者使本技术领域的其它普通技术人员能理解本文披露的各实施例。The embodiments of the present invention have been described above, and the above description is exemplary, not exhaustive, and is not limited to the disclosed embodiments. Without departing from the scope and spirit of the illustrated embodiments, many modifications and changes are obvious to those of ordinary skill in the art. The choice of terms used herein is intended to best explain the principles, practical applications, or technical improvements of the various embodiments in the market, or to enable other ordinary skilled in the art to understand the various embodiments disclosed herein.

Claims (10)

  1. 一种基于情景感知的多模态抑郁症检测方法,包括以下步骤:A multi-modal depression detection method based on situational perception, including the following steps:
    步骤S1:构建训练样本集,所述训练样本集包括话题信息、语谱图和对应的文本信息;Step S1: Construct a training sample set, the training sample set includes topic information, a spectrogram and corresponding text information;
    步骤S2:使用卷积神经网络,结合多任务学习,对所述训练样本集的语谱图进行声学特征提取,获得具备情景感知的声学特征;Step S2: Using a convolutional neural network, combined with multi-task learning, perform acoustic feature extraction on the spectrogram of the training sample set to obtain acoustic features with contextual awareness;
    步骤S3:利用所述训练样本集,使用Transformer模型对词嵌入进行处理,提取具备情景感知的文本特征;Step S3: Use the training sample set to process the word embedding using the Transformer model, and extract context-aware text features;
    步骤S4:对于所述情景感知的声学特征建立进行抑郁症检测的声学通道子系统,对于所述情景感知的文本特征建立进行抑郁症检测的文本通道子系统;Step S4: establish an acoustic channel subsystem for depression detection for the context-perceived acoustic features, and establish a text channel subsystem for depression detection for the context-aware text features;
    步骤S5:对所述声学通道子系统和所述文本通道子系统的输出进行融合,获得抑郁症分类信息。Step S5: fusing the outputs of the acoustic channel subsystem and the text channel subsystem to obtain depression classification information.
  2. 根据权利要求1所述的方法,其特征在于,根据以下步骤获得所述情景感知的声学特征:The method according to claim 1, wherein the acoustic characteristics of the situational perception are obtained according to the following steps:
    构建卷积神经网络,该卷积神经网络包括输入层、多个卷积层、多个全连接层、输出层、以及位于最后一层全连接层和输出层之间的瓶颈层,该瓶颈层相对于卷积层和全连接层具有较少的节点;Construct a convolutional neural network. The convolutional neural network includes an input layer, multiple convolutional layers, multiple fully connected layers, an output layer, and a bottleneck layer located between the last fully connected layer and the output layer. The bottleneck layer Compared with the convolutional layer and the fully connected layer, it has fewer nodes;
    将所述训练样本集中的语谱图输入到卷积神经网络,输出层包含抑郁症分类任务和话题的标签任务;Inputting the spectrogram in the training sample set to the convolutional neural network, and the output layer contains the depression classification task and the topic labeling task;
    从卷积神经网络的瓶颈层提取得到所述情景感知的声学特征。The acoustic features of the context perception are extracted from the bottleneck layer of the convolutional neural network.
  3. 根据权利要求1所述的方法,其特征在于,根据以下步骤提取所述情景感知的文本特征:The method according to claim 1, wherein the context-aware text features are extracted according to the following steps:
    构建Transformer模型,以词嵌入加上话题标识作为Transformer模型的输入,该Transformer模型包括多个含有自注意力的编码器和解码器以及位于最后一层的softmax层;Construct a Transformer model, and use word embedding and topic identification as the input of the Transformer model. The Transformer model includes multiple encoders and decoders with self-attention and a softmax layer at the last layer;
    利用已有的文本语料,使用无监督训练方法预训练Transformer模型参数,然后采用迁移学习,在采集得到的抑郁症文本数据进行自适应训练;Use the existing text corpus to pre-train the Transformer model parameters using an unsupervised training method, and then use transfer learning to perform adaptive training on the collected depression text data;
    在训练完成之后,将softmax层去除,以Transformer模型的输出作为所述情景感知的文本特征。After the training is completed, the softmax layer is removed, and the output of the Transformer model is used as the context-aware text feature.
  4. 根据权利要求1所述的方法,其特征在于,步骤S5包括:The method according to claim 1, wherein step S5 comprises:
    采用强化学习机制,调整所述声学通道子系统的权重和所述文本通道子系统的权重,使得最终抑郁症分类预测结果和反馈信息之间的差异最小化;Using a reinforcement learning mechanism to adjust the weight of the acoustic channel subsystem and the weight of the text channel subsystem to minimize the difference between the final depression classification prediction result and the feedback information;
    融合所述声学通道子系统和所述文本通道子系统的输出,获得抑郁症的分类打分。The outputs of the acoustic channel subsystem and the text channel subsystem are merged to obtain a classification score for depression.
  5. 根据权利要求4所述的方法,其特征在于,所述抑郁症的分类打分表示为:The method of claim 4, wherein the classification score of the depression is expressed as:
    Figure PCTCN2020129214-appb-100001
    Figure PCTCN2020129214-appb-100001
    其中,权重w i=[λ 12,…,λ c],c为抑郁症的分类个数。 Among them, the weight w i =[λ 12 ,...,λ c ], and c is the number of classifications of depression.
  6. 根据权利要求1所述的方法,其特征在于,所述声学通道子系统和所述文本通道子系统基于BLSTM网络建立,所述声学通道子系统的网络输入为连续多帧的感知线性预测系数和所述情景感知的声学特征,输出为抑郁症分类标签,所述文本通道子系统的网络输入是文本信息,输出为抑郁症分类标签。The method according to claim 1, wherein the acoustic channel subsystem and the text channel subsystem are established based on a BLSTM network, and the network input of the acoustic channel subsystem is the perceptual linear prediction coefficients of consecutive multiple frames and The output of the context-aware acoustic feature is a depression classification label, the network input of the text channel subsystem is text information, and the output is a depression classification label.
  7. 根据权利要求1所述的方法,其特征在于,所述训练样本集中的话题信息包括基于医生与抑郁症患者交谈的内容所划分的多种类型标识。The method according to claim 1, wherein the topic information in the training sample set includes multiple types of identifiers classified based on the content of the conversation between the doctor and the depression patient.
  8. 一种基于情景感知的多模态抑郁症检测系统,包括:A multi-modal depression detection system based on situational awareness, including:
    训练样本构建单元:用于构建训练样本集,所述训练样本集包括话题信息、语谱图和对应的文本信息;Training sample construction unit: used to construct a training sample set, the training sample set including topic information, spectrogram and corresponding text information;
    声学特征提取单元:用于使用卷积神经网络,结合多任务学习,对所述训练样本集的语谱图进行声学特征提取,获得具备情景感知的声学特征;Acoustic feature extraction unit: used to extract acoustic features from the spectrogram of the training sample set by using a convolutional neural network, combined with multi-task learning, to obtain acoustic features with contextual awareness;
    文本特征提取单元:用于利用所述训练样本集,使用Transformer模型对词嵌入进行处理,提取具备情景感知的文本特征;Text feature extraction unit: used to use the training sample set to process word embedding using a Transformer model to extract context-aware text features;
    分类子系统建立单元:用于对于所述情景感知的声学特征建立进行抑郁症检测的声学通道子系统,对于所述情景感知的文本特征建立进行抑郁症检测的文本通道子系统;Classification subsystem establishment unit: used to establish an acoustic channel subsystem for depression detection for the context-aware acoustic features, and establish a text channel subsystem for depression detection for the context-aware text features;
    分类融合单元:用于对所述声学通道子系统和所述文本通道子系统的输出进行融合,获得抑郁症分类信息。Classification and fusion unit: used to fuse the output of the acoustic channel subsystem and the text channel subsystem to obtain depression classification information.
  9. 一种计算机可读存储介质,其上存储有计算机程序,其中,该程序被处理器执行时实现根据权利要求1至7中任一项所述方法的步骤。A computer-readable storage medium having a computer program stored thereon, wherein the program is executed by a processor to realize the steps of the method according to any one of claims 1 to 7.
  10. 一种计算机设备,包括存储器和处理器,在所述存储器上存储有 能够在处理器上运行的计算机程序,其特征在于,所述处理器执行所述程序时实现权利要求1至7中任一项所述的方法的步骤。A computer device, comprising a memory and a processor, and a computer program that can run on the processor is stored on the memory, wherein the processor implements any one of claims 1 to 7 when the program is executed. The steps of the method described in item.
PCT/CN2020/129214 2019-11-29 2020-11-17 Multimodal depression detection method and system employing context awareness WO2021104099A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201911198356.X 2019-11-29
CN201911198356.XA CN110728997B (en) 2019-11-29 2019-11-29 Multi-modal depression detection system based on context awareness

Publications (1)

Publication Number Publication Date
WO2021104099A1 true WO2021104099A1 (en) 2021-06-03

Family

ID=69225856

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/129214 WO2021104099A1 (en) 2019-11-29 2020-11-17 Multimodal depression detection method and system employing context awareness

Country Status (2)

Country Link
CN (1) CN110728997B (en)
WO (1) WO2021104099A1 (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113627377A (en) * 2021-08-18 2021-11-09 福州大学 Cognitive radio frequency spectrum sensing method and system Based on Attention-Based CNN
CN113674767A (en) * 2021-10-09 2021-11-19 复旦大学 Depression state identification method based on multi-modal fusion
CN113822192A (en) * 2021-09-18 2021-12-21 山东大学 Method, device and medium for identifying emotion of escort personnel based on Transformer multi-modal feature fusion
CN114118200A (en) * 2021-09-24 2022-03-01 杭州电子科技大学 Multi-modal emotion classification method based on attention-guided bidirectional capsule network
CN114464182A (en) * 2022-03-03 2022-05-10 慧言科技(天津)有限公司 Voice recognition fast self-adaption method assisted by audio scene classification
US20220180056A1 (en) * 2020-12-09 2022-06-09 Here Global B.V. Method and apparatus for translation of a natural language query to a service execution language
CN114973120A (en) * 2022-04-14 2022-08-30 山东大学 Behavior identification method and system based on multi-dimensional sensing data and monitoring video multi-mode heterogeneous fusion
CN115346561A (en) * 2022-08-15 2022-11-15 南京脑科医院 Method and system for estimating and predicting depression mood based on voice characteristics
CN115481681A (en) * 2022-09-09 2022-12-16 武汉中数医疗科技有限公司 Artificial intelligence-based breast sampling data processing method
CN115969381A (en) * 2022-11-16 2023-04-18 西北工业大学 Electroencephalogram signal analysis method based on multi-band fusion and space-time Transformer
CN117137488A (en) * 2023-10-27 2023-12-01 吉林大学 Auxiliary identification method for depression symptoms based on electroencephalogram data and facial expression images
CN117497140A (en) * 2023-10-09 2024-02-02 合肥工业大学 Multi-level depression state detection method based on fine granularity prompt learning
CN117497140B (en) * 2023-10-09 2024-05-31 合肥工业大学 Multi-level depression state detection method based on fine granularity prompt learning

Families Citing this family (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110728997B (en) * 2019-11-29 2022-03-22 中国科学院深圳先进技术研究院 Multi-modal depression detection system based on context awareness
CN111150372B (en) * 2020-02-13 2021-03-16 云南大学 Sleep stage staging system combining rapid representation learning and semantic learning
CN111329494B (en) * 2020-02-28 2022-10-28 首都医科大学 Depression reference data acquisition method and device
CN111581470B (en) * 2020-05-15 2023-04-28 上海乐言科技股份有限公司 Multi-mode fusion learning analysis method and system for scene matching of dialogue system
CN112006697B (en) * 2020-06-02 2022-11-01 东南大学 Voice signal-based gradient lifting decision tree depression degree recognition system
CN111798874A (en) * 2020-06-24 2020-10-20 西北师范大学 Voice emotion recognition method and system
CN113269277B (en) * 2020-07-27 2023-07-25 西北工业大学 Continuous dimension emotion recognition method based on transducer encoder and multi-head multi-mode attention
CN112966429A (en) * 2020-08-11 2021-06-15 中国矿业大学 Non-linear industrial process modeling method based on WGANs data enhancement
CN112631147B (en) * 2020-12-08 2023-05-02 国网四川省电力公司经济技术研究院 Intelligent power grid frequency estimation method and system oriented to impulse noise environment
CN112768070A (en) * 2021-01-06 2021-05-07 万佳安智慧生活技术(深圳)有限公司 Mental health evaluation method and system based on dialogue communication
CN112885334A (en) * 2021-01-18 2021-06-01 吾征智能技术(北京)有限公司 Disease recognition system, device, storage medium based on multi-modal features
CN112818892B (en) * 2021-02-10 2023-04-07 杭州医典智能科技有限公司 Multi-modal depression detection method and system based on time convolution neural network
CN113012720B (en) * 2021-02-10 2023-06-16 杭州医典智能科技有限公司 Depression detection method by multi-voice feature fusion under spectral subtraction noise reduction
CN115346657B (en) * 2022-07-05 2023-07-28 深圳市镜象科技有限公司 Training method and device for improving identification effect of senile dementia by utilizing transfer learning
CN116843377A (en) * 2023-07-25 2023-10-03 河北鑫考科技股份有限公司 Consumption behavior prediction method, device, equipment and medium based on big data
CN116965817B (en) * 2023-07-28 2024-03-15 长江大学 EEG emotion recognition method based on one-dimensional convolution network and transducer
CN116978409A (en) * 2023-09-22 2023-10-31 苏州复变医疗科技有限公司 Depression state evaluation method, device, terminal and medium based on voice signal

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016028495A1 (en) * 2014-08-22 2016-02-25 Sri International Systems for speech-based assessment of a patient's state-of-mind
CN107133481A (en) * 2017-05-22 2017-09-05 西北工业大学 The estimation of multi-modal depression and sorting technique based on DCNN DNN and PV SVM
CN107657964A (en) * 2017-08-15 2018-02-02 西北大学 Depression aided detection method and grader based on acoustic feature and sparse mathematics
JP2018121749A (en) * 2017-01-30 2018-08-09 株式会社リコー Diagnostic apparatus, program, and diagnostic system
CN109599129A (en) * 2018-11-13 2019-04-09 杭州电子科技大学 Voice depression recognition methods based on attention mechanism and convolutional neural networks
CN110728997A (en) * 2019-11-29 2020-01-24 中国科学院深圳先进技术研究院 Multi-modal depression detection method and system based on context awareness

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10204625B2 (en) * 2010-06-07 2019-02-12 Affectiva, Inc. Audio analysis learning using video data
EP3252769B8 (en) * 2016-06-03 2020-04-01 Sony Corporation Adding background sound to speech-containing audio data
WO2019017462A1 (en) * 2017-07-21 2019-01-24 日本電信電話株式会社 Satisfaction estimation model learning device, satisfaction estimation device, satisfaction estimation model learning method, satisfaction estimation method, and program
CN107316654A (en) * 2017-07-24 2017-11-03 湖南大学 Emotion identification method based on DIS NV features
GB2567826B (en) * 2017-10-24 2023-04-26 Cambridge Cognition Ltd System and method for assessing physiological state
CN108764010A (en) * 2018-03-23 2018-11-06 姜涵予 Emotional state determines method and device
WO2019225801A1 (en) * 2018-05-23 2019-11-28 한국과학기술원 Method and system for simultaneously recognizing emotion, age, and gender on basis of voice signal of user
CN109389992A (en) * 2018-10-18 2019-02-26 天津大学 A kind of speech-emotion recognition method based on amplitude and phase information
CN109841231B (en) * 2018-12-29 2020-09-04 深圳先进技术研究院 Early AD (AD) speech auxiliary screening system for Chinese mandarin
CN110047516A (en) * 2019-03-12 2019-07-23 天津大学 A kind of speech-emotion recognition method based on gender perception

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016028495A1 (en) * 2014-08-22 2016-02-25 Sri International Systems for speech-based assessment of a patient's state-of-mind
JP2018121749A (en) * 2017-01-30 2018-08-09 株式会社リコー Diagnostic apparatus, program, and diagnostic system
CN107133481A (en) * 2017-05-22 2017-09-05 西北工业大学 The estimation of multi-modal depression and sorting technique based on DCNN DNN and PV SVM
CN107657964A (en) * 2017-08-15 2018-02-02 西北大学 Depression aided detection method and grader based on acoustic feature and sparse mathematics
CN109599129A (en) * 2018-11-13 2019-04-09 杭州电子科技大学 Voice depression recognition methods based on attention mechanism and convolutional neural networks
CN110728997A (en) * 2019-11-29 2020-01-24 中国科学院深圳先进技术研究院 Multi-modal depression detection method and system based on context awareness

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220180056A1 (en) * 2020-12-09 2022-06-09 Here Global B.V. Method and apparatus for translation of a natural language query to a service execution language
CN113627377A (en) * 2021-08-18 2021-11-09 福州大学 Cognitive radio frequency spectrum sensing method and system Based on Attention-Based CNN
CN113822192A (en) * 2021-09-18 2021-12-21 山东大学 Method, device and medium for identifying emotion of escort personnel based on Transformer multi-modal feature fusion
CN113822192B (en) * 2021-09-18 2023-06-30 山东大学 Method, equipment and medium for identifying emotion of on-press personnel based on multi-mode feature fusion of Transformer
CN114118200A (en) * 2021-09-24 2022-03-01 杭州电子科技大学 Multi-modal emotion classification method based on attention-guided bidirectional capsule network
CN113674767A (en) * 2021-10-09 2021-11-19 复旦大学 Depression state identification method based on multi-modal fusion
CN114464182A (en) * 2022-03-03 2022-05-10 慧言科技(天津)有限公司 Voice recognition fast self-adaption method assisted by audio scene classification
CN114464182B (en) * 2022-03-03 2022-10-21 慧言科技(天津)有限公司 Voice recognition fast self-adaption method assisted by audio scene classification
CN114973120A (en) * 2022-04-14 2022-08-30 山东大学 Behavior identification method and system based on multi-dimensional sensing data and monitoring video multi-mode heterogeneous fusion
CN114973120B (en) * 2022-04-14 2024-03-12 山东大学 Behavior recognition method and system based on multi-dimensional sensing data and monitoring video multimode heterogeneous fusion
CN115346561A (en) * 2022-08-15 2022-11-15 南京脑科医院 Method and system for estimating and predicting depression mood based on voice characteristics
CN115346561B (en) * 2022-08-15 2023-11-24 南京医科大学附属脑科医院 Depression emotion assessment and prediction method and system based on voice characteristics
CN115481681A (en) * 2022-09-09 2022-12-16 武汉中数医疗科技有限公司 Artificial intelligence-based breast sampling data processing method
CN115481681B (en) * 2022-09-09 2024-02-06 武汉中数医疗科技有限公司 Mammary gland sampling data processing method based on artificial intelligence
CN115969381A (en) * 2022-11-16 2023-04-18 西北工业大学 Electroencephalogram signal analysis method based on multi-band fusion and space-time Transformer
CN115969381B (en) * 2022-11-16 2024-04-30 西北工业大学 Electroencephalogram signal analysis method based on multi-band fusion and space-time transducer
CN117497140A (en) * 2023-10-09 2024-02-02 合肥工业大学 Multi-level depression state detection method based on fine granularity prompt learning
CN117497140B (en) * 2023-10-09 2024-05-31 合肥工业大学 Multi-level depression state detection method based on fine granularity prompt learning
CN117137488B (en) * 2023-10-27 2024-01-26 吉林大学 Auxiliary identification method for depression symptoms based on electroencephalogram data and facial expression images
CN117137488A (en) * 2023-10-27 2023-12-01 吉林大学 Auxiliary identification method for depression symptoms based on electroencephalogram data and facial expression images

Also Published As

Publication number Publication date
CN110728997A (en) 2020-01-24
CN110728997B (en) 2022-03-22

Similar Documents

Publication Publication Date Title
WO2021104099A1 (en) Multimodal depression detection method and system employing context awareness
Shou et al. Conversational emotion recognition studies based on graph convolutional neural networks and a dependent syntactic analysis
Mirheidari et al. Detecting Signs of Dementia Using Word Vector Representations.
Schuller et al. Cross-corpus acoustic emotion recognition: Variances and strategies
Batliner et al. The automatic recognition of emotions in speech
Gu et al. Speech intention classification with multimodal deep learning
Atmaja et al. Two-stage dimensional emotion recognition by fusing predictions of acoustic and text networks using SVM
Wang et al. Learning Mutual Correlation in Multimodal Transformer for Speech Emotion Recognition.
Qin et al. An end-to-end approach to automatic speech assessment for Cantonese-speaking people with aphasia
Harati et al. Speech-based depression prediction using encoder-weight-only transfer learning and a large corpus
Saha et al. Emotion aided dialogue act classification for task-independent conversations in a multi-modal framework
Sechidis et al. A machine learning perspective on the emotional content of Parkinsonian speech
Zhang et al. Deep cross-corpus speech emotion recognition: Recent advances and perspectives
CN115640530A (en) Combined analysis method for dialogue sarcasm and emotion based on multi-task learning
CN116130092A (en) Method and device for training multi-language prediction model and predicting Alzheimer's disease
Prabhakaran et al. Detecting institutional dialog acts in police traffic stops
Özkanca et al. Multi-lingual depression-level assessment from conversational speech using acoustic and text features
Pérez-Espinosa et al. Using acoustic paralinguistic information to assess the interaction quality in speech-based systems for elderly users
Jia et al. A deep learning system for sentiment analysis of service calls
JP6992725B2 (en) Para-language information estimation device, para-language information estimation method, and program
Johar Paralinguistic profiling using speech recognition
Ryumina et al. Emotional speech recognition based on lip-reading
Akhtiamov et al. Gaze, prosody and semantics: relevance of various multimodal signals to addressee detection in human-human-computer conversations
Ohta et al. Response type selection for chat-like spoken dialog systems based on LSTM and multi-task learning
Meddeb et al. Content-based arabic speech similarity search and emotion detection

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20892740

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20892740

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 240123)

122 Ep: pct application non-entry in european phase

Ref document number: 20892740

Country of ref document: EP

Kind code of ref document: A1