WO2024040793A1 - 一种结合分层策略的多模态情绪识别方法 - Google Patents

一种结合分层策略的多模态情绪识别方法 Download PDF

Info

Publication number
WO2024040793A1
WO2024040793A1 PCT/CN2022/136487 CN2022136487W WO2024040793A1 WO 2024040793 A1 WO2024040793 A1 WO 2024040793A1 CN 2022136487 W CN2022136487 W CN 2022136487W WO 2024040793 A1 WO2024040793 A1 WO 2024040793A1
Authority
WO
WIPO (PCT)
Prior art keywords
emotion
text
model
emotion recognition
speech
Prior art date
Application number
PCT/CN2022/136487
Other languages
English (en)
French (fr)
Inventor
刘波
孙芃
徐小龙
Original Assignee
天翼电子商务有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 天翼电子商务有限公司 filed Critical 天翼电子商务有限公司
Publication of WO2024040793A1 publication Critical patent/WO2024040793A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state

Definitions

  • the present invention relates to the field of emotion recognition, and in particular to a multi-modal emotion recognition method combining hierarchical strategies.
  • the technical problem to be solved by the present invention is to overcome the shortcomings of the existing technology and provide a multi-modal emotion recognition method combined with a hierarchical strategy. Compared with single text and single speech emotion recognition, the effect of emotion recognition is further improved, and It further combines the hierarchical strategy to infer and predict the easier-to-predict samples in the shallow model, and place the more difficult-to-predict samples in the deep model to infer and predict, thereby improving multi-modal emotions while ensuring accuracy. The overall response speed of recognition.
  • the present invention provides a multi-modal emotion recognition method combined with a hierarchical strategy, which includes the following steps:
  • the input of the multi-modal emotion recognition method combined with the hierarchical strategy is speech and the text corresponding to the speech;
  • the shallow model of the multi-modal emotion recognition method combined with hierarchical strategies consists of the speech emotion recognition model CNN and a text emotion recognition framework.
  • the text emotion recognition framework consists of high-frequency sentence matching, regular expression matching and BiGRU- Attention model is composed of a deep model, which is a multi-modal emotion recognition model Transformer-based joint-encoding (TBJE);
  • the text emotion recognition framework is divided into high-frequency sentence matching, regular expression matching and a BiGRU-Attention model.
  • the BiGRU-Attention model is a bidirectional GRU model and combined With the Attention mechanism, the model is relatively small and the reasoning speed is fast; the GRU unit is updated as follows:
  • z t represents the update gate
  • r t represents the reset gate
  • is the sigmod activation function
  • x t represents the input at time t
  • h t-1 represents the hidden state at time t-1
  • h t represents the hidden state at time t;
  • the forward and reverse hidden states are calculated for each text and spliced together to obtain the target text sequence H;
  • the calculation process is as follows:
  • H is the target text sequence
  • softmax is the normalized exponential function
  • a is the attention weight coefficient
  • W T is the variable parameter
  • the attention weight coefficient is used to calculate the context sequence of the target text sequence as:
  • H is the target text sequence
  • M is the context sequence
  • a hierarchical text emotion recognition framework is used.
  • high-frequency sentence matching is first performed. If the input text matches the high-frequency sentence
  • let text_emotion equal the emotion label corresponding to the high-frequency sentence, and end the text emotion recognition process. Otherwise, enter the text into the regular expression matching layer. If the text matches a certain regular expression If successful, let text_emotion equal the emotion label corresponding to the regular expression, and end the text emotion recognition process. Otherwise, input the text into the BiGRU-Attention model, and set the corresponding threshold for the classified emotion label of the BiGRU-Attention model. If BiGRU -The probability value of the emotion category predicted by the Attention model exceeds the threshold corresponding to the emotion category, then make the variable text_emotion equal to the emotion category, otherwise let the value of the variable text_emotion be null;
  • TBJE Transformer-based joint-encoding
  • TBJE Transformer-based joint-encoding
  • the text After inputting the Embedding layer and the LSTM layer, the text feature a is obtained.
  • the speech extraction feature is input to the Full Connected Layer to obtain the speech feature b.
  • the text feature a and speech feature b are simultaneously input into the multi-layer Multimodal Transformer, and Output features and features will feature and features After going through the Flatten, Add and Norm layers, the feature c that fuses speech and text is obtained, and the feature c is input to the fully connected layer (Full Connected Layer) to obtain the current round of emotion recognition results and output;
  • This multi-modal emotion recognition method combined with hierarchical strategies performs reasoning and prediction on easier-to-predict samples at a shallow level, and sets up a smaller speech emotion recognition model and text emotion recognition framework at the shallow level. Only when the two Only when the emotion labels predicted by the users are the same, the emotion recognition result will be directly output; otherwise, the more difficult-to-predict samples will be input to the deeper model, ensuring the accuracy of emotion recognition of the shallow model.
  • the deep model is a multi-modal emotion
  • the recognition model Transformer-based joint-encoding (TBJE) inputs difficult-to-predict samples into the model to obtain the emotion recognition results and output them; because most commonly used words or ordinary expressions can be predicted and output in the shallow model The output result thus improves the overall response speed of multi-modal emotion recognition while ensuring accuracy.
  • the present invention proposes a multi-modal emotion recognition method combined with a hierarchical strategy.
  • This emotion recognition method combines speech features and text features. Compared with single speech and single text emotion recognition methods, it further improves the accuracy of emotion recognition. Accuracy;
  • Multi-modal emotion recognition models are generally larger, which makes the model's reasoning and prediction speed slower, affecting the model's response efficiency and concurrency. Therefore, the present invention proposes a multi-modal emotion recognition method combined with a hierarchical strategy.
  • the easy-to-predict samples are inferred and predicted in the shallow model, and the more difficult-to-predict samples are inferred and predicted in the deep model, thereby improving the overall response speed of multi-modal emotion recognition while ensuring accuracy.
  • Figure 1 is an overall architecture diagram of the present invention
  • Figure 2 is a schematic diagram of the architecture of the speech emotion recognition model CNN of the present invention.
  • FIG. 3 is a schematic architectural diagram of the BiGRU-Attention model in the text emotion recognition framework of the present invention
  • Figure 4 is an overall architecture diagram of the multi-modal emotion recognition model Transformer-based joint-encoding (TBJE) of the present invention.
  • the present invention provides a multi-modal emotion recognition method combined with a hierarchical strategy, which includes the following steps:
  • the input of the multi-modal emotion recognition method combined with the hierarchical strategy is speech and the text corresponding to the speech;
  • the shallow model of the multi-modal emotion recognition method combined with hierarchical strategies consists of the speech emotion recognition model CNN and a text emotion recognition framework.
  • the text emotion recognition framework consists of high-frequency sentence matching, regular expression matching and BiGRU- Attention model is composed of a deep model, which is a multi-modal emotion recognition model Transformer-based joint-encoding (TBJE);
  • the text emotion recognition framework is divided into high-frequency sentence matching, regular expression matching and a BiGRU-Attention model.
  • the BiGRU-Attention model is a bidirectional GRU model and combined With the Attention mechanism, the model is relatively small and the inference speed is fast.
  • the GRU unit is updated as follows:
  • z t represents the update gate
  • r t represents the reset gate
  • is the sigmod activation function
  • x t represents the input at time t
  • h t-1 represents the hidden state at time t-1
  • h t represents the hidden state at time t;
  • This invention uses the BiGRU structure to calculate the forward and reverse hidden states for each text and splice them together to obtain the target text sequence H;
  • the calculation process is as follows:
  • H is the target text sequence
  • softmax is the normalized exponential function
  • a is the attention weight coefficient
  • W T is the variable parameter
  • the attention weight coefficient is used to calculate the context sequence of the target text sequence as:
  • H is the target text sequence
  • M is the context sequence
  • a hierarchical text emotion recognition framework is used.
  • high-frequency sentence matching is first performed. If the input text matches the high-frequency sentence
  • let text_emotion equal the emotion label corresponding to the high-frequency sentence, and end the text emotion recognition process. Otherwise, enter the text into the regular expression matching layer. If the text matches a certain regular expression If successful, let text_emotion equal the emotion label corresponding to the regular expression, and end the text emotion recognition process. Otherwise, input the text into the BiGRU-Attention model, and set the corresponding threshold for the classified emotion label of the BiGRU-Attention model. If BiGRU -The probability value of the emotion category predicted by the Attention model exceeds the threshold corresponding to the emotion category, then the variable text_emotion is equal to the emotion category, otherwise the value of the variable text_emotion is null.
  • TBJE Transformer-based joint-encoding
  • TBJE Transformer-based joint-encoding
  • the text After inputting the Embedding layer and the LSTM layer, the text feature a is obtained.
  • the speech extraction feature is input to the Full Connected Layer to obtain the speech feature b.
  • the text feature a and speech feature b are simultaneously input into the multi-layer Multimodal Transformer, and Output features and features will feature and features After going through the Flatten, Add and Norm layers, the feature c that fuses speech and text is obtained, and the feature c is input to the fully connected layer (Full Connected Layer) to obtain the emotion recognition result of this round and output it.
  • the fully connected layer Full Connected Layer
  • This multi-modal emotion recognition method combined with hierarchical strategies performs reasoning and prediction on easier-to-predict samples at a shallow level, and sets up a smaller speech emotion recognition model and text emotion recognition framework at the shallow level. Only when the two Only when the emotion labels predicted by the two are the same, the emotion recognition results will be directly output. Otherwise, samples that are difficult to predict will be input to a deeper model, ensuring the accuracy of emotion recognition in the shallow model.
  • the deep model is a multi-modal emotion recognition model Transformer-based joint-encoding (TBJE ), input samples that are difficult to predict into the model, obtain the emotion recognition results and output them. Since most commonly used words or ordinary expressions can be predicted and output in the shallow model, the overall response speed of multi-modal emotion recognition is improved while ensuring accuracy.
  • TJE multi-modal emotion recognition model Transformer-based joint-encoding
  • the input sample is speech and the text corresponding to the speech.
  • Multimodal emotion recognition models are generally larger, which makes the model's reasoning and prediction speed slower and affects the model's response efficiency and concurrency. Therefore, the present invention proposes a multimodal emotion recognition method combined with a hierarchical strategy.
  • the easy-to-predict samples are inferred and predicted in the shallow model, and the more difficult-to-predict samples are inferred and predicted in the deep model, thereby improving the overall response speed of multi-modal emotion recognition while ensuring accuracy.
  • the shallow model of the multi-modal emotion recognition method combined with the hierarchical strategy consists of a smaller speech emotion recognition model (such as CNN) and a text emotion recognition framework, in which the text emotion recognition framework is composed of high-frequency It consists of sentence matching, regular expression matching and a smaller model (such as BiGRU-Attention).
  • the speech emotion recognition model and text emotion recognition framework are relatively small and have fast reasoning speed.
  • Its deep model is a multi-modal emotion Recognition model (e.g. Transformer-based joint-encoding).
  • the input of this invention is speech and the text corresponding to the speech, where the speech and text are input at the same time.
  • high-frequency sentence matching is first performed. If the input text matches a high-frequency sentence in the high-frequency sentence library, then text_emotion is equal to the emotion label corresponding to the high-frequency sentence, and ends. Text emotion recognition process, otherwise enter the text into the regular expression matching layer. If the text successfully matches a certain regular expression, let text_emotion equal the emotion label corresponding to the regular expression, and end the text emotion recognition process. Otherwise, input the text into a smaller model (such as BiGRU-Attention), and set the corresponding threshold for the model's classification emotion label. If the probability value of the emotion category predicted by the model exceeds the threshold of the corresponding emotion category, then let The variable text_emotion is equal to the emotion category, otherwise the value of the variable text_emotion is null.
  • a smaller model such as BiGRU-Attention
  • the deep model is a multi-modal emotion recognition model (such as: Transformer-based joint-encoding), which will be more Difficult-to-predict samples are input to the model, and its emotion recognition results are obtained and output. Because most commonly used words or ordinary expressions can be predicted and output in the shallow model, the overall response speed of multi-modal emotion recognition is improved while ensuring accuracy.
  • a multi-modal emotion recognition model such as: Transformer-based joint-encoding

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • General Health & Medical Sciences (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychiatry (AREA)
  • Child & Adolescent Psychology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Machine Translation (AREA)

Abstract

本发明公开了一种结合分层策略的多模态情绪识别方法。本发明提出了一种结合分层策略的多模态情绪识别方法,该情绪识别方法结合了语音特征和文本特征,与单语音和单文本情绪识别方法相比,进一步提高了情绪识别的准确率;多模态情绪识别模型一般较大,使得模型的推理预测速度较慢,影响模型的响应效率和并发,因此本发明提出了一种结合分层策略的多模态情绪识别方法,对较易预测的样本在浅层模型中推理预测,对较难预测的样本放在深层模型中推理预测,从而在保证准确率的情况下,提高了多模态情绪识别的整体响应速度。

Description

一种结合分层策略的多模态情绪识别方法 技术领域
本发明涉及情绪识别领域,特别涉及一种结合分层策略的多模态情绪识别方法。
背景技术
情绪作为人的一种心理表现,进而会影响到人的行为表现,一个好的情绪,能有助于更好的进行沟通以及提高工作效率。因此在人机对话或在人人对话中,情绪变化的监测识别就有着重要的作用和意义。情绪的识别技术也在近几年不断的兴起,被逐渐应用到客服对话、智能机器人等场景。
目前常用的情绪识别为文本情绪识别,但文本情绪识别只能从文本语义中判断情绪的变化情况,无法结合语调、语气等语音信息,而多模态结合的情绪识别能够融合文本和语音特征进一步改善情绪识别的效果,但目前多模态情绪识别模型一般较大,推理速度较慢,这样会影响实际业务的响应速度,并使得实际业务的并发受到影响,在实际的场景中,用户有许多常用语或者简单普通的表达,这些表达只需用较简单的模型就能准确识别,只有较复杂的表达才需要用大模型去识别。
发明内容
本发明要解决的技术问题是克服现有技术的缺陷,提供一种结合分层策略的多模态情绪识别方法,与单文本和单语音情绪识别相比,进一步提高了情绪识别的效果,并进一步结合了分层策略,对较易预测的样本在浅层模型中推理预测,对较难预测的样本放在深层模型中推理预测,从而在保证准确率的情况下,提高了多模态情绪识别的整体响应速度。
本发明提供了如下的技术方案:
本发明提供一种结合分层策略的多模态情绪识别方法,包括以下步骤:
S1、首先该结合分层策略的多模态情绪识别方法的输入为语音以及该语 音对应的文本;
S2、该结合分层策略的多模态情绪识别方法的浅层模型由语音情绪识别模型CNN和一个文本情绪识别框架组成,其中文本情绪识别框架由高频句匹配、正则表达式匹配和BiGRU-Attention模型构成,其深层模型为一个多模态情绪识别模型Transformer-based joint-encoding(TBJE);
S3、将语音数据输入一个CNN语音情绪识别模型进行推理预测,该语音情绪识别模型较小,并且推理的速度较快;
S4、为语音情绪识别模型的情绪标签设置相应的阈值,如果语音情绪识别模型预测到该情绪类别的概率值超过该情绪类别对应的阈值时,则令变量audio_emotion等于该情绪类别,否则令变量audio_emotion的值为null;
S5、同时将文本数据输入一个分层的文本情绪识别框架,该文本情绪识别框架分为高频句匹配,正则表达式匹配以及一个BiGRU-Attention模型,BiGRU-Attention模型为一个双向GRU模型并结合了Attention注意力机制,该模型相对较小,并且推理的速度较快;其中GRU单元的更新方式如下:
z t=σ(W xzx t+W hzh t-1)
r t=σ(W xrx t+W hrh t-1)
Figure PCTCN2022136487-appb-000001
Figure PCTCN2022136487-appb-000002
其中z t表示更新门,r t表示重置门,σ为sigmod激活函数,x t表示t时刻的输入,h t-1表示t-1时刻的隐藏状态,h t表示t时刻的隐藏状态;
采用BiGRU结构,对每条文本分别计算正向和反向的隐藏状态并拼接,得到目标文本序列H;
并使用attention注意力机制,计算注意力权重系数,计算过程如下:
a=softmax(W Ttanh(H))
其中H为目标文本序列,softmax为归一化指数函数,a为注意力权重系数,W T为变量参数;
进一步地,利用注意力权重系数计算出目标文本序列的上下文序列为:
M=tanh(Ha T)
其中a为注意力权重系数,H为目标文本序列,M为上下文序列;
将上下文序列M输入全连接层(Full Connected Layer)以及softmax函数得到分类结果;
以上为BiGRU-Attention模型的推理过程,在本发明中采用一种分层的文本情绪识别框架,当文本输入该文本情绪识别框架时,首先进行高频句匹配,如果该输入文本匹配到高频句库中的高频句,则令text_emotion等于该高频句所对应的情绪标签,并结束文本情绪识别过程,否则将该文本输入正则表达式匹配层,如果该文本与某条正则表达式匹配成功,则令text_emotion等于该条正则表达式所对应的情绪标签,并结束文本情绪识别过程,否则将该文本输入BiGRU-Attention模型,为BiGRU-Attention模型的分类情绪标签设置相应的阈值,如果BiGRU-Attention模型预测到的情绪类别的概率值超过该情绪类别对应的阈值,则令变量text_emotion等于该情绪类别,否则令变量text_emotion的值为null;
S6、比较语音情绪识别与文本情绪识别的结果,即比较audio_emotion与text_emotion的情绪标签值,如果两者的值相同,则将该情绪标签值作为最后的情绪识别结果并输出,结束本轮预测过程;如果audio_emotion与text_emotion的值不同或者audio_emotion、text_emotion中存在null值,则将语音和其对应的文本输入多模态情绪识别模型Transformer-basedjoint-encoding(TBJE);
S7、Transformer-based joint-encoding(TBJE)为一个多模态情绪识别模型,其输入为语音以及该语音对应的文本,首先将语音和文本同时输入Transformer-based joint-encoding(TBJE)模型,文本输入Embedding层以及LSTM层后得到文本特征a,语音提取特征并输入至全连接层(Full Connected Layer),得到语音特征b,将文本特征a以及语音特征b,同时输入至多层的Multimodal Transformer,并输出特征
Figure PCTCN2022136487-appb-000003
和特征
Figure PCTCN2022136487-appb-000004
将特征
Figure PCTCN2022136487-appb-000005
和特征
Figure PCTCN2022136487-appb-000006
经过Flatten、Add和Norm层后,得到融合语音和文本的特征c,将特征c输入至全连接层(Full Connected Layer)得到本轮情绪识别结果并输出;
S8、该结合分层策略的多模态情绪识别方法将较易预测的样本在浅层进行推理预测,并在浅层设置了一个较小的语音情绪识别模型和文本情绪识别框架,只有当两者预测的情绪标签相同时,才会直接输出情绪识别结果;否则会将较难预测的样本输入至更深层的模型,保证了浅层模型情绪识别的准确率,深层模型为一个多模态情绪识别模型Transformer-based joint-encoding(TBJE),将较难预测的样本输入至该模型,得到其情绪识别结果并输出;因大部分常用语或普通的表达都能在浅层模型中得到预测并输出结果,因此在保证了准确率的前提下,提高了多模态情绪识别的整体响应速度。
与现有技术相比,本发明的有益效果如下:
1.本发明提出了一种结合分层策略的多模态情绪识别方法,该情绪识别方法结合了语音特征和文本特征,与单语音和单文本情绪识别方法相比,进一步提高了情绪识别的准确率;
2.多模态情绪识别模型一般较大,使得模型的推理预测速度较慢,影响模型的响应效率和并发,因此本发明提出了一种结合分层策略的多模态情绪识别方法,对较易预测的样本在浅层模型中推理预测,对较难预测的样本放在深层模型中推理预测,从而在保证准确率的情况下,提高了多模态情绪识 别的整体响应速度。
附图说明
附图用来提供对本发明的进一步理解,并且构成说明书的一部分,与本发明的实施例一起用于解释本发明,并不构成对本发明的限制。在附图中:
图1是本发明的整体架构图;
图2是本发明的语音情绪识别模型CNN的架构示意图;
图3是本发明的文本情绪识别框架中的BiGRU-Attention模型的架构示意图;
图4是本发明的多模态情绪识别模型Transformer-based joint-encoding(TBJE)的整体架构图。
具体实施方式
以下结合附图对本发明的优选实施例进行说明,应当理解,此处所描述的优选实施例仅用于说明和解释本发明,并不用于限定本发明。其中附图中相同的标号全部指的是相同的部件。
实施例1
如图1-4,本发明提供一种结合分层策略的多模态情绪识别方法,包括以下步骤:
S1、首先该结合分层策略的多模态情绪识别方法的输入为语音以及该语音对应的文本;
S2、该结合分层策略的多模态情绪识别方法的浅层模型由语音情绪识别模型CNN和一个文本情绪识别框架组成,其中文本情绪识别框架由高频句匹配、正则表达式匹配和BiGRU-Attention模型构成,其深层模型为一个多模态情绪识别模型Transformer-based joint-encoding(TBJE);
S3、将语音数据输入一个CNN语音情绪识别模型进行推理预测,该语音情绪识别模型较小,并且推理的速度较快;
S4、为语音情绪识别模型的情绪标签设置相应的阈值,如果语音情绪识别模型预测到该情绪类别的概率值超过该情绪类别对应的阈值时,则令变量audio_emotion等于该情绪类别,否则令变量audio_emotion的值为null;
S5、同时将文本数据输入一个分层的文本情绪识别框架,该文本情绪识别框架分为高频句匹配,正则表达式匹配以及一个BiGRU-Attention模型,BiGRU-Attention模型为一个双向GRU模型并结合了Attention注意力机制,该模型相对较小,并且推理的速度较快。其中GRU单元的更新方式如下:
z t=σ(W xzx t+W hzh t-1)
r t=σ(W xrx t+W hrh t-1)
Figure PCTCN2022136487-appb-000007
Figure PCTCN2022136487-appb-000008
其中z t表示更新门,r t表示重置门,σ为sigmod激活函数,x t表示t时刻的输入,h t-1表示t-1时刻的隐藏状态,h t表示t时刻的隐藏状态;
本发明采用BiGRU结构,对每条文本分别计算正向和反向的隐藏状态并拼接,得到目标文本序列H;
并使用attention注意力机制,计算注意力权重系数,计算过程如下:
a=softmax(W Ttanh(H))
其中H为目标文本序列,softmax为归一化指数函数,a为注意力权重系数,W T为变量参数。
进一步地,利用注意力权重系数计算出目标文本序列的上下文序列为:
M=tanh(Ha T)
其中a为注意力权重系数,H为目标文本序列,M为上下文序列。
将上下文序列M输入全连接层(Full Connected Layer)以及softmax函数得到分类结果。
以上为BiGRU-Attention模型的推理过程,在本发明中采用一种分层的文本情绪识别框架,当文本输入该文本情绪识别框架时,首先进行高频句匹配,如果该输入文本匹配到高频句库中的高频句,则令text_emotion等于该高频句所对应的情绪标签,并结束文本情绪识别过程,否则将该文本输入正则表达式匹配层,如果该文本与某条正则表达式匹配成功,则令text_emotion等于该条正则表达式所对应的情绪标签,并结束文本情绪识别过程,否则将该文本输入BiGRU-Attention模型,为BiGRU-Attention模型的分类情绪标签设置相应的阈值,如果BiGRU-Attention模型预测到的情绪类别的概率值超过该情绪类别对应的阈值,则令变量text_emotion等于该情绪类别,否则令变量text_emotion的值为null。
S6、比较语音情绪识别与文本情绪识别的结果,即比较audio_emotion与text_emotion的情绪标签值,如果两者的值相同,则将该情绪标签值作为最后的情绪识别结果并输出,结束本轮预测过程。如果audio_emotion与text_emotion的值不同或者audio_emotion、text_emotion中存在null值,则将语音和其对应的文本输入多模态情绪识别模型Transformer-basedjoint-encoding(TBJE)。
S7、Transformer-based joint-encoding(TBJE)为一个多模态情绪识别模型,其输入为语音以及该语音对应的文本,首先将语音和文本同时输入Transformer-based joint-encoding(TBJE)模型,文本输入Embedding层以及LSTM层后得到文本特征a,语音提取特征并输入至全连接层(Full Connected Layer),得到语音特征b,将文本特征a以及语音特征b,同时输入至多层的Multimodal Transformer,并输出特征
Figure PCTCN2022136487-appb-000009
和特征
Figure PCTCN2022136487-appb-000010
将特征
Figure PCTCN2022136487-appb-000011
和特征
Figure PCTCN2022136487-appb-000012
经过Flatten、Add和Norm层后,得到融合语音和文本的特征c,将特征c输入至全连接层(Full Connected Layer)得到本轮情绪识别结果并输出。
S8、该结合分层策略的多模态情绪识别方法将较易预测的样本在浅层进行推理预测,并在浅层设置了一个较小的语音情绪识别模型和文本情绪识别框架,只有当两者预测的情绪标签相同时,才会直接输出情绪识别结果。否则会将较难预测的样本输入至更深层的模型,保证了浅层模型情绪识别的准确率,在本发明方案中,深层模型为一个多模态情绪识别模型Transformer-based joint-encoding(TBJE),将较难预测的样本输入至该模型,得到其情绪识别结果并输出。因大部分常用语或普通的表达都能在浅层模型中得到预测并输出结果,因此在保证了准确率的前提下,提高了多模态情绪识别的整体响应速度。
具体的,示例如下:
1.设情绪识别场景的类别有中性,高兴,愤怒三种情绪类别。
2.设语音情绪识别模型CNN在中性,高兴,愤怒三种情绪类别上的阈值都为0.5。
3.设文本情绪识别框架中的BiGRU-Attention模型在中性,高兴,愤怒三种情绪类别上的阈值都为0.5。
4.输入的样本为语音以及该语音所对应的文本,将语音输入语音情绪识别CNN模型,设语音情绪识别模型预测到三个类别中性、高兴、愤怒的概率为0.21、0.6、0.19,因为情绪标签高兴的概率值0.6大于阈值0.5。则令audio_emotion=高兴。反之如果模型预测到三个类别中性、高兴、愤怒的概率值都小于0.5,则令audio_emotion=null。
5.将文本输入文本情绪识别框架,如该文本匹配到了高频句,则令text_emotion=该条高频句对应的情绪类别,并结束文本情绪识别过程。如果该文本没有匹配到高频句,则将文本输入到正则表达式匹配层,如该文本匹 配到某条正则表达式,则令text_emotion=该条正则表达式对应的情绪类别,并结束文本情绪识别过程,如果该文本未匹配到正则表达式,则将该文本输入至BiGRU-Attention模型,假设该条文本在高频句和正则表达式层均未匹配成功,且通过BiGRU-Attention模型预测后在三个类别中性、高兴、愤怒的概率为0.05、0.7、0.25,因为情绪标签高兴的概率值大于阈值0.5,则令text_emotion=高兴。反之如果模型预测到三个类别中性、高兴、愤怒的概率值都小于0.5,则令text_emotion=null。
6.比较audio_emotion与text_emotion的情绪标签值,如果两者的值相等,则输出该情绪标签值,结束本轮预测过程。如果两者的值不相等,或audio_emotion、text_emotion中存在null值,则将语音和该条语音对应的文本输入至多模态情绪识别模型Transformer-based joint-encoding(TBJE)中进行推理预测,将预测的结果作为本轮情绪识别的预测结果并输出。
本发明具备以下特点:
1.多模态情绪识别模型一般较大,使得模型的推理预测速度较慢,影响模型的响应效率和并发,因此本发明提出了一种结合分层策略的多模态情绪识别方法,对较易预测的样本在浅层模型中推理预测,对较难预测的样本放在深层模型中推理预测,从而在保证准确率的情况下,提高了多模态情绪识别的整体响应速度。
2.具体地该结合分层策略的多模态情绪识别方法的浅层模型由一个较小的语音情绪识别模型(如:CNN)和一个文本情绪识别框架组成,其中文本情绪识别框架由高频句匹配、正则表达式匹配和一个较小的模型(如:BiGRU-Attention)构成,语音情绪识别模型和文本情绪识别框架均相对较小,推理速度较快,其深层模型为一个多模态情绪识别模型(如:Transformer-based joint-encoding)。
3.该发明的输入为语音以及该语音对应的文本,其中语音和文本为同时 输入。将语音输入至语音情绪识别模型,为语音情绪识别模型的情绪标签设置相应的阈值,如果语音情绪识别模型预测到该情绪类别的概率值超过该情绪类别对应的阈值时,则令变量audio_emotion等于该情绪类别,否则令变量audio_emotion的值为null。
4.将文本输入文本情绪识别框架时,首先进行高频句匹配,如果该输入文本匹配到高频句库中的高频句,则令text_emotion等于该高频句所对应的情绪标签,并结束文本情绪识别过程,否则将该文本输入正则表达式匹配层,如果该文本与某条正则表达式匹配成功,则令text_emotion等于该条正则表达式所对应的情绪标签,并结束文本情绪识别过程,否则将该文本输入一个较小的模型(如:BiGRU-Attention),为该模型的分类情绪标签设置相应的阈值,如果该模型预测到的情绪类别的概率值超过对应情绪类别的阈值,则令变量text_emotion等于该情绪类别,否则令变量text_emotion的值为null。
5.当浅层模型中语音情绪识别的结果audio_emotion与文本情绪识别的的结果text_emotion,两者的情绪标签值相等时,会直接输出情绪识别结果,否则会将较难预测的样本输入至更深层的模型,保证了该结合分层策略的多模态情绪识别方法的准确率,在本发明方案中,深层模型为一个多模态情绪识别模型(如:Transformer-based joint-encoding),将较难预测的样本输入至该模型,得到其情绪识别结果并输出。因大部分常用语或普通的表达都能在浅层模型中得到预测并输出结果,因此在保证了准确率的前提下,提高了多模态情绪识别的整体响应速度。
最后应说明的是:以上所述仅为本发明的优选实施例而已,并不用于限制本发明,尽管参照前述实施例对本发明进行了详细的说明,对于本领域的技术人员来说,其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换。凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。

Claims (1)

  1. 一种结合分层策略的多模态情绪识别方法,其特征在于,包括以下步骤:
    S1、首先该结合分层策略的多模态情绪识别方法的输入为语音以及该语音对应的文本;
    S2、该结合分层策略的多模态情绪识别方法的浅层模型由语音情绪识别模型CNN和一个文本情绪识别框架组成,其中文本情绪识别框架由高频句匹配、正则表达式匹配和BiGRU-Attention模型构成,其深层模型为一个多模态情绪识别模型Transformer-based joint-encoding(TBJE);
    S3、将语音数据输入一个CNN语音情绪识别模型进行推理预测,该语音情绪识别模型较小,并且推理的速度较快;
    S4、为语音情绪识别模型的情绪标签设置相应的阈值,如果语音情绪识别模型预测到该情绪类别的概率值超过该情绪类别对应的阈值时,则令变量audio_emotion等于该情绪类别,否则令变量audio_emotion的值为null;
    S5、同时将文本数据输入一个分层的文本情绪识别框架,该文本情绪识别框架分为高频句匹配,正则表达式匹配以及一个BiGRU-Attention模型,BiGRU-Attention模型为一个双向GRU模型并结合了Attention注意力机制,该模型相对较小,并且推理的速度较快;其中GRU单元的更新方式如下:
    z t=σ(W xzx t+W hzh t-1)
    r t=σ(W xrx t+W hrh t-1)
    Figure PCTCN2022136487-appb-100001
    Figure PCTCN2022136487-appb-100002
    其中z t表示更新门,r t表示重置门,σ为sigmod激活函数,x t表示t时刻的输入,h t-1表示t-1时刻的隐藏状态,h t表示t时刻的隐藏状态;
    采用BiGRU结构,对每条文本分别计算正向和反向的隐藏状态并拼接,得到目标文本序列H;
    并使用attention注意力机制,计算注意力权重系数,计算过程如下:
    a=softmax(W Ttanh(H))
    其中H为目标文本序列,softmax为归一化指数函数,a为注意力权重系数,W T为变量参数;
    进一步地,利用注意力权重系数计算出目标文本序列的上下文序列为:
    M=tanh(Ha T)
    其中a为注意力权重系数,H为目标文本序列,M为上下文序列;
    将上下文序列M输入全连接层(Full Connected Layer)以及softmax函数得到分类结果;
    以上为BiGRU-Attention模型的推理过程,在本发明中采用一种分层的文本情绪识别框架,当文本输入该文本情绪识别框架时,首先进行高频句匹配,如果该输入文本匹配到高频句库中的高频句,则令text_emotion等于该高频句所对应的情绪标签,并结束文本情绪识别过程,否则将该文本输入正则表达式匹配层,如果该文本与某条正则表达式匹配成功,则令text_emotion等于该条正则表达式所对应的情绪标签,并结束文本情绪识别过程,否则将该文本输入BiGRU-Attention模型,为BiGRU-Attention模型的分类情绪标签设置相应的阈值,如果BiGRU-Attention模型预测到的情绪类别的概率值超过该情绪类别对应的阈值,则令变量text_emotion等于该情绪类别,否则令变量text_emotion的值为null;
    S6、比较语音情绪识别与文本情绪识别的结果,即比较audio_emotion与text_emotion的情绪标签值,如果两者的值相同,则将该情绪标签值作为最后的情绪识别结果并输出,结束本轮预测过程;如果audio_emotion与 text_emotion的值不同或者audio_emotion、text_emotion中存在null值,则将语音和其对应的文本输入多模态情绪识别模型Transformer-based joint-encoding(TBJE);
    S7、Transformer-based joint-encoding(TBJE)为一个多模态情绪识别模型,其输入为语音以及该语音对应的文本,首先将语音和文本同时输入Transformer-based joint-encoding(TBJE)模型,文本输入Embedding层以及LSTM层后得到文本特征a,语音提取特征并输入至全连接层(Full Connected Layer),得到语音特征b,将文本特征a以及语音特征b,同时输入至多层的Multimodal Transformer,并输出特征
    Figure PCTCN2022136487-appb-100003
    和特征
    Figure PCTCN2022136487-appb-100004
    将特征
    Figure PCTCN2022136487-appb-100005
    和特征
    Figure PCTCN2022136487-appb-100006
    经过Flatten、Add和Norm层后,得到融合语音和文本的特征c,将特征c输入至全连接层(Full Connected Layer)得到本轮情绪识别结果并输出;
    S8、该结合分层策略的多模态情绪识别方法将较易预测的样本在浅层进行推理预测,并在浅层设置了一个较小的语音情绪识别模型和文本情绪识别框架,只有当两者预测的情绪标签相同时,才会直接输出情绪识别结果;否则会将较难预测的样本输入至更深层的模型,保证了浅层模型情绪识别的准确率,深层模型为一个多模态情绪识别模型Transformer-based joint-encoding(TBJE),将较难预测的样本输入至该模型,得到其情绪识别结果并输出;因大部分常用语或普通的表达都能在浅层模型中得到预测并输出结果,因此在保证了准确率的前提下,提高了多模态情绪识别的整体响应速度。
PCT/CN2022/136487 2022-08-26 2022-12-05 一种结合分层策略的多模态情绪识别方法 WO2024040793A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211037654.2 2022-08-26
CN202211037654.2A CN115641878A (zh) 2022-08-26 2022-08-26 一种结合分层策略的多模态情绪识别方法

Publications (1)

Publication Number Publication Date
WO2024040793A1 true WO2024040793A1 (zh) 2024-02-29

Family

ID=84939393

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/136487 WO2024040793A1 (zh) 2022-08-26 2022-12-05 一种结合分层策略的多模态情绪识别方法

Country Status (2)

Country Link
CN (1) CN115641878A (zh)
WO (1) WO2024040793A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117828537A (zh) * 2024-03-04 2024-04-05 北京建筑大学 一种基于cba模型的音乐情感识别方法和装置

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110364185A (zh) * 2019-07-05 2019-10-22 平安科技(深圳)有限公司 一种基于语音数据的情绪识别方法、终端设备及介质
WO2021068843A1 (zh) * 2019-10-08 2021-04-15 平安科技(深圳)有限公司 一种情绪识别方法及装置、电子设备和可读存储介质
WO2021174757A1 (zh) * 2020-03-03 2021-09-10 深圳壹账通智能科技有限公司 语音情绪识别方法、装置、电子设备及计算机可读存储介质
CN114120978A (zh) * 2021-11-29 2022-03-01 中国平安人寿保险股份有限公司 情绪识别模型训练、语音交互方法、装置、设备及介质
CN114882522A (zh) * 2022-04-01 2022-08-09 浙江西图盟数字科技有限公司 基于多模态融合的行为属性识别方法、装置及存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110364185A (zh) * 2019-07-05 2019-10-22 平安科技(深圳)有限公司 一种基于语音数据的情绪识别方法、终端设备及介质
WO2021068843A1 (zh) * 2019-10-08 2021-04-15 平安科技(深圳)有限公司 一种情绪识别方法及装置、电子设备和可读存储介质
WO2021174757A1 (zh) * 2020-03-03 2021-09-10 深圳壹账通智能科技有限公司 语音情绪识别方法、装置、电子设备及计算机可读存储介质
CN114120978A (zh) * 2021-11-29 2022-03-01 中国平安人寿保险股份有限公司 情绪识别模型训练、语音交互方法、装置、设备及介质
CN114882522A (zh) * 2022-04-01 2022-08-09 浙江西图盟数字科技有限公司 基于多模态融合的行为属性识别方法、装置及存储介质

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
WU LIANGQING, LIU QIYUAN, ZHANG DONG, WANG JIANCHENG, LI SHOUSHAN, ZHOU GUODONG: "Multimodal Emotion Recognition with Auxiliary Sentiment Information", BEIJING DAXUE XUEBAO ZERAN KEXUE BAN - ACTA SCIENTIARUMNATURALIUM UNIVERSITATIS PEKINENSIS, BEIJING DAXUE CHUBANSHE, BEIJING, CN, vol. 56, no. 1, 20 January 2020 (2020-01-20), CN , pages 75 - 81, XP093143266, ISSN: 0479-8023, DOI: 10.13209/j.0479-8023.2019.105 *
ZOU JIYUN, XU YUNFENG: "Emotion recognition neural network based on auxiliary modal supervised training", JOURNAL OF HEBEI UNIVERSITY OF SCIENCE AND TECHNOLOGY., vol. 41, no. 5, 1 October 2020 (2020-10-01), pages 424 - 432, XP093143267, DOI: 1008-1542(2020)05-0424-09 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117828537A (zh) * 2024-03-04 2024-04-05 北京建筑大学 一种基于cba模型的音乐情感识别方法和装置
CN117828537B (zh) * 2024-03-04 2024-05-17 北京建筑大学 一种基于cba模型的音乐情感识别方法和装置

Also Published As

Publication number Publication date
CN115641878A (zh) 2023-01-24

Similar Documents

Publication Publication Date Title
CN108717856B (zh) 一种基于多尺度深度卷积循环神经网络的语音情感识别方法
CN108255805B (zh) 舆情分析方法及装置、存储介质、电子设备
CN110321563B (zh) 基于混合监督模型的文本情感分析方法
CN114694076A (zh) 基于多任务学习与层叠跨模态融合的多模态情感分析方法
CN111191450B (zh) 语料清洗方法、语料录入设备及计算机可读存储介质
CN109299267B (zh) 一种文本对话的情绪识别与预测方法
CN113987179B (zh) 基于知识增强和回溯损失的对话情绪识别网络模型、构建方法、电子设备及存储介质
CN109214006A (zh) 图像增强的层次化语义表示的自然语言推理方法
CN113641822B (zh) 一种基于图神经网络的细粒度情感分类方法
WO2024040793A1 (zh) 一种结合分层策略的多模态情绪识别方法
CN111651973A (zh) 一种基于句法感知的文本匹配方法
CN113435211A (zh) 一种结合外部知识的文本隐式情感分析方法
CN115640530A (zh) 一种基于多任务学习的对话讽刺和情感联合分析方法
CN114528387A (zh) 基于对话流自举的深度学习对话策略模型构建方法和系统
CN111737467B (zh) 一种基于分段卷积神经网络的对象级情感分类方法
CN116108856B (zh) 基于长短回路认知与显隐情感交互的情感识别方法及系统
CN116795971A (zh) 一种基于生成式语言模型的人机对话场景构建系统
CN112257432A (zh) 一种自适应意图识别方法、装置及电子设备
Dave et al. Emotion Detection in Conversation Using Class Weights
Huang et al. Research on Man-Machine Conversation System Based on GRU seq2seq Model
Li et al. A joint multi-task learning framework for spoken language understanding
CN114662503B (zh) 一种基于lstm和语法距离的方面级情感分析方法
Rajan et al. Graph-Based Transfer Learning for Conversational Agents
US20230136527A1 (en) Intent detection
CN116628203A (zh) 基于动态互补图卷积网络的对话情感识别方法及系统

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22956326

Country of ref document: EP

Kind code of ref document: A1