CN115033695A - Long-dialog emotion detection method and system based on common sense knowledge graph - Google Patents

Long-dialog emotion detection method and system based on common sense knowledge graph Download PDF

Info

Publication number
CN115033695A
CN115033695A CN202210676215.XA CN202210676215A CN115033695A CN 115033695 A CN115033695 A CN 115033695A CN 202210676215 A CN202210676215 A CN 202210676215A CN 115033695 A CN115033695 A CN 115033695A
Authority
CN
China
Prior art keywords
speech
utterance
attention
time
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210676215.XA
Other languages
Chinese (zh)
Inventor
聂为之
鲍玉茹
刘安安
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin University
Original Assignee
Tianjin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University filed Critical Tianjin University
Priority to CN202210676215.XA priority Critical patent/CN115033695A/en
Publication of CN115033695A publication Critical patent/CN115033695A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • G06N5/022Knowledge engineering; Knowledge acquisition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Animal Behavior & Ethology (AREA)
  • Health & Medical Sciences (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to the technical field of emotion analysis, and discloses a long-conversation emotion detection method and system of a common knowledge map, which are used for detecting the emotion of a long conversation through the speech characteristics at the time t
Figure DDA0003694725370000011
Construction of graph model g t (V t ,E t ) The words are combined
Figure DDA0003694725370000012
Is added as a new node to the graph model g t (V t ,E t ) Get the updated graph model g t * (V t ,E t ) (ii) a Updating graph model g with GNN model t * (V t ,E t ) Obtaining updated speech features according to the speech features; speech element u based on time t i The first m utterances, determining the final t-time utterance subject f t Based on the updated speech feature and the final t-time speech subject f t And determining the long conversation emotion label. The invention adds the prior knowledge as a new node to the graph model g t (V t ,E t ) In the method, the function of conversation emotion recognition and the relation between speakers are described, the influence of self emotion and mutual emotion is integrated, and the emotion of the conversation is recognized quickly and effectively.

Description

Long-dialog emotion detection method and system based on common sense knowledge graph
Technical Field
The invention relates to the technical field of emotion analysis, in particular to a long-conversation emotion detection method and system based on a common knowledge graph.
Background
In recent years, with the rapid development of computer technology and social networks, conversation robots have gradually replaced traditional human workers in many application fields, such as healthcare, early education, and court programs. In order to ensure the logic of conversation robot communication, not only the setting of a conversation needs to be evaluated, but also the emotion of a human in a conversation phase needs to be evaluated, so the research of conversation emotion detection is more and more important.
For a session, how to capture the human emotion in its rapidly changing context is a prominent problem in the detection of emotion in the session. Classic LSTM, RNN and other methods can only extract a part of context information, and cannot realize quick and effective identification. Li et al propose a bi-directional loop unit for emotion analysis of a conversation, which includes a generalized neural tensor block for obtaining context information and a two-channel classifier for emotion classification, however, a conversation often includes multiple speakers, which are affected by both self-emotion and emotion of other speakers, and the above model considers more time influence, ignores relationship information between speakers, and does not integrate self-emotion and mutual emotion influences. Hazarika et al proposed a deep neural network called Conversational Memory Network (CMN) and extensively discussed these two factors, self-emotion and mutual emotion, modeling each speaker's past utterances as memories, then merging these memories using attention-based jumps and propagating context and sequence information using long-short term memory networks (LSTM) and gated round robin units (GRU), however, the extracted context information is limited and the actual performance is poor.
In the prior art of dialogue emotion detection, extracted context information is limited, the emotion of a dialogue cannot be quickly and effectively identified, the effect of prior knowledge on dialogue emotion identification and relationship information between speakers are generally ignored, and the influence of self emotion and mutual emotion is not comprehensively generated.
Disclosure of Invention
The invention aims to provide a long-conversation emotion detection method and system based on a common-sense knowledge graph, which can quickly identify the emotion of a conversation by combining prior knowledge and can synthesize the influence of self emotion and mutual emotion.
In order to achieve the purpose, the invention provides the following scheme:
a long-conversation emotion detection method of a common sense knowledge graph, comprising:
speech features based on time t
Figure BDA0003694725350000021
Construction of graph model g t (V t ,E t ) (ii) a Wherein, V t Is a collection of nodes that are to be grouped together,
Figure BDA0003694725350000022
E t for different nodes v i And v j The set of undirected edges between the two nodes, wherein i belongs to 1 … n, and j belongs to 1 … n; the graph model g t (V t ,E t ) Dialogue data for representing time t;
utterances based on time t
Figure BDA0003694725350000023
Extracting utterances
Figure BDA0003694725350000024
A priori knowledge of;
combining the utterance
Figure BDA0003694725350000025
Is added as a new node to the graph model g t (V t ,E t ) To obtain an updated graph model g t * (V t ,E t );
Graph model g is updated by GNN model t * (V t ,E t ) Obtaining updated speech features according to the speech features;
speech element u based on time t i The first m utterances, determining the final t-time utterance subject f t Wherein m is 1 … i-1, i is 1 … n;
based on the updated speech characteristics and the final t-time speech subject f t Determining a long conversation emotion label; the long dialog emotion tag comprises: happiness, depression, surprise, emotional colorfulness, fear, disgust and anger.
Optionally, based on the utterance at time t
Figure BDA0003694725350000026
Extracting utterances
Figure BDA0003694725350000027
The prior knowledge of (a) is specifically:
obtaining utterance elements u i The search result of (2); the search result is the first k results in the triple format, and is represented as:
Figure BDA0003694725350000028
based on the speech element u i Retrieval result of (A) obtaining an utterance
Figure BDA0003694725350000029
The search result set of (2) is expressed as:
Figure BDA00036947253500000210
merging each retrieval result in the set based on the retrieval result set to obtain an utterance
Figure BDA00036947253500000211
A priori knowledge of.
Optionally, the speech element u based on the t time i The first m speech elements determine the final t speech theme f t The method specifically comprises the following steps:
obtaining the speech element u at the time t i The top m utterance elements;
the speech element u at the time t is divided into i The previous m speech elements are converted into m speech features;
3 XN different speech elements u at time t i Inputting the first m speech elements into the self-attention mechanism layer to obtain N sample speech subjects f t N positive sample utterance topics f p And N negative sample utterance topics f n (ii) a Wherein, every time the theme is extracted, the time interval is more than or equal to T interval The N sample utterance topics f t N positive sample utterance topics f p And N negative sample utterance topics f n Constituting a triplet (f) t ,f p ,f n );
Based on the triplet (f) t ,f p ,f n ) Constructing a loss function; the loss function is specifically:
Figure BDA0003694725350000031
wherein sim (f) t ,f p ) As a subject f t And f p Dot product between them, i.e. cosine similarity, sim (f) t ,f n ) As a subject f t And f n Dot product between, i.e. cosine similarity, f t Is a sample utterance topic, f p Is f t Positive sample utterance topic of f n Is f t N is the number of training samples, N belongs to 1 … N, and α is a model parameter;
based on the lossLost function, determining final t-time speaking subject f t
Optionally, based on the updated speech feature and the final t-time speech topic f t Determining the long conversation emotion label specifically as follows:
the updated speech characteristics and the final t-time speech subject f t Inputting the speech into a transform decoder structure to obtain the final speech characteristics
Figure BDA0003694725350000032
l is the maximum number of layers in the transform decoder structure;
the final speech characteristics are combined
Figure BDA0003694725350000033
Inputting the softmax function to obtain the long dialogue emotion label.
Optionally, one layer of the transform decoder structure comprises:
a self-attention module for fusing and updating speech features
Figure BDA0003694725350000034
Obtaining a first speech feature
Figure BDA0003694725350000035
Wherein k is 1 … l;
a cross attention module connected with the self attention module for fusing and updating the first speech feature
Figure BDA0003694725350000036
Obtaining second speech characteristics
Figure BDA0003694725350000037
The cross attention module is also for fusing and updating the second spoken language feature
Figure BDA0003694725350000038
Deriving a third speech feature
Figure BDA0003694725350000039
Wherein k is 1 … l;
a feed-forward neural network module connected with the cross attention module for updating the third speech feature
Figure BDA0003694725350000041
Obtaining an updated fourth utterance feature
Figure BDA0003694725350000042
Optionally, the self-attention module applies the following formula to the speech features
Figure BDA0003694725350000043
And (3) performing fusion and updating:
Figure BDA0003694725350000044
Figure BDA0003694725350000045
Figure BDA0003694725350000046
wherein,
Figure BDA0003694725350000047
the weight parameter of the attention Key index that needs to be learned for the k-th self-attention module,
Figure BDA0003694725350000048
is the first weight parameter of the Attention Value that the k-th self-Attention module needs to learn, MultiHead is the multi-headed self-Attention function, Attention is the self-Attention function, softmax is the classification function,
Figure BDA0003694725350000049
is composed of
Figure BDA00036947253500000410
The variance of (c).
Optionally, the cross-attention module applies the following formula specifically to the first utterance feature
Figure BDA00036947253500000411
And (3) performing fusion and updating:
Figure BDA0003694725350000051
Figure BDA0003694725350000052
wherein,
Figure BDA0003694725350000053
is a second weight parameter of Attention Value that needs to be learned by the k-th layer cross Attention module, MultiHead is a multi-head self-Attention function, Attention is a self-Attention function, softmax is a classification function,
Figure BDA0003694725350000054
is composed of
Figure BDA0003694725350000055
The variance of (a);
the cross attention module is specifically configured to apply the following formula to the second speech feature
Figure BDA0003694725350000056
And (3) performing fusion and updating:
Figure BDA0003694725350000057
Figure BDA0003694725350000058
wherein,
Figure BDA0003694725350000059
is the third weight parameter of Attention Value that the k-th cross Attention module needs to learn, MultiHead is a multi-headed self-Attention function, Attention is a self-Attention function, softmax is a classification function,
Figure BDA00036947253500000510
is composed of
Figure BDA00036947253500000511
The variance of (c).
Optionally, the feed-forward neural network module is to the third utterance feature by the following formula
Figure BDA0003694725350000061
Updating:
Figure BDA0003694725350000062
wherein,
Figure BDA0003694725350000063
and (4) constructing the updated fourth speech characteristic for the k-th layer transform decoder.
Based on the method, the invention also provides a long-conversation emotion detection system of the common knowledge base, and the long-conversation emotion detection system comprises:
a graph model construction module for constructing the graph model according to the characteristics of the utterances at the time t
Figure BDA0003694725350000064
Construction of graph model g t (V t ,E t ) (ii) a Wherein, V t In the form of a set of nodes, the nodes,
Figure BDA0003694725350000065
E t for different nodes v i And v j The set of undirected edges, wherein i belongs to 1 … n, and j belongs to 1 … n; the graph model g t (V t ,E t ) Dialogue data for representing time t;
a priori knowledge acquisition module for utterances based on t-time
Figure BDA0003694725350000066
Extracting utterances
Figure BDA0003694725350000067
A priori knowledge of;
a graph model update module connected with the priori knowledge acquisition module for updating the utterance
Figure BDA0003694725350000068
Is added as a new node to the graph model g t (V t ,E t ) Get the updated graph model g t * (V t ,E t );
A speech feature updating module connected with the graph model updating module and used for updating the graph model g by adopting the GNN model t * (V t ,E t ) Obtaining updated utterance features;
a speaking subject determining module for determining the speaking element u based on the t time i The first m utterances, determining the final t-time utterance subject f t Wherein m is 1 … i-1, i is 1 … n;
a long dialog emotion label determination module connected with the speaking feature updating module and the speaking theme determination module and used for determining the final t moment speaking theme f based on the updated speaking feature t Determining a long conversation emotion label; the long dialog emotion tag comprises: happiness, depression, surprise, emotional colorfulness, fear, disgust and anger.
Optionally, the a priori knowledge obtaining module specifically includes:
speech element u i For obtaining the utterance element u i The search result is the first k results in the triple format, and is represented as:
Figure BDA0003694725350000069
words and phrases
Figure BDA00036947253500000610
And the utterance element u i Is connected to the search result acquisition unit for obtaining a search result based on the utterance element u i Retrieval result of (A) obtaining an utterance
Figure BDA00036947253500000611
The search result set of (2) is expressed as:
Figure BDA0003694725350000071
a priori knowledge acquisition unit, and the utterance
Figure BDA0003694725350000072
Is connected to the retrieval result set obtaining unit and is used for taking and collecting each retrieval result in the set based on the retrieval result set to obtain the words
Figure BDA0003694725350000073
A priori knowledge of.
According to the specific embodiment provided by the invention, the invention discloses the following technical effects:
the invention provides a long-dialog emotion detection method and system of a common knowledge map, which passes the speech characteristics at the time t
Figure BDA0003694725350000074
Construction of graph model g t (V t ,E t ) The words are combined
Figure BDA0003694725350000075
As new nodes are added to the graph model g t (V t ,E t ) Get the updated graph model g t * (V t ,E t ) (ii) a Updating graph model g with GNN model t * (V t ,E t ) Obtaining updated utterance features; speech element u based on time t i The first m utterances, determining the final t-time utterance subject f t Based on the updated speech feature and the final t-time speech subject f t Determining a long conversation emotion label; the long dialog emotion tag comprises: happiness, depression, surprise, emotional colorfulness, fear, disgust and anger. The invention adds the prior knowledge as a new node to the graph model g t (V t ,E t ) In the method, the function of dialogue emotion recognition and the relation between speakers are described, the influence of self emotion and mutual emotion is integrated, and the emotion of the dialogue is recognized quickly and effectively.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.
FIG. 1 is a flowchart of a method for detecting emotion in long dialog based on common sense knowledge-graph according to an embodiment of the present invention;
FIG. 2 is a flow chart of updated utterance features in accordance with an embodiment of the present invention;
FIG. 3 is a flowchart illustrating a method for determining a final time t-speaking topic according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of a transform decoder structure layer according to an embodiment of the present invention;
FIG. 5 is a schematic structural diagram of a long-dialog emotion detection system based on common sense knowledge base.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The invention aims to provide a long dialogue emotion detection method and system of a common knowledge map, which can synthesize the influence of self emotion and mutual emotion and quickly and effectively identify the emotion of dialogue.
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
Referring to fig. 1 of the drawings, the invention provides a long-dialog emotion detection method for a common knowledge graph, which comprises the following steps:
s1, speech feature based on t time
Figure BDA0003694725350000081
Construction of graph model g t (V t ,E t ) (ii) a Wherein, V t Is a collection of nodes that are to be grouped together,
Figure BDA0003694725350000082
E t for different nodes v i And v j The set of undirected edges between the two nodes, wherein i belongs to 1 … n, and j belongs to 1 … n; the graph model g t (V t ,E t ) For representing the dialogue data at time t.
In particular, E t Two types of edges are included: one is a temporal edge, representing the utterance v i And v j In the sequence of occurrence in the conversation, when speaker A speaks the speech feature v i Speaker B then speaks utterance feature v j I.e. node v i And v j There is a chronological order between them, if v i At v is j Before, then the edge exists and the weight is defined as 1, otherwise, changeBecomes absent and has a weight of 0. The other is semantic edge, which measures utterances v by cosine distance i And v j The similarity between them.
S2 utterance based on t time
Figure BDA0003694725350000083
Extracting utterances
Figure BDA0003694725350000084
The prior knowledge comprises the following specific steps:
s21, obtaining the speech element u i The search result of (2); the search result is the first k results in the triple format, and is represented as:
Figure BDA0003694725350000085
wherein each similar utterance in the top k results consists of triplets, the relationships in these triplets including intent, reaction, and the like.
Specifically, based on the common sense knowledge corpus ATOMIC and the common sense knowledge base ConceptNet, the common sense knowledge retrieval method is applied to search the prior knowledge of the utterance.
S22, based on the speech element u i Retrieval result of (A) obtaining an utterance
Figure BDA0003694725350000086
The search result set of (2) is expressed as:
Figure BDA0003694725350000087
s23, based on the retrieval result set, merging each retrieval result in the set to obtain an utterance
Figure BDA0003694725350000091
A priori knowledge of.
Specifically, these prior knowledge are extracted to feature vectors through BilSTM and added to the graph as new nodes as auxiliary information to help the utterance embedding learning.
In particular, utterances at time t
Figure BDA0003694725350000092
Converting the speech characteristics of t time into speech characteristics of t time through a Bi-directional Long Short-Term Memory network model
Figure BDA0003694725350000093
S3, transforming the words
Figure BDA0003694725350000094
Is added as a new node to the graph model g t (V t ,E t ) Get the updated graph model g t * (V t ,E t )。
S4, updating graph model g by GNN model t * (V t ,E t ) The updated speech feature is obtained.
In particular, referring to FIG. 2 of the drawings, the words at time t are labeled
Figure BDA0003694725350000095
Converting into the speech characteristics of t time through a BilSTM layer bidirectional long-term short-term memory network model
Figure BDA0003694725350000096
Construction of graph model g t (V t ,E t ) And retrieving prior knowledge based on the common sense knowledge corpus ATOMIC and the common sense knowledge base conceptNet, wherein the prior knowledge is used for updating the graph model, and the node features in the graph are updated through the classical GNN model to obtain updated speech features.
S5, speech element u based on t time i The first m utterances, determining the final t-time utterance subject f t Wherein m is 1 … i-1, i is 1 … n, and the concrete steps are as follows:
s51, obtaining the speech element u at the time t i The top m speech elements.
S52, converting the speech element u at the t moment i The first m wordsThe speech elements are converted into m speech features.
S53, converting 3 XN different speech elements u at t time i Inputting the first m speech elements into the self-attention mechanism layer to obtain N sample speech subjects f t N positive sample utterance topics f p And N negative sample utterance topics f n (ii) a Wherein, every time the theme is extracted, the time interval is more than or equal to T interval The N sample utterance topics f t N positive sample utterance topics f p And N negative sample utterance topics f n Constituting a triplet (f) t ,f p ,f n )。
S54, based on the triplet (f) t ,f p ,f n ) Constructing a loss function; the loss function is specifically:
Figure BDA0003694725350000097
wherein sim (f) t ,f p ) As a subject f t And f p Dot product between them, i.e. cosine similarity, sim (f) t ,f n ) As a subject f t And f n Dot product between, i.e. cosine similarity, f t Is a sample utterance topic, f p Is f t Positive sample utterance topic of f n Is f t Is the number of training samples, N is e 1 … N, and α is the model parameter.
In particular, the objective of the loss function is to reduce two identical sample utterance topics f t And positive sample utterance subject f p And adding two different sample utterance topics f t And negative sample utterance topic f n The difference between them.
In particular, for said triplet (f) t ,f p ,f n ) Performing contrast learning, wherein the task of the contrast learning is to enable positive samples and negative samples to have the same or similar feature vectors and dissimilar representations, (f) t ,f p ,f n ) Through comparative learning, a changing utterance subject is obtained.
S55, determining the final t-time speaking topic f based on the loss function t
Specifically, referring to FIG. 3 of the drawings, m utterances of a current conversation are input into a BilSTM layer bidirectional long-term short-term memory network model to obtain m utterance features, and the m utterance features are input into a self-attention mechanism to obtain N sample utterance topics f t N positive sample utterance topics f p And N negative sample utterance topics f n Triplet (f) t ,f p ,f n ) And performing comparison learning to obtain the current theme.
Specifically, conversation topic extraction mainly focuses on conversation background topics in different periods and guides speech emotion recognition.
S6, based on the updated speaking characteristics and the final t-time speaking subject f t Determining a long conversation emotion label; the long dialog emotion tag comprises: happiness, depression, surprise, emotional colorfulness, fear, disgust and anger.
In particular, referring to FIG. 4 of the drawings, based on the updated utterance feature and the final t-time utterance subject f t Determining the long conversation emotion label specifically as follows:
the updated speech characteristics and the final t-time speech subject f t Inputting the speech into a transform decoder structure to obtain the final speech characteristics
Figure BDA0003694725350000101
l is the maximum number of layers in the transform decoder structure;
the final speech characteristics are processed
Figure BDA0003694725350000102
Inputting the softmax function to obtain the long dialogue emotion label.
Specifically, the loss function of the softmax function is:
Figure BDA0003694725350000103
wherein L is final For the cross-entropy loss value when calculating the softmax function,
Figure BDA0003694725350000104
representing speech features
Figure BDA0003694725350000111
The k-th layer transform decoder structure updates the speech characteristics, M is the number of emotion labels,
Figure BDA0003694725350000112
being features of speech
Figure BDA0003694725350000113
The k-th layer transform decoder structure updates the speech characteristics, k being 1 … … l.
Specifically, the transformer decoder structure has l layers in common, wherein one layer in the transformer decoder structure comprises: a self-attention module, a cross-attention module and a feedforward neural network module;
the self-attention module is used for fusing and updating speech features
Figure BDA0003694725350000114
Obtaining a first speech feature
Figure BDA0003694725350000115
Wherein k is 1 … l;
in particular, the self-attention module applies the following formula specifically to the utterance feature
Figure BDA0003694725350000116
And (3) performing fusion and updating:
Figure BDA0003694725350000117
Figure BDA0003694725350000118
Figure BDA0003694725350000119
wherein,
Figure BDA00036947253500001110
for the weight parameter of the attention Key index that the k-th self-attention module needs to learn,
Figure BDA00036947253500001111
is the first weight parameter of the Attention Value that the k-th self-Attention module needs to learn, MultiHead is the multi-headed self-Attention function, Attention is the self-Attention function, softmax is the classification function,
Figure BDA00036947253500001112
is composed of
Figure BDA00036947253500001113
The variance of (c).
The cross attention module is connected with the self attention module, and the cross attention module is used for fusing and updating the first speech feature
Figure BDA00036947253500001114
Obtaining second speech characteristics
Figure BDA00036947253500001115
The cross attention module is also for fusing and updating the second spoken language feature
Figure BDA00036947253500001116
Deriving a third speech feature
Figure BDA0003694725350000121
Wherein k is 1 … l;
specifically, the cross-attention module specifically employs the following formula for the firstA speech feature
Figure BDA0003694725350000122
And (3) performing fusion and updating:
Figure BDA0003694725350000123
Figure BDA0003694725350000124
wherein,
Figure BDA0003694725350000125
is a second weight parameter of Attention Value that needs to be learned by the k-th layer cross Attention module, MultiHead is a multi-head self-Attention function, Attention is a self-Attention function, softmax is a classification function,
Figure BDA0003694725350000126
is composed of
Figure BDA0003694725350000127
The variance of (a);
the cross attention module is specifically configured to apply the following formula to the second speech feature
Figure BDA0003694725350000128
And (3) performing fusion and updating:
Figure BDA0003694725350000129
Figure BDA00036947253500001210
wherein,
Figure BDA00036947253500001211
is the k-th layer cross attention module learningThe third weight parameter of the learned Attention Value, MultiHead, Attention, softmax, classification function,
Figure BDA0003694725350000131
is composed of
Figure BDA0003694725350000132
The variance of (c).
The feed-forward neural network module is connected with the cross attention module and used for updating the third speech feature
Figure BDA0003694725350000133
Obtaining an updated fourth utterance feature
Figure BDA0003694725350000134
Specifically, the feed-forward neural network module applies the following formula to the third speech feature
Figure BDA0003694725350000135
Updating:
Figure BDA0003694725350000136
wherein,
Figure BDA0003694725350000137
and (4) constructing the updated fourth speech characteristic for the k-th layer transform decoder.
Referring to fig. 5 of the drawings, based on the method, the invention further provides a long-dialog emotion detection system of the common sense knowledge base, and the long-dialog emotion detection system comprises: the system comprises a graph model construction module 1, a priori knowledge acquisition module 2, a graph model updating module 3, a speaking characteristic updating module 4, a speaking topic determining module 5 and a long conversation emotion label determining module 6.
The graph model construction module 1 is used for constructing the speech characteristics according to t time
Figure BDA0003694725350000138
Construction of graph model g t (V t ,E t ) (ii) a Wherein, V t In the form of a set of nodes, the nodes,
Figure BDA0003694725350000139
E t for different nodes v i And v j The set of undirected edges between the two nodes, wherein i belongs to 1 … n, and j belongs to 1 … n; the graph model g t (V t ,E t ) Dialogue data for representing time t;
the priori knowledge acquisition module 2 is used for utterances based on t time
Figure BDA00036947253500001310
Extracting utterances
Figure BDA00036947253500001311
A priori knowledge of;
the graph model updating module 3 is connected with the prior knowledge obtaining module 2, and the graph model updating module 3 is used for updating the utterance
Figure BDA00036947253500001312
As new nodes are added to the graph model g t (V t ,E t ) To obtain an updated graph model g t * (V t ,E t );
The speech feature updating module 4 is connected with the graph model updating module 3, and the speech feature updating module 4 is used for updating the graph model g by adopting a GNN model t * (V t ,E t ) Obtaining updated speech features according to the speech features;
the speaking theme determining module 5 is used for determining the speaking element u based on the t time i The first m utterances, determining the final t-time utterance subject f t Wherein m is 1 … i-1, i is 1 … n;
the long-dialog emotion label determination module 6 and the utterance feature updating module 4A topic determination module 5 connected to said long-dialog emotion tag determination module for determining a topic f of said final t-time utterance based on said updated utterance feature t Determining a long conversation emotion label; the long dialog emotion tag comprises: happy (happy), depressed (frustrated), surprised (surrise), anhedonic (neutral), fear (fear), disgust (distust), anger (anger).
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the description of the method part.
The invention provides a long-conversation emotion detection method and system based on a common sense knowledge graph, which not only keep the correlation between utterances, but also keep time change information. By searching each speech feature from the common sense knowledge graph, relevant prior knowledge is learned, global semantic information is obtained, the speech feature is guided to be updated, and the final performance of emotion detection is improved. The emotion is detected by using the structure of the transform decoder, so that the deep fusion of the speech characteristics containing the prior knowledge and the speech theme is realized, the potential dialogue semantic information and the prior knowledge are introduced, and the emotion recognition capability is improved.
The principle and the embodiment of the present invention are explained by applying specific examples, and the above description of the embodiments is only used to help understanding the method and the core idea of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims (10)

1. A long-conversation emotion detection method of a common sense knowledge graph, the long-conversation emotion detection method comprising:
speech features based on time t
Figure FDA0003694725340000011
Construction of graph model g t (V t ,E t ) (ii) a Wherein, V t Is a collection of nodes that are to be grouped together,
Figure FDA0003694725340000012
E t for different nodes v i And v j The set of undirected edges between the two nodes, wherein i belongs to 1 … n, and j belongs to 1 … n; the graph model g t (V t ,E t ) Dialogue data for representing time t;
utterances based on time t
Figure FDA0003694725340000013
Extracting utterances
Figure FDA0003694725340000014
A priori knowledge of;
combining the utterance
Figure FDA0003694725340000015
Is added as a new node to the graph model g t (V t ,E t ) Get the updated graph model g t * (V t ,E t );
Updating graph model g with GNN model t * (V t ,E t ) Obtaining updated speech features according to the speech features;
t moment based utterance element u i The first m utterances, determining the final t-time utterance subject f t Wherein, m is 1 … i-1, i is 1 … n;
based on the updated speech characteristics and the final t-time speech subject f t Determining a long conversation emotion label; the long dialog emotion tag comprises: happiness, depression, surprise, emotional colorfulness, fear, disgust and anger.
2. According to claim 1The method for detecting the long-dialog emotion of the common sense knowledge graph is characterized in that the utterance at the t moment is based on
Figure FDA0003694725340000016
Extracting utterances
Figure FDA0003694725340000017
The prior knowledge of (a) is specifically:
obtaining utterance elements u i The search result of (2); the search result is the first k results in the triple format, and is represented as:
Figure FDA0003694725340000018
based on the speech element u i Retrieval result of (A) obtaining an utterance
Figure FDA0003694725340000019
The search result set of (2) is expressed as:
Figure FDA00036947253400000110
merging each retrieval result in the set based on the retrieval result set to obtain an utterance
Figure FDA00036947253400000111
A priori knowledge of.
3. The method of claim 1, wherein the speech element u at the time t is based on i The first m speech elements determine the final t-time speech subject f t The method specifically comprises the following steps:
obtaining the speech element u at the time t i The top m utterance elements;
the speech element u at the time t is divided into i The first m speech elements are converted into m speech features;
3 XN differentSpeech element u at time t i Inputting the first m speech elements into the self-attention mechanism layer to obtain N sample speech subjects f t N positive sample utterance topics f p And N negative sample utterance topics f n (ii) a Wherein, every time the theme is extracted, the time interval is more than or equal to T interval The N sample utterance topics f t N positive sample utterance topics f p And N negative sample utterance topics f n Constituting a triplet (f) t ,f p ,f n );
Based on the triplet (f) t ,f p ,f n ) Constructing a loss function; the loss function is specifically:
Figure FDA0003694725340000021
wherein sim (f) t ,f p ) As a subject f t And f p Dot product between them, i.e. cosine similarity, sim (f) t ,f n ) As a subject f t And f n Dot product between, i.e. cosine similarity, f t Is a sample utterance topic, f p Is f t Positive sample utterance topic of f n Is f t N is the number of training samples, N belongs to 1 … N, and α is a model parameter;
determining a final t-time utterance topic f based on the loss function t
4. The method of claim 1, wherein the method is based on the updated utterance features and the final t-time utterance topic f t Determining the long conversation emotion label specifically as follows:
the updated speech characteristics and the final t-time speech subject f t Inputting the speech into a transform decoder structure to obtain the final speech characteristics
Figure FDA0003694725340000022
l isMaximum number of layers in the transform decoder structure;
the final speech characteristics are combined
Figure FDA0003694725340000023
Inputting the softmax function to obtain the long dialogue emotion label.
5. The method of claim 4, wherein one layer of the transform decoder structure comprises:
a self-attention module for fusing and updating speech features
Figure FDA0003694725340000024
Obtaining a first speech feature
Figure FDA0003694725340000025
Wherein k is 1 … l;
a cross attention module connected with the self attention module for fusing and updating the first speech feature
Figure FDA0003694725340000031
Obtaining second speech characteristics
Figure FDA0003694725340000032
The cross attention module is further configured to fuse and update the second spoken language feature
Figure FDA0003694725340000033
Deriving a third speech feature
Figure FDA0003694725340000034
Wherein k is 1 … l;
a feed-forward neural network module connected with the cross attention module for updating the third speech feature
Figure FDA0003694725340000035
Obtaining an updated fourth utterance feature
Figure FDA0003694725340000036
6. The method of claim 5, wherein the self-attention module is configured to apply the following formula to the utterance feature
Figure FDA0003694725340000037
And (3) performing fusion and updating:
Figure FDA0003694725340000038
Figure FDA0003694725340000039
Figure FDA00036947253400000310
wherein,
Figure FDA00036947253400000311
the weight parameter of the attention Key index that needs to be learned for the k-th self-attention module,
Figure FDA00036947253400000312
is the first weight parameter of the Attention Value that the k-th self-Attention module needs to learn, MultiHead is the multi-headed self-Attention function, Attention is the self-Attention function, softmax is the classification function,
Figure FDA00036947253400000313
is composed of
Figure FDA00036947253400000314
The variance of (c).
7. The method of claim 5, wherein the cross-attention module is configured to apply the following formula to the first utterance feature
Figure FDA00036947253400000315
And (3) performing fusion and updating:
Figure FDA0003694725340000041
Figure FDA0003694725340000042
wherein,
Figure FDA0003694725340000043
is the second weight parameter of the Attention Value that the k-th cross Attention module needs to learn, MultiHead is a multi-headed self-Attention function, Attention is a self-Attention function, softmax is a classification function,
Figure FDA0003694725340000044
is composed of
Figure FDA0003694725340000045
The variance of (a);
the cross attention module is specifically configured to apply the following formula to the second spoken language feature
Figure FDA0003694725340000046
And (3) performing fusion and updating:
Figure FDA0003694725340000047
Figure FDA0003694725340000048
wherein,
Figure FDA0003694725340000049
is the third weight parameter of Attention Value that the k-th cross Attention module needs to learn, MultiHead is a multi-headed self-Attention function, Attention is a self-Attention function, softmax is a classification function,
Figure FDA00036947253400000410
is composed of
Figure FDA00036947253400000411
The variance of (c).
8. The method of claim 5, wherein the feedforward neural network module is configured to apply the third utterance feature according to the following formula
Figure FDA0003694725340000051
Updating:
Figure FDA0003694725340000052
wherein,
Figure FDA0003694725340000053
and (4) constructing the updated fourth speech characteristic for the k-th layer transform decoder.
9. A long-dialog emotion detection system of a common-sense knowledge graph, comprising:
a graph model construction module for constructing the graph model according to the characteristics of the utterances at the time t
Figure FDA0003694725340000054
Construction of graph model g t (V t ,E t ) (ii) a Wherein, V t In the form of a set of nodes, the nodes,
Figure FDA0003694725340000055
E t for different nodes v i And v j The set of undirected edges between the two nodes, wherein i belongs to 1 … n, and j belongs to 1 … n; the graph model g t (V t ,E t ) Dialogue data for representing time t;
a priori knowledge acquisition module for utterances based on t-time
Figure FDA0003694725340000056
Extracting utterances
Figure FDA0003694725340000057
A priori knowledge of;
a graph model update module connected with the priori knowledge acquisition module for updating the utterance
Figure FDA0003694725340000058
As new nodes are added to the graph model g t (V t ,E t ) Get the updated graph model g t * (V t ,E t );
A speech feature updating module connected with the graph model updating module and used for updating the graph model g by adopting the GNN model t * (V t ,E t ) Obtaining updated speech features according to the speech features;
a speaking subject determining module for determining the speaking element u based on the t time i The first m utterances determine the utterances subject f at the final t moment t Wherein m is 1 … i-1, i is 1 …n;
A long dialog emotion label determination module connected with the speaking feature updating module and the speaking theme determination module and used for determining the final t moment speaking theme f based on the updated speaking feature t Determining a long conversation emotion label; the long dialog emotion tag comprises: happiness, depression, surprise, emotional colorfulness, fear, disgust and anger.
10. The system for long-dialog emotion detection of a common sense knowledge base, according to claim 9, wherein the a priori knowledge acquisition module specifically comprises:
speech element u i For obtaining the utterance element u i The search result is the first k results in the triple format, and is expressed as:
Figure FDA0003694725340000059
words and phrases
Figure FDA0003694725340000061
And the utterance element u i Is connected to the search result acquisition unit for obtaining a search result based on the utterance element u i Retrieval result of (A) obtaining an utterance
Figure FDA0003694725340000062
The search result set of (2) is expressed as:
Figure FDA0003694725340000063
a priori knowledge acquisition unit, and the utterance
Figure FDA0003694725340000064
Is connected to the retrieval result set obtaining unit and is used for taking and collecting each retrieval result in the set based on the retrieval result set to obtain the words
Figure FDA0003694725340000065
A priori knowledge of.
CN202210676215.XA 2022-06-15 2022-06-15 Long-dialog emotion detection method and system based on common sense knowledge graph Pending CN115033695A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210676215.XA CN115033695A (en) 2022-06-15 2022-06-15 Long-dialog emotion detection method and system based on common sense knowledge graph

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210676215.XA CN115033695A (en) 2022-06-15 2022-06-15 Long-dialog emotion detection method and system based on common sense knowledge graph

Publications (1)

Publication Number Publication Date
CN115033695A true CN115033695A (en) 2022-09-09

Family

ID=83124422

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210676215.XA Pending CN115033695A (en) 2022-06-15 2022-06-15 Long-dialog emotion detection method and system based on common sense knowledge graph

Country Status (1)

Country Link
CN (1) CN115033695A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115905518A (en) * 2022-10-17 2023-04-04 华南师范大学 Emotion classification method, device and equipment based on knowledge graph and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115905518A (en) * 2022-10-17 2023-04-04 华南师范大学 Emotion classification method, device and equipment based on knowledge graph and storage medium
CN115905518B (en) * 2022-10-17 2023-10-20 华南师范大学 Emotion classification method, device, equipment and storage medium based on knowledge graph

Similar Documents

Publication Publication Date Title
CN110532355B (en) Intention and slot position joint identification method based on multitask learning
US10347244B2 (en) Dialogue system incorporating unique speech to text conversion method for meaningful dialogue response
CN114973062B (en) Multimode emotion analysis method based on Transformer
CN107993665B (en) Method for determining role of speaker in multi-person conversation scene, intelligent conference method and system
Santhanavijayan et al. A semantic-aware strategy for automatic speech recognition incorporating deep learning models
CN114694076A (en) Multi-modal emotion analysis method based on multi-task learning and stacked cross-modal fusion
CN110211594B (en) Speaker identification method based on twin network model and KNN algorithm
CN109446331A (en) A kind of text mood disaggregated model method for building up and text mood classification method
CN113094578A (en) Deep learning-based content recommendation method, device, equipment and storage medium
CN112101044B (en) Intention identification method and device and electronic equipment
CN113178193A (en) Chinese self-defined awakening and Internet of things interaction method based on intelligent voice chip
CN115640530A (en) Combined analysis method for dialogue sarcasm and emotion based on multi-task learning
CN116304973A (en) Classroom teaching emotion recognition method and system based on multi-mode fusion
Bluche et al. Predicting detection filters for small footprint open-vocabulary keyword spotting
CN112417132A (en) New intention recognition method for screening negative samples by utilizing predicate guest information
CN115910066A (en) Intelligent dispatching command and operation system for regional power distribution network
Huang et al. Speech emotion recognition using convolutional neural network with audio word-based embedding
CN115033695A (en) Long-dialog emotion detection method and system based on common sense knowledge graph
Farooq et al. Mispronunciation detection in articulation points of Arabic letters using machine learning
Ananthi et al. Speech recognition system and isolated word recognition based on Hidden Markov model (HMM) for Hearing Impaired
CN114360584A (en) Phoneme-level-based speech emotion layered recognition method and system
CN114254096A (en) Multi-mode emotion prediction method and system based on interactive robot conversation
CN117056506A (en) Public opinion emotion classification method based on long-sequence text data
CN115376547B (en) Pronunciation evaluation method, pronunciation evaluation device, computer equipment and storage medium
CN116434786A (en) Text-semantic-assisted teacher voice emotion recognition method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination