CN115033695A - Long-dialog emotion detection method and system based on common sense knowledge graph - Google Patents
Long-dialog emotion detection method and system based on common sense knowledge graph Download PDFInfo
- Publication number
- CN115033695A CN115033695A CN202210676215.XA CN202210676215A CN115033695A CN 115033695 A CN115033695 A CN 115033695A CN 202210676215 A CN202210676215 A CN 202210676215A CN 115033695 A CN115033695 A CN 115033695A
- Authority
- CN
- China
- Prior art keywords
- speech
- utterance
- attention
- time
- module
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000008451 emotion Effects 0.000 title claims abstract description 79
- 238000001514 detection method Methods 0.000 title claims abstract description 23
- 238000000034 method Methods 0.000 claims abstract description 21
- 238000010276 construction Methods 0.000 claims abstract description 13
- 230000006870 function Effects 0.000 claims description 42
- 230000004927 fusion Effects 0.000 claims description 10
- 238000013528 artificial neural network Methods 0.000 claims description 8
- 230000002996 emotional effect Effects 0.000 claims description 6
- 238000012549 training Methods 0.000 claims description 3
- 230000008909 emotion recognition Effects 0.000 abstract description 4
- 230000015654 memory Effects 0.000 description 4
- 230000006403 short-term memory Effects 0.000 description 3
- 230000002457 bidirectional effect Effects 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000007774 longterm Effects 0.000 description 2
- 239000013598 vector Substances 0.000 description 2
- 241001522296 Erithacus rubecula Species 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000000052 comparative effect Effects 0.000 description 1
- 230000000994 depressogenic effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 230000001537 neural effect Effects 0.000 description 1
- 230000007935 neutral effect Effects 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 230000001902 propagating effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
- 230000001131 transforming effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/353—Clustering; Classification into predefined classes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
- G06F16/367—Ontology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/02—Knowledge representation; Symbolic representation
- G06N5/022—Knowledge engineering; Knowledge acquisition
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Computational Linguistics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Databases & Information Systems (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Software Systems (AREA)
- Mathematical Physics (AREA)
- Computing Systems (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- General Health & Medical Sciences (AREA)
- Biophysics (AREA)
- Animal Behavior & Ethology (AREA)
- Health & Medical Sciences (AREA)
- Machine Translation (AREA)
Abstract
The invention relates to the technical field of emotion analysis, and discloses a long-conversation emotion detection method and system of a common knowledge map, which are used for detecting the emotion of a long conversation through the speech characteristics at the time tConstruction of graph model g t (V t ,E t ) The words are combinedIs added as a new node to the graph model g t (V t ,E t ) Get the updated graph model g t * (V t ,E t ) (ii) a Updating graph model g with GNN model t * (V t ,E t ) Obtaining updated speech features according to the speech features; speech element u based on time t i The first m utterances, determining the final t-time utterance subject f t Based on the updated speech feature and the final t-time speech subject f t And determining the long conversation emotion label. The invention adds the prior knowledge as a new node to the graph model g t (V t ,E t ) In the method, the function of conversation emotion recognition and the relation between speakers are described, the influence of self emotion and mutual emotion is integrated, and the emotion of the conversation is recognized quickly and effectively.
Description
Technical Field
The invention relates to the technical field of emotion analysis, in particular to a long-conversation emotion detection method and system based on a common knowledge graph.
Background
In recent years, with the rapid development of computer technology and social networks, conversation robots have gradually replaced traditional human workers in many application fields, such as healthcare, early education, and court programs. In order to ensure the logic of conversation robot communication, not only the setting of a conversation needs to be evaluated, but also the emotion of a human in a conversation phase needs to be evaluated, so the research of conversation emotion detection is more and more important.
For a session, how to capture the human emotion in its rapidly changing context is a prominent problem in the detection of emotion in the session. Classic LSTM, RNN and other methods can only extract a part of context information, and cannot realize quick and effective identification. Li et al propose a bi-directional loop unit for emotion analysis of a conversation, which includes a generalized neural tensor block for obtaining context information and a two-channel classifier for emotion classification, however, a conversation often includes multiple speakers, which are affected by both self-emotion and emotion of other speakers, and the above model considers more time influence, ignores relationship information between speakers, and does not integrate self-emotion and mutual emotion influences. Hazarika et al proposed a deep neural network called Conversational Memory Network (CMN) and extensively discussed these two factors, self-emotion and mutual emotion, modeling each speaker's past utterances as memories, then merging these memories using attention-based jumps and propagating context and sequence information using long-short term memory networks (LSTM) and gated round robin units (GRU), however, the extracted context information is limited and the actual performance is poor.
In the prior art of dialogue emotion detection, extracted context information is limited, the emotion of a dialogue cannot be quickly and effectively identified, the effect of prior knowledge on dialogue emotion identification and relationship information between speakers are generally ignored, and the influence of self emotion and mutual emotion is not comprehensively generated.
Disclosure of Invention
The invention aims to provide a long-conversation emotion detection method and system based on a common-sense knowledge graph, which can quickly identify the emotion of a conversation by combining prior knowledge and can synthesize the influence of self emotion and mutual emotion.
In order to achieve the purpose, the invention provides the following scheme:
a long-conversation emotion detection method of a common sense knowledge graph, comprising:
speech features based on time tConstruction of graph model g t (V t ,E t ) (ii) a Wherein, V t Is a collection of nodes that are to be grouped together,E t for different nodes v i And v j The set of undirected edges between the two nodes, wherein i belongs to 1 … n, and j belongs to 1 … n; the graph model g t (V t ,E t ) Dialogue data for representing time t;
combining the utteranceIs added as a new node to the graph model g t (V t ,E t ) To obtain an updated graph model g t * (V t ,E t );
Graph model g is updated by GNN model t * (V t ,E t ) Obtaining updated speech features according to the speech features;
speech element u based on time t i The first m utterances, determining the final t-time utterance subject f t Wherein m is 1 … i-1, i is 1 … n;
based on the updated speech characteristics and the final t-time speech subject f t Determining a long conversation emotion label; the long dialog emotion tag comprises: happiness, depression, surprise, emotional colorfulness, fear, disgust and anger.
Optionally, based on the utterance at time tExtracting utterancesThe prior knowledge of (a) is specifically:
obtaining utterance elements u i The search result of (2); the search result is the first k results in the triple format, and is represented as:
based on the speech element u i Retrieval result of (A) obtaining an utteranceThe search result set of (2) is expressed as:
merging each retrieval result in the set based on the retrieval result set to obtain an utteranceA priori knowledge of.
Optionally, the speech element u based on the t time i The first m speech elements determine the final t speech theme f t The method specifically comprises the following steps:
obtaining the speech element u at the time t i The top m utterance elements;
the speech element u at the time t is divided into i The previous m speech elements are converted into m speech features;
3 XN different speech elements u at time t i Inputting the first m speech elements into the self-attention mechanism layer to obtain N sample speech subjects f t N positive sample utterance topics f p And N negative sample utterance topics f n (ii) a Wherein, every time the theme is extracted, the time interval is more than or equal to T interval The N sample utterance topics f t N positive sample utterance topics f p And N negative sample utterance topics f n Constituting a triplet (f) t ,f p ,f n );
Based on the triplet (f) t ,f p ,f n ) Constructing a loss function; the loss function is specifically:
wherein sim (f) t ,f p ) As a subject f t And f p Dot product between them, i.e. cosine similarity, sim (f) t ,f n ) As a subject f t And f n Dot product between, i.e. cosine similarity, f t Is a sample utterance topic, f p Is f t Positive sample utterance topic of f n Is f t N is the number of training samples, N belongs to 1 … N, and α is a model parameter;
based on the lossLost function, determining final t-time speaking subject f t 。
Optionally, based on the updated speech feature and the final t-time speech topic f t Determining the long conversation emotion label specifically as follows:
the updated speech characteristics and the final t-time speech subject f t Inputting the speech into a transform decoder structure to obtain the final speech characteristicsl is the maximum number of layers in the transform decoder structure;
the final speech characteristics are combinedInputting the softmax function to obtain the long dialogue emotion label.
Optionally, one layer of the transform decoder structure comprises:
a self-attention module for fusing and updating speech featuresObtaining a first speech featureWherein k is 1 … l;
a cross attention module connected with the self attention module for fusing and updating the first speech featureObtaining second speech characteristicsThe cross attention module is also for fusing and updating the second spoken language featureDeriving a third speech featureWherein k is 1 … l;
a feed-forward neural network module connected with the cross attention module for updating the third speech featureObtaining an updated fourth utterance feature
Optionally, the self-attention module applies the following formula to the speech featuresAnd (3) performing fusion and updating:
wherein,the weight parameter of the attention Key index that needs to be learned for the k-th self-attention module,is the first weight parameter of the Attention Value that the k-th self-Attention module needs to learn, MultiHead is the multi-headed self-Attention function, Attention is the self-Attention function, softmax is the classification function,is composed ofThe variance of (c).
Optionally, the cross-attention module applies the following formula specifically to the first utterance featureAnd (3) performing fusion and updating:
wherein,is a second weight parameter of Attention Value that needs to be learned by the k-th layer cross Attention module, MultiHead is a multi-head self-Attention function, Attention is a self-Attention function, softmax is a classification function,is composed ofThe variance of (a);
the cross attention module is specifically configured to apply the following formula to the second speech featureAnd (3) performing fusion and updating:
wherein,is the third weight parameter of Attention Value that the k-th cross Attention module needs to learn, MultiHead is a multi-headed self-Attention function, Attention is a self-Attention function, softmax is a classification function,is composed ofThe variance of (c).
Optionally, the feed-forward neural network module is to the third utterance feature by the following formulaUpdating:
wherein,and (4) constructing the updated fourth speech characteristic for the k-th layer transform decoder.
Based on the method, the invention also provides a long-conversation emotion detection system of the common knowledge base, and the long-conversation emotion detection system comprises:
a graph model construction module for constructing the graph model according to the characteristics of the utterances at the time tConstruction of graph model g t (V t ,E t ) (ii) a Wherein, V t In the form of a set of nodes, the nodes,E t for different nodes v i And v j The set of undirected edges, wherein i belongs to 1 … n, and j belongs to 1 … n; the graph model g t (V t ,E t ) Dialogue data for representing time t;
a priori knowledge acquisition module for utterances based on t-timeExtracting utterancesA priori knowledge of;
a graph model update module connected with the priori knowledge acquisition module for updating the utteranceIs added as a new node to the graph model g t (V t ,E t ) Get the updated graph model g t * (V t ,E t );
A speech feature updating module connected with the graph model updating module and used for updating the graph model g by adopting the GNN model t * (V t ,E t ) Obtaining updated utterance features;
a speaking subject determining module for determining the speaking element u based on the t time i The first m utterances, determining the final t-time utterance subject f t Wherein m is 1 … i-1, i is 1 … n;
a long dialog emotion label determination module connected with the speaking feature updating module and the speaking theme determination module and used for determining the final t moment speaking theme f based on the updated speaking feature t Determining a long conversation emotion label; the long dialog emotion tag comprises: happiness, depression, surprise, emotional colorfulness, fear, disgust and anger.
Optionally, the a priori knowledge obtaining module specifically includes:
speech element u i For obtaining the utterance element u i The search result is the first k results in the triple format, and is represented as:
words and phrasesAnd the utterance element u i Is connected to the search result acquisition unit for obtaining a search result based on the utterance element u i Retrieval result of (A) obtaining an utteranceThe search result set of (2) is expressed as:
a priori knowledge acquisition unit, and the utteranceIs connected to the retrieval result set obtaining unit and is used for taking and collecting each retrieval result in the set based on the retrieval result set to obtain the wordsA priori knowledge of.
According to the specific embodiment provided by the invention, the invention discloses the following technical effects:
the invention provides a long-dialog emotion detection method and system of a common knowledge map, which passes the speech characteristics at the time tConstruction of graph model g t (V t ,E t ) The words are combinedAs new nodes are added to the graph model g t (V t ,E t ) Get the updated graph model g t * (V t ,E t ) (ii) a Updating graph model g with GNN model t * (V t ,E t ) Obtaining updated utterance features; speech element u based on time t i The first m utterances, determining the final t-time utterance subject f t Based on the updated speech feature and the final t-time speech subject f t Determining a long conversation emotion label; the long dialog emotion tag comprises: happiness, depression, surprise, emotional colorfulness, fear, disgust and anger. The invention adds the prior knowledge as a new node to the graph model g t (V t ,E t ) In the method, the function of dialogue emotion recognition and the relation between speakers are described, the influence of self emotion and mutual emotion is integrated, and the emotion of the dialogue is recognized quickly and effectively.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.
FIG. 1 is a flowchart of a method for detecting emotion in long dialog based on common sense knowledge-graph according to an embodiment of the present invention;
FIG. 2 is a flow chart of updated utterance features in accordance with an embodiment of the present invention;
FIG. 3 is a flowchart illustrating a method for determining a final time t-speaking topic according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of a transform decoder structure layer according to an embodiment of the present invention;
FIG. 5 is a schematic structural diagram of a long-dialog emotion detection system based on common sense knowledge base.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The invention aims to provide a long dialogue emotion detection method and system of a common knowledge map, which can synthesize the influence of self emotion and mutual emotion and quickly and effectively identify the emotion of dialogue.
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
Referring to fig. 1 of the drawings, the invention provides a long-dialog emotion detection method for a common knowledge graph, which comprises the following steps:
s1, speech feature based on t timeConstruction of graph model g t (V t ,E t ) (ii) a Wherein, V t Is a collection of nodes that are to be grouped together,E t for different nodes v i And v j The set of undirected edges between the two nodes, wherein i belongs to 1 … n, and j belongs to 1 … n; the graph model g t (V t ,E t ) For representing the dialogue data at time t.
In particular, E t Two types of edges are included: one is a temporal edge, representing the utterance v i And v j In the sequence of occurrence in the conversation, when speaker A speaks the speech feature v i Speaker B then speaks utterance feature v j I.e. node v i And v j There is a chronological order between them, if v i At v is j Before, then the edge exists and the weight is defined as 1, otherwise, changeBecomes absent and has a weight of 0. The other is semantic edge, which measures utterances v by cosine distance i And v j The similarity between them.
S2 utterance based on t timeExtracting utterancesThe prior knowledge comprises the following specific steps:
s21, obtaining the speech element u i The search result of (2); the search result is the first k results in the triple format, and is represented as:wherein each similar utterance in the top k results consists of triplets, the relationships in these triplets including intent, reaction, and the like.
Specifically, based on the common sense knowledge corpus ATOMIC and the common sense knowledge base ConceptNet, the common sense knowledge retrieval method is applied to search the prior knowledge of the utterance.
S22, based on the speech element u i Retrieval result of (A) obtaining an utteranceThe search result set of (2) is expressed as:
s23, based on the retrieval result set, merging each retrieval result in the set to obtain an utteranceA priori knowledge of.
Specifically, these prior knowledge are extracted to feature vectors through BilSTM and added to the graph as new nodes as auxiliary information to help the utterance embedding learning.
In particular, utterances at time tConverting the speech characteristics of t time into speech characteristics of t time through a Bi-directional Long Short-Term Memory network model
S3, transforming the wordsIs added as a new node to the graph model g t (V t ,E t ) Get the updated graph model g t * (V t ,E t )。
S4, updating graph model g by GNN model t * (V t ,E t ) The updated speech feature is obtained.
In particular, referring to FIG. 2 of the drawings, the words at time t are labeledConverting into the speech characteristics of t time through a BilSTM layer bidirectional long-term short-term memory network modelConstruction of graph model g t (V t ,E t ) And retrieving prior knowledge based on the common sense knowledge corpus ATOMIC and the common sense knowledge base conceptNet, wherein the prior knowledge is used for updating the graph model, and the node features in the graph are updated through the classical GNN model to obtain updated speech features.
S5, speech element u based on t time i The first m utterances, determining the final t-time utterance subject f t Wherein m is 1 … i-1, i is 1 … n, and the concrete steps are as follows:
s51, obtaining the speech element u at the time t i The top m speech elements.
S52, converting the speech element u at the t moment i The first m wordsThe speech elements are converted into m speech features.
S53, converting 3 XN different speech elements u at t time i Inputting the first m speech elements into the self-attention mechanism layer to obtain N sample speech subjects f t N positive sample utterance topics f p And N negative sample utterance topics f n (ii) a Wherein, every time the theme is extracted, the time interval is more than or equal to T interval The N sample utterance topics f t N positive sample utterance topics f p And N negative sample utterance topics f n Constituting a triplet (f) t ,f p ,f n )。
S54, based on the triplet (f) t ,f p ,f n ) Constructing a loss function; the loss function is specifically:
wherein sim (f) t ,f p ) As a subject f t And f p Dot product between them, i.e. cosine similarity, sim (f) t ,f n ) As a subject f t And f n Dot product between, i.e. cosine similarity, f t Is a sample utterance topic, f p Is f t Positive sample utterance topic of f n Is f t Is the number of training samples, N is e 1 … N, and α is the model parameter.
In particular, the objective of the loss function is to reduce two identical sample utterance topics f t And positive sample utterance subject f p And adding two different sample utterance topics f t And negative sample utterance topic f n The difference between them.
In particular, for said triplet (f) t ,f p ,f n ) Performing contrast learning, wherein the task of the contrast learning is to enable positive samples and negative samples to have the same or similar feature vectors and dissimilar representations, (f) t ,f p ,f n ) Through comparative learning, a changing utterance subject is obtained.
S55, determining the final t-time speaking topic f based on the loss function t 。
Specifically, referring to FIG. 3 of the drawings, m utterances of a current conversation are input into a BilSTM layer bidirectional long-term short-term memory network model to obtain m utterance features, and the m utterance features are input into a self-attention mechanism to obtain N sample utterance topics f t N positive sample utterance topics f p And N negative sample utterance topics f n Triplet (f) t ,f p ,f n ) And performing comparison learning to obtain the current theme.
Specifically, conversation topic extraction mainly focuses on conversation background topics in different periods and guides speech emotion recognition.
S6, based on the updated speaking characteristics and the final t-time speaking subject f t Determining a long conversation emotion label; the long dialog emotion tag comprises: happiness, depression, surprise, emotional colorfulness, fear, disgust and anger.
In particular, referring to FIG. 4 of the drawings, based on the updated utterance feature and the final t-time utterance subject f t Determining the long conversation emotion label specifically as follows:
the updated speech characteristics and the final t-time speech subject f t Inputting the speech into a transform decoder structure to obtain the final speech characteristicsl is the maximum number of layers in the transform decoder structure;
the final speech characteristics are processedInputting the softmax function to obtain the long dialogue emotion label.
Specifically, the loss function of the softmax function is:
wherein L is final For the cross-entropy loss value when calculating the softmax function,representing speech featuresThe k-th layer transform decoder structure updates the speech characteristics, M is the number of emotion labels,being features of speechThe k-th layer transform decoder structure updates the speech characteristics, k being 1 … … l.
Specifically, the transformer decoder structure has l layers in common, wherein one layer in the transformer decoder structure comprises: a self-attention module, a cross-attention module and a feedforward neural network module;
the self-attention module is used for fusing and updating speech featuresObtaining a first speech featureWherein k is 1 … l;
in particular, the self-attention module applies the following formula specifically to the utterance featureAnd (3) performing fusion and updating:
wherein,for the weight parameter of the attention Key index that the k-th self-attention module needs to learn,is the first weight parameter of the Attention Value that the k-th self-Attention module needs to learn, MultiHead is the multi-headed self-Attention function, Attention is the self-Attention function, softmax is the classification function,is composed ofThe variance of (c).
The cross attention module is connected with the self attention module, and the cross attention module is used for fusing and updating the first speech featureObtaining second speech characteristicsThe cross attention module is also for fusing and updating the second spoken language featureDeriving a third speech featureWherein k is 1 … l;
specifically, the cross-attention module specifically employs the following formula for the firstA speech featureAnd (3) performing fusion and updating:
wherein,is a second weight parameter of Attention Value that needs to be learned by the k-th layer cross Attention module, MultiHead is a multi-head self-Attention function, Attention is a self-Attention function, softmax is a classification function,is composed ofThe variance of (a);
the cross attention module is specifically configured to apply the following formula to the second speech featureAnd (3) performing fusion and updating:
wherein,is the k-th layer cross attention module learningThe third weight parameter of the learned Attention Value, MultiHead, Attention, softmax, classification function,is composed ofThe variance of (c).
The feed-forward neural network module is connected with the cross attention module and used for updating the third speech featureObtaining an updated fourth utterance feature
Specifically, the feed-forward neural network module applies the following formula to the third speech featureUpdating:
wherein,and (4) constructing the updated fourth speech characteristic for the k-th layer transform decoder.
Referring to fig. 5 of the drawings, based on the method, the invention further provides a long-dialog emotion detection system of the common sense knowledge base, and the long-dialog emotion detection system comprises: the system comprises a graph model construction module 1, a priori knowledge acquisition module 2, a graph model updating module 3, a speaking characteristic updating module 4, a speaking topic determining module 5 and a long conversation emotion label determining module 6.
The graph model construction module 1 is used for constructing the speech characteristics according to t timeConstruction of graph model g t (V t ,E t ) (ii) a Wherein, V t In the form of a set of nodes, the nodes,E t for different nodes v i And v j The set of undirected edges between the two nodes, wherein i belongs to 1 … n, and j belongs to 1 … n; the graph model g t (V t ,E t ) Dialogue data for representing time t;
the priori knowledge acquisition module 2 is used for utterances based on t timeExtracting utterancesA priori knowledge of;
the graph model updating module 3 is connected with the prior knowledge obtaining module 2, and the graph model updating module 3 is used for updating the utteranceAs new nodes are added to the graph model g t (V t ,E t ) To obtain an updated graph model g t * (V t ,E t );
The speech feature updating module 4 is connected with the graph model updating module 3, and the speech feature updating module 4 is used for updating the graph model g by adopting a GNN model t * (V t ,E t ) Obtaining updated speech features according to the speech features;
the speaking theme determining module 5 is used for determining the speaking element u based on the t time i The first m utterances, determining the final t-time utterance subject f t Wherein m is 1 … i-1, i is 1 … n;
the long-dialog emotion label determination module 6 and the utterance feature updating module 4A topic determination module 5 connected to said long-dialog emotion tag determination module for determining a topic f of said final t-time utterance based on said updated utterance feature t Determining a long conversation emotion label; the long dialog emotion tag comprises: happy (happy), depressed (frustrated), surprised (surrise), anhedonic (neutral), fear (fear), disgust (distust), anger (anger).
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the description of the method part.
The invention provides a long-conversation emotion detection method and system based on a common sense knowledge graph, which not only keep the correlation between utterances, but also keep time change information. By searching each speech feature from the common sense knowledge graph, relevant prior knowledge is learned, global semantic information is obtained, the speech feature is guided to be updated, and the final performance of emotion detection is improved. The emotion is detected by using the structure of the transform decoder, so that the deep fusion of the speech characteristics containing the prior knowledge and the speech theme is realized, the potential dialogue semantic information and the prior knowledge are introduced, and the emotion recognition capability is improved.
The principle and the embodiment of the present invention are explained by applying specific examples, and the above description of the embodiments is only used to help understanding the method and the core idea of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.
Claims (10)
1. A long-conversation emotion detection method of a common sense knowledge graph, the long-conversation emotion detection method comprising:
speech features based on time tConstruction of graph model g t (V t ,E t ) (ii) a Wherein, V t Is a collection of nodes that are to be grouped together,E t for different nodes v i And v j The set of undirected edges between the two nodes, wherein i belongs to 1 … n, and j belongs to 1 … n; the graph model g t (V t ,E t ) Dialogue data for representing time t;
combining the utteranceIs added as a new node to the graph model g t (V t ,E t ) Get the updated graph model g t * (V t ,E t );
Updating graph model g with GNN model t * (V t ,E t ) Obtaining updated speech features according to the speech features;
t moment based utterance element u i The first m utterances, determining the final t-time utterance subject f t Wherein, m is 1 … i-1, i is 1 … n;
based on the updated speech characteristics and the final t-time speech subject f t Determining a long conversation emotion label; the long dialog emotion tag comprises: happiness, depression, surprise, emotional colorfulness, fear, disgust and anger.
2. According to claim 1The method for detecting the long-dialog emotion of the common sense knowledge graph is characterized in that the utterance at the t moment is based onExtracting utterancesThe prior knowledge of (a) is specifically:
obtaining utterance elements u i The search result of (2); the search result is the first k results in the triple format, and is represented as:
based on the speech element u i Retrieval result of (A) obtaining an utteranceThe search result set of (2) is expressed as:
3. The method of claim 1, wherein the speech element u at the time t is based on i The first m speech elements determine the final t-time speech subject f t The method specifically comprises the following steps:
obtaining the speech element u at the time t i The top m utterance elements;
the speech element u at the time t is divided into i The first m speech elements are converted into m speech features;
3 XN differentSpeech element u at time t i Inputting the first m speech elements into the self-attention mechanism layer to obtain N sample speech subjects f t N positive sample utterance topics f p And N negative sample utterance topics f n (ii) a Wherein, every time the theme is extracted, the time interval is more than or equal to T interval The N sample utterance topics f t N positive sample utterance topics f p And N negative sample utterance topics f n Constituting a triplet (f) t ,f p ,f n );
Based on the triplet (f) t ,f p ,f n ) Constructing a loss function; the loss function is specifically:
wherein sim (f) t ,f p ) As a subject f t And f p Dot product between them, i.e. cosine similarity, sim (f) t ,f n ) As a subject f t And f n Dot product between, i.e. cosine similarity, f t Is a sample utterance topic, f p Is f t Positive sample utterance topic of f n Is f t N is the number of training samples, N belongs to 1 … N, and α is a model parameter;
determining a final t-time utterance topic f based on the loss function t 。
4. The method of claim 1, wherein the method is based on the updated utterance features and the final t-time utterance topic f t Determining the long conversation emotion label specifically as follows:
the updated speech characteristics and the final t-time speech subject f t Inputting the speech into a transform decoder structure to obtain the final speech characteristicsl isMaximum number of layers in the transform decoder structure;
5. The method of claim 4, wherein one layer of the transform decoder structure comprises:
a self-attention module for fusing and updating speech featuresObtaining a first speech featureWherein k is 1 … l;
a cross attention module connected with the self attention module for fusing and updating the first speech featureObtaining second speech characteristicsThe cross attention module is further configured to fuse and update the second spoken language featureDeriving a third speech featureWherein k is 1 … l;
6. The method of claim 5, wherein the self-attention module is configured to apply the following formula to the utterance featureAnd (3) performing fusion and updating:
wherein,the weight parameter of the attention Key index that needs to be learned for the k-th self-attention module,is the first weight parameter of the Attention Value that the k-th self-Attention module needs to learn, MultiHead is the multi-headed self-Attention function, Attention is the self-Attention function, softmax is the classification function,is composed ofThe variance of (c).
7. The method of claim 5, wherein the cross-attention module is configured to apply the following formula to the first utterance featureAnd (3) performing fusion and updating:
wherein,is the second weight parameter of the Attention Value that the k-th cross Attention module needs to learn, MultiHead is a multi-headed self-Attention function, Attention is a self-Attention function, softmax is a classification function,is composed ofThe variance of (a);
the cross attention module is specifically configured to apply the following formula to the second spoken language featureAnd (3) performing fusion and updating:
9. A long-dialog emotion detection system of a common-sense knowledge graph, comprising:
a graph model construction module for constructing the graph model according to the characteristics of the utterances at the time tConstruction of graph model g t (V t ,E t ) (ii) a Wherein, V t In the form of a set of nodes, the nodes,E t for different nodes v i And v j The set of undirected edges between the two nodes, wherein i belongs to 1 … n, and j belongs to 1 … n; the graph model g t (V t ,E t ) Dialogue data for representing time t;
a priori knowledge acquisition module for utterances based on t-timeExtracting utterancesA priori knowledge of;
a graph model update module connected with the priori knowledge acquisition module for updating the utteranceAs new nodes are added to the graph model g t (V t ,E t ) Get the updated graph model g t * (V t ,E t );
A speech feature updating module connected with the graph model updating module and used for updating the graph model g by adopting the GNN model t * (V t ,E t ) Obtaining updated speech features according to the speech features;
a speaking subject determining module for determining the speaking element u based on the t time i The first m utterances determine the utterances subject f at the final t moment t Wherein m is 1 … i-1, i is 1 …n;
A long dialog emotion label determination module connected with the speaking feature updating module and the speaking theme determination module and used for determining the final t moment speaking theme f based on the updated speaking feature t Determining a long conversation emotion label; the long dialog emotion tag comprises: happiness, depression, surprise, emotional colorfulness, fear, disgust and anger.
10. The system for long-dialog emotion detection of a common sense knowledge base, according to claim 9, wherein the a priori knowledge acquisition module specifically comprises:
speech element u i For obtaining the utterance element u i The search result is the first k results in the triple format, and is expressed as:
words and phrasesAnd the utterance element u i Is connected to the search result acquisition unit for obtaining a search result based on the utterance element u i Retrieval result of (A) obtaining an utteranceThe search result set of (2) is expressed as:
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210676215.XA CN115033695A (en) | 2022-06-15 | 2022-06-15 | Long-dialog emotion detection method and system based on common sense knowledge graph |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210676215.XA CN115033695A (en) | 2022-06-15 | 2022-06-15 | Long-dialog emotion detection method and system based on common sense knowledge graph |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115033695A true CN115033695A (en) | 2022-09-09 |
Family
ID=83124422
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210676215.XA Pending CN115033695A (en) | 2022-06-15 | 2022-06-15 | Long-dialog emotion detection method and system based on common sense knowledge graph |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115033695A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115905518A (en) * | 2022-10-17 | 2023-04-04 | 华南师范大学 | Emotion classification method, device and equipment based on knowledge graph and storage medium |
-
2022
- 2022-06-15 CN CN202210676215.XA patent/CN115033695A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115905518A (en) * | 2022-10-17 | 2023-04-04 | 华南师范大学 | Emotion classification method, device and equipment based on knowledge graph and storage medium |
CN115905518B (en) * | 2022-10-17 | 2023-10-20 | 华南师范大学 | Emotion classification method, device, equipment and storage medium based on knowledge graph |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110532355B (en) | Intention and slot position joint identification method based on multitask learning | |
US10347244B2 (en) | Dialogue system incorporating unique speech to text conversion method for meaningful dialogue response | |
CN114973062B (en) | Multimode emotion analysis method based on Transformer | |
CN107993665B (en) | Method for determining role of speaker in multi-person conversation scene, intelligent conference method and system | |
Santhanavijayan et al. | A semantic-aware strategy for automatic speech recognition incorporating deep learning models | |
CN114694076A (en) | Multi-modal emotion analysis method based on multi-task learning and stacked cross-modal fusion | |
CN110211594B (en) | Speaker identification method based on twin network model and KNN algorithm | |
CN109446331A (en) | A kind of text mood disaggregated model method for building up and text mood classification method | |
CN113094578A (en) | Deep learning-based content recommendation method, device, equipment and storage medium | |
CN112101044B (en) | Intention identification method and device and electronic equipment | |
CN113178193A (en) | Chinese self-defined awakening and Internet of things interaction method based on intelligent voice chip | |
CN115640530A (en) | Combined analysis method for dialogue sarcasm and emotion based on multi-task learning | |
CN116304973A (en) | Classroom teaching emotion recognition method and system based on multi-mode fusion | |
Bluche et al. | Predicting detection filters for small footprint open-vocabulary keyword spotting | |
CN112417132A (en) | New intention recognition method for screening negative samples by utilizing predicate guest information | |
CN115910066A (en) | Intelligent dispatching command and operation system for regional power distribution network | |
Huang et al. | Speech emotion recognition using convolutional neural network with audio word-based embedding | |
CN115033695A (en) | Long-dialog emotion detection method and system based on common sense knowledge graph | |
Farooq et al. | Mispronunciation detection in articulation points of Arabic letters using machine learning | |
Ananthi et al. | Speech recognition system and isolated word recognition based on Hidden Markov model (HMM) for Hearing Impaired | |
CN114360584A (en) | Phoneme-level-based speech emotion layered recognition method and system | |
CN114254096A (en) | Multi-mode emotion prediction method and system based on interactive robot conversation | |
CN117056506A (en) | Public opinion emotion classification method based on long-sequence text data | |
CN115376547B (en) | Pronunciation evaluation method, pronunciation evaluation device, computer equipment and storage medium | |
CN116434786A (en) | Text-semantic-assisted teacher voice emotion recognition method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |