WO2023078370A1 - Conversation sentiment analysis method and apparatus, and computer-readable storage medium - Google Patents

Conversation sentiment analysis method and apparatus, and computer-readable storage medium Download PDF

Info

Publication number
WO2023078370A1
WO2023078370A1 PCT/CN2022/129655 CN2022129655W WO2023078370A1 WO 2023078370 A1 WO2023078370 A1 WO 2023078370A1 CN 2022129655 W CN2022129655 W CN 2022129655W WO 2023078370 A1 WO2023078370 A1 WO 2023078370A1
Authority
WO
WIPO (PCT)
Prior art keywords
utterance
dialogue
features
node
feature
Prior art date
Application number
PCT/CN2022/129655
Other languages
French (fr)
Chinese (zh)
Inventor
夏睿
肖德斌
屠要峰
董修岗
周祥生
Original Assignee
中兴通讯股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中兴通讯股份有限公司 filed Critical 中兴通讯股份有限公司
Publication of WO2023078370A1 publication Critical patent/WO2023078370A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • G06F40/35Discourse or dialogue representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • G06N3/0442Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the embodiments of the present application relate to but are not limited to the technical field of data processing, and in particular relate to a dialogue sentiment analysis method, device and computer-readable storage medium.
  • sentiment analysis is an important direction in the field of natural language processing.
  • the current sentiment analysis methods for natural language are usually aimed at personal text expressions, but only through the single modality of the text mode for sentiment analysis, other important features such as intonation will be ignored, resulting in distortion of the analysis results;
  • Sentiment analysis is mainly based on a time series model, and it is not suitable for sentiment analysis of multiple rounds of conversations between multiple people, and will ignore the influence of speakers on sentiment analysis.
  • the embodiment of the present application provides a dialogue sentiment analysis method, the method comprising: acquiring the text data and voice data of the target dialogue; inputting the text data and the voice data into a classification model based on a graph structure Carry out emotion classification to obtain emotion classification information, wherein the graph structure includes a plurality of nodes and corresponding node features, the nodes correspond to the words in the target dialogue one-to-one, and the node features include the text data Text features and speech features of the speech data.
  • the embodiment of the present application also provides a dialogue emotion analysis device, including: a memory, a processor, and a computer program stored on the memory and operable on the processor, when the processor executes the computer program Realize the dialog sentiment analysis method as described in the first aspect.
  • the embodiment of the present application also provides a computer-readable storage medium, which is characterized in that it stores computer-executable instructions, and the computer-executable instructions are used to execute the dialog sentiment analysis method as described in the first aspect .
  • the embodiment of the present application includes: obtaining the text data and speech data of the target dialogue; inputting the text data and speech data into the classification model based on the graph structure for emotion classification, and obtaining the emotion classification information, wherein the graph structure includes multiple nodes and corresponding The node features correspond one-to-one to the utterances in the target dialogue, and the node features include the text features of the text data and the phonetic features of the voice data.
  • Fig. 1 is the flow chart of a kind of dialog sentiment analysis method of the embodiment of the present application
  • Fig. 2 is the specific flowchart of step S200 in Fig. 1;
  • Fig. 3 is the concrete flowchart of step S240 in Fig. 2;
  • Fig. 4 is the specific flowchart of step S241 among Fig. 3;
  • Fig. 5 is a structural diagram of a dialog sentiment analysis device according to an embodiment of the present application.
  • the embodiment of the present application provides a dialogue emotion analysis method, device, and computer-readable storage medium; text data and voice data are input into a graph structure-based classification model for emotion classification, and emotion classification information is obtained, wherein the graph structure includes multiple Each node corresponds to the corresponding node feature, and the node corresponds to the utterance in the target dialogue.
  • the node feature includes the text feature of the text data and the voice feature of the voice data; it can combine the text data and the voice data dual-modal information of the multi-person dialogue.
  • Emotion classification which captures the sequence features of the bimodal information of text data and voice data respectively; uses the graph structure to model the target dialogue, and uses the graph attention network to capture the important global features, so that the classification model can focus on the two modalities At the same time, it also pays attention to the semantic features of the two modalities; it strengthens the effectiveness of multi-modal learning and improves the accuracy of emotion classification for multi-person dialogue.
  • An embodiment of the present application provides a dialogue sentiment analysis method.
  • FIG. 1 is a flow chart of a method for dialogue sentiment analysis.
  • the dialogue sentiment analysis method includes but is not limited to the following steps:
  • Step S100 acquiring text data and voice data of the target dialogue.
  • the target dialogue is a dialogue between multiple persons, and the target dialogue is composed of multiple sentences.
  • the text data and the voice data are extracted from the target dialogue, and the text data and the voice data are in one-to-one correspondence.
  • the text data corresponds to the entire target dialogue, and the voice data also corresponds to the entire target dialogue.
  • the voice data can be converted from the text data, and the text data can also be converted from the voice data.
  • the target dialogue may be a voice chat input by multiple people through a microphone, or a voice record, or dialogue text data recognized from chat picture records, or dialogue text data input through a keyboard, etc.
  • the preprocessing method to the voice data is as follows: the conversation is divided into an ordered set of utterances according to the breath or pause of the speaker; the audio data is converted into a preset audio format; in this embodiment, the preset audio format is .wav format ; Of course, other formats may also be used in other embodiments.
  • the preprocessing method of the text data is as follows: the division of the text data and the audio data is aligned at the utterance level; characters such as stop words and symbols in the text data are removed.
  • step S200 text data and voice data are input into a classification model based on a graph structure to classify emotions, and obtain emotion classification information.
  • FIG. 2 is a specific flowchart of step S200.
  • step S200 it includes but is not limited to the following steps:
  • Step S210 for each utterance of the target dialogue, perform feature extraction on the text data to obtain multiple text features corresponding to the utterance, and perform feature extraction on the voice data to obtain multiple voice features corresponding to the utterance.
  • the text features of the text data include the first utterance-level features of the text data
  • the phonetic features of the voice data include the second utterance-level features of the voice data.
  • map each word of the text data to a word vector and obtain a word vector matrix perform utterance-level feature extraction on the word vector matrix, and obtain the first utterance-level feature; and for each sentence of the target dialogue , performing utterance-level feature extraction on the speech data to obtain a second utterance-level feature.
  • the method for extracting the first discourse-level feature to the text data is as follows: each word of each sentence of the target dialogue in the text data is mapped to a corresponding word vector by a global vector algorithm (Global Vectors, GloVe), The word vector matrix is composed of word vectors; the word vector matrix is input to the convolutional neural network for utterance-level feature extraction, and the convolutional neural network outputs the first utterance-level features.
  • the convolutional neural network has three convolution kernels. The dimensions of the three convolution kernels are 3, 4, and 5 respectively. Each convolution kernel corresponds to 50 output channels, and the results of the convolution calculation are maximally pooled.
  • the result of the maximum pooling layer passes through the fully connected layer to obtain the first utterance-level features of fixed dimensions.
  • the emotional category label of each sentence is used as the training label
  • the cross entropy is used as the loss function
  • the convolutional neural network is trained through the back propagation algorithm to update and update the network parameters. fine-tuning.
  • feature extraction networks with other structures may also be used to extract utterance-level features from text data.
  • the method of extracting the second utterance-level features from the speech data is as follows: select the configuration file in the openSMILE software, such as IS09_emotion.conf; then input the speech data to the openSMILE software, and the openSMILE software outputs the second utterance-level features of fixed dimensions, the second utterance Level features include Mel cepstral coefficients, frequencies, etc. of speech data.
  • the openSMILE software outputs the second utterance-level features of fixed dimensions
  • the second utterance Level features include Mel cepstral coefficients, frequencies, etc. of speech data.
  • other methods may also be used to extract speech-level features from speech data.
  • the text feature of the text data also includes the first hidden layer state of the text data
  • the speech feature of the speech data also includes the second hidden layer state of the speech data.
  • the first utterance-level features are input into the encoding network in time sequence to encode the first hidden layer state corresponding to the text data; for each sentence of the target dialogue, the second utterance-level features are time-sequenced Input the encoding network for encoding to obtain the second hidden layer state corresponding to the speech data.
  • the first utterance-level feature and the second utterance-level feature constitute a bimodal utterance-level feature in, denote the first utterance-level features of the i-th utterance in d, denote the second utterance-level features of the i-th utterance in d.
  • the method for encoding the text data to obtain the state of the first hidden layer is as follows: the first utterance-level feature is used as the input of each time step of the long-short-term memory network (Long Short-Term Memory, LSTM), LSTM The network encodes it according to Get the first hidden layer state corresponding to the text data, where, For the forward representation of text data in a bidirectional LSTM network, is the backward representation of text data in a bidirectional LSTM network, The state of the first hidden layer concatenated for the forward and backward representations of text data in a bidirectional LSTM network.
  • Long Short-Term Memory Long Short-Term Memory
  • the method of encoding the speech data to obtain the state of the second hidden layer is as follows: the second utterance-level feature is used as the input of each time step of the Long Short-Term Memory (LSTM) in time series, and the LSTM network encodes it ,according to Get the first hidden layer state corresponding to the text data, where, is the forward representation of speech data in a bidirectional LSTM network, is the backward representation of speech data in a bidirectional LSTM network, The state of the second hidden layer obtained by concatenating the forward and backward representations of speech data in a bidirectional LSTM network.
  • LSTM Long Short-Term Memory
  • the text feature of the text data also includes a first predicted probability distribution of the text data
  • the voice feature of the voice data further includes a second predicted probability distribution of the voice data.
  • the method for sentiment classification of the first hidden layer state to obtain the first predicted probability distribution of text data is as follows: the first hidden layer state is input to the softmax layer, and the first predicted probability distribution is expressed as in, Denotes the first predicted probability distribution of the text data of the i-th utterance in the target dialogue, W T is the weight, and b T is the configuration parameter.
  • loss T is the prediction loss of text data
  • Corpus represents all the dialogues in the dataset
  • I represents the number of utterances in the dialogue
  • C is the dimension of the output vector, that is, the number of categories of emotion classification.
  • the second hidden layer state is input into the softmax layer, and the second predicted probability distribution is expressed as in, Indicates the second predicted probability distribution of the speech data of the i-th utterance in the target dialogue, W A is the weight, and b A is the configuration parameter.
  • the cross-entropy loss needs to be calculated in the emotion classification process of speech data, and the cross-entropy loss is expressed as Among them, loss A is the prediction loss of speech data, Corpus represents all the dialogues in the data set, I represents the number of utterances in the dialogue, and C is the dimension of the output vector, that is, the number of categories of emotion classification.
  • the multiple text features of the text data include the first utterance-level feature, the first hidden layer state, and the first predicted probability distribution of the text data; the multiple voice features of the voice data include the second An utterance-level feature, a second hidden layer state, and a second predicted probability distribution.
  • other types of features may also be included.
  • Step S220 for each utterance of the target dialogue, perform feature fusion on multiple text features and multiple voice features to obtain fused features corresponding to the utterance.
  • step S220 in an embodiment, for each utterance of the target dialogue, the first utterance-level feature, the second utterance-level feature, the first hidden layer state, the second hidden layer state, the first predicted probability distribution and the second The predicted probability distribution is processed by vector splicing to obtain the fusion features corresponding to the utterance.
  • the first utterance-level features of the text data and the second utterance-level features of the speech data of the same speaker in the target dialogue are initialized to a fixed-dimensional vector, that is, speaker features.
  • step S230 a graph structure is constructed with utterances as nodes and fusion features as node features.
  • step S230 it includes but is not limited to the following steps: use each sentence of the target dialogue as each node of the graph structure; use the fusion feature corresponding to the statement as the node feature of the node corresponding to the statement; each sentence in the target dialogue
  • the information of is related to the global information, and the nodes are connected in pairs to obtain a completely undirected graph as a graph structure, and a completely undirected graph corresponds to a dialogue.
  • Step S240 perform emotion classification on the target dialogue according to the graph structure, and obtain emotion classification information.
  • FIG. 3 is a specific flowchart of step S240.
  • step S240 it includes but not limited to the following steps:
  • Step S241 update the node features of the graph structure based on the attention mechanism, and obtain new node features fused with global feature information.
  • FIG. 4 is a specific flowchart of step S241.
  • step S241 it includes but not limited to the following steps:
  • Step S2411 perform linear mapping on all nodes of the graph structure to obtain new nodes.
  • a layer of network is obtained; the entire graph attention network can be a multi-layer network structure.
  • the linear map can be expressed as in is the input of the linear map of the l-th layer network of the graph attention network, W (l) is the weight of the linear map of the l-th layer network of the graph attention network, is the output of the linear map of the l-layer network of the graph attention network.
  • Step S2412 determine the central node from the new nodes, and determine the adjacent nodes adjacent to the central node; it should be noted that, in this embodiment, each new node is calculated as the central node in turn; wherein in this embodiment In , the adjacent nodes adjacent to the central node are the nodes directly connected to the central node.
  • Step S2413 respectively calculate the first attention weight from the central node to each adjacent node; perform vector splicing on the node features of the two nodes, do a dot product operation between the spliced vector and a learnable weight vector, and then point The result of the product operation is nonlinearly activated by the LeakyReLU activation function to obtain the attention score between the two nodes.
  • the attention score can be expressed as in, is the attention score of the i-th node to the j-th node of the l-th layer network of the graph attention network, is the learnable weight vector of the l-th layer network, is the i-th node of the l-layer network, is the jth node of the l-th layer network; calculate the attention scores from the central node to each adjacent node according to the above method; according to The first attention weight from the central node to each adjacent node is calculated from the attention scores from the central node to each adjacent node, that is, the attention scores are normalized.
  • the formula is the first attention weight of the i-th node to the j-th node in the l-th layer network, and N i is the set of all adjacent nodes.
  • Step S2414 weighting and summing all first attention weights to obtain a second attention weight, and performing non-linear activation on the second attention weight through a ReLU activation function to obtain new node features fused with global feature information;
  • the updated node features process can be expressed as in, is the new node feature, is the second attention weight.
  • the new node features are used as the output of the current layer network.
  • the graph attention network obtained by updating the node features of the graph structure based on the attention mechanism has a multi-layer structure, and the input of each layer is the output of the previous layer Right now
  • Step S242 perform emotion classification on the target dialogue according to the new node features, and obtain emotion classification information.
  • the classifier adopts a softmax classifier, of course, other types of classifiers may also be used in other embodiments.
  • Probability distributions for predicting sentiment categories, i.e. in, is the predicted probability distribution of the i-th utterance in the target dialogue,
  • the category corresponding to the maximum probability of is the emotion classification information.
  • the loss of the emotion classification main task is calculated by cross-entropy loss, and the prediction loss of the main task is expressed as: Among them, loss B is the prediction loss of the main task, Corpus represents all the dialogues in the data set, I represents the number of utterances in the dialogue, and C is the dimension of the output vector, that is, the number of categories of emotion classification.
  • the text data and voice data dual-modal information of multi-person dialogue are used to classify emotions, and the sequence features of the text data and voice data dual-modal information are captured respectively, and the two are independently predicted and compared.
  • the embodiment of the present application also provides a dialog sentiment analysis device.
  • FIG. 5 is a structural diagram of a dialogue emotion analysis device.
  • the dialog sentiment analysis device includes: a memory 20 , a processor 10 and a computer program stored in the memory 20 and operable on the processor 10 .
  • the processor 10 executes the computer program, the above dialog sentiment analysis method is realized.
  • the processor 10 and the memory 20 may be connected through a bus 30 or other means.
  • the memory 20 may include a high-speed random access memory, and may also include a non-transitory memory, such as at least one magnetic disk storage device, a flash memory device, or other non-transitory solid-state storage devices.
  • the memory 20 includes memory located remotely from the processor, and these remote memories may be connected to the processor through a network. Examples of the aforementioned networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.
  • the non-transitory software programs and instructions required to realize the information processing method of the above-mentioned embodiment are stored in the memory 20, and when executed by the processor, the dialogue emotion analysis method in the above-mentioned embodiment is executed, for example, step S100 described above is executed Go to step S200, step S210 to step S240, step S241 to step S242, and step S2411 to step S2414.
  • node embodiments described above are only illustrative, and the units described as separate components may or may not be physically separated, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • the embodiment of the present application also provides a computer-readable storage medium, the computer-readable storage medium stores computer-executable instructions, and the computer-executable instructions are executed by a processor or a controller, for example, by a processor Execution may cause the above processor to execute the dialogue emotion analysis method in the above embodiment, for example, execute the steps S100 to S200, S210 to S240, S241 to S242, and S2411 to S2414 described above.
  • the embodiment of the present application can combine the bimodal information of text data and voice data of multi-person dialogue to classify emotions, capture the respective sequence characteristics of the bimodal information of text data and voice data respectively; use the graph structure to model the target dialogue,
  • the graph attention network is used to capture important global features, so that the classification model not only focuses on the sequence features of the two modalities, but also pays attention to the semantic features of the two modalities; it can target multi-person dialogue and strengthen the effectiveness of multi-modal learning. , improving the accuracy of sentiment classification.
  • Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disk (DVD) or other optical disk storage, magnetic cartridges, tape, magnetic disk storage or other magnetic storage devices, or can Any other medium used to store desired information and which can be accessed by a computer.
  • communication media typically embodies computer readable instructions, data structures, program elements, or other data in a modulated data signal such as a carrier wave or other transport mechanism, and may include any information delivery media .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Machine Translation (AREA)

Abstract

Provided in the embodiments of the present application are a conversation sentiment analysis method and apparatus, and a computer-readable storage medium. The method comprises: acquiring text data and speech data of a target conversation (S100); and inputting the text data and the speech data into a classification model based on a graph structure and performing sentiment classification on the classification model, wherein the graph structure comprises a plurality of nodes and node features, each node corresponds to an utterance, and each node feature comprises a text feature and a speech feature (S200).

Description

对话情绪分析方法、装置和计算机可读存储介质Dialogue sentiment analysis method, device and computer-readable storage medium
相关申请的交叉引用Cross References to Related Applications
本申请基于申请号为202111295920.7、申请日为2021年11月3日的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此引入本申请作为参考。This application is based on a Chinese patent application with application number 202111295920.7 and a filing date of November 3, 2021, and claims the priority of this Chinese patent application. The entire content of this Chinese patent application is hereby incorporated by reference into this application.
技术领域technical field
本申请实施例涉及但不限于数据处理技术领域,尤其涉及对话情绪分析方法、装置和计算机可读存储介质。The embodiments of the present application relate to but are not limited to the technical field of data processing, and in particular relate to a dialogue sentiment analysis method, device and computer-readable storage medium.
背景技术Background technique
在互联网中,用户乐于通过文字、语音等方式分享自己的经历和表达自己的观点。因此情绪分析是自然语言处理领域的一个重要方向。但目前的自然语言的情绪分析方法通常是针对个人的文本表达,但只通过文本模态该单一模态进行情绪分析,会忽略其他如语调等重要特征,造成分析结果失真;另外,针对个人的情绪分析主要基于时序模型,也并不适用于多人多轮对话的情绪分析,会忽略发言人对情绪分析的影响性。On the Internet, users are willing to share their experiences and express their opinions through text and voice. Therefore, sentiment analysis is an important direction in the field of natural language processing. However, the current sentiment analysis methods for natural language are usually aimed at personal text expressions, but only through the single modality of the text mode for sentiment analysis, other important features such as intonation will be ignored, resulting in distortion of the analysis results; Sentiment analysis is mainly based on a time series model, and it is not suitable for sentiment analysis of multiple rounds of conversations between multiple people, and will ignore the influence of speakers on sentiment analysis.
发明内容Contents of the invention
以下是对本文详细描述的主题的概述。本概述并非是为了限制权利要求的保护范围。The following is an overview of the topics described in detail in this article. This summary is not intended to limit the scope of the claims.
第一方面,本申请实施例提供了一种对话情绪分析方法,所述方法包括:获取目标对话的文本数据和语音数据;将所述文本数据和所述语音数据输入至基于图结构的分类模型进行情绪分类,得到情绪分类信息,其中,所述图结构包括多个节点和对应的节点特征,所述节点和所述目标对话中的话语一一对应,所述节点特征包括所述文本数据的文本特征和所述语音数据的语音特征。In the first aspect, the embodiment of the present application provides a dialogue sentiment analysis method, the method comprising: acquiring the text data and voice data of the target dialogue; inputting the text data and the voice data into a classification model based on a graph structure Carry out emotion classification to obtain emotion classification information, wherein the graph structure includes a plurality of nodes and corresponding node features, the nodes correspond to the words in the target dialogue one-to-one, and the node features include the text data Text features and speech features of the speech data.
第二方面,本申请实施例还提供了一种对话情绪分析装置,包括:存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现如第一方面所述的对话情绪分析方法。In the second aspect, the embodiment of the present application also provides a dialogue emotion analysis device, including: a memory, a processor, and a computer program stored on the memory and operable on the processor, when the processor executes the computer program Realize the dialog sentiment analysis method as described in the first aspect.
第三方面,本申请实施例还提供了一种计算机可读存储介质,其特征在于,存储有计算机可执行指令,所述计算机可执行指令用于执行如第一方面所述的对话情绪分析方法。In the third aspect, the embodiment of the present application also provides a computer-readable storage medium, which is characterized in that it stores computer-executable instructions, and the computer-executable instructions are used to execute the dialog sentiment analysis method as described in the first aspect .
本申请实施例包括:获取目标对话的文本数据和语音数据;将文本数据和语音数据输入至基于图结构的分类模型进行情绪分类,得到情绪分类信息,其中,图结构包括多个节点和对应的节点特征,节点和目标对话中的话语一一对应,节点特征包括文本数据的文本特征和语音数据的语音特征。The embodiment of the present application includes: obtaining the text data and speech data of the target dialogue; inputting the text data and speech data into the classification model based on the graph structure for emotion classification, and obtaining the emotion classification information, wherein the graph structure includes multiple nodes and corresponding The node features correspond one-to-one to the utterances in the target dialogue, and the node features include the text features of the text data and the phonetic features of the voice data.
本申请的其它特征和优点将在随后的说明书中阐述,并且,部分地从说明书中变得显而易见,或者通过实施本申请而了解。本申请的目的和其他优点可通过在说明书、权利要求书以及附图中所特别指出的结构来实现和获得。Additional features and advantages of the application will be set forth in the description which follows, and, in part, will be obvious from the description, or may be learned by practice of the application. The objectives and other advantages of the application will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.
附图说明Description of drawings
附图用来提供对本申请技术方案的进一步理解,并且构成说明书的一部分,与本申请实施例一起用于解释本申请技术方案,并不构成对本申请技术方案的限制。The accompanying drawings are used to provide a further understanding of the technical solution of the present application, and constitute a part of the specification, and are used together with the embodiments of the present application to explain the technical solution of the present application, and do not constitute a limitation to the technical solution of the present application.
图1是本申请实施例一种对话情绪分析方法的流程图;Fig. 1 is the flow chart of a kind of dialog sentiment analysis method of the embodiment of the present application;
图2是图1中步骤S200的具体流程图;Fig. 2 is the specific flowchart of step S200 in Fig. 1;
图3是图2中步骤S240的具体流程图;Fig. 3 is the concrete flowchart of step S240 in Fig. 2;
图4是图3中步骤S241的具体流程图;Fig. 4 is the specific flowchart of step S241 among Fig. 3;
图5是本申请实施例一种对话情绪分析装置的结构图。Fig. 5 is a structural diagram of a dialog sentiment analysis device according to an embodiment of the present application.
具体实施方式Detailed ways
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处所描述的具体实施例仅用以解释本申请,并不用于限定本申请。In order to make the purpose, technical solution and advantages of the present application clearer, the present application will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present application, not to limit the present application.
需要说明的是,虽然在装置示意图中进行了功能模块划分,在流程图中示出了逻辑顺序,但是在某些情况下,可以以不同于装置中的模块划分,或流程图中的顺序执行所示出或描述的步骤。说明书、权利要求书或上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。It should be noted that although the functional modules are divided in the schematic diagram of the device, and the logical sequence is shown in the flowchart, in some cases, it can be executed in a different order than the module division in the device or the flowchart in the flowchart. steps shown or described. The terms "first", "second" and the like in the specification, claims or the above drawings are used to distinguish similar objects, and not necessarily used to describe a specific order or sequence.
本申请实施例提供了一种对话情绪分析方法、装置和计算机可读存储介质;将文本数据和语音数据输入至基于图结构的分类模型进行情绪分类,得到情绪分类信息,其中,图结构包括多个节点和对应的节点特征,节点和目标对话中的话语一一对应,节点特征包括文本数据的文本特征和语音数据的语音特征;能够结合多人对话的文本数据和语音数据双模态信息进行情绪分类,分别对文本数据和语音数据双模态信息捕抓各自的序列特征;利用图结构对目标对话建模,采用图注意力网络捕捉全局的重要特征,使分类模型在关注两个模态的序列特征的同时也关注两个模态的语义特征;加强了多模态学习的有效性,提升了针对多人对话的情绪分类的准确性。The embodiment of the present application provides a dialogue emotion analysis method, device, and computer-readable storage medium; text data and voice data are input into a graph structure-based classification model for emotion classification, and emotion classification information is obtained, wherein the graph structure includes multiple Each node corresponds to the corresponding node feature, and the node corresponds to the utterance in the target dialogue. The node feature includes the text feature of the text data and the voice feature of the voice data; it can combine the text data and the voice data dual-modal information of the multi-person dialogue. Emotion classification, which captures the sequence features of the bimodal information of text data and voice data respectively; uses the graph structure to model the target dialogue, and uses the graph attention network to capture the important global features, so that the classification model can focus on the two modalities At the same time, it also pays attention to the semantic features of the two modalities; it strengthens the effectiveness of multi-modal learning and improves the accuracy of emotion classification for multi-person dialogue.
下面结合附图,对本申请实施例作进一步阐述。The embodiments of the present application will be further described below in conjunction with the accompanying drawings.
本申请实施例,提供了一种对话情绪分析方法。An embodiment of the present application provides a dialogue sentiment analysis method.
参照图1,图1是对话情绪分析方法的流程图。对话情绪分析方法包括但不限于有以下步骤:Referring to FIG. 1 , FIG. 1 is a flow chart of a method for dialogue sentiment analysis. The dialogue sentiment analysis method includes but is not limited to the following steps:
步骤S100,获取目标对话的文本数据和语音数据。Step S100, acquiring text data and voice data of the target dialogue.
对于步骤S100,获取目标对话,需要说明的是,目标对话是多人之间进行的对话,目标对话由多句话语构成。Regarding the step S100 of obtaining the target dialogue, it should be noted that the target dialogue is a dialogue between multiple persons, and the target dialogue is composed of multiple sentences.
从目标对话中提取出文本数据和语音数据,文本数据和语音数据是一一对应的。文本数据与整个目标对话对应,语音数据同样与该整个目标对话对应,语音数据可以从文本数据转换得到,文本数据也可以从语音数据转换得到。The text data and the voice data are extracted from the target dialogue, and the text data and the voice data are in one-to-one correspondence. The text data corresponds to the entire target dialogue, and the voice data also corresponds to the entire target dialogue. The voice data can be converted from the text data, and the text data can also be converted from the voice data.
需要说明的是,目标对话可以是多人通过麦克风输入的语音聊天,或者是语音记录,或者是从聊天图片记录中识别的对话文本数据,或者是通过键盘输入的对话文本数据等。It should be noted that the target dialogue may be a voice chat input by multiple people through a microphone, or a voice record, or dialogue text data recognized from chat picture records, or dialogue text data input through a keyboard, etc.
另外,在获取目标对话的文本数据和语音数据之后,需要对文本数据和语音数据进行预处理,以便于之后后续的情绪分类处理。In addition, after acquiring the text data and voice data of the target dialogue, it is necessary to preprocess the text data and voice data, so as to facilitate subsequent emotion classification processing.
对语音数据的预处理方法如下:根据发言人的呼吸或停顿将对话划分成话语的有序集合;将音频数据转换为预设音频格式;在本实施例中,预设音频格式为.wav格式;当然在其他实施例中也可以采用其他格式。The preprocessing method to the voice data is as follows: the conversation is divided into an ordered set of utterances according to the breath or pause of the speaker; the audio data is converted into a preset audio format; in this embodiment, the preset audio format is .wav format ; Of course, other formats may also be used in other embodiments.
对文本数据的预处理方法如下:将文本数据和音频数据的划分在话语级别上对齐;将文本数据中的停用词、符号等字符去除。The preprocessing method of the text data is as follows: the division of the text data and the audio data is aligned at the utterance level; characters such as stop words and symbols in the text data are removed.
步骤S200,将文本数据和语音数据输入至基于图结构的分类模型进行情绪分类,得到情绪分类信息。In step S200, text data and voice data are input into a classification model based on a graph structure to classify emotions, and obtain emotion classification information.
参照图2,图2是步骤S200的具体流程图。对于步骤S200,其包括但不限于有以下步骤:Referring to FIG. 2, FIG. 2 is a specific flowchart of step S200. For step S200, it includes but is not limited to the following steps:
步骤S210,对目标对话的每句话语,对文本数据进行特征提取得到与话语对应的多种文本特征,并对语音数据进行特征提取得到与话语对应的多种语音特征。Step S210, for each utterance of the target dialogue, perform feature extraction on the text data to obtain multiple text features corresponding to the utterance, and perform feature extraction on the voice data to obtain multiple voice features corresponding to the utterance.
对于步骤S210,文本数据的文本特征包括文本数据的第一话语级特征,语音数据的语音特征包括语音数据的第二话语级特征。对目标对话的每句话语,将文本数据的每个词映射为词向量并得到词向量矩阵,对词向量矩阵进行话语级特征提取,得到第一话语级特征;以及对目标对话的每句话语,对语音数据进行话语级特征提取得到第二话语级特征。For step S210, the text features of the text data include the first utterance-level features of the text data, and the phonetic features of the voice data include the second utterance-level features of the voice data. For each sentence of the target dialogue, map each word of the text data to a word vector and obtain a word vector matrix, perform utterance-level feature extraction on the word vector matrix, and obtain the first utterance-level feature; and for each sentence of the target dialogue , performing utterance-level feature extraction on the speech data to obtain a second utterance-level feature.
在实施例中,对文本数据提取第一话语级特征的方法如下:通过全局向量算法(Global Vectors,GloVe)将文本数据中目标对话的每句话语的每个词均映射为对应的词向量,由词向量组成词向量矩阵;将词向量矩阵输入至卷积神经网络进行话语级特征提取,卷积神经网络输出得到第一话语级特征。其中,该卷积神经网络具有三种卷积核,三种卷积核的维度分别为3、4、5,每种卷积核分别对应50个输出通道,卷积计算的结果经过最大池化层,最大池化层的结果经过全连接层,即可得到固定维度的第一话语级特征。另外,该卷积神经网络在训练过程中,采用每句话语的情绪类别标签作为训练标签,采用交叉熵作为损失函数,通过反向传播算法训练该卷积神经网络以对网络的参数进行更新和微调。当然在其他实施例中,也可以采用其他结构的特征提取网络对文本数据进行话语级特征提取。In an embodiment, the method for extracting the first discourse-level feature to the text data is as follows: each word of each sentence of the target dialogue in the text data is mapped to a corresponding word vector by a global vector algorithm (Global Vectors, GloVe), The word vector matrix is composed of word vectors; the word vector matrix is input to the convolutional neural network for utterance-level feature extraction, and the convolutional neural network outputs the first utterance-level features. Among them, the convolutional neural network has three convolution kernels. The dimensions of the three convolution kernels are 3, 4, and 5 respectively. Each convolution kernel corresponds to 50 output channels, and the results of the convolution calculation are maximally pooled. Layer, the result of the maximum pooling layer passes through the fully connected layer to obtain the first utterance-level features of fixed dimensions. In addition, during the training process of the convolutional neural network, the emotional category label of each sentence is used as the training label, and the cross entropy is used as the loss function, and the convolutional neural network is trained through the back propagation algorithm to update and update the network parameters. fine-tuning. Of course, in other embodiments, feature extraction networks with other structures may also be used to extract utterance-level features from text data.
对语音数据提取得到第二话语级特征的方法如下:在openSMILE软件中选择配置文件,例如IS09_emotion.conf;然后输入语音数据至openSMILE软件,openSMILE软件输出固定维度的第二话语级特征,第二话语级特征包括语音数据的梅尔倒谱系数、频率等。当然在其他实施例中,也可以采用其他方法对语音数据进行话语级特征提取。The method of extracting the second utterance-level features from the speech data is as follows: select the configuration file in the openSMILE software, such as IS09_emotion.conf; then input the speech data to the openSMILE software, and the openSMILE software outputs the second utterance-level features of fixed dimensions, the second utterance Level features include Mel cepstral coefficients, frequencies, etc. of speech data. Of course, in other embodiments, other methods may also be used to extract speech-level features from speech data.
另外,文本数据的文本特征还包括文本数据的第一隐藏层状态,语音数据的语音特征还包括语音数据的第二隐藏层状态。对目标对话的每句话语,对第一话语级特征按时序输入编码网络进行编码,得到与文本数据对应的第一隐藏层状态;对目标对话的每句话语,对第二话语级特征按时序输入编码网络进行编码,得到与语音数据对应的第二隐藏层状态。In addition, the text feature of the text data also includes the first hidden layer state of the text data, and the speech feature of the speech data also includes the second hidden layer state of the speech data. For each sentence of the target dialogue, the first utterance-level features are input into the encoding network in time sequence to encode the first hidden layer state corresponding to the text data; for each sentence of the target dialogue, the second utterance-level features are time-sequenced Input the encoding network for encoding to obtain the second hidden layer state corresponding to the speech data.
第一话语级特征和第二话语级特征组成双模态话语级特征
Figure PCTCN2022129655-appb-000001
其中,
Figure PCTCN2022129655-appb-000002
表示d中第i个话语的第一话语级特征,
Figure PCTCN2022129655-appb-000003
表示d中第i个话语的第二话语级特征。
The first utterance-level feature and the second utterance-level feature constitute a bimodal utterance-level feature
Figure PCTCN2022129655-appb-000001
in,
Figure PCTCN2022129655-appb-000002
denote the first utterance-level features of the i-th utterance in d,
Figure PCTCN2022129655-appb-000003
denote the second utterance-level features of the i-th utterance in d.
在实施例中,对文本数据编码获取第一隐藏层状态的方法如下:将第一话语级特征按时序作为长短期记忆网络(Long Short-Term Memory,LSTM)的每个时间步的输入,LSTM网络对其进行编码,根据
Figure PCTCN2022129655-appb-000004
得到与文本数据对应的第一隐藏层状态,其中,
Figure PCTCN2022129655-appb-000005
为文本数据在双向LSTM网络中的前向表示,
Figure PCTCN2022129655-appb-000006
为文本数据在双向LSTM网络中的后向表示,
Figure PCTCN2022129655-appb-000007
为文本数据在双向LSTM网络中的前向表示和后向表示拼接得到的第一隐藏层状态。
In an embodiment, the method for encoding the text data to obtain the state of the first hidden layer is as follows: the first utterance-level feature is used as the input of each time step of the long-short-term memory network (Long Short-Term Memory, LSTM), LSTM The network encodes it according to
Figure PCTCN2022129655-appb-000004
Get the first hidden layer state corresponding to the text data, where,
Figure PCTCN2022129655-appb-000005
For the forward representation of text data in a bidirectional LSTM network,
Figure PCTCN2022129655-appb-000006
is the backward representation of text data in a bidirectional LSTM network,
Figure PCTCN2022129655-appb-000007
The state of the first hidden layer concatenated for the forward and backward representations of text data in a bidirectional LSTM network.
对语音数据编码获取第二隐藏层状态的方法如下:将第二话语级特征按时序作为长短期记忆网络(Long Short-Term Memory,LSTM)的每个时间步的输入,LSTM网络对其进行编码,根据
Figure PCTCN2022129655-appb-000008
得到与文本数据对应的第一隐藏层状态,其中,
Figure PCTCN2022129655-appb-000009
为语音数据在双向LSTM网络中的前向表示,
Figure PCTCN2022129655-appb-000010
为语音数据在 双向LSTM网络中的后向表示,
Figure PCTCN2022129655-appb-000011
为语音数据在双向LSTM网络中的前向表示和后向表示拼接得到的第二隐藏层状态。
The method of encoding the speech data to obtain the state of the second hidden layer is as follows: the second utterance-level feature is used as the input of each time step of the Long Short-Term Memory (LSTM) in time series, and the LSTM network encodes it ,according to
Figure PCTCN2022129655-appb-000008
Get the first hidden layer state corresponding to the text data, where,
Figure PCTCN2022129655-appb-000009
is the forward representation of speech data in a bidirectional LSTM network,
Figure PCTCN2022129655-appb-000010
is the backward representation of speech data in a bidirectional LSTM network,
Figure PCTCN2022129655-appb-000011
The state of the second hidden layer obtained by concatenating the forward and backward representations of speech data in a bidirectional LSTM network.
另外,文本数据的文本特征还包括文本数据的第一预测概率分布,语音数据的语音特征还包括语音数据的第二预测概率分布。对目标对话的每句话语,根据第一隐藏层状态进行概率预测,得到与文本数据对应的第一预测概率分布;对目标对话的每句话语,根据第二隐藏层状态进行概率预测,得到与语音数据对应的第二预测概率分布。In addition, the text feature of the text data also includes a first predicted probability distribution of the text data, and the voice feature of the voice data further includes a second predicted probability distribution of the voice data. For each sentence of the target dialogue, perform probability prediction according to the state of the first hidden layer, and obtain the first predicted probability distribution corresponding to the text data; for each sentence of the target dialogue, perform probability prediction according to the state of the second hidden layer, and obtain the same as A second predicted probability distribution corresponding to the speech data.
在实施例中,对第一隐藏层状态进行情绪分类得到文本数据的第一预测概率分布的方法如下:将第一隐藏层状态输入softmax层,第一预测概率分布表示为
Figure PCTCN2022129655-appb-000012
Figure PCTCN2022129655-appb-000013
其中,
Figure PCTCN2022129655-appb-000014
表示目标对话中第i个话语的文本数据的第一预测概率分布,W T为权重,b T为配置参数。在文本数据的情绪分类过程中需要计算交叉熵损失,交叉熵损失表示为
Figure PCTCN2022129655-appb-000015
Figure PCTCN2022129655-appb-000016
其中,loss T为文本数据的预测损失,Corpus表示数据集中的所有对话,I表示对话中的话语的数量,C为输出向量的维度,即情绪分类的类别数。
In an embodiment, the method for sentiment classification of the first hidden layer state to obtain the first predicted probability distribution of text data is as follows: the first hidden layer state is input to the softmax layer, and the first predicted probability distribution is expressed as
Figure PCTCN2022129655-appb-000012
Figure PCTCN2022129655-appb-000013
in,
Figure PCTCN2022129655-appb-000014
Denotes the first predicted probability distribution of the text data of the i-th utterance in the target dialogue, W T is the weight, and b T is the configuration parameter. In the process of sentiment classification of text data, the cross-entropy loss needs to be calculated, and the cross-entropy loss is expressed as
Figure PCTCN2022129655-appb-000015
Figure PCTCN2022129655-appb-000016
Among them, loss T is the prediction loss of text data, Corpus represents all the dialogues in the dataset, I represents the number of utterances in the dialogue, and C is the dimension of the output vector, that is, the number of categories of emotion classification.
对第二隐藏层状态进行情绪分类得到语音数据的第二预测概率分布的方法如下:将第二隐藏层状态输入softmax层,第二预测概率分布表示为
Figure PCTCN2022129655-appb-000017
其中,
Figure PCTCN2022129655-appb-000018
表示目标对话中第i个话语的语音数据的第二预测概率分布,W A为权重,b A为配置参数。在语音数据的情绪分类过程中需要计算交叉熵损失,交叉熵损失表示为
Figure PCTCN2022129655-appb-000019
Figure PCTCN2022129655-appb-000020
其中,loss A为语音数据的预测损失,Corpus表示数据集中的所有对话,I表示对话中的话语的数量,C为输出向量的维度,即情绪分类的类别数。
Carry out emotion classification to the second hidden layer state and obtain the method for the second predicted probability distribution of voice data as follows: the second hidden layer state is input into the softmax layer, and the second predicted probability distribution is expressed as
Figure PCTCN2022129655-appb-000017
in,
Figure PCTCN2022129655-appb-000018
Indicates the second predicted probability distribution of the speech data of the i-th utterance in the target dialogue, W A is the weight, and b A is the configuration parameter. The cross-entropy loss needs to be calculated in the emotion classification process of speech data, and the cross-entropy loss is expressed as
Figure PCTCN2022129655-appb-000019
Figure PCTCN2022129655-appb-000020
Among them, loss A is the prediction loss of speech data, Corpus represents all the dialogues in the data set, I represents the number of utterances in the dialogue, and C is the dimension of the output vector, that is, the number of categories of emotion classification.
即,在该实施例中,文本数据的多种文本特征包括文本数据的第一话语级特征、第一隐藏层状态和第一预测概率分布;语音数据的多种语音特征包括语音数据的第二话语级特征、第二隐藏层状态和第二预测概率分布。当然在其他实施例中,还可以包括其他种类的特征。That is, in this embodiment, the multiple text features of the text data include the first utterance-level feature, the first hidden layer state, and the first predicted probability distribution of the text data; the multiple voice features of the voice data include the second An utterance-level feature, a second hidden layer state, and a second predicted probability distribution. Of course, in other embodiments, other types of features may also be included.
步骤S220,对目标对话的每句话语,对多种文本特征和多种语音特征进行特征融合,得到与话语对应的融合特征。Step S220, for each utterance of the target dialogue, perform feature fusion on multiple text features and multiple voice features to obtain fused features corresponding to the utterance.
对于步骤S220,在实施例中,对目标对话的每句话语,对第一话语级特征、第二话语级特征、第一隐藏层状态、第二隐藏层状态、第一预测概率分布和第二预测概率分布进行向量拼接处理,得到与话语对应的融合特征。For step S220, in an embodiment, for each utterance of the target dialogue, the first utterance-level feature, the second utterance-level feature, the first hidden layer state, the second hidden layer state, the first predicted probability distribution and the second The predicted probability distribution is processed by vector splicing to obtain the fusion features corresponding to the utterance.
将同一发言人在目标对话中的文本数据的第一话语级特征和语音数据的第二话语级特征初始化成固定维度的向量,即发言人特征。根据发言人的发言顺序将发言人特征进行排列,得到序列s=[…,s i,…],其中s i表示目标对话中第i个话语对应的发言人特征。 The first utterance-level features of the text data and the second utterance-level features of the speech data of the same speaker in the target dialogue are initialized to a fixed-dimensional vector, that is, speaker features. According to the speaking order of the speakers, the speaker features are arranged to obtain the sequence s=[…, s i ,…], where si represents the speaker feature corresponding to the i-th utterance in the target dialogue.
将发言人特征、第一隐藏层状态、第二隐藏层状态、第一预测概率分布和第二预测概率分布在特征维度上进行向量拼接,表示为
Figure PCTCN2022129655-appb-000021
其中h i表示目标对话中第i个话语对应的融合特征。则根据融合特征,对话可以表示为d=[h 1,…,h i,…]。
The speaker features, the first hidden layer state, the second hidden layer state, the first predicted probability distribution and the second predicted probability distribution are vector spliced in the feature dimension, expressed as
Figure PCTCN2022129655-appb-000021
where h i represents the fusion feature corresponding to the i-th utterance in the target dialogue. Then according to the fusion feature, the dialogue can be expressed as d=[h 1 ,..., hi ,...].
步骤S230,以话语作为节点,并以融合特征作为节点特征,构建图结构。In step S230, a graph structure is constructed with utterances as nodes and fusion features as node features.
对于步骤S230,其包括但不限于以下步骤:将目标对话的每句话语分别作为图结构的每个节点;将话语对应的融合特征作为话语对应的节点的节点特征;目标对话中的每句话语的信息均和全局信息相关,将节点两两连接得到完全无向图作为图结构,一个完全无向图对应一段对话。For step S230, it includes but is not limited to the following steps: use each sentence of the target dialogue as each node of the graph structure; use the fusion feature corresponding to the statement as the node feature of the node corresponding to the statement; each sentence in the target dialogue The information of is related to the global information, and the nodes are connected in pairs to obtain a completely undirected graph as a graph structure, and a completely undirected graph corresponds to a dialogue.
步骤S240,根据图结构对目标对话进行情绪分类,得到情绪分类信息。Step S240, perform emotion classification on the target dialogue according to the graph structure, and obtain emotion classification information.
参照图3,图3是步骤S240的具体流程图。对于步骤S240,其包括但不限于以下步骤:Referring to FIG. 3, FIG. 3 is a specific flowchart of step S240. For step S240, it includes but not limited to the following steps:
步骤S241,基于注意力机制对图结构进行节点特征更新,得到融合全局特征信息的新节点特征。Step S241, update the node features of the graph structure based on the attention mechanism, and obtain new node features fused with global feature information.
参照图4,图4是步骤S241的具体流程图。对于步骤S241,其包括但不限于以下步骤:Referring to FIG. 4, FIG. 4 is a specific flowchart of step S241. For step S241, it includes but not limited to the following steps:
步骤S2411,对图结构的所有节点进行线性映射,得到新节点。每一次对图结构映射,得到一层网络;整个图注意力网络可以为多层网络结构。其中,线性映射可以表示为
Figure PCTCN2022129655-appb-000022
Figure PCTCN2022129655-appb-000023
其中
Figure PCTCN2022129655-appb-000024
为图注意力网络的第l层网络的线性映射的输入,W (l)为图注意力网络的第l层网络的线性映射的权重,
Figure PCTCN2022129655-appb-000025
为图注意力网络的第l层网络的线性映射的输出。
Step S2411, perform linear mapping on all nodes of the graph structure to obtain new nodes. Each time the graph structure is mapped, a layer of network is obtained; the entire graph attention network can be a multi-layer network structure. Among them, the linear map can be expressed as
Figure PCTCN2022129655-appb-000022
Figure PCTCN2022129655-appb-000023
in
Figure PCTCN2022129655-appb-000024
is the input of the linear map of the l-th layer network of the graph attention network, W (l) is the weight of the linear map of the l-th layer network of the graph attention network,
Figure PCTCN2022129655-appb-000025
is the output of the linear map of the l-layer network of the graph attention network.
步骤S2412,从新节点中确定中心节点,以及确定与中心节点相邻的相邻节点;需要说明的是,在该实施例中,将每个新节点依次作为中心节点进行计算;其中在该实施例中,与中心节点相邻的相邻节点为与中心节点直接连接的节点。Step S2412, determine the central node from the new nodes, and determine the adjacent nodes adjacent to the central node; it should be noted that, in this embodiment, each new node is calculated as the central node in turn; wherein in this embodiment In , the adjacent nodes adjacent to the central node are the nodes directly connected to the central node.
步骤S2413,分别计算从中心节点至每个相邻节点的第一注意力权重;将两个节点的节点特征进行向量拼接,将拼接向量与一个可学习的权重向量做点积运算,然后将点积运算的结果通过LeakyReLU激活函数进行非线性激活得到两个节点间的注意力分数,注意力分数可以表示为
Figure PCTCN2022129655-appb-000026
其中,
Figure PCTCN2022129655-appb-000027
为图注意力网络的第l层网络的第i个节点至第j个节点的注意力分数,
Figure PCTCN2022129655-appb-000028
为第l层网络的可学习的权重向量,
Figure PCTCN2022129655-appb-000029
为第l层网络的第i个节点,
Figure PCTCN2022129655-appb-000030
为第l层网络的第j个节点;按照上述方法分别计算中心节点至每个相邻节点的注意力分数;根据
Figure PCTCN2022129655-appb-000031
由中心节点至每个相邻节点的注意力分数计算中心节点至每个相邻节点的第一注意力权重,即对注意力分数归一化处理。对于
Figure PCTCN2022129655-appb-000032
Figure PCTCN2022129655-appb-000033
该式子,
Figure PCTCN2022129655-appb-000034
为第l层网络的第i个节点至第j个节点的第一注意力权重,N i为所有相邻节点的集合。
Step S2413, respectively calculate the first attention weight from the central node to each adjacent node; perform vector splicing on the node features of the two nodes, do a dot product operation between the spliced vector and a learnable weight vector, and then point The result of the product operation is nonlinearly activated by the LeakyReLU activation function to obtain the attention score between the two nodes. The attention score can be expressed as
Figure PCTCN2022129655-appb-000026
in,
Figure PCTCN2022129655-appb-000027
is the attention score of the i-th node to the j-th node of the l-th layer network of the graph attention network,
Figure PCTCN2022129655-appb-000028
is the learnable weight vector of the l-th layer network,
Figure PCTCN2022129655-appb-000029
is the i-th node of the l-layer network,
Figure PCTCN2022129655-appb-000030
is the jth node of the l-th layer network; calculate the attention scores from the central node to each adjacent node according to the above method; according to
Figure PCTCN2022129655-appb-000031
The first attention weight from the central node to each adjacent node is calculated from the attention scores from the central node to each adjacent node, that is, the attention scores are normalized. for
Figure PCTCN2022129655-appb-000032
Figure PCTCN2022129655-appb-000033
The formula,
Figure PCTCN2022129655-appb-000034
is the first attention weight of the i-th node to the j-th node in the l-th layer network, and N i is the set of all adjacent nodes.
步骤S2414,对所有第一注意力权重加权求和得到第二注意力权重,对第二注意力权重通过ReLU激活函数进行非线性激活,得到融合全局特征信息的新节点特征;该更新节点特征的过程可以表示为
Figure PCTCN2022129655-appb-000035
其中,
Figure PCTCN2022129655-appb-000036
为新节点特征,
Figure PCTCN2022129655-appb-000037
为第二注意力权重。同时,将新节点特征作为当前层网络的输出。
Step S2414, weighting and summing all first attention weights to obtain a second attention weight, and performing non-linear activation on the second attention weight through a ReLU activation function to obtain new node features fused with global feature information; the updated node features process can be expressed as
Figure PCTCN2022129655-appb-000035
in,
Figure PCTCN2022129655-appb-000036
is the new node feature,
Figure PCTCN2022129655-appb-000037
is the second attention weight. At the same time, the new node features are used as the output of the current layer network.
另外,基于注意力机制对图结构进行节点特征更新中得到的图注意力网络具有多层结构,每一层的输入
Figure PCTCN2022129655-appb-000038
为前一层的输出
Figure PCTCN2022129655-appb-000039
Figure PCTCN2022129655-appb-000040
In addition, the graph attention network obtained by updating the node features of the graph structure based on the attention mechanism has a multi-layer structure, and the input of each layer
Figure PCTCN2022129655-appb-000038
is the output of the previous layer
Figure PCTCN2022129655-appb-000039
Right now
Figure PCTCN2022129655-appb-000040
步骤S242,根据新节点特征对目标对话进行情绪分类,得到情绪分类信息。Step S242, perform emotion classification on the target dialogue according to the new node features, and obtain emotion classification information.
对于步骤S242,将具有两层结构的图注意力网络的输出结果
Figure PCTCN2022129655-appb-000041
作为分类器的输入,在该实施例中,分类器采用softmax分类器,当然在其他实施例中也可以采用其他类型的分类器。通过softmax分类器对
Figure PCTCN2022129655-appb-000042
预测情绪类别的概率分布,即
Figure PCTCN2022129655-appb-000043
其中,
Figure PCTCN2022129655-appb-000044
为目标对话中第i个话语的预测概率分布,
Figure PCTCN2022129655-appb-000045
的最大概率所对应的类别即为情绪分类信息。
For step S242, the output result of the graph attention network with two-layer structure
Figure PCTCN2022129655-appb-000041
As the input of the classifier, in this embodiment, the classifier adopts a softmax classifier, of course, other types of classifiers may also be used in other embodiments. Through the softmax classifier pair
Figure PCTCN2022129655-appb-000042
Probability distributions for predicting sentiment categories, i.e.
Figure PCTCN2022129655-appb-000043
in,
Figure PCTCN2022129655-appb-000044
is the predicted probability distribution of the i-th utterance in the target dialogue,
Figure PCTCN2022129655-appb-000045
The category corresponding to the maximum probability of is the emotion classification information.
对于步骤S242,该情绪分类主任务的损失通过交叉熵损失计算,主任务的预测损失表示为:
Figure PCTCN2022129655-appb-000046
其中,loss B为主任务的预测损失,Corpus表示数据集中的所有对话,I表示对话中的话语的数量,C为输出向量的维度,即情绪分类的类别数。
For step S242, the loss of the emotion classification main task is calculated by cross-entropy loss, and the prediction loss of the main task is expressed as:
Figure PCTCN2022129655-appb-000046
Among them, loss B is the prediction loss of the main task, Corpus represents all the dialogues in the data set, I represents the number of utterances in the dialogue, and C is the dimension of the output vector, that is, the number of categories of emotion classification.
则对于整个分类模型,其最终损失包括文本数据的预测损失、语音数据的预测损失、主任务的预测损失以及正则化项后的加权和,则最终损失表示为:loss all=λ Tloss TAloss ABloss Br‖θ‖ 2,其中,loss all为最终损失,loss T为文本数据的预测损失,loss A为语音数据的预测损失,loss B为主任务的预测损失,λ T、λ A、λ B和λ r分别为loss T、loss A、loss B和L2正则化项‖θ‖ 2的权重,θ表示分类模型中所有可调参数的集合。通过最终损失对整个分类模型 进行参数微调,使得分类模型的情绪分类更准确。 Then for the entire classification model, the final loss includes the prediction loss of text data, the prediction loss of speech data, the prediction loss of the main task and the weighted sum after the regularization item, and the final loss is expressed as: loss all = λ T loss T + λ A loss AB loss Br ‖θ‖ 2 , where loss all is the final loss, loss T is the prediction loss of text data, loss A is the prediction loss of voice data, and loss B is the prediction of the main task Loss, λ T , λ A , λ B and λ r are the weights of loss T , loss A , loss B and L2 regularization term ‖θ‖2 , respectively, θ represents the set of all adjustable parameters in the classification model. The parameters of the whole classification model are fine-tuned through the final loss, so that the sentiment classification of the classification model is more accurate.
在该实施例中,结合多人对话的文本数据和语音数据双模态信息进行情绪分类,分别对文本数据和语音数据双模态信息捕抓各自的序列特征,将两者独立进行情绪预测并将此作为辅助任务以增强两个模态的特征表示;利用图结构对目标对话建模,采用图注意力网络捕捉全局的重要特征,使分类模型在关注两个模态的序列特征的同时也关注两个模态的语义特征;加强了多模态学习的有效性,提升了针对多人对话的情绪分类的准确性。In this embodiment, the text data and voice data dual-modal information of multi-person dialogue are used to classify emotions, and the sequence features of the text data and voice data dual-modal information are captured respectively, and the two are independently predicted and compared. We use this as an auxiliary task to enhance the feature representation of the two modalities; use the graph structure to model the target dialogue, and use the graph attention network to capture the global important features, so that the classification model can also focus on the sequential features of the two modalities. Focus on the semantic features of the two modalities; strengthen the effectiveness of multimodal learning, and improve the accuracy of emotion classification for multi-person dialogue.
另外,本申请实施例还提供了一种对话情绪分析装置。In addition, the embodiment of the present application also provides a dialog sentiment analysis device.
参照图5,图5是对话情绪分析装置的结构图。该对话情绪分析装置包括:存储器20、处理器10及存储在存储器20上并可在处理器10上运行的计算机程序。处理器10执行计算机程序时实现如上的对话情绪分析方法。Referring to FIG. 5 , FIG. 5 is a structural diagram of a dialogue emotion analysis device. The dialog sentiment analysis device includes: a memory 20 , a processor 10 and a computer program stored in the memory 20 and operable on the processor 10 . When the processor 10 executes the computer program, the above dialog sentiment analysis method is realized.
处理器10和存储器20可以通过总线30或者其他方式连接。The processor 10 and the memory 20 may be connected through a bus 30 or other means.
存储器20作为一种非暂态计算机可读存储介质,可用于存储非暂态软件程序以及非暂态性计算机可执行程序。此外,存储器20可以包括高速随机存取存储器,还可以包括非暂态存储器,例如至少一个磁盘存储器件、闪存器件、或其他非暂态固态存储器件。在一些实施方式中,存储器20包括相对于处理器远程设置的存储器,这些远程存储器可以通过网络连接至该处理器。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。The memory 20, as a non-transitory computer-readable storage medium, can be used to store non-transitory software programs and non-transitory computer-executable programs. In addition, the memory 20 may include a high-speed random access memory, and may also include a non-transitory memory, such as at least one magnetic disk storage device, a flash memory device, or other non-transitory solid-state storage devices. In some embodiments, the memory 20 includes memory located remotely from the processor, and these remote memories may be connected to the processor through a network. Examples of the aforementioned networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.
实现上述实施例的信息处理方法所需的非暂态软件程序以及指令存储在存储器20中,当被处理器执行时,执行上述实施例中的对话情绪分析方法,例如,执行以上描述的步骤S100至步骤S200,步骤S210至步骤S240,步骤S241至步骤S242,以及步骤S2411至步骤S2414。The non-transitory software programs and instructions required to realize the information processing method of the above-mentioned embodiment are stored in the memory 20, and when executed by the processor, the dialogue emotion analysis method in the above-mentioned embodiment is executed, for example, step S100 described above is executed Go to step S200, step S210 to step S240, step S241 to step S242, and step S2411 to step S2414.
以上所描述的节点实施例仅仅是示意性的,其中作为分离部件说明的单元可以是或者也可以不是物理上分开的,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The node embodiments described above are only illustrative, and the units described as separate components may or may not be physically separated, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
此外,本申请实施例还提供了一种计算机可读存储介质,该计算机可读存储介质存储有计算机可执行指令,该计算机可执行指令被一个处理器或控制器执行,例如,被一个处理器执行,可使得上述处理器执行上述实施例中的对话情绪分析方法,例如,执行以上描述的步骤S100至步骤S200,步骤S210至步骤S240,步骤S241至步骤S242,以及步骤S2411至步骤S2414。In addition, the embodiment of the present application also provides a computer-readable storage medium, the computer-readable storage medium stores computer-executable instructions, and the computer-executable instructions are executed by a processor or a controller, for example, by a processor Execution may cause the above processor to execute the dialogue emotion analysis method in the above embodiment, for example, execute the steps S100 to S200, S210 to S240, S241 to S242, and S2411 to S2414 described above.
本申请实施例能够结合多人对话的文本数据和语音数据双模态信息进行情绪分类,分别对文本数据和语音数据双模态信息捕抓各自的序列特征;利用图结构对目标对话建模,采用图注意力网络捕捉全局的重要特征,使分类模型在关注两个模态的序列特征的同时也关注两个模态的语义特征;能够针对多人对话,加强了多模态学习的有效性,提升了情绪分类的准确性。The embodiment of the present application can combine the bimodal information of text data and voice data of multi-person dialogue to classify emotions, capture the respective sequence characteristics of the bimodal information of text data and voice data respectively; use the graph structure to model the target dialogue, The graph attention network is used to capture important global features, so that the classification model not only focuses on the sequence features of the two modalities, but also pays attention to the semantic features of the two modalities; it can target multi-person dialogue and strengthen the effectiveness of multi-modal learning. , improving the accuracy of sentiment classification.
本领域普通技术人员可以理解,上文中所公开方法中的全部或某些步骤、系统可以被实施为软件、固件、硬件及其适当的组合。某些物理组件或所有物理组件可以被实施为由处理器,如中央处理器、数字信号处理器或微处理器执行的软件,或者被实施为硬件,或者被实施为集成电路,如专用集成电路。这样的软件可以分布在计算机可读介质上,计算机可读介质可以包括计算机存储介质(或非暂时性介质)和通信介质(或暂时性介质)。如本领域普通技术人员公知的,术语计算机存储介质包括在用于存储信息(诸如计算机可读指令、数据 结构、程序单元或其他数据)的任何方法或技术中实施的易失性和非易失性、可移除和不可移除介质。计算机存储介质包括但不限于RAM、ROM、EEPROM、闪存或其他存储器技术、CD-ROM、数字多功能盘(DVD)或其他光盘存储、磁盒、磁带、磁盘存储或其他磁存储装置、或者可以用于存储期望的信息并且可以被计算机访问的任何其他的介质。此外,本领域普通技术人员公知的是,通信介质通常包含计算机可读指令、数据结构、程序单元或者诸如载波或其他传输机制之类的调制数据信号中的其他数据,并且可包括任何信息递送介质。Those skilled in the art can understand that all or some of the steps and systems in the methods disclosed above can be implemented as software, firmware, hardware and an appropriate combination thereof. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application-specific integrated circuit . Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). As known to those of ordinary skill in the art, the term computer storage media includes both volatile and nonvolatile media implemented in any method or technology for storage of information, such as computer readable instructions, data structures, program elements, or other data. permanent, removable and non-removable media. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disk (DVD) or other optical disk storage, magnetic cartridges, tape, magnetic disk storage or other magnetic storage devices, or can Any other medium used to store desired information and which can be accessed by a computer. Furthermore, as is well known to those of ordinary skill in the art, communication media typically embodies computer readable instructions, data structures, program elements, or other data in a modulated data signal such as a carrier wave or other transport mechanism, and may include any information delivery media .
以上是对本申请实施进行了具体说明,但本申请并不局限于上述实施方式,熟悉本领域的技术人员在不违背本申请精神的共享条件下还可作出种种等同的变形或替换,这些等同的变形或替换均包括在本申请权利要求所限定的范围内。The above is a specific description of the implementation of the present application, but the present application is not limited to the above-mentioned embodiment. Those skilled in the art can also make various equivalent deformations or replacements under the sharing conditions that do not violate the spirit of the present application. These equivalent Any modification or substitution is within the scope defined by the claims of the present application.

Claims (11)

  1. 一种对话情绪分析方法,包括:A dialogue sentiment analysis method, comprising:
    获取目标对话的文本数据和语音数据;Obtain the text data and voice data of the target dialogue;
    将所述文本数据和所述语音数据输入至基于图结构的分类模型进行情绪分类,得到情绪分类信息,其中,所述图结构包括多个节点和对应的节点特征,所述节点和所述目标对话中的话语一一对应,所述节点特征包括所述文本数据的文本特征和所述语音数据的语音特征。The text data and the voice data are input into a classification model based on a graph structure for emotion classification to obtain emotion classification information, wherein the graph structure includes a plurality of nodes and corresponding node features, and the nodes and the target There is a one-to-one correspondence between utterances in the dialogue, and the node features include text features of the text data and phonetic features of the voice data.
  2. 根据权利要求1所述的对话情绪分析方法,其中,所述将所述文本数据和所述语音数据输入至基于图结构的分类模型进行情绪分类,得到情绪分类信息,包括:The dialogue emotion analysis method according to claim 1, wherein said inputting said text data and said voice data into a classification model based on a graph structure for emotion classification, and obtaining emotion classification information, comprising:
    对所述目标对话的每句话语,对所述文本数据进行特征提取得到与所述话语对应的多种所述文本特征,并对所述语音数据进行特征提取得到与所述话语对应的多种所述语音特征;For each utterance of the target dialogue, perform feature extraction on the text data to obtain various text features corresponding to the utterance, and perform feature extraction on the speech data to obtain various text features corresponding to the utterance. said speech characteristics;
    对所述目标对话的每句话语,对多种所述文本特征和多种所述语音特征进行特征融合,得到与所述话语对应的所述融合特征;For each utterance of the target dialogue, performing feature fusion on multiple text features and multiple voice features to obtain the fused features corresponding to the utterance;
    以所述话语作为所述节点,并以所述融合特征作为所述节点特征,构建图结构;using the utterance as the node, and using the fusion feature as the node feature to construct a graph structure;
    根据所述图结构对所述目标对话进行情绪分类,得到所述情绪分类信息。Perform emotion classification on the target dialogue according to the graph structure to obtain the emotion classification information.
  3. 根据权利要求2所述的对话情绪分析方法,其中,所述对所述目标对话的每句话语,对所述文本数据进行特征提取得到与所述话语对应的多种所述文本特征,并对所述语音数据进行特征提取得到与所述话语对应的多种所述语音特征,包括:The dialog sentiment analysis method according to claim 2, wherein, for each sentence of the target dialog, feature extraction is performed on the text data to obtain multiple text features corresponding to the speech, and The voice data is subjected to feature extraction to obtain a variety of voice features corresponding to the utterance, including:
    对所述目标对话的每句话语,将所述文本数据的每个词映射为词向量并得到词向量矩阵,对所述词向量矩阵进行话语级特征提取,得到第一话语级特征;For each sentence of the target dialogue, each word of the text data is mapped to a word vector to obtain a word vector matrix, and the word vector matrix is subjected to utterance-level feature extraction to obtain a first utterance-level feature;
    对所述目标对话的每句话语,对所述语音数据进行话语级特征提取得到第二话语级特征。For each utterance of the target dialogue, perform utterance-level feature extraction on the speech data to obtain a second utterance-level feature.
  4. 根据权利要求3所述的对话情绪分析方法,其中,所述对所述目标对话的每句话语,对所述文本数据进行特征提取得到与所述话语对应的多种所述文本特征,并对所述语音数据进行特征提取得到与所述话语对应的多种所述语音特征,还包括:The dialog sentiment analysis method according to claim 3, wherein, for each sentence of the target dialog, feature extraction is performed on the text data to obtain multiple text features corresponding to the speech, and The voice data is subjected to feature extraction to obtain a variety of voice features corresponding to the utterance, which also includes:
    对所述目标对话的每句话语,对所述第一话语级特征按时序输入编码网络进行编码,得到与所述文本数据对应的第一隐藏层状态;For each utterance of the target dialogue, the first utterance-level feature is input into the encoding network in time sequence to encode the first hidden layer state corresponding to the text data;
    对所述目标对话的每句话语,对所述第二话语级特征按时序输入编码网络进行编码,得到与所述语音数据对应的第二隐藏层状态。For each utterance of the target dialogue, the second utterance-level feature is input into the encoding network in time sequence for encoding to obtain a second hidden layer state corresponding to the speech data.
  5. 根据权利要求4所述的对话情绪分析方法,其中,所述对所述目标对话的每句话语,对所述文本数据进行特征提取得到与所述话语对应的多种所述文本特征,并对所述语音数据进行特征提取得到与所述话语对应的多种所述语音特征,还包括:The dialogue sentiment analysis method according to claim 4, wherein, for each sentence of the target dialogue, feature extraction is performed on the text data to obtain multiple text features corresponding to the speech, and The voice data is subjected to feature extraction to obtain a variety of voice features corresponding to the utterance, which also includes:
    对所述目标对话的每句话语,根据所述第一隐藏层状态进行概率预测,得到与所述文本数据对应的第一预测概率分布;For each sentence of the target dialogue, perform probability prediction according to the state of the first hidden layer to obtain a first prediction probability distribution corresponding to the text data;
    对所述目标对话的每句话语,根据所述第二隐藏层状态进行概率预测,得到与所述语音数据对应的第二预测概率分布。For each utterance of the target dialogue, perform probability prediction according to the state of the second hidden layer to obtain a second prediction probability distribution corresponding to the speech data.
  6. 根据权利要求5所述的对话情绪分析方法,其中,所述对所述目标对话的每句话语,对多种所述文本特征和多种所述语音特征进行特征融合,得到与所述话语对应的融合特征,包括:The dialogue sentiment analysis method according to claim 5, wherein, for each sentence of the target dialogue, a plurality of text features and a plurality of speech features are subjected to feature fusion to obtain a corresponding utterance corresponding to the utterance Fusion features, including:
    对所述目标对话的每句话语,对所述第一话语级特征、所述第二话语级特征、所述第一 隐藏层状态、所述第二隐藏层状态、所述第一预测概率分布和所述第二预测概率分布进行向量拼接处理,得到与所述话语对应的融合特征。For each utterance of the target dialogue, for the first utterance-level feature, the second utterance-level feature, the first hidden layer state, the second hidden layer state, the first predicted probability distribution Perform vector splicing processing with the second predicted probability distribution to obtain fusion features corresponding to the utterance.
  7. 根据权利要求2所述的对话情绪分析方法,其中,所述以所述话语作为所述节点,并以所述融合特征作为所述节点特征,构建图结构,包括:The dialogue sentiment analysis method according to claim 2, wherein, the described node is used as the node, and the fusion feature is used as the node feature to construct a graph structure, including:
    将所述目标对话的每句所述话语分别作为所述图结构的每个所述节点;each of the utterances of the target dialogue as each of the nodes of the graph structure;
    将所述话语对应的融合特征作为所述话语对应的所述节点的所述节点特征;using the fusion feature corresponding to the utterance as the node feature of the node corresponding to the utterance;
    将所述节点两两连接得到完全无向图作为所述图结构。A complete undirected graph is obtained by connecting the nodes in pairs as the graph structure.
  8. 根据权利要求2或7所述的对话情绪分析方法,其中,所述根据所述图结构对所述目标对话进行情绪分类,包括:The dialogue sentiment analysis method according to claim 2 or 7, wherein said performing sentiment classification on said target dialogue according to said graph structure comprises:
    基于注意力机制对所述图结构进行节点特征更新,得到融合全局特征信息的新节点特征;updating node features of the graph structure based on an attention mechanism to obtain new node features that fuse global feature information;
    根据所述新节点特征对所述目标对话进行情绪分类。Sentiment classification is performed on the target dialogue according to the features of the new node.
  9. 根据权利要求8所述的对话情绪分析方法,其中,所述基于注意力机制对所述图结构进行节点特征更新,得到融合全局特征信息的新节点特征,包括:The dialog sentiment analysis method according to claim 8, wherein, the described graph structure is updated based on the attention mechanism to obtain the node features of the fusion global feature information, including:
    对所有所述节点进行线性映射,得到新节点;Perform linear mapping on all the nodes to obtain new nodes;
    从所述新节点中确定中心节点,以及确定与所述中心节点相邻的相邻节点;determining a central node from among the new nodes, and determining adjacent nodes adjacent to the central node;
    分别计算从所述中心节点至每个所述相邻节点的第一注意力权重;Calculating the first attention weights from the central node to each of the adjacent nodes respectively;
    对所有所述第一注意力权重加权求和得到第二注意力权重,根据所述第二注意力权重得到融合全局特征信息的所述新节点特征。All the first attention weights are weighted and summed to obtain a second attention weight, and the new node features fused with global feature information are obtained according to the second attention weights.
  10. 一种对话情绪分析装置,包括:存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现如权利要求1至9中任意一项所述的对话情绪分析方法。A dialogue sentiment analysis device, comprising: a memory, a processor, and a computer program stored on the memory and operable on the processor, when the processor executes the computer program, any one of claims 1 to 9 is realized The described dialogue sentiment analysis method.
  11. 一种计算机可读存储介质,存储有计算机可执行指令,所述计算机可执行指令被处理器执行时,使得所述处理器执行如权利要求1至9中任意一项所述的对话情绪分析方法。A computer-readable storage medium, storing computer-executable instructions, when the computer-executable instructions are executed by a processor, the processor executes the dialogue sentiment analysis method according to any one of claims 1 to 9 .
PCT/CN2022/129655 2021-11-03 2022-11-03 Conversation sentiment analysis method and apparatus, and computer-readable storage medium WO2023078370A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111295920.7 2021-11-03
CN202111295920.7A CN116090474A (en) 2021-11-03 2021-11-03 Dialogue emotion analysis method, dialogue emotion analysis device and computer-readable storage medium

Publications (1)

Publication Number Publication Date
WO2023078370A1 true WO2023078370A1 (en) 2023-05-11

Family

ID=86205014

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/129655 WO2023078370A1 (en) 2021-11-03 2022-11-03 Conversation sentiment analysis method and apparatus, and computer-readable storage medium

Country Status (2)

Country Link
CN (1) CN116090474A (en)
WO (1) WO2023078370A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116361442A (en) * 2023-06-02 2023-06-30 国网浙江宁波市鄞州区供电有限公司 Business hall data analysis method and system based on artificial intelligence
CN117409780A (en) * 2023-12-14 2024-01-16 浙江宇宙奇点科技有限公司 AI digital human voice interaction method and system

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170270921A1 (en) * 2016-03-15 2017-09-21 SESTEK Ses ve Iletisim Bilgisayar Tekn. San. Ve Tic. A.S. Dialog management system
CN108805089A (en) * 2018-06-14 2018-11-13 南京云思创智信息科技有限公司 Based on multi-modal Emotion identification method
CN108877801A (en) * 2018-06-14 2018-11-23 南京云思创智信息科技有限公司 More wheel dialog semantics based on multi-modal Emotion identification system understand subsystem
CN112348075A (en) * 2020-11-02 2021-02-09 大连理工大学 Multi-mode emotion recognition method based on contextual attention neural network
CN113254648A (en) * 2021-06-22 2021-08-13 暨南大学 Text emotion analysis method based on multilevel graph pooling

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170270921A1 (en) * 2016-03-15 2017-09-21 SESTEK Ses ve Iletisim Bilgisayar Tekn. San. Ve Tic. A.S. Dialog management system
CN108805089A (en) * 2018-06-14 2018-11-13 南京云思创智信息科技有限公司 Based on multi-modal Emotion identification method
CN108877801A (en) * 2018-06-14 2018-11-23 南京云思创智信息科技有限公司 More wheel dialog semantics based on multi-modal Emotion identification system understand subsystem
CN112348075A (en) * 2020-11-02 2021-02-09 大连理工大学 Multi-mode emotion recognition method based on contextual attention neural network
CN113254648A (en) * 2021-06-22 2021-08-13 暨南大学 Text emotion analysis method based on multilevel graph pooling

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116361442A (en) * 2023-06-02 2023-06-30 国网浙江宁波市鄞州区供电有限公司 Business hall data analysis method and system based on artificial intelligence
CN116361442B (en) * 2023-06-02 2023-10-17 国网浙江宁波市鄞州区供电有限公司 Business hall data analysis method and system based on artificial intelligence
CN117409780A (en) * 2023-12-14 2024-01-16 浙江宇宙奇点科技有限公司 AI digital human voice interaction method and system
CN117409780B (en) * 2023-12-14 2024-02-27 浙江宇宙奇点科技有限公司 AI digital human voice interaction method and system

Also Published As

Publication number Publication date
CN116090474A (en) 2023-05-09

Similar Documents

Publication Publication Date Title
US11043205B1 (en) Scoring of natural language processing hypotheses
US10347244B2 (en) Dialogue system incorporating unique speech to text conversion method for meaningful dialogue response
KR102289917B1 (en) Method for processing dialogue using dialogue act information and Apparatus thereof
WO2023078370A1 (en) Conversation sentiment analysis method and apparatus, and computer-readable storage medium
US11450332B2 (en) Audio conversion learning device, audio conversion device, method, and program
US11823678B2 (en) Proactive command framework
CN109887484B (en) Dual learning-based voice recognition and voice synthesis method and device
WO2019118254A1 (en) Chatbot integrating derived user intent
US11081104B1 (en) Contextual natural language processing
US20240153489A1 (en) Data driven dialog management
CN109903750B (en) Voice recognition method and device
US10963819B1 (en) Goal-oriented dialog systems and methods
Masumura et al. Large context end-to-end automatic speech recognition via extension of hierarchical recurrent encoder-decoder models
JP2007256342A (en) Clustering system, clustering method, clustering program and attribute estimation system using clustering program and clustering system
CN111081230A (en) Speech recognition method and apparatus
CN115329779A (en) Multi-person conversation emotion recognition method
Chen et al. Sequence-to-sequence modelling for categorical speech emotion recognition using recurrent neural network
CN112669845A (en) Method and device for correcting voice recognition result, electronic equipment and storage medium
CN115497465A (en) Voice interaction method and device, electronic equipment and storage medium
US11615787B2 (en) Dialogue system and method of controlling the same
CN113853651A (en) Apparatus and method for speech-emotion recognition using quantized emotional states
CN116361442B (en) Business hall data analysis method and system based on artificial intelligence
JP2017167938A (en) Learning device, learning method, and program
KR20230120790A (en) Speech Recognition Healthcare Service Using Variable Language Model
JP7420211B2 (en) Emotion recognition device, emotion recognition model learning device, methods thereof, and programs

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22889391

Country of ref document: EP

Kind code of ref document: A1