WO2021134417A1 - 交互行为预测方法、智能装置和计算机可读存储介质 - Google Patents

交互行为预测方法、智能装置和计算机可读存储介质 Download PDF

Info

Publication number
WO2021134417A1
WO2021134417A1 PCT/CN2019/130367 CN2019130367W WO2021134417A1 WO 2021134417 A1 WO2021134417 A1 WO 2021134417A1 CN 2019130367 W CN2019130367 W CN 2019130367W WO 2021134417 A1 WO2021134417 A1 WO 2021134417A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
neural network
speech
piece
emotion
Prior art date
Application number
PCT/CN2019/130367
Other languages
English (en)
French (fr)
Inventor
丁万
黄东延
李柏
邵池
熊友军
Original Assignee
深圳市优必选科技股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市优必选科技股份有限公司 filed Critical 深圳市优必选科技股份有限公司
Priority to PCT/CN2019/130367 priority Critical patent/WO2021134417A1/zh
Priority to CN201980003374.XA priority patent/CN111344717B/zh
Publication of WO2021134417A1 publication Critical patent/WO2021134417A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24133Distances to prototypes
    • G06F18/24137Distances to cluster centroïds
    • G06F18/2414Smoothing the distance, e.g. radial basis function networks [RBFN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present invention relates to the field of artificial intelligence technology, in particular to an interactive behavior prediction method, an intelligent device and a computer-readable storage medium.
  • the existing emotional interaction behavior theory shows that the change of emotional state in the interaction process has a high correlation with the type of interaction behavior.
  • the prior art recognizes emotions and predicts behaviors based on speech.
  • the emotions in the interaction in actual scenes are through Multi-modality (such as face, voice, text) collaboratively expressed.
  • Speech-based emotion interaction behavior prediction ignores the important features contained in other modal information, which will lead to inaccurate prediction results.
  • An interactive behavior prediction method comprising: obtaining multiple rounds of dialogue data, extracting at least one piece of speech data of a designated speaker in the multiple rounds of dialogue data; extracting multi-modal feature data in each piece of the speech data, Generate a multi-modal feature sequence according to the multi-modal feature data; input the multi-modal feature sequence corresponding to the at least one piece of speech data into a pre-trained classification neural network, and obtain the output result of the classification neural network as the result State the predicted interaction behavior of the designated speaker
  • An intelligent device includes: an acquisition module for acquiring multiple rounds of dialogue data, and extracting at least one piece of speech data of a designated speaker in the multiple rounds of dialogue data; an extraction module for extracting multiple pieces of each piece of the speech data Modal feature data, generating a multi-modal feature sequence based on the multi-modal feature data; an interaction module, configured to input the multi-modal feature sequence corresponding to the at least one piece of speech data into a pre-trained classification neural network to obtain The output result of the classification neural network is used as the predicted interactive behavior of the designated speaker.
  • An intelligent device includes: an acquisition circuit, a processor, and a memory, the processor is coupled to the memory and the acquisition circuit, a computer program is stored in the memory, and the processor executes the computer program to implement The method described above.
  • a computer-readable storage medium stores a computer program, and the computer program can be executed by a processor to implement the above-mentioned method.
  • the present invention extracts the multi-modal feature data in each piece of said speech data, and generates a multi-modal feature sequence based on the multi-modal feature data.
  • a multi-modal feature sequence of speech data is input into a pre-trained classification neural network to obtain and predict the interaction behavior of a specified speaker, and perform emotion recognition through multi-modal features, and then predict the behavior type according to the emotional changes in the interaction process. Effectively improve the accuracy of forecasts.
  • Figure 1 is an application environment diagram of a facial emotion recognition method in an embodiment of the present invention
  • FIG. 2 is a schematic flowchart of the first embodiment of the interactive behavior prediction method provided by the present invention.
  • FIG. 3 is a schematic flowchart of a second embodiment of the interactive behavior prediction method provided by the present invention.
  • FIG. 4 is a schematic flowchart of a third embodiment of the interactive behavior prediction method provided by the present invention.
  • FIG. 5 is a schematic flowchart of an embodiment of a method for acquiring multi-modal feature data of each piece of speech data in the interactive behavior prediction method provided by the present invention
  • FIG. 6 is a schematic structural diagram of the first embodiment of the smart device provided by the present invention.
  • Fig. 7 is a schematic structural diagram of a second embodiment of a smart device provided by the present invention.
  • FIG. 8 is a schematic structural diagram of an embodiment of a computer-readable storage medium provided by the present invention.
  • the prior art recognizes emotions and performs behavior prediction based on speech, but the emotions in the interaction in the actual scene are expressed through multi-modality (such as face, voice, and text) collaboratively. Speech-based emotion interaction behavior prediction ignores the important features contained in other modal information, which will lead to inaccurate prediction results.
  • an interactive behavior prediction method is provided, which can improve the accuracy of the interactive behavior prediction.
  • FIG. 1 is an application environment diagram of an interactive behavior prediction method in an embodiment of the present invention.
  • the facial emotion recognition method is applied to an interactive behavior prediction system.
  • the interactive behavior prediction system includes a terminal 110 and a server 120.
  • the terminal 110 and the server 120 are connected through a network.
  • the terminal 110 may specifically be a desktop terminal or a mobile terminal, and the mobile terminal may specifically be at least one of a mobile phone, a tablet computer, and a notebook computer.
  • the server 120 may be implemented as an independent server or a server cluster composed of multiple servers.
  • the terminal 110 is used to obtain multiple rounds of dialogue data
  • the server 120 is used to extract at least one piece of speech data of a designated speaker in the multiple rounds of dialogue data, extract multi-modal feature data in each piece of speech data, and generate multiple pieces of speech data based on the multi-modal feature data.
  • the modal feature sequence input the multi-modal feature sequence corresponding to at least one piece of speech data into the pre-trained classification neural network, and obtain the output result of the classification neural network as the predicted interaction behavior of the designated speaker.
  • FIG. 2 is a schematic flowchart of the first embodiment of the method for predicting interactive behaviors provided by the present invention.
  • the method for predicting interactive behavior provided by the present invention includes the following steps:
  • S101 Obtain multiple rounds of dialogue data, and extract at least one piece of speech data of a designated speaker in the multiple rounds of dialogue data.
  • data of multiple rounds of dialogue is obtained.
  • the dialogue may include two or more speakers, and different speakers can be identified according to different voices of the speakers.
  • the user can select one person from different speakers as the designated speaker, or select multiple speakers as the designated speaker, and each speaker can be analyzed separately in the follow-up.
  • the speech data is sorted according to the order in which the designated speaker speaks.
  • S102 Extract multi-modal feature data in each piece of speech data, and generate a multi-modal feature sequence according to the multi-modal feature data.
  • the multi-modal feature data in each piece of speech data is extracted.
  • the multi-modal feature data includes video feature data, audio feature data, and text feature data.
  • the multi-modal feature data of each speech is a multi-dimensional vector
  • the multi-segment speech data of a designated speaker corresponds to a multi-dimensional vector.
  • These multi-dimensional vectors are arranged according to the time sequence of the corresponding speech data to generate a multi-dimensional vector. Modal feature sequence.
  • the multi-modal feature data of each piece of speech can be obtained.
  • S103 Input the multi-modal feature sequence corresponding to at least one piece of speech data into the pre-trained classification neural network, and obtain the output result of the classification neural network as the predicted interaction behavior of the designated speaker.
  • the multi-modal feature sequence corresponding to at least one piece of speech data is input into the pre-trained classification neural network, and the output result of the classification neural network is obtained as the predicted interaction behavior of the designated speaker.
  • the classification neural network Before training, prepare multiple training multimodal feature sequences, label the interactive behavior of each training multimodal feature sequence, and define the network structure of the classification neural network.
  • the number of layers of the classification neural network can be defined, for example, 19 layers. You can also define the type of neural classification network, such as convolutional neural network, or fully connected neural network, and so on.
  • Define the loss function of the classification neural network and define the conditions for the termination of the training of the classification neural network, for example, stop after training 2000 times.
  • the multi-modal feature sequence corresponding to at least one piece of speech data is input into the classification neural network, and the classification neural network will output the predicted interaction behavior corresponding to the multi-modal feature sequence.
  • the multi-modal feature data in each piece of speech data is extracted, and the multi-modal feature data is generated according to the multi-modal feature data.
  • State feature sequence input at least one piece of multi-modal feature sequence of speech data into a pre-trained classification neural network to obtain and predict the interaction behavior of the specified speaker, perform emotion recognition through multi-modal features, and then according to the emotional changes in the interaction process Come to obtain the predicted interactive behavior, which can effectively improve the accuracy of the prediction.
  • FIG. 3 is a schematic flowchart of a second embodiment of the interactive behavior prediction method provided by the present invention.
  • the interactive behavior prediction method provided by the present invention includes the following steps:
  • S201 Obtain multiple rounds of dialogue data, and extract at least one piece of speech data of a designated speaker in the multiple rounds of dialogue data.
  • S202 Extract multi-modal feature data in each piece of speech data, and generate a multi-modal feature sequence according to the multi-modal feature data.
  • steps S201-S202 are basically the same as steps S101-S102 in the first embodiment of the interactive behavior prediction method provided by the present invention, and will not be repeated here.
  • S203 Input the multi-modal feature data of each piece of speech data into the pre-trained emotion recognition neural network, and obtain the output result of the emotion recognition neural network as the emotion data of each piece of speech data.
  • the multi-modal feature data of each piece of speech data is input into the pre-trained emotion recognition neural network, and the output result of the emotion recognition neural network is used as the emotion data of each piece of speech data.
  • the emotion data may be an emotion category corresponding to the multi-modal feature data, or an emotion combination.
  • the emotion recognition neural network needs to be trained, and multiple training multi-modal feature data can be prepared in advance, and the emotion data of each multi-modal feature data can be labeled.
  • Obtain the semantic features of different modalities of each training multi-modal feature data For example, by inputting each multi-modal feature data into a pre-trained semantic feature extraction neural network, the different modalities of the multi-modal feature data can be obtained. Semantic features. Or, multiple sets of semantic features of different modalities can be prepared in advance, and each set of semantic features of different modalities corresponds to one training multi-modal feature data.
  • the emotion recognition neural network includes a convolutional network layer and a long and short-term memory network layer.
  • the use of a two-layer structure of the neural network can further improve the accuracy of the output emotion data.
  • the emotion recognition neural network may only include one layer of structure, for example, the emotion recognition neural network is a long and short-term memory network.
  • the emotion data of at least one piece of speech data is formed into a speech emotion sequence in a chronological order, the speech emotion sequence is input to the pre-trained behavior prediction neural network, and the output result of the behavior prediction neural network is obtained as the predicted interactive behavior.
  • the emotional data of at least one piece of speech data is formed into a speech emotion sequence in chronological order. For example, there are three pieces of speech data, and the corresponding emotional data are A, B, and B, according to the time of these three pieces of speech data.
  • the emotional sequence composed of the sequence is ABB.
  • the speech emotion sequence is input into the pre-trained behavior prediction neural network, and the output result of the behavior prediction neural network is used as the predicted interactive behavior. For example, the predicted interaction behavior corresponding to ABB is frustration.
  • the behavior prediction neural network needs to be trained.
  • Multiple training speech emotion sequences can be prepared in advance, and each training speech emotion sequence can be labeled with its interactive behavior, and the network structure of the behavior prediction neural network can be defined.
  • the number of layers of the behavior prediction neural network can be, for example, 19 layers.
  • the multi-modal feature sequence corresponding to at least one piece of speech data is input into the behavior prediction neural network, and the behavior prediction neural network will output the predicted interactive behavior corresponding to the multi-modal feature sequence.
  • the interactive behavior includes at least one of acceptance, blame, positive, negative, and frustrated.
  • the behavior prediction neural network is a fully connected neural network.
  • the emotion data of each piece of speech data is obtained, and the emotion data of at least one piece of speech data is
  • the time sequence composes the speech emotion sequence, and the speech emotion sequence is input into the pre-trained behavior prediction neural network to obtain the predicted interaction behavior.
  • the predicted interaction behavior can be obtained according to the emotional changes in the interaction process, which can effectively improve the accuracy of prediction.
  • FIG. 4 is a schematic flowchart of a third embodiment of the interactive behavior prediction method provided by the present invention.
  • the interactive behavior prediction method provided by the present invention includes the following steps:
  • S301 Obtain multiple rounds of dialogue data, and extract at least one piece of speech data of a designated speaker in the multiple rounds of dialogue data.
  • S302 Extract multi-modal feature data in each piece of speech data, and generate a multi-modal feature sequence according to the multi-modal feature data.
  • S303 Input the multi-modal feature data of each piece of speech data into the pre-trained emotion recognition neural network, and obtain the output result of the emotion recognition neural network as the emotion data of each piece of speech data.
  • steps S301-S303 are basically the same as steps S201-S203 in the second embodiment of the interactive behavior prediction method provided by the present invention, and will not be repeated here.
  • S304 Obtain the weight of each emotion data in the speech emotion sequence, multiply each emotion data with its corresponding weight, and input the calculated speech emotion sequence into a pre-trained behavior prediction neural network.
  • each emotion data in the speech emotion sequence is obtained, and each emotion data is multiplied by its corresponding weight. Because in at least one piece of speech data, each piece of speech data influences each other. For example, some speech data are sentences that specify speakers to express their own opinions, and some speech data are responsive answers of the specified speakers, so different speech data The influence ability of predicting interactive behavior on at least one piece of speech data is different.
  • the weight of each emotion data is obtained by performing attention calculations.
  • the method of attention calculation is:
  • a is the weight of each sentiment data
  • the calculation formula of the softmax function is:
  • S305 Combine the emotion data of at least one piece of speech data into a speech emotion sequence in chronological order, input the speech emotion sequence into a pre-trained behavior prediction neural network, and obtain an output result of the behavior prediction neural network as a predicted interactive behavior.
  • this step is basically the same as step S204 in the second embodiment of the interactive behavior prediction method provided by the present invention, and will not be repeated here.
  • the emotion data pairs of different segments of speech data can be combined. Predict the magnitude of the impact of the interaction behavior, and predict the interaction behavior, thereby effectively improving the accuracy of the prediction.
  • FIG. 5 is a schematic flowchart of an embodiment of a method for obtaining multi-modal feature data of each piece of speech data in the interactive behavior prediction method provided by the present invention.
  • the method for obtaining the multi-modal feature data of each piece of speech data includes the following steps:
  • S401 Input each piece of speech data into a pre-trained feature extraction neural network, and obtain video feature data, audio feature data, and text feature data of each piece of speech data.
  • S402 Combine the video feature data, audio feature data, and text feature data of each piece of speech data to obtain multi-modal feature data of each piece of speech data.
  • the video feature data, audio feature data, and text feature data of each piece of speech data are connected in series to obtain the multi-modal feature data of each piece of speech data.
  • the video feature data, audio feature data, and text feature data are all a two-dimensional vector
  • the multi-modal feature data obtained after concatenation is a six-dimensional vector.
  • FIG. 6 is a schematic structural diagram of the first embodiment of the smart device provided by the present invention.
  • the smart device 10 includes an acquisition module 11, an extraction module 12, and an interaction module 13.
  • the obtaining module 11 is used to obtain multiple rounds of dialogue data, and extract at least one piece of speech data of a designated speaker in the multiple rounds of dialogue data.
  • the extraction module 12 is used to extract multi-modal feature data in each piece of speech data, and generate a multi-modal feature sequence according to the multi-modal feature data.
  • the interaction module 13 is configured to input the multi-modal feature sequence corresponding to at least one piece of speech data into the pre-trained classification neural network, and obtain the output result of the classification neural network as the predicted interaction behavior of the designated speaker.
  • Multi-modal feature data includes video feature data, audio feature data, and text feature data.
  • the smart device extracts at least one piece of speech data of the designated speaker in the multiple rounds of dialogue data, and then extracts the multi-modal feature data in each piece of speech data, and generates it based on the multi-modal feature data.
  • Multi-modal feature sequence input at least one piece of multi-modal feature sequence of speech data into a pre-trained classification neural network to obtain and predict the interaction behavior of the specified speaker, perform emotion recognition through multi-modal features, and then according to the interaction process Emotional changes are used to obtain predicted interactive behaviors, which can effectively improve the accuracy of predictions.
  • the interaction module 13 includes an emotion data sub-module 131 and an interaction sub-module 132.
  • the emotion data sub-module 131 is used to input the multi-modal feature data of each piece of speech data into the pre-trained emotion recognition neural network, and obtain the output result of the emotion recognition neural network as the emotion data of each piece of speech data.
  • the interaction sub-module 132 is configured to compose a speech emotion sequence from at least one piece of speech data emotion data in chronological order, input the speech emotion sequence into a pre-trained behavior prediction neural network, and obtain the output result of the behavior prediction neural network as the predicted interaction behavior.
  • the emotion data sub-module 131 includes a weight unit 1311.
  • the weight unit 1311 is used to obtain the weight of each emotion data in the speech emotion sequence, multiply each emotion data with its corresponding weight, and input the calculated speech emotion sequence into the pre-trained behavior prediction neural network.
  • the weighting unit 1311 is used to perform attention calculations on the speech emotion sequence to obtain the weight of each emotion data in the speech emotion sequence.
  • the emotion recognition neural network includes a convolutional network layer and a long and short-term memory network layer.
  • the behavior prediction neural network is a fully connected neural network.
  • the acquisition module 11 includes a feature extraction sub-module 111, which is used to input each piece of speech data into a pre-trained feature extraction neural network to obtain multi-modal feature data of each piece of speech data.
  • the feature extraction sub-module 111 includes a feature extraction unit 1111 and a fusion unit 1112.
  • the feature extraction unit 1111 is used to input each piece of speech data into the pre-trained video feature extraction neural network to obtain the video feature data of each piece of speech data; input each piece of speech data into the pre-trained audio feature extraction neural network to obtain each piece of speech The audio feature data of the data; each piece of speech data is input into the pre-trained text feature extraction neural network to obtain the text feature data of each piece of speech data.
  • the fusion unit 1112 is used to fuse the video feature data, audio feature data, and text feature data of each piece of speech data to obtain multi-modal feature data of each piece of speech data.
  • the smart device 10 also includes a training module 14 for training the classification neural network.
  • the training module 14 includes a preparation sub-module 141, a definition sub-module 142, and an input sub-module 143.
  • the preparation sub-module 141 is used to prepare a plurality of training multi-modal feature sequences, and annotate the annotation interaction behavior of each training multi-modal feature sequence.
  • the definition sub-module 142 is used to define the structure, loss function and termination conditions of the trained classification neural network.
  • the input sub-module 143 is configured to input multiple multi-modal feature sequences and their corresponding labeled interaction behaviors into the classification neural network for training.
  • the smart device in this embodiment inputs the multi-modal feature data of each piece of speech data into the pre-trained emotion recognition neural network to obtain the emotion data of each piece of speech data, and the emotion data of at least one piece of speech data is Time sequence composes the speech emotion sequence, input the speech emotion sequence into the pre-trained behavior prediction neural network, and combine the influence of the emotion data of different segments of the speech data on the prediction of the interaction behavior, and predict the interaction behavior, thereby effectively improving the accuracy of the prediction Sex.
  • FIG. 7 is a schematic structural diagram of a second embodiment of a smart device provided by the present invention.
  • the smart device 20 includes a processor 21, a memory 22, and an acquisition circuit 23.
  • the processor 21 is coupled to the memory 22 and the acquisition circuit 23.
  • a computer program is stored in the memory 22, and the processor 21 executes the computer program when working to implement the method shown in FIGS. 2-5. The detailed method can be referred to the above, and will not be repeated here.
  • the smart device extracts at least one piece of speech data of the designated speaker in the multiple rounds of dialogue data, it extracts the multi-modal feature data in each piece of speech data, and generates the multi-modal feature data based on the multi-modal feature data.
  • Modal feature sequence input at least one piece of multi-modal feature sequence of speech data into a pre-trained classification neural network to obtain and predict the interaction behavior of a specified speaker, perform emotion recognition through multi-modal features, and then based on the emotion in the interaction process Changes to obtain predicted interaction behaviors can effectively improve the accuracy of predictions.
  • FIG. 8 is a schematic structural diagram of an embodiment of a computer-readable storage medium provided by the present invention.
  • At least one computer program 31 is stored in the computer-readable storage medium 30, and the computer program 31 is used to be executed by the processor to implement the methods shown in FIGS.
  • the computer-readable storage medium 30 may be a storage chip in a terminal, a hard disk, or a mobile hard disk, or other readable and writable storage tools such as a USB flash drive, or an optical disk, and may also be a server or the like.
  • the computer program stored in the storage medium in this embodiment can be used to extract the multi-modal feature data in each piece of speech data after extracting at least one piece of speech data of a designated speaker in multiple rounds of dialogue data , Generate a multi-modal feature sequence based on multi-modal feature data, and input at least one piece of multi-modal feature sequence of speech data into a pre-trained classification neural network to obtain and predict the interaction behavior of a specified speaker, which is performed through the multi-modal feature data Emotion recognition, and then obtain the predicted interactive behavior according to the emotional changes in the interaction process, which can effectively improve the accuracy of the prediction.
  • the present invention obtains the multi-modal feature data of the speech data of the designated speaker, performs emotion recognition through the multi-modal feature, and then obtains the predicted interaction behavior according to the emotion changes in the interaction process, which can be effective Improve the accuracy of forecasts.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Machine Translation (AREA)

Abstract

一种交互行为预测方法,包括:获取多轮对话数据,提取多轮对话数据中指定说话人的至少一段发言数据(S101);提取每段发言数据中的多模态特征数据,根据多模态特征数据生成多模态特征序列(S102);将至少一段发言数据对应的多模态特征序列输入预训练的分类神经网络,获取分类神经网络的输出结果作为指定说话人的预测交互行为(S103)。还公开了智能装置(20)和计算机可读存储介质(30)。可以有效提升预测的准确性。

Description

交互行为预测方法、智能装置和计算机可读存储介质 技术领域
本发明涉及人工智能技术领域,尤其涉及交互行为预测方法、智能装置和计算机可读存储介质。
背景技术
现有的情感交互行为理论表明交互过程中情感状态的变化与交互行为的类别有着较高的相关度,现有技术基于语音来识别情感并进行行为预测,然而实际场景中交互中的情感是通过多模态(如脸部、语音、文本)协同进行表达的。基于语音的情感交互行为预测忽略了其他模态信息所包含的重要特征,会导致预测结果不准确。
申请内容
基于此,有必要针对上述问题,提出了交互行为预测方法、智能装置和计算机可读存储介质。
一种交互行为预测方法,所述方法包括:获取多轮对话数据,提取所述多轮对话数据中指定说话人的至少一段发言数据;提取每段所述发言数据中的多模态特征数据,根据所述多模态特征数据生成多模态特征序列;将所述至少一段发言数据对应的所述多模态特征序列输入预训练的分类神经网络,获取所述分类神经网络的输出结果作为所述指定说话人的预测交互行为
一种智能装置,包括:获取模块,用于获取多轮对话数据,提取所述多轮对话数据中指定说话人的至少一段发言数据;提取模块,用于提取每段所述发言数据中的多模态特征数据,根据所述多模态特征数据生成多模态特征序列;交互模块,用于将所述至少一段发言数据对应的所述多模态特征序列输入预训练的分类神经网络,获取所述分类神经网络的输出结果作为所述指定说话人的 预测交互行为。
一种智能装置,包括:获取电路、处理器、存储器,所述处理器耦接所述存储器和所述获取电路,所述存储器中存储有计算机程序,所述处理器执行所述计算机程序以实现如上所述的方法。
一种计算机可读存储介质,存储有计算机程序,所述计算机程序能够被处理器执行以实现如上所述的方法。
采用本发明实施例,具有如下有益效果:
本发明在提取到多轮对话数据中指定说话人的至少一段发言数据后,提取每段所述发言数据中的多模态特征数据,根据多模态特征数据生成多模态特征序列,将至少一段发言数据的多模态特征序列输入预训练的分类神经网络,获取预测指定说话人的交互行为,通过多模态的特征进行情感识别,然后根据交互过程中的情感变化来预测行为类型,可以有效提升预测的准确性。
附图说明
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
其中:
图1是本发明一个实施例中脸部情感识别方法应用环境图;
图2是本发明提供的交互行为预测方法的第一实施例的流程示意图;
图3是本发明提供的交互行为预测方法的第二实施例的流程示意图;
图4是本发明提供的交互行为预测方法的第三实施例的流程示意图;
图5是本发明提供的交互行为预测方法中获取每段发言数据的多模态特征数据的方法的一实施例的流程示意图;
图6是本发明提供的智能装置的第一实施例的结构示意图;
图7是本发明提供的智能装置的第二实施例的结构示意图;
图8是本发明提供的计算机可读存储介质的一实施例的结构示意图。
具体实施方式
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。
现有技术基于语音来识别情感并进行行为预测,然而实际场景中交互中的情感是通过多模态(如脸部、语音、文本)协同进行表达的。基于语音的情感交互行为预测忽略了其他模态信息所包含的重要特征,会导致预测结果不准确。
在本实施例中,为了解决上述问题,提供了一种交互行为预测方法,能够提升对交互行为预测的准确性。
请参阅图1,图1是本发明一个实施例中交互行为预测方法应用环境图。参照图1,该脸部情感识别方法应用于交互行为预测系统。该交互行为预测系统包括终端110和服务器120。终端110和服务器120通过网络连接,终端110具体可以是台式终端或移动终端,移动终端具体可以是手机、平板电脑、笔记本电脑等中的至少一种。服务器120可以用独立的服务器或者是多个服务器组成的服务器集群来实现。终端110用于获取多轮对话数据,服务器120用于提取多轮对话数据中指定说话人的至少一段发言数据,提取每段发言数据中的多模态特征数据,根据多模态特征数据生成多模态特征序列,将至少一段发言数据对应的多模态特征序列输入预训练的分类神经网络,获取分类神经网络的输出结果作为指定说话人的预测交互行为。
请参阅图2,图2是本发明提供的交互行为的预测方法的第一实施例的流程示意图。本发明提供的交互行为的预测方法包括如下步骤:
S101:获取多轮对话数据,提取多轮对话数据中指定说话人的至少一段发 言数据。
在一个具体的实施场景中,获取多轮对话的数据,该对话可以包括两个或两个以上说话人,可以根据说话人的声音不同,分别识别出不同的说话人。用户可以从不同的说话人中选择一个人作为指定说话人,也可以选择多个说话人作为指定说话人,后续可以分别对每个说话人进行分析。
在获取到指定说话人后,提取该指定说话人在多轮对话数据中的至少一段发言数据,在本实施场景中,获取该指定说话人的全部发言数据,在其他实施场景中,还可以指定说话人的获取语音长度超过预设阈值的发言数据,或者指定说话人的其他满足预设条件的发言数据。
在本实施场景中,在获取该指定说话人的全部发言数据之后,将这些发言数据根据指定说话人的说话的顺序对这些发言数据排序。
S102:提取每段发言数据中的多模态特征数据,根据多模态特征数据生成多模态特征序列。
在本实施场景中,提取每段发言数据中的多模态特征数据,多模态特征数据包括视频特征数据、音频特征数据和文本特征数据。将每段发言数据的多模态特征数据依据时间顺序排列,生成多模态特征序列。例如,每段发言的多模态特征数据为一多维向量,指定说话人的多段发言数据分别对应一多维向量,将这些多维向量根据其对应的一段发言数据的时间顺序进行排列,生成多模态特征序列。
在本实施场景中,可以通过将每段发言数据输入预训练的特征提取神经网络,获取每段发言的多模态特征数据。可以输入多个不同的特征提取神经网络,分别提取每段发言数据的视频特征数据、音频特征数据和文本特征数据,或者输入一个特征提取神经网络,提取每段发言数据的视频特征数据、音频特征数据和文本特征数据。
S103:将至少一段发言数据对应的多模态特征序列输入预训练的分类神经网络,获取分类神经网络的输出结果作为指定说话人的预测交互行为。
在本实施场景中,将至少一段发言数据对应的多模态特征序列输入预训练的分类神经网络,获取分类神经网络的输出结果作为指定说话人的预测交互行为。
在本实施场景中,需要对分类神经网络进行训练,在训练前准备好多个训练多模态特征序列,为每个训练多模态特征序列标注其标注交互行为,定义分类神经网络的网络结构,可以定义分类神经网络的层数,例如,19层。还可以定义神经分类网络的类型,例如卷积神经网络,或者全连接神经网络等等。定义分类神经网络的损失函数,以及定义分类神经网络的训练终止的条件,例如训练2000次后停止。在训练成功后,将至少一段发言数据对应的多模态特征序列输入分类神经网络,分类神经网络将会输出多模态特征序列对应的预测交互行为。
通过上述描述可知,在本实施例中在提取到多轮对话数据中指定说话人的至少一段发言数据后,提取每段发言数据中的多模态特征数据,根据多模态特征数据生成多模态特征序列,将至少一段发言数据的多模态特征序列输入预训练的分类神经网络,获取预测指定说话人的交互行为,通过多模态的特征进行情感识别,然后根据交互过程中的情感变化来来获取预测交互行为,可以有效提升预测的准确性。
请参阅图3,图3是本发明提供的交互行为预测方法的第二实施例的流程示意图。本发明提供的交互行为预测方法包括如下步骤:
S201:获取多轮对话数据,提取多轮对话数据中指定说话人的至少一段发言数据。
S202:提取每段发言数据中的多模态特征数据,根据多模态特征数据生成多模态特征序列。
在一个具体的实施场景中,步骤S201-S202与本发明提供的交互行为预测方法的的第一实施例中的步骤S101-S102基本一致,此处不再进行赘述。
S203:将每段发言数据的多模态特征数据输入预训练的情感识别神经网络, 获取情感识别神经网络的输出结果作为每段发言数据的情感数据。
在本实施场景中,将每段发言数据的多模态特征数据输入预训练的情感识别神经网络,将情感识别神经网络的输出结果作为每段发言数据的情感数据。情感数据可以是多模态特征数据对应的情感类别,或者情感组合。
在本实施场景中,需要对情感识别神经网络进行训练,可以预先准备多个训练多模态特征数据,并标注每个多模态特征数据的情感数据。获取每个训练多模态特征数据的不同模态的语义特征,例如可以通过将每个多模态特征数据输入预训练的语义特征提取神经网络,获取该多模态特征数据的不同模态的语义特征。或者可以预先准备多组不同模态的语义特征,每组不同模态的语义特征对应一个训练多模态特征数据。
还可以定义情感识别神经网络的网络结构,可以定义情感识别神经网络的层数,例如,19层。还可以定义情感识别神经网络的类型,例如卷积神经网络,或者全连接神经网络等等。定义情感识别神经网络的损失函数,以及定义情感识别神经网络的训练终止的条件,例如训练2000次后停止。在训练成功后,将每段发言数据对应的多模态特征数据输入情感识别神经网络,情感识别神经网络将会输出多模态特征序列对应的情感数据。
在本实施场景中,情感识别神经网络包括卷积网络层和长短期记忆网络层。采用两层结构的神经网络,可以进一步提升输出的情感数据的准确性,在其他实施场景中,情感识别神经网络可以只包括一层结构,例如情感识别神经网络为长短期记忆网络。
S204:将至少一段发言数据的情感数据按照时间顺序组成发言情感序列,将发言情感序列输入预训练的行为预测神经网络,获取行为预测神经网络的输出结果作为预测交互行为。
在本实施场景中,将至少一段发言数据的情感数据按照时间顺序组成发言情感序列,例如,有三段发言数据,分别对应的情感数据为A、B、B,则按照这三段发言数据的时间顺序组成的发言情感序列为ABB。将发言情感序列 输入预训练的行为预测神经网络,将行为预测神经网络的输出结果作为预测交互行为。例如,ABB对应的预测交互行为为沮丧。
在本实施场景中,需要对行为预测神经网络进行训练。可预先准备多个训练发言情感序列,为每个训练发言情感序列标注其标注交互行为,定义行为预测神经网络的网络结构,可以行为预测神经网络的层数,例如,19层。还可以定义行为预测神经网络的类型,例如卷积神经网络,或者全连接神经网络等等。定义行为预测神经网络的损失函数,以及定义行为预测神经网络的训练终止的条件,例如训练2000次后停止。在训练成功后,将至少一段发言数据对应的多模态特征序列输入行为预测神经网络,行为预测神经网络将会输出多模态特征序列对应的预测交互行为。
在本实施场景中,交互行为包括接纳、责备、积极、消极和沮丧中的至少一项。行为预测神经网络为全连接神经网络。
通过上述描述可知,在本实施例中,通过将每段发言数据的多模态特征数据输入预训练的情感识别神经网络,获取每段发言数据的情感数据,将至少一段发言数据的情感数据按照时间顺序组成发言情感序列,将发言情感序列输入预训练的行为预测神经网络,获取预测交互行为,可以根据交互过程中的情感变化来获取预测交互行为,可以有效提升预测的准确性。
请参阅图4,图4是本发明提供的交互行为预测方法的第三实施例的流程示意图。本发明提供的交互行为预测方法包括如下步骤:
S301:获取多轮对话数据,提取多轮对话数据中指定说话人的至少一段发言数据。
S302:提取每段发言数据中的多模态特征数据,根据多模态特征数据生成多模态特征序列。
S303:将每段发言数据的多模态特征数据输入预训练的情感识别神经网络,获取情感识别神经网络的输出结果作为每段发言数据的情感数据。
在一个具体的实施场景中,步骤S301-S303与本发明提供的交互行为预测 方法的的第二实施例中的步骤S201-S203基本一致,此处不再进行赘述。
S304:获取发言情感序列中每个情感数据的权重,将每个情感数据与其对应的权重点乘运算,将运算后的发言情感序列输入预训练的行为预测神经网络。
在本实施场景中,获取发言情感序列中每个情感数据的权重,将每个情感数据与其对应的权重点乘运算。因为至少一段发言数据中,各段发言数据之间相互影响,例如,某些发言数据是指定说话人表达自己观点的语句,某些发言数据是指定说话人应付性的回答,则不同的发言数据对至少一段发言数据的预测交互行为的影响能力大小不同。
在本实施场景中,通过进行注意力运算获取每个情感数据的权重。在本实施场景中,注意力运算的方法为:
Figure PCTCN2019130367-appb-000001
其中,a为每个情感数据的权重,
Figure PCTCN2019130367-appb-000002
为发言情感序列,softmax函数的运算公式为:
Figure PCTCN2019130367-appb-000003
S305:将至少一段发言数据的情感数据按照时间顺序组成发言情感序列,将发言情感序列输入预训练的行为预测神经网络,获取行为预测神经网络的输出结果作为预测交互行为。
在本实施场景中,本步骤与本发明提供的交互行为预测方法的的第二实施例中的步骤S204基本一致,此处不再进行赘述。
通过上述描述可知,在本实施例中,通过获取每个情感数据的权重,并将每个情感数据与其对应的权重点乘后的组成发言情感序列,可以结合不同段的发言数据的情感数据对预测交互行为的影响的大小,对交互行为进行预测,从 而有效提升预测的准确性。
请参阅图5,图5是本发明提供的交互行为预测方法中获取每段发言数据的多模态特征数据的方法的一实施例的流程示意图。获取每段发言数据的多模态特征数据的方法包括如下步骤:
S401:将每段发言数据输入预训练的特征提取神经网络,分别获取每段发言数据的视频特征数据、音频特征数据和文本特征数据。
在一个具体的实施场景中,将每段发言数据输入预训练的视频特征提取神经网络,获取每段发言数据的视频特征数据;将每段发言数据输入预训练的音频特征提取神经网络,获取每段发言数据的音频特征数据;将每段发言数据输入预训练的文本特征提取神经网络,获取每段发言数据的文本特征数据。上述步骤可以先后进行或者同步进行,在此不做限定。
S402:将每段发言数据的视频特征数据、音频特征数据和文本特征数据融合,获取每段发言数据的多模态特征数据。
在本实施场景中,将每段发言数据的视频特征数据、音频特征数据和文本特征数据串联,获取每段发言数据的多模态特征数据。例如视频特征数据、音频特征数据和文本特征数据均为一2维向量,则串联后获取的多模态特征数据为一6维向量。
通过上述描述可知,在本实施例中,通过将发言数据输入预训练的特征提取神经网络,分别获取每段发言数据的视频特征数据、音频特征数据和文本特征数据,将这些特征数据串联,获取多模态特征数据,提升提取的特征数据的准确性,从而有效提升预测的准确性。
请参阅图6,图6是本发明提供的智能装置的第一实施例的结构示意图。智能装置10包括获取模块11、提取模块12和交互模块13。
获取模块11用于获取多轮对话数据,提取多轮对话数据中指定说话人的至少一段发言数据。提取模块12用于提取每段发言数据中的多模态特征数据,根据多模态特征数据生成多模态特征序列。交互模块13用于将至少一段发言 数据对应的多模态特征序列输入预训练的分类神经网络,获取分类神经网络的输出结果作为指定说话人的预测交互行为。
其中,交互行为包括接纳、责备、积极、消极和沮丧中的至少一项。多模态特征数据包括视频特征数据、音频特征数据和文本特征数据。
通过上述描述可知,在本实施例中智能装置在提取到多轮对话数据中指定说话人的至少一段发言数据后,提取每段发言数据中的多模态特征数据,根据多模态特征数据生成多模态特征序列,将至少一段发言数据的多模态特征序列输入预训练的分类神经网络,获取预测指定说话人的交互行为,通过多模态的特征进行情感识别,然后根据交互过程中的情感变化来来获取预测交互行为,可以有效提升预测的准确性。
请继续参阅图6,交互模块13包括情感数据子模块131和交互子模块132。情感数据子模块131用于将每段发言数据的多模态特征数据输入预训练的情感识别神经网络,获取情感识别神经网络的输出结果作为每段发言数据的情感数据。交互子模块132用于将至少一段发言数据的情感数据按照时间顺序组成发言情感序列,将发言情感序列输入预训练的行为预测神经网络,获取行为预测神经网络的输出结果作为预测交互行为。
情感数据子模块131包括权重单元1311。权重单元1311用于获取发言情感序列中每个情感数据的权重,将每个情感数据与其对应的权重点乘运算,将运算后的发言情感序列输入预训练的行为预测神经网络。
具体地说,权重单元1311用于对发言情感序列进行注意力运算,获取发言情感序列中每个情感数据的权重。
其中,情感识别神经网络包括卷积网络层和长短期记忆网络层。行为预测神经网络为全连接神经网络。
获取模块11包括特征提取子模块111,特征提取子模块111用于将每段发言数据输入预训练的特征提取神经网络,获取每段发言数据的多模态特征数据。
特征提取子模块111包括特征提取单元1111和融合单元1112。特征提取 单元1111用于将每段发言数据输入预训练的视频特征提取神经网络,获取每段发言数据的视频特征数据;将每段发言数据输入预训练的音频特征提取神经网络,获取每段发言数据的音频特征数据;将每段发言数据输入预训练的文本特征提取神经网络,获取每段发言数据的文本特征数据。融合单元1112用于将每段发言数据的视频特征数据、音频特征数据和文本特征数据融合,获取每段发言数据的多模态特征数据。
智能装置10还包括训练模块14,训练模块14用于对分类神经网络进行训练。
训练模块14包括准备子模块141、定义子模块142和输入子模块143。准备子模块141用于准备多个训练多模态特征序列,标注每个训练多模态特征序列的标注交互行为。定义子模块142用于定义训练的分类神经网络的结构、损失函数和终止条件。输入子模块143用于将多个多模态特征序列及其对应的标注交互行为输入分类神经网络进行训练。
通过上述描述可知,本实施例中智能装置过将每段发言数据的多模态特征数据输入预训练的情感识别神经网络,获取每段发言数据的情感数据,将至少一段发言数据的情感数据按照时间顺序组成发言情感序列,将发言情感序列输入预训练的行为预测神经网络,结合不同段的发言数据的情感数据对预测交互行为的影响的大小,对交互行为进行预测,从而有效提升预测的准确性。
请参阅图7,图7是本发明提供的智能装置的第二实施例的结构示意图。智能装置20包括处理器21、存储器22和获取电路23。处理器21耦接存储器22和获取电路23。存储器22中存储有计算机程序,处理器21在工作时执行该计算机程序以实现如图2-图5所示的方法。详细的方法可参见上述,在此不再赘述。
通过上述描述可知,在本实施例中智能装置提取到多轮对话数据中指定说话人的至少一段发言数据后,提取每段发言数据中的多模态特征数据,根据多模态特征数据生成多模态特征序列,将至少一段发言数据的多模态特征序列输 入预训练的分类神经网络,获取预测指定说话人的交互行为,通过多模态的特征进行情感识别,然后根据交互过程中的情感变化来来获取预测交互行为,可以有效提升预测的准确性。
请参阅图8,图8是本发明提供的计算机可读存储介质的一实施例的结构示意图。计算机可读存储介质30中存储有至少一个计算机程序31,计算机程序31用于被处理器执行以实现如图2-图5所示的方法,详细的方法可参见上述,在此不再赘述。在一个实施例中,计算机可读存储介质30可以是终端中的存储芯片、硬盘或者是移动硬盘或者优盘、光盘等其他可读写存储的工具,还可以是服务器等等。
通过上述描述可知,在本实施例中存储介质中存储的计算机程序可以用于在提取到多轮对话数据中指定说话人的至少一段发言数据后,提取每段发言数据中的多模态特征数据,根据多模态特征数据生成多模态特征序列,将至少一段发言数据的多模态特征序列输入预训练的分类神经网络,获取预测指定说话人的交互行为,通过多模态的特征数据进行情感识别,然后根据交互过程中的情感变化来来获取预测交互行为,可以有效提升预测的准确性。
区别于现有技术,本发明通过获取指定说话人的发言数据的多模态特征数据,通过多模态的特征进行情感识别,然后根据交互过程中的情感变化来来获取预测交互行为,可以有效提升预测的准确性。
以上所揭露的仅为本发明较佳实施例而已,当然不能以此来限定本发明之权利范围,因此依本发明权利要求所作的等同变化,仍属本发明所涵盖的范围。

Claims (20)

  1. 一种交互行为预测方法,其特征在于,包括:
    获取多轮对话数据,提取所述多轮对话数据中指定说话人的至少一段发言数据;
    提取每段所述发言数据中的多模态特征数据,根据所述多模态特征数据生成多模态特征序列;
    将所述至少一段发言数据对应的所述多模态特征序列输入预训练的分类神经网络,获取所述分类神经网络的输出结果作为所述指定说话人的预测交互行为。
  2. 根据权利要求1所述的交互行为预测方法,其特征在于,所述将所述至少一段发言数据对应的所述多模态特征序列输入预训练的分类神经网络,获取所述分类神经网络的输出结果作为所述指定说话人的预测交互行为的步骤,包括:
    将每段所述发言数据的多模态特征数据输入预训练的情感识别神经网络,获取所述情感识别神经网络的输出结果作为每段所述发言数据的情感数据;
    将至少一段发言数据的情感数据按照时间顺序组成发言情感序列,将所述发言情感序列输入预训练的行为预测神经网络,获取所述行为预测神经网络的输出结果作为预测交互行为。
  3. 根据权利要求2所述的交互行为预测方法,其特征在于,所述将所述发言情感序列输入预训练的行为预测神经网络的步骤,包括:
    获取所述发言情感序列中每个所述情感数据的权重,将每个所述情感数据与其对应的权重点乘运算,将运算后的所述发言情感序列输入所述预训练的行为预测神经网络。
  4. 根据权利要求3所述的交互行为预测方法,其特征在于,所述获取所述发言情感序列中每个所述情感数据的权重的步骤,包括:
    对所述发言情感序列进行注意力运算,获取所述发言情感序列中每个所述 情感数据的权重。
  5. 根据权利要求2所述的交互行为预测方法,其特征在于,所述情感识别神经网络包括卷积网络层和长短期记忆网络层;
    所述行为预测神经网络为全连接神经网络。
  6. 根据权利要求1所述的交互行为预测方法,其特征在于,
    所述预测交互行为包括接纳、责备、积极、消极和沮丧中的至少一项;
    所述多模态特征数据包括视频特征数据、音频特征数据和文本特征数据。
  7. 根据权利要求1所述的交互行为预测方法,其特征在于,所述提取每段所述发言数据中的多模态特征数据的步骤,包括:
    将每段所述发言数据输入预训练的特征提取神经网络,分别获取每段所述发言数据的所述多模态特征数据。
  8. 根据权利要求7所述的交互行为预测方法,其特征在于,所述将每段所述发言数据输入预训练的特征提取神经网络,分别获取每段所述发言数据的所述多模态特征数据的步骤,包括:
    将每段所述发言数据输入预训练的视频特征提取神经网络,获取每段所述发言数据的视频特征数据;将每段所述发言数据输入预训练的音频特征提取神经网络,获取每段所述发言数据的音频特征数据;将每段所述发言数据输入预训练的文本特征提取神经网络,获取每段所述发言数据的文本特征数据;
    将每段所述发言数据的所述视频特征数据、所述音频特征数据和所述文本特征数据融合,获取每段所述发言数据的多模态特征数据。
  9. 根据权利要求1所述的交互行为预测方法,其特征在于,所述将所述至少一段发言数据对应的所述多模态特征序列输入预训练的分类神经网络的步骤之前,包括:
    对所述分类神经网络进行训练;
    所述对所述分类神经网络进行训练的步骤,包括:
    准备多个训练多模态特征序列,标注每个所述训练多模态特征序列的标注 交互行为;
    定义训练的所述分类神经网络的结构、损失函数和终止条件;
    将所述多个多模态特征序列及其对应的标注交互行为输入所述分类神经网络进行训练。
  10. 一种智能装置,其特征在于,包括:
    获取模块,用于获取多轮对话数据,提取所述多轮对话数据中指定说话人的至少一段发言数据;
    提取模块,用于提取每段所述发言数据中的多模态特征数据,根据所述多模态特征数据生成多模态特征序列;
    交互模块,用于将所述至少一段发言数据对应的所述多模态特征序列输入预训练的分类神经网络,获取所述分类神经网络的输出结果作为所述指定说话人的预测交互行为。
  11. 根据权利要求10所述的智能装置,其特征在于,所述交互模块包括:
    情感数据子模块,用于将每段所述发言数据的多模态特征数据输入预训练的情感识别神经网络,获取所述情感识别神经网络的输出结果作为每段所述发言数据的情感数据;
    交互子模块,用于将至少一段发言数据的情感数据按照时间顺序组成发言情感序列,将所述发言情感序列输入预训练的行为预测神经网络,获取所述行为预测神经网络的输出结果作为预测交互行为。
  12. 根据权利要求11所述的智能装置,其特征在于,所述情感数据子模块包括:
    权重单元,用于获取所述发言情感序列中每个所述情感数据的权重,将每个所述情感数据与其对应的权重点乘运算,将运算后的所述发言情感序列输入所述预训练的行为预测神经网络。
  13. 根据权利要求12所述的智能装置,其特征在于,
    所述权重单元用于对所述发言情感序列进行注意力运算,获取所述发言情 感序列中每个所述情感数据的权重。
  14. 根据权利要求11所述的智能装置,其特征在于,
    所述情感识别神经网络包括卷积网络层和长短期记忆网络层;
    所述行为预测神经网络为全连接神经网络。
  15. 根据权利要求10所述的智能装置,其特征在于,
    所述预测交互行为包括接纳、责备、积极、消极和沮丧中的至少一项;
    所述多模态特征数据包括视频特征数据、音频特征数据和文本特征数据。
  16. 根据权利要求10所述的智能装置,其特征在于,所述获取模块包括:
    特征提取子模块,用于将每段所述发言数据输入预训练的特征提取神经网络,获取每段所述发言数据的所述多模态特征数据。
  17. 根据权利要求16所述的智能装置,其特征在于,所述特征提取子模块包括:
    特征提取单元,用于将每段所述发言数据输入预训练的视频特征提取神经网络,获取每段所述发言数据的视频特征数据;将每段所述发言数据输入预训练的音频特征提取神经网络,获取每段所述发言数据的音频特征数据;将每段所述发言数据输入预训练的文本特征提取神经网络,获取每段所述发言数据的文本特征数据;
    融合单元,用于将每段所述发言数据的所述视频特征数据、所述音频特征数据和所述文本特征数据融合,获取每段所述发言数据的多模态特征数据。
  18. 根据权利要求10所述的智能装置,其特征在于,所述智能装置还包括:
    训练模块,用于对所述分类神经网络进行训练;
    所述训练模块包括:
    准备子模块,用于准备多个训练多模态特征序列,标注每个所述训练多模态特征序列的标注交互行为;
    定义子模块,用于定义训练的所述分类神经网络的结构、损失函数和终止 条件;
    输入子模块,用于将所述多个多模态特征序列及其对应的标注交互行为输入所述分类神经网络进行训练。
  19. 一种智能装置,其特征在于,包括:获取电路、处理器、存储器,所述处理器耦接所述存储器和所述获取电路,所述存储器中存储有计算机程序,所述处理器执行所述计算机程序以实现如权利要求1-9任一项所述的方法。
  20. 一种计算机可读存储介质,其特征在于,存储有计算机程序,所述计算机程序能够被处理器执行以实现如权利要求1-9任一项所述的方法。
PCT/CN2019/130367 2019-12-31 2019-12-31 交互行为预测方法、智能装置和计算机可读存储介质 WO2021134417A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2019/130367 WO2021134417A1 (zh) 2019-12-31 2019-12-31 交互行为预测方法、智能装置和计算机可读存储介质
CN201980003374.XA CN111344717B (zh) 2019-12-31 2019-12-31 交互行为预测方法、智能装置和计算机可读存储介质

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2019/130367 WO2021134417A1 (zh) 2019-12-31 2019-12-31 交互行为预测方法、智能装置和计算机可读存储介质

Publications (1)

Publication Number Publication Date
WO2021134417A1 true WO2021134417A1 (zh) 2021-07-08

Family

ID=71187715

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/130367 WO2021134417A1 (zh) 2019-12-31 2019-12-31 交互行为预测方法、智能装置和计算机可读存储介质

Country Status (2)

Country Link
CN (1) CN111344717B (zh)
WO (1) WO2021134417A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115019237A (zh) * 2022-06-30 2022-09-06 中国电信股份有限公司 多模态情感分析方法、装置、电子设备及存储介质

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111899738A (zh) * 2020-07-29 2020-11-06 北京嘀嘀无限科技发展有限公司 对话生成方法、装置及存储介质
CN111950275B (zh) * 2020-08-06 2023-01-17 平安科技(深圳)有限公司 基于循环神经网络的情绪识别方法、装置及存储介质
CN117215415B (zh) * 2023-11-07 2024-01-26 山东经鼎智能科技有限公司 基于mr录播技术的多人协同虚拟交互方法

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108597541A (zh) * 2018-04-28 2018-09-28 南京师范大学 一种增强愤怒与开心识别的语音情感识别方法及系统
CN105426365B (zh) * 2014-08-01 2018-11-02 阿里巴巴集团控股有限公司 区分交互行为的方法及装置
CN109284506A (zh) * 2018-11-29 2019-01-29 重庆邮电大学 一种基于注意力卷积神经网络的用户评论情感分析系统及方法
CN109547332A (zh) * 2018-11-22 2019-03-29 腾讯科技(深圳)有限公司 通讯会话交互方法、装置、计算机设备
CN109766476A (zh) * 2018-12-27 2019-05-17 西安电子科技大学 视频内容情感分析方法、装置、计算机设备及存储介质
US20190282153A1 (en) * 2009-03-24 2019-09-19 The Nielsen Company (Us), Llc Presentation Measure Using Neurographics

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016004425A1 (en) * 2014-07-04 2016-01-07 Intelligent Digital Avatars, Inc. Systems and methods for assessing, verifying and adjusting the affective state of a user
JP6823809B2 (ja) * 2016-08-09 2021-02-03 パナソニックIpマネジメント株式会社 対話行為推定方法、対話行為推定装置およびプログラム
US11120353B2 (en) * 2016-08-16 2021-09-14 Toyota Jidosha Kabushiki Kaisha Efficient driver action prediction system based on temporal fusion of sensor data using deep (bidirectional) recurrent neural network
CN109986553B (zh) * 2017-12-29 2021-01-08 深圳市优必选科技有限公司 一种主动交互的机器人、系统、方法及存储装置
US10860858B2 (en) * 2018-06-15 2020-12-08 Adobe Inc. Utilizing a trained multi-modal combination model for content and text-based evaluation and distribution of digital video content to client devices

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190282153A1 (en) * 2009-03-24 2019-09-19 The Nielsen Company (Us), Llc Presentation Measure Using Neurographics
CN105426365B (zh) * 2014-08-01 2018-11-02 阿里巴巴集团控股有限公司 区分交互行为的方法及装置
CN108597541A (zh) * 2018-04-28 2018-09-28 南京师范大学 一种增强愤怒与开心识别的语音情感识别方法及系统
CN109547332A (zh) * 2018-11-22 2019-03-29 腾讯科技(深圳)有限公司 通讯会话交互方法、装置、计算机设备
CN109284506A (zh) * 2018-11-29 2019-01-29 重庆邮电大学 一种基于注意力卷积神经网络的用户评论情感分析系统及方法
CN109766476A (zh) * 2018-12-27 2019-05-17 西安电子科技大学 视频内容情感分析方法、装置、计算机设备及存储介质

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115019237A (zh) * 2022-06-30 2022-09-06 中国电信股份有限公司 多模态情感分析方法、装置、电子设备及存储介质
CN115019237B (zh) * 2022-06-30 2023-12-08 中国电信股份有限公司 多模态情感分析方法、装置、电子设备及存储介质

Also Published As

Publication number Publication date
CN111344717B (zh) 2023-07-18
CN111344717A (zh) 2020-06-26

Similar Documents

Publication Publication Date Title
WO2021134277A1 (zh) 情感识别方法、智能装置和计算机可读存储介质
WO2021134417A1 (zh) 交互行为预测方法、智能装置和计算机可读存储介质
CN108255805B (zh) 舆情分析方法及装置、存储介质、电子设备
CN110444198B (zh) 检索方法、装置、计算机设备和存储介质
CN106658129B (zh) 基于情绪的终端控制方法、装置及终端
EP3617946B1 (en) Context acquisition method and device based on voice interaction
US20240021202A1 (en) Method and apparatus for recognizing voice, electronic device and medium
CN107391575B (zh) 一种基于词向量模型的隐式特征识别方法
JP7394809B2 (ja) ビデオを処理するための方法、装置、電子機器、媒体及びコンピュータプログラム
CN111274372A (zh) 用于人机交互的方法、电子设备和计算机可读存储介质
CN111931482B (zh) 文本分段方法和装置
CN110765294B (zh) 图像搜索方法、装置、终端设备及存储介质
CN111159358A (zh) 多意图识别训练和使用方法及装置
Gogate et al. A novel brain-inspired compression-based optimised multimodal fusion for emotion recognition
CN110633475A (zh) 基于计算机场景的自然语言理解方法、装置、系统和存储介质
US20230154172A1 (en) Emotion recognition in multimedia videos using multi-modal fusion-based deep neural network
CN113505198A (zh) 关键词驱动的生成式对话回复方法、装置及电子设备
CN117980991A (zh) 利用约束谱聚类的基于说话者转换的在线说话者日志化
JP2021081713A (ja) 音声信号を処理するための方法、装置、機器、および媒体
CN115828889A (zh) 文本分析方法、情感分类模型、装置、介质、终端及产品
KR20190074508A (ko) 챗봇을 위한 대화 모델의 데이터 크라우드소싱 방법
WO2020227968A1 (en) Adversarial multi-binary neural network for multi-class classification
CN112910761B (zh) 即时通讯方法、装置、设备、存储介质以及程序产品
CN108538292B (zh) 一种语音识别方法、装置、设备及可读存储介质
CN111224863B (zh) 会话任务生成方法、装置、计算机设备和存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19958080

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19958080

Country of ref document: EP

Kind code of ref document: A1