WO2021134277A1 - 情感识别方法、智能装置和计算机可读存储介质 - Google Patents

情感识别方法、智能装置和计算机可读存储介质 Download PDF

Info

Publication number
WO2021134277A1
WO2021134277A1 PCT/CN2019/130065 CN2019130065W WO2021134277A1 WO 2021134277 A1 WO2021134277 A1 WO 2021134277A1 CN 2019130065 W CN2019130065 W CN 2019130065W WO 2021134277 A1 WO2021134277 A1 WO 2021134277A1
Authority
WO
WIPO (PCT)
Prior art keywords
semantic feature
sequence
emotion recognition
data
neural network
Prior art date
Application number
PCT/CN2019/130065
Other languages
English (en)
French (fr)
Inventor
丁万
黄东延
李柏
邵池
熊友军
Original Assignee
深圳市优必选科技股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市优必选科技股份有限公司 filed Critical 深圳市优必选科技股份有限公司
Priority to PCT/CN2019/130065 priority Critical patent/WO2021134277A1/zh
Priority to CN201980003314.8A priority patent/CN111164601B/zh
Publication of WO2021134277A1 publication Critical patent/WO2021134277A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present invention relates to the field of artificial intelligence technology, in particular to an emotion recognition method, an intelligent device and a computer-readable storage medium.
  • An emotion recognition method comprising: obtaining a multi-modal data group to be recognized, the multi-modal data group to be recognized including at least two of video data, audio data, and/or text data; and extracting the video The video semantic feature sequence of the data, extract the audio semantic feature sequence of the audio data, and/or extract the text semantic feature sequence in the text data; align the text semantic feature sequence to the time dimension of the audio data Process to generate a text semantic sequence sequence; merge the video semantic feature sequence, the audio semantic feature sequence, and/or the text semantic sequence sequence according to the time dimension to generate a multi-modal semantic feature sequence; The modal semantic feature sequence is input to a pre-trained emotion recognition neural network, and the output result of the emotion recognition neural network is used as the target emotion corresponding to the data group to be recognized.
  • An intelligent device includes: an acquisition module to acquire a data group to be identified, the data group to be identified includes video data, audio data, and text data; an extraction module to extract a video semantic feature sequence of the video data, and extract all The audio semantic feature sequence of the audio data, and extracting the text semantic feature sequence from the text data; an alignment module for aligning the text semantic feature sequence to the time dimension of the audio data to generate a text semantic sequence sequence;
  • the concatenation module is used to concatenate the video semantic feature sequence, the audio semantic feature sequence, and the text semantic sequence sequence according to the time dimension to generate a multi-modal semantic feature sequence;
  • the emotion module is used to combine the multiple The modal semantic feature sequence is input to a pre-trained emotion recognition neural network, and the output result of the emotion recognition neural network is used as the target emotion corresponding to the data group to be recognized.
  • An intelligent device includes: an acquisition circuit, a processor, and a memory, the processor is coupled to the memory and the acquisition circuit, a computer program is stored in the memory, and the processor executes the computer program to implement The method described above.
  • a computer-readable storage medium stores a computer program, and the computer program can be executed by a processor to implement the above-mentioned method.
  • extract the video semantic feature sequence of the video data After obtaining the multi-modal data group to be recognized, extract the video semantic feature sequence of the video data, extract the audio semantic feature sequence of the audio data, and/or extract the text semantic feature sequence in the text data. Aligning the text semantic feature sequence to the time dimension of the audio data to generate a text semantic sequence sequence, and fusing the video semantic feature sequence, the audio semantic feature sequence and/or the text semantic sequence sequence according to the time dimension, Generate a multi-modal semantic feature sequence, which acquires semantic features instead of low-level features, which can more accurately represent the emotional features of the multi-modal data set to be recognized. The feature alignment and fusion of multi-modal spatio-temporal relationships are retained. According to the multi-modality The accuracy of the target emotion obtained by the semantic feature sequence is higher, so the accuracy of emotion recognition is effectively improved.
  • Figure 1 is an application environment diagram of an emotion recognition method in an embodiment of the present invention
  • FIG. 2 is a schematic flowchart of the first embodiment of the emotion recognition method provided by the present invention.
  • FIG. 3 is a schematic flowchart of a second embodiment of the emotion recognition method provided by the present invention.
  • FIG. 4 is a schematic flowchart of a third embodiment of the emotion recognition method provided by the present invention.
  • Figure 5 is a schematic structural diagram of a first embodiment of a smart device provided by the present invention.
  • Fig. 6 is a schematic structural diagram of a second embodiment of a smart device provided by the present invention.
  • FIG. 7 is a schematic structural diagram of an embodiment of a computer-readable storage medium provided by the present invention.
  • decision-level fusion ignores the spatiotemporal relationship between multimodal semantic features. Since the different spatio-temporal distributions of multi-modal semantic features correspond to different emotional information, ignoring the spatio-temporal relationship will cause the accuracy of emotion recognition to be low.
  • an emotion recognition method is provided, which can effectively improve the accuracy of emotion recognition.
  • FIG. 1 is an application environment diagram of an emotion recognition method in an embodiment of the present invention.
  • the emotion recognition system includes a terminal 110 and a server 120.
  • the terminal 110 and the server 120 are connected through a network.
  • the terminal 110 may specifically be a desktop terminal or a mobile terminal, and the mobile terminal may specifically be at least one of a mobile phone, a tablet computer, and a notebook computer.
  • the server 120 may be implemented as an independent server or a server cluster composed of multiple servers.
  • the terminal 110 is configured to obtain a multi-modal data group to be recognized, and the multi-modal data group to be recognized includes at least two of video data, audio data and/or text data
  • the server 120 is configured to extract a video semantic feature sequence of the video data, Extract the audio semantic feature sequence of the audio data, and/or extract the text semantic feature sequence in the text data; align the text semantic feature sequence to the time dimension of the audio data to generate the text semantic sequence sequence; combine the video semantic feature sequence, audio The semantic feature sequence and/or the text semantic sequence sequence are merged according to the time dimension to generate a multi-modal semantic feature sequence; the multi-modal semantic feature sequence is input into the pre-trained emotion recognition neural network to obtain the target emotion corresponding to the data group to be recognized.
  • FIG. 2 is a schematic flowchart of the first embodiment of the emotion recognition method provided by the present invention.
  • the emotion recognition method provided by the present invention includes the following steps:
  • S101 Acquire a multi-modal data group to be identified, where the multi-modal data group to be identified includes at least two of video data, audio data, and/or text data.
  • a multi-modal data group to be identified is acquired, and the multi-modal data group to be identified includes at least two of video data, audio data, and/or text data.
  • the multi-modal data group to be identified includes video data, audio data, and text data.
  • the multi-modal data group to be identified may be provided by the user, or obtained from a database, or may be generated by on-site recording.
  • the video data, audio data, and text data correspond to the same speaker in the same time period.
  • S102 Extract the video semantic feature sequence of the video data, extract the audio semantic feature sequence of the audio data, and/or extract the text semantic feature sequence in the text data.
  • the video semantic feature sequence of video data is extracted
  • the audio semantic feature sequence of audio data is extracted
  • the text semantic feature sequence of text data is extracted.
  • the video semantic feature sequence, the audio semantic feature sequence of the audio data, and the text semantic feature sequence can be obtained by inputting the multi-modal data group to be recognized into the pre-trained feature extraction neural network.
  • the audio data is input to the pre-trained audio feature extraction neural network
  • the audio semantic feature sequence is obtained
  • the text data Input the pre-trained text feature extraction neural network to obtain the text semantic feature sequence.
  • the video data is input into the pre-trained video feature extraction neural network, and the video feature extraction neural network needs to be trained before acquiring the video semantic feature sequence.
  • Prepare facial video data and mark the facial action units in the facial video data.
  • the text data is input into the pre-trained text feature extraction neural network, and the text feature extraction neural network needs to be trained before obtaining the text semantic feature sequence.
  • Prepare training text data label the training text data with positive/negative emotions, count the word frequency of the training text data, and segment the text data based on the largest word frequency with the largest value.
  • Probability function p word2vec method based training conditions (w i
  • the structure of the text feature extraction neural network as Transformer+Attention+RNN structure, define the loss function, input the word features of the text data and the positive/negative sentiment annotations of the text data into the text feature extraction neural network for training, and the loss function meets the preset Terminate training when conditions are met.
  • both audio data and video data have a time dimension, while text data does not have a time dimension. Therefore, the audio semantic feature sequence and the video semantic feature sequence both have a time dimension, while the text semantic feature sequence does not have a time dimension.
  • the text semantic feature sequence is aligned to the time dimension of the audio data. In other implementation scenarios, the text semantic feature sequence can also be aligned to the time dimension of the video data.
  • each pronunciation phoneme in the audio data can be obtained through the method of speech recognition, the text semantic feature data corresponding to the pronunciation phoneme can be found in the text semantic feature sequence, and each text semantic feature in the text semantic feature sequence The data is aligned with the time dimension of pronunciation phonemes to generate a textual semantic sequence.
  • S104 Combine the video semantic feature sequence, the audio semantic feature sequence, and/or the text semantic sequence sequence according to the time dimension to generate a multi-modal semantic feature sequence.
  • the time dimension of the video semantic feature sequence is aligned with the time dimension of the audio semantic feature sequence, and the text semantic sequence sequence and the audio semantic feature sequence are aligned in the time dimension of.
  • the semantic feature units at each moment are arranged in time sequence to generate a multi-modal semantic feature sequence.
  • S105 Input the multi-modal semantic feature sequence into the pre-trained emotion recognition neural network, and use the output of the emotion recognition neural network as the target emotion corresponding to the data group to be recognized.
  • the multi-modal semantic feature sequence is input to the pre-trained emotion recognition neural network, and the output of the emotion recognition neural network is used as the target emotion corresponding to the data group to be recognized.
  • the emotion recognition neural network needs to be trained. Prepare multiple training multimodal semantic feature sequences before training, label emotional data for each training multimodal semantic feature sequence, define the network structure of the emotional recognition neural network, and define the number of layers of the emotional recognition neural network, for example, 19 Floor. You can also define the type of emotion recognition neural network, such as convolutional neural network, or fully connected neural network, and so on. Define the loss function of the emotion recognition neural network, and define the conditions for the termination of the training of the emotion recognition neural network, such as stopping after 2000 training. After the training is successful, the multi-modal semantic feature sequence is input to the emotion recognition neural network, and the emotion recognition neural network will output the target emotion corresponding to the multi-modal semantic feature sequence.
  • the video semantic feature sequence of the video data is extracted, the audio semantic feature sequence of the audio data is extracted, and/or the text semantic feature in the text data is extracted sequence.
  • the text semantic feature sequence is aligned with the time dimension of the audio data to generate a text semantic sequence sequence, and the video semantic feature sequence, audio semantic feature sequence and/or text semantic sequence sequence are merged according to the time dimension to generate a multi-modal semantic feature sequence, Semantic features are acquired instead of low-level features, which can more accurately represent the emotional features of the multi-modal data set to be recognized, retain the alignment and fusion of features of multi-modal spatiotemporal relationships, and obtain the target emotions based on the multi-modal semantic feature sequence The accuracy is higher, so the accuracy of emotion recognition is effectively improved.
  • FIG. 3 is a schematic flowchart of a second embodiment of the emotion recognition method provided by the present invention.
  • the emotion recognition method provided by the present invention includes the following steps:
  • S201 Acquire a multi-modal data group to be identified, where the multi-modal data group to be identified includes at least two of video data, audio data, and/or text data.
  • S202 Extract the video semantic feature sequence of the video data, extract the audio semantic feature sequence of the audio data, and/or extract the text semantic feature sequence in the text data.
  • steps S201-S202 are basically the same as steps S101-S102 of the first embodiment of the emotion recognition method provided by the present invention, and will not be repeated here.
  • S203 Acquire at least one pronunciation phoneme of the audio data, and acquire text semantic feature data in the text semantic feature sequence corresponding to each pronunciation phoneme.
  • At least one pronunciation phoneme of audio data is acquired through ASR (Automatic Speech Recognition) technology, and the text semantic feature data corresponding to each pronunciation phoneme is found in the text semantic feature sequence.
  • ASR Automatic Speech Recognition
  • S204 Obtain the time position of each pronunciation phoneme, and align the text semantic feature data with the time position of the corresponding pronunciation phoneme.
  • the time position of each pronunciation phoneme is acquired, and the text semantic feature data in the text semantic feature sequence is aligned with the time position of the corresponding speaking phoneme. For example, if the time position of the pronunciation phoneme "Ah" is 1 minute and 32 seconds, the text semantic feature data corresponding to "Ah" in the text semantic feature sequence is aligned with the time position 1 minute and 32 seconds.
  • S205 Obtain video semantic feature data, audio semantic feature data, and text semantic feature data at each moment of the video semantic feature sequence, the audio semantic feature sequence, and/or the text semantic timing sequence, respectively.
  • the video semantic feature sequence also has a time dimension, and the video semantic feature data at each moment can be obtained.
  • the audio semantic feature data at each moment can be obtained.
  • the text semantic feature data in the text semantic sequence sequence is aligned with the time dimension of the audio data in step S204, the text semantic feature data at each moment can be obtained.
  • S206 Concatenate the video semantic feature data, audio semantic feature data and/or text semantic feature data at the same moment into a semantic feature unit.
  • the video semantic feature data, audio semantic feature data, and text semantic feature data are all vectors.
  • the video semantic feature data, audio semantic feature data, and text semantic feature data at the same moment are concatenated into a semantic feature unit, which is Concatenate three vectors into one vector.
  • the voice feature unit generated after concatenation is a 6-dimensional vector.
  • S207 Arrange the semantic feature units at each moment in chronological order to generate a multi-modal semantic feature sequence.
  • the speech feature units at each moment are arranged in chronological order to generate multiple semantic feature sequences.
  • the time sequence is the time dimension of the audio semantic feature sequence.
  • S208 Input the multi-modal semantic feature sequence into the pre-trained emotion recognition neural network, and use the output of the emotion recognition neural network as the target emotion corresponding to the data group to be recognized.
  • step S208 is basically the same as step S105 of the first embodiment of the emotion recognition method provided by the present invention, and will not be repeated here.
  • the text semantic feature data in the text semantic feature sequence corresponding to each pronunciation phoneme of the audio data by obtaining the text semantic feature data in the text semantic feature sequence corresponding to each pronunciation phoneme of the audio data, the time corresponding to the text semantic feature data is obtained, and the video semantic feature data at the same time is obtained, The audio semantic feature data and the text semantic feature number are concatenated into a semantic feature unit.
  • the semantic feature unit at each moment is arranged in chronological order to generate a multi-modal semantic feature sequence, and the feature alignment and fusion of the multi-modal spatio-temporal relationship are retained.
  • the accuracy of the target emotion obtained by the multi-modal semantic feature sequence is higher, so the accuracy of emotion recognition is effectively improved.
  • FIG. 4 is a schematic flowchart of a third embodiment of an emotion recognition method provided by the present invention.
  • the emotion recognition method provided by the present invention includes the following steps:
  • S301 Acquire a multi-modal data group to be identified, where the multi-modal data group to be identified includes at least two of video data, audio data, and/or text data.
  • S302 Extract the video semantic feature sequence of the video data, extract the audio semantic feature sequence of the audio data, and/or extract the text semantic feature sequence in the text data.
  • S304 Combine the video semantic feature sequence, the audio semantic feature sequence, and/or the text semantic sequence sequence according to the time dimension to generate a multi-modal semantic feature sequence.
  • steps S301-S304 are basically the same as steps S101-S104 of the first embodiment of the emotion recognition method provided by the present invention, and will not be repeated here.
  • S305 Input the semantic feature unit at each moment into the pre-trained unit recognition neural network, and use the output result of the unit recognition neural network as the emotion recognition result at each moment.
  • the semantic feature unit at each moment is input to the pre-trained unit recognition neural network, and the output result of the unit recognition neural network is used as the emotion recognition result at each moment.
  • the unit recognition neural network includes a convolutional neural network layer and a bidirectional long and short memory neural network layer.
  • the convolutional neural network defines a sensing window with a width of 2d with the current element x i as the center, and performs a fully connected network calculation on the input elements in the window, taking one-dimensional data as an example
  • is a nonlinear activation function
  • w k represents a shared weight, that is, when i is not equal but k is equal, the corresponding weights of the inputs are equal.
  • CNN is often used together with the pooling layer.
  • the function of the pooling layer is characterized by spatial invariance. Common ones are:
  • LSTM Long Short-Term Memory
  • LSTM Long Short-Term Memory
  • h t is the output vector at the previous time
  • c t-1 is the cell state vector at the previous time
  • h t is the output vector at the current time
  • h t is calculated as:
  • the unit recognition neural network may also include only one layer of neural network, such as LSTM.
  • S306 Sort the emotion recognition results at each moment in time to generate an emotion recognition sequence.
  • the emotion recognition results at each moment are sorted according to time to generate an emotion recognition sequence.
  • Multiple unit recognition neural networks can be set, which can output the emotion recognition results at each moment at the same time, or one unit recognition neural network can be set up to input the semantic feature units at each time in turn, and output the emotion recognition results at each time in turn.
  • S307 Obtain the weight of the emotion recognition result at each moment, perform a dot multiplication operation on the emotion recognition result at each moment and its corresponding weight, and input the emotion recognition sequence after the dot multiplication operation into the pre-trained emotion recognition neural network to transfer the emotion
  • the output of the recognition neural network is used as the target emotion corresponding to the data group to be recognized.
  • the weight of the emotion recognition result at each moment in the emotion recognition sequence is obtained, and the emotion recognition result at each moment is multiplied by the corresponding weight. Because in the emotion recognition sequence, the emotion recognition results at each moment affect each other. For example, some emotion recognition results are subconscious responses, and some emotion recognition results have stronger emotions, and different emotion recognition results have an effect on emotion recognition. The target emotions corresponding to the sequence have different influence capacities.
  • the attention calculation is performed on the emotion recognition sequence to obtain the weight of the emotion recognition result at each moment.
  • a is the weight of the emotion recognition result at each moment.
  • the operating formula of the softmax function is:
  • the emotion recognition neural network is a fully connected neural network.
  • the fully connected neural network establishes all weight connections between inputs and outputs by default. Take one-dimensional data as an example:
  • w i is a network parameter
  • the video semantic feature data, audio semantic feature data, and text semantic feature number at the same moment are connected in series into semantic feature units, and the semantic feature units at each moment are input into the unit recognition neural network to obtain each
  • the unit recognition neural network includes a convolutional neural network layer and a two-way long and short memory neural network layer, which can improve the accuracy of emotion recognition results.
  • FIG. 5 is a schematic structural diagram of the first embodiment of the smart device provided by the present invention.
  • the smart device 10 includes an acquisition module 11, an extraction module 12, an alignment module 13, a series connection module 14, and an emotional module 15.
  • the acquiring module 11 acquires a data group to be identified, and the data group to be identified includes video data, audio data, and text data.
  • the extraction module 12 is used to extract the video semantic feature sequence of video data, extract the audio semantic feature sequence of audio data, and extract the text semantic feature sequence of text data.
  • the alignment module 13 is used to align the text semantic feature sequence to the time dimension of the audio data to generate a text semantic sequence sequence.
  • the concatenation module 14 is used to concatenate the video semantic feature sequence, the audio semantic feature sequence, and the text semantic sequence sequence according to the time dimension to generate a multi-modal semantic feature sequence.
  • the emotion module 15 is used to input the multi-modal semantic feature sequence into the pre-trained emotion recognition neural network to obtain the emotion included in the data group to be recognized.
  • the smart device extracts the video semantic feature sequence of the video data, the audio semantic feature sequence of the audio data, and/or the text in the text data Semantic feature sequence.
  • the text semantic feature sequence is aligned with the time dimension of the audio data to generate a text semantic sequence sequence, and the video semantic feature sequence, audio semantic feature sequence and/or text semantic sequence sequence are merged according to the time dimension to generate a multi-modal semantic feature sequence,
  • the feature alignment and fusion of the multi-modal spatiotemporal relationship can be retained, and the accuracy of the target emotion obtained according to the multi-modal semantic feature sequence is higher, thus effectively improving the accuracy of emotion recognition.
  • the alignment module 13 includes a first acquisition sub-module 131 and an alignment sub-module 132.
  • the first acquiring submodule 131 is configured to acquire at least one pronunciation phoneme of the audio data, and acquire the text semantic feature data corresponding to each pronunciation phoneme.
  • the alignment sub-module 132 is used to obtain the time position of each pronunciation phoneme, and align the text semantic feature data with the time position of the corresponding pronunciation phoneme.
  • the series module 14 includes a second acquisition sub-module 141 and a series sub-module 142.
  • the second acquisition sub-module 141 is used to separately acquire the video semantic feature data, audio semantic feature data, and text semantic feature data at each moment of the video semantic feature sequence, the audio semantic feature sequence, and the text semantic sequence sequence.
  • the concatenation sub-module 142 is used to concatenate the video semantic feature data, audio semantic feature data, and text semantic feature data at the same moment into a semantic feature unit.
  • the emotion module 15 includes an emotion recognition sub-module 151, an arrangement sub-module 152, and an emotion sub-module 153.
  • the emotion recognition sub-module 151 is used to input the semantic feature unit at each moment into the pre-trained unit recognition neural network to obtain emotion recognition data at each moment.
  • the arrangement sub-module 152 is used to arrange the emotion recognition data at each moment in time to generate an emotion recognition sequence.
  • the emotion sub-module 153 is used to input the emotion recognition sequence into the pre-trained emotion recognition neural network to obtain the emotion included in the data group to be recognized.
  • the emotion sub-module 153 includes a weight unit 1531.
  • the weight unit 1531 is used to obtain the weight of the emotion recognition data at each moment, perform a dot multiplication operation on the emotion recognition data at each moment and its corresponding weight, and input the calculated emotion recognition sequence into the pre-trained emotion recognition neural network.
  • the weight unit 1531 is used to perform attention calculation on the emotion recognition sequence to obtain the weight of the emotion recognition data at each moment.
  • the unit recognition neural network includes a convolutional neural network layer and a bidirectional long and short memory network layer.
  • the emotion recognition neural network is a fully connected neural network.
  • the smart device 10 also includes a training module 16 for training an emotion recognition neural network.
  • the training module 16 includes a preparation sub-module 161, a definition sub-module 162, and an input sub-module 163.
  • the preparation sub-module 161 is used to prepare a plurality of training multi-modal feature sequences, and annotate the target emotion of each training multi-modal feature sequence.
  • the definition sub-module 162 is used to define the structure, loss function and termination conditions of the trained emotion recognition neural network.
  • the input sub-module 163 is used to train a plurality of multi-modal feature sequences and their corresponding target emotions as an input emotion recognition neural network.
  • the semantic feature units at each moment are arranged in chronological order to generate a multi-modal semantic feature sequence, and the semantic features are acquired instead of low-level features, which can more accurately represent the multi-modality to be identified
  • the emotional characteristics of the data set retain the feature alignment and fusion of the multi-modal spatio-temporal relationship.
  • the target emotion obtained according to the multi-modal semantic feature sequence has a higher accuracy, thus effectively improving the accuracy of emotion recognition.
  • the video at the same time Semantic feature data, audio semantic feature data, and text semantic feature numbers are concatenated into a semantic feature unit.
  • the semantic feature unit at each moment is input into the unit recognition neural network to obtain the emotion recognition results at each moment.
  • the unit recognition neural network includes convolutional neural networks.
  • the network layer and the two-way long and short memory neural network layer can improve the accuracy of emotion recognition results.
  • FIG. 6 is a schematic structural diagram of a second embodiment of a smart device provided by the present invention.
  • the smart device 20 includes a processor 21, a memory 22, and an acquisition circuit 23.
  • the processor 21 is coupled to the memory 22 and the acquisition circuit 23.
  • a computer program is stored in the memory 22, and the processor 21 executes the computer program when it is working to implement the method shown in FIGS. 2-4. The detailed method can be referred to the above, and will not be repeated here.
  • the smart device obtains the multi-modal data group to be recognized in this embodiment, it extracts the video semantic feature sequence of the video data, the audio semantic feature sequence of the audio data, and/or the text in the text data Semantic feature sequence.
  • the text semantic feature sequence is aligned with the time dimension of the audio data to generate a text semantic sequence sequence, and the video semantic feature sequence, audio semantic feature sequence and/or text semantic sequence sequence are merged according to the time dimension to generate a multi-modal semantic feature sequence, Semantic features are acquired instead of low-level features, which can more accurately represent the emotional features of the multi-modal data set to be recognized, retain the alignment and fusion of features of multi-modal spatiotemporal relationships, and obtain the target emotions based on the multi-modal semantic feature sequence The accuracy is higher, so the accuracy of emotion recognition is effectively improved.
  • FIG. 7 is a schematic structural diagram of an embodiment of a computer-readable storage medium provided by the present invention.
  • At least one computer program 31 is stored in the computer-readable storage medium 30, and the computer program 71 is used to be executed by the processor to implement the methods shown in FIGS.
  • the computer-readable storage medium 30 may be a storage chip in a terminal, a hard disk, or a mobile hard disk, or other readable and writable storage tools such as a USB flash drive, or an optical disk, and may also be a server or the like.
  • the computer program stored in the storage medium in this embodiment can be used to extract the video semantic feature sequence of the video data after acquiring the multimodal data group to be identified, and extract the audio semantic feature sequence of the audio data, and/ Or, extract the semantic feature sequence of the text in the text data.
  • the text semantic feature sequence is aligned with the time dimension of the audio data to generate a text semantic sequence sequence, and the video semantic feature sequence, audio semantic feature sequence and/or text semantic sequence sequence are merged according to the time dimension to generate a multi-modal semantic feature sequence, Semantic features are acquired instead of low-level features, which can more accurately represent the emotional features of the multi-modal data set to be recognized, retain the alignment and fusion of features of multi-modal spatiotemporal relationships, and obtain the target emotions based on the multi-modal semantic feature sequence The accuracy is higher, so the accuracy of emotion recognition is effectively improved.
  • the present invention acquires semantic features rather than low-level features, which can more accurately represent the emotional features of the multi-modal data set to be recognized, retain the alignment and fusion of features of multi-modal spatio-temporal relationships, and according to the multi-modality
  • the accuracy of the target emotion obtained by the semantic feature sequence is higher, so the accuracy of emotion recognition is effectively improved.

Abstract

一种情感识别方法、智能装置(10)和计算机可读存储介质(30),该情感识别方法包括:获取包括视频数据、音频数据和/或文本数据中的至少两个的待识别多模态数据组(S101);提取视频数据的视频语义特征序列,提取音频数据的音频语义特征序列,和/或提取文本数据中的文本语义特征序列(S102);将文本语义特征序列向音频数据的时间维度对齐处理,生成文本语义时序序列(S103);将视频语义特征序列、音频语义特征序列和/或文本语义时序序列按照时间维度融合,生成多模态语义特征序列(S104);将多模态语义特征序列输入预训练的情感识别神经网络,将情感识别神经网络的输出结果作为待识别数据组对应的目标情感(S105)。该方法可以有效提升情感识别的准确性。

Description

情感识别方法、智能装置和计算机可读存储介质 技术领域
本发明涉及人工智能技术领域,尤其涉及情感识别方法、智能装置和计算机可读存储介质。
背景技术
人在自然状态下的情感会引起多个模态(如脸部动作、说话音调、语言、心跳等)的反应。传统的多模态融合情感识别方法基于低层特征对齐融合(Low-level features fusion)或者决策层融合(Decision-level fusion)。这两种方法的局限性在于(a)人脑对于不同模态的低层信息(如物理特征:像素的亮度、声波的频谱、单词的拼写)的处理机制是相互独立的;(b)决策层融合忽略了多模态语义特征之间的时空关系。多模态语义特征的不同时-空分布会对应不同的情感信息。例如A:笑脸和说“好”同时出现;B:笑脸在说“好”之后出现。A和B的不同在于笑脸和说“好”这两个语义特征的先后关系不同,先后关系的不同导致情感表达的不同,例如B更可能是在敷衍或无奈。
申请内容
基于此,有必要针对上述问题,提出了情感识别方法、智能装置和计算机可读存储介质。
一种情感识别方法,所述方法包括:获取待识别多模态数据组,所述待识别多模态数据组包括视频数据、音频数据和/或文本数据中的至少两个;提取所述视频数据的视频语义特征序列,提取所述音频数据的音频语义特征序列,和/或,提取所述文本数据中的文本语义特征序列;将所述文本语义特征序列向所述音频数据的时间维度对齐处理,生成文本语义时序序列;将所述视频语义特征序列、所述音频语义特征序列和/或所述文本语义时序序列按照所述时间维度融合,生成多模态语义特征序列;将所述多模态语义特征序列输入预训练的情感识别神经网络,将所述情感识别神经网络的输出结果作为所述待识别数据组对应的目标情感。
一种智能装置,包括:获取模块,获取待识别数据组,所述待识别数据组包括视频数据、音频数据和文本数据;提取模块,用于提取所述视频数据的视频语义特征序列,提取所述音频数据的音频语义特征序列,以及提取所述文本数据中的文本语义特征序列;对齐模块,用于将所述文本语义特征序列向所述音频数据的时间维度对齐,生成文本语义时序序列;串联模块,用于将所述视频语义特征序列、所述音频语义特征序列以及所述文本语义时序序列按照所述时间维度串联,生成多模态语义特征序列;情感模块,用于将所述多模态语义特征序列输入预训练的情感识别神经网络,将所述情感识别神经网络的输出结果作为所述待识别数据组对应的目标情感。
一种智能装置,包括:获取电路、处理器、存储器,所述处理器耦接所述存储器和所述获取电路,所述存储器中存储有计算机程序,所述处理器执行所述计算机程序以实现如上所述的方法。
一种计算机可读存储介质,存储有计算机程序,所述计算机程序能够被处理器执行以实现如上所述的方法。
采用本发明实施例,具有如下有益效果:
获取待识别多模态数据组后,提取视频数据的视频语义特征序列,提取音频数据的音频语义特征序列,和/或,提取文本数据中的文本语义特征序列。将文本语义特征序列向音频数据的时间维度对齐处理,生成文本语义时序序列,将所述视频语义特征序列、所述音频语义特征序列和/或所述文本语义时序序列按照所述时间维度融合,生成多模态语义特征序列,获取的是语义特征而非低层特征,可以更加准确的表示待识别多模态数据组的情感特征保留多模态时空关系的特征对齐和融合,根据该多模态语义特征序列获取的目标情感的准确度更高,因此有效提升了情感识别的准确率。
附图说明
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
其中:
图1是本发明一个实施例中情感识别方法应用环境图;
图2是本发明提供的情感识别方法的第一实施例的流程示意图;
图3是本发明提供的情感识别方法的第二实施例的流程示意图;
图4是本发明提供的情感识别方法的第三实施例的流程示意图;
图5是本发明提供的智能装置的第一实施例的结构示意图;
图6是本发明提供的智能装置的第二实施例的结构示意图;
图7是本发明提供的计算机可读存储介质的一实施例的结构示意图。
具体实施方式
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。
现有技术中决策层融合忽略了多模态语义特征之间的时空关系。由于多模态语义特征的不同时-空分布对应不同的情感信息,因而忽略了时空关系会造成情感识别的准确率不高。
在本实施例中,为了解决上述问题,提供了一种情感识别方法,可以有效提升了情感识别的准确率。
请参阅图1,图1是本发明一个实施例中情感识别方法应用环境图。参照图1,该脸部情感识别方法应用于情感识别系统。该情感识别系统包括终端110和服务器120。终端110和服务器120通过网络连接,终端110具体可以是台式终端或移动终端,移动终端具体可以是手机、平板电脑、笔记本电脑等中的至少一种。服务器120可以用独立的服务器或者是多个服务器组成的服务器集群来实现。终端110用于获取待识别多模态数据组,待识别多模态数据组包括视频数据、音频数据和/或文本数据中的至少两个,服务器120用于提取视频数据的视频语义特征序列,提取音频数据的音频语义特征序列,和/或,提取文本数据中的文本语义特征序列;将文本语义特征序列向音频数据的时间维度对齐处理,生成文本语义时序序列;将视频语义特征序列、音频语义特征序列和/或文本语义时序序列按照时间维度融合,生成多模态语义特征序列;将多模态语义特征序列输入预训练的情感识别神经网络,获取待识别数据组对应的 目标情感。
请参阅图2,图2是本发明提供的情感识别方法的第一实施例的流程示意图。本发明提供的情感识别方法包括如下步骤:
S101:获取待识别多模态数据组,待识别多模态数据组包括视频数据、音频数据和/或文本数据中的至少两个。
在一个具体的实施场景中,获取待识别多模态数据组,待识别多模态数据组包括视频数据、音频数据和/或文本数据中的至少两个。在本实施场景中,待识别多模态数据组包括视频数据、音频数据和文本数据。待识别多模态数据组可以是由用户提供的,或者是从数据库中获取的,还可以是现场录制生成的。视频数据、音频数据和文本数据对应同一时间段内同一说话人。
S102:提取视频数据的视频语义特征序列,提取音频数据的音频语义特征序列,和/或,提取文本数据中的文本语义特征序列。
在本实施场景中,提取视频数据的视频语义特征序列,提取音频数据的音频语义特征序列,提取文本数据中的文本语义特征序列。可以通过将待识别多模态数据组输入预训练的特征提取神经网络,获取视频语义特征序列、音频数据的音频语义特征序列和文本语义特征序列。在其他实施场景中,还可以是将视频数据输入预训练的视频特征提取神经网络,获取视频语义特征序列,将音频数据输入预训练的音频特征提取神经网络,获取音频语义特征序列,将文本数据输入预训练的文本特征提取神经网络,获取文本语义特征序列。
具体地说,将视频数据输入预训练的视频特征提取神经网络,获取视频语义特征序列之前,需要对视频特征提取神经网络进行训练。准备脸部视频数据,标注出脸部视频数据中的脸部动作单元。在训练前,定义视频特征提取网络的结果为CNN-RNN结构,定义迭代初始值为Epoch=0,定义损失函数。将脸部视频数据及其对应的脸部动作单元输入视频特征提取神经网络,获取训练结果,将训练结果随机分批,计算损失函数,根据计算出的损失值的大小,采用返现梯度传播算法更新CNN-RNN的权值,当全部的训练结构遍历后,迭代值Epoch+1,直至Epoch=2000,训练终止。
将文本数据输入预训练的文本特征提取神经网络,获取文本语义特征序列之前,需要对文本特征提取神经网络进行训练。准备训练文本数据,为训练文本数据标注正/负面情绪标注,统计训练文本数据的词频,基于数值最大的最 大词频对文本数据进行分词。基于word2vec方法训练条件概率函数p(w i|w i-2,w i-1,w i+1,w i+2),提取文本数据中的词特征。定义文本特征提取神经网络的结构为Transformer+Attention+RNN结构,定义损失函数,将文本数据的词特征和文本数据的正/负面情绪标注输入文本特征提取神经网络进行训练,在损失函数满足预设条件时终止训练。
S103:将文本语义特征序列向音频数据的时间维度对齐处理,生成文本语义时序序列。
在本实施场景中,音频数据和视频数据均带有时间维度,而文本数据不具有时间维度,因此音频语义特征序列和视频语义特征序列均具有时间维度,而文本语义特征序列不具有时间维度。将文本语义特征序列向音频数据的时间维度进行对齐处理。在其他实施场景中,还可以将文本语义特征序列向视频数据的时间维度进行对齐处理。
在本实施场景中,可以通过语音识别的方法获取音频数据中每个发音音素,在文本语义特征序列中找到对应该发音音素的文本语义特征数据,将文本语义特征序列中的每个文本语义特征数据与发音音素的时间维度对齐,生成文本语义时序序列。
S104:将视频语义特征序列、音频语义特征序列和/或文本语义时序序列按照时间维度融合,生成多模态语义特征序列。
在本实施场景中,以音频语义特征序列的时间维度为基准,将视频语义特征序列的时间维度与音频语义特征序列的时间维度对齐,文本语义时序序列与音频语义特征序列在时间维度上是对齐的。
获取每一时刻的视频语义特征数据、音频语义特征数据和文本语义特征数据,将每一时刻的视频语义特征数据、音频语义特征数据和文本语义特征数据串联成语义特征单元。将每一时刻的语义特征单元按照时序排列生成多模态语义特征序列。
S105:将多模态语义特征序列输入预训练的情感识别神经网络,将情感识别神经网络的输出作为待识别数据组对应的目标情感。
在本实施场景中,将多模态语义特征序列输入预训练的情感识别神经网络,将情感识别神经网络的输出作为待识别数据组对应的目标情感。
在本实施场景中,需要对情感识别神经网络进行训练。在训练前准备好多 个训练多模态语义特征序列,为每个训练多模态语义特征序列标注情感数据,定义情感识别神经网络的网络结构,可以定义情感识别神经网络的层数,例如,19层。还可以定义情感识别神经网络的类型,例如卷积神经网络,或者全连接神经网络等等。定义情感识别神经网络的损失函数,以及定义情感识别神经网络的训练终止的条件,例如训练2000次后停止。在训练成功后,将多模态语义特征序列输入情感识别神经网络,情感识别神经网络将会输出多模态语义特征序列对应的目标情感。
通过上述描述可知,在本实施例中获取待识别多模态数据组后,提取视频数据的视频语义特征序列,提取音频数据的音频语义特征序列,和/或,提取文本数据中的文本语义特征序列。将文本语义特征序列向音频数据的时间维度对齐处理,生成文本语义时序序列,将视频语义特征序列、音频语义特征序列和/或文本语义时序序列按照时间维度融合,生成多模态语义特征序列,获取的是语义特征而非低层特征,可以更加准确的表示待识别多模态数据组的情感特征,保留多模态时空关系的特征对齐和融合,根据该多模态语义特征序列获取的目标情感的准确度更高,因此有效提升了情感识别的准确率。
请参阅图3,图3是本发明提供的情感识别方法的第二实施例的流程示意图。本发明提供的情感识别方法包括如下步骤:
S201:获取待识别多模态数据组,待识别多模态数据组包括视频数据、音频数据和/或文本数据中的至少两个。
S202:提取视频数据的视频语义特征序列,提取音频数据的音频语义特征序列,和/或,提取文本数据中的文本语义特征序列。
在一个具体的实施场景中,步骤S201-S202与本发明提供的情感识别方法的第一实施例的步骤S101-S102基本一致,此处不在进行赘述。
S203:获取音频数据的至少一个发音音素,获取每个发音音素对应的文本语义特征序列中的文本语义特征数据。
在本实施场景中,通过ASR(Automatic Speech Recognition,语音识别)技术获取音频数据的至少一个发音音素,在文本语义特征序列中找出与每个发音音素对应的文本语义特征数据。
S204:获取每个发音音素的时刻位置,将文本语义特征数据与对应的发音音素的时刻位置对齐。
在本实施场景中,获取每个发音音素的时刻位置,将文本语义特征序列中的文本语义特征数据与对应的发言音素的时刻位置对齐。例如,发音音素“啊”的时刻位置为1分32秒,则文本语义特征序列中对应“啊”的文本语义特征数据与时刻位置1分32秒对齐。
S205:分别获取视频语义特征序列、音频语义特征序列和/或文本语义时序序列每一时刻的视频语义特征数据、音频语义特征数据和文本语义特征数据。
在本实施场景中,视频语义特征序列也是具有时间维度的,可以获取每一时刻的视频语义特征数据。同理,可以获取每一时刻的音频语义特征数据,文本语义时序序列中的文本语义特征数据经过步骤S204中与音频数据的时间维度对齐后,可以获取每一时刻的文本语义特征数据。
S206:将同一时刻的视频语义特征数据、音频语义特征数据和/或文本语义特征数据串联成语义特征单元。
在本实施场景中,视频语义特征数据、音频语义特征数据和文本语义特征数据均为向量,将同一时刻的视频语义特征数据、音频语义特征数据和文本语义特征数据串联成语义特征单元,即为将三个向量串联为一个向量。例如,视频语义特征数据、音频语义特征数据和文本语义特征数据均为2维向量,则串联后生成的语音特征单元为6维向量。
S207:将每一时刻的语义特征单元按照时间顺序排列,生成多模态语义特征序列。
在本实施场景中,将每一时刻的语音特征单元按照时间顺序排列,生成多磨语义特征序列。时间顺序即为音频语义特征序列的时间维度。
S208:将多模态语义特征序列输入预训练的情感识别神经网络,将情感识别神经网络的输出作为待识别数据组对应的目标情感。
在一个具体的实施场景中,步骤S208与本发明提供的情感识别方法的第一实施例的步骤S105基本一致,此处不在进行赘述。
通过上述描述可知,在本实施例中通过获取音频数据的每个发音音素对应的文本语义特征序列中的文本语义特征数据,获取文本语义特征数据对应的时刻,将同一时刻的视频语义特征数据、音频语义特征数据和文本语义特征数串联为语义特征单元,将每一时刻的语义特征单元按照时间顺序排列,生成多模态语义特征序列,保留多模态时空关系的特征对齐和融合,根据该多模态语义 特征序列获取的目标情感的准确度更高,因此有效提升了情感识别的准确率。
请参阅图4,图4是本发明提供的情感识别方法的第三实施例的流程示意图。本发明提供的情感识别方法包括如下步骤:
S301:获取待识别多模态数据组,待识别多模态数据组包括视频数据、音频数据和/或文本数据中的至少两个。
S302:提取视频数据的视频语义特征序列,提取音频数据的音频语义特征序列,和/或,提取文本数据中的文本语义特征序列。
S303:将文本语义特征序列向音频数据的时间维度对齐处理,生成文本语义时序序列。
S304:将视频语义特征序列、音频语义特征序列和/或文本语义时序序列按照时间维度融合,生成多模态语义特征序列。
在一个具体的实施场景中,步骤S301-S304与本发明提供的情感识别方法的第一实施例的步骤S101-S104基本一致,此处不在进行赘述。
S305:分别将每一时刻的语义特征单元输入预训练的单元识别神经网络,将单元识别神经网络的输出结果作为每一时刻的情绪识别结果。
在本实施场景中,将每一时刻的语义特征单元输入预训练的单元识别神经网络,将单元识别神经网络的输出结果作为每一时刻的情绪识别结果。
在本实施场景中,单元识别神经网络包括卷积神经网络层和双向长短记忆神经网络层。卷积神经网络以当前元素x i为中心定义宽度为2d的感应窗口,对窗口内的输入元素进行全连接网络计算,以一维数据为例
设输入为[x 1,x 2,…,x n-1,x n],卷积神经网络的模型为:
Figure PCTCN2019130065-appb-000001
其中σ为非线性激活函数,w k表示共享权值,即i不等但是k相等时输入对应的权值即相等。
CNN常与池化(pooling)层一起使用,池化层函数的特点在于空间不变性,常见的有:
Max-pooling:
Figure PCTCN2019130065-appb-000002
Average-pooling:
Figure PCTCN2019130065-appb-000003
长短期记忆网络(LSTM,Long Short-Term Memory)是一个序列标注模型,当前时刻t的输出h t是当前时刻输入x t和前一时刻输出h t-1的函数。以下展示了一种LSTM的实现方法:
设x t为当前输入向量,h t-1为前一时刻输出向量,c t-1为前一时刻细胞状态向量,h t为当前时刻输出向量,h t的计算方式为:
f t=σ(W fx t+U fh t-1)
i t=σ(W ix t+U ih t-1)
o t=σ(W ox t+U oh t-1)
Figure PCTCN2019130065-appb-000004
Figure PCTCN2019130065-appb-000005
h t=o t*tanh(c t)
其中W和U分别代表不同的权值矩阵,tanh为非线性激活函数:
Figure PCTCN2019130065-appb-000006
在其他实施场景中,单元识别神经网络也可以仅包括一层神经网络,例如LSTM。
S306:将每一时刻的情绪识别结果按照时间排序,生成情绪识别序列。
在本实施场景中,将每一时刻的情绪识别结果按照时间排序,生成情绪识别序列。可以设置多个单元识别神经网络,可以同时输出每一时刻的情绪识别结果,也可以设置一个单元识别神经网络,依次输入每一时刻的语义特征单元,依次输出每一时刻的情绪识别结果。
S307:获取每一时刻的情绪识别结果的权重,将每一时刻的情绪识别结果与其对应的权重进行点乘运算,将点乘运算后的情绪识别序列输入预训练的情绪识别神经网络,将情感识别神经网络的输出作为待识别数据组对应的目标情感。
在本实施场景中,获取情绪识别序列中每一时刻的情绪识别结果的权重, 将每一时刻的情绪识别结果与其对应的权重点乘运算。因为情绪识别序列中,各时刻的情绪识别结果之间相互影响,例如,某些情绪识别结果是下意识的反应,某些情绪识别结果带有较为强烈的情感,则不同的情绪识别结果对情绪识别序列对应的目标情绪的影响能力大小不同。
在本实施场景中,对情绪识别序列进行注意力运算,获取每一时刻的情绪识别结果的权重。
Figure PCTCN2019130065-appb-000007
其中,a为每一时刻的情绪识别结果的权重,
Figure PCTCN2019130065-appb-000008
为情绪识别序列,softmax函数的运算公式为:
Figure PCTCN2019130065-appb-000009
在本实施场景中,情感识别神经网络为全连接神经网络。全连接神经网络默认建立所有的输入与输出之间的权值连接,以一维数据为例:
设输入为[x 1,x 2,…,x n-1,x n],全连接网络的模型为:
Figure PCTCN2019130065-appb-000010
其中w i为网络参数,σ为非线性激活函数,常见的如Sigmoid函数σ(x)=1/1+e -x
通过上述描述可知,在本实施例中将同一时刻的视频语义特征数据、音频语义特征数据和文本语义特征数串联为语义特征单元,将每一时刻的语义特征单元输入单元识别神经网络,获取每一时刻的情绪识别结果,单元识别神经网络包括卷积神经网络层和双向长短记忆神经网络层,可以提高情绪识别结果的准确率。
请参阅图5,图5是本发明提供的智能装置的第一实施例的结构示意图。智能装置10包括获取模块11、提取模块12、对齐模块13、串联模块14和情 感模块15。获取模块11获取待识别数据组,待识别数据组包括视频数据、音频数据和文本数据。提取模块12用于提取视频数据的视频语义特征序列,提取音频数据的音频语义特征序列,以及提取文本数据中的文本语义特征序列。对齐模块13用于将文本语义特征序列向音频数据的时间维度对齐,生成文本语义时序序列。串联模块14用于将视频语义特征序列、音频语义特征序列以及文本语义时序序列按照时间维度串联,生成多模态语义特征序列。情感模块15用于将多模态语义特征序列输入预训练的情感识别神经网络,获取待识别数据组的包括的情感。
通过上述描述可知,在本实施例中智能装置获取待识别多模态数据组后,提取视频数据的视频语义特征序列,提取音频数据的音频语义特征序列,和/或,提取文本数据中的文本语义特征序列。将文本语义特征序列向音频数据的时间维度对齐处理,生成文本语义时序序列,将视频语义特征序列、音频语义特征序列和/或文本语义时序序列按照时间维度融合,生成多模态语义特征序列,可以保留多模态时空关系的特征对齐和融合,根据该多模态语义特征序列获取的目标情感的准确度更高,因此有效提升了情感识别的准确率。
请继续参阅图5。对齐模块13包括第一获取子模块131和对齐子模块132。第一获取子模块131用于获取音频数据的至少一个发音音素,获取每个发音音素对应的文本语义特征数据。对齐子模块132用于获取每个发音音素的时刻位置,将文本语义特征数据与对应的发音音素的时刻位置对齐。
串联模块14包括第二获取子模块141和串联子模块142。第二获取子模块141用于分别获取视频语义特征序列、音频语义特征序列以及文本语义时序序列每一时刻的视频语义特征数据、音频语义特征数据和文本语义特征数据。串联子模块142用于将同一时刻的视频语义特征数据、音频语义特征数据和文本语义特征数据串联成语义特征单元。
情感模块15包括情绪识别子模块151、排列子模块152和情感子模块153。情绪识别子模块151用于分别将每一时刻的语义特征单元输入预训练的单元识别神经网络,获取每一时刻的情绪识别数据。排列子模块152用于将每一时刻的情绪识别数据按照时间排序,生成情绪识别序列。情感子模块153用于将情绪识别序列输入预训练的情绪识别神经网络,获取待识别数据组的包括的情感。
情感子模块153包括权重单元1531。权重单元1531用于获取每一时刻的情绪识别数据的权重,将每一时刻的情绪识别数据与其对应的权重进行点乘运算,将运算的情绪识别序列后输入预训练的情绪识别神经网络。
其中,权重单元1531用于对情绪识别序列进行注意力运算,获取每一时刻的情绪识别数据的权重。
其中,单元识别神经网络包括卷积神经网络层和双向长短记忆网络层。
其中,情感识别神经网络为全连接神经网络。
智能装置10还包括训练模块16,训练模块16用于训练情感识别神经网络。
训练模块16包括准备子模块161、定义子模块162、输入子模块163。
准备子模块161用于准备多个训练多模态特征序列,标注每个训练多模态特征序列的目标情感。定义子模块162用于定义训练的情感识别神经网络的结构、损失函数和终止条件。输入子模块163用于将多个多模态特征序列及其对应的目标情感为输入情感识别神经网络进行训练。
通过上述描述可知,本实施例中将每一时刻的语义特征单元按照时间顺序排列,生成多模态语义特征序列,获取的是语义特征而非低层特征,可以更加准确的表示待识别多模态数据组的情感特征,保留多模态时空关系的特征对齐和融合,根据该多模态语义特征序列获取的目标情感的准确度更高,因此有效提升了情感识别的准确率,同一时刻的视频语义特征数据、音频语义特征数据和文本语义特征数串联为语义特征单元,将每一时刻的语义特征单元输入单元识别神经网络,获取每一时刻的情绪识别结果,单元识别神经网络包括卷积神经网络层和双向长短记忆神经网络层,可以提高情绪识别结果的准确率。
请参阅图6,图6是本发明提供的智能装置的第二实施例的结构示意图。智能装置20包括处理器21、存储器22和获取电路23。处理器21耦接存储器22和获取电路23。存储器22中存储有计算机程序,处理器21在工作时执行该计算机程序以实现如图2-图4所示的方法。详细的方法可参见上述,在此不再赘述。
通过上述描述可知,在本实施例中智能装置获取待识别多模态数据组后,提取视频数据的视频语义特征序列,提取音频数据的音频语义特征序列,和/或,提取文本数据中的文本语义特征序列。将文本语义特征序列向音频数据的 时间维度对齐处理,生成文本语义时序序列,将视频语义特征序列、音频语义特征序列和/或文本语义时序序列按照时间维度融合,生成多模态语义特征序列,获取的是语义特征而非低层特征,可以更加准确的表示待识别多模态数据组的情感特征,保留多模态时空关系的特征对齐和融合,根据该多模态语义特征序列获取的目标情感的准确度更高,因此有效提升了情感识别的准确率。
请参阅图7,图7是本发明提供的计算机可读存储介质的一实施例的结构示意图。计算机可读存储介质30中存储有至少一个计算机程序31,计算机程序71用于被处理器执行以实现如图2-图4所示的方法,详细的方法可参见上述,在此不再赘述。在一个实施例中,计算机可读存储介质30可以是终端中的存储芯片、硬盘或者是移动硬盘或者优盘、光盘等其他可读写存储的工具,还可以是服务器等等。
通过上述描述可知,在本实施例中存储介质中存储的计算机程序可以用于获取待识别多模态数据组后,提取视频数据的视频语义特征序列,提取音频数据的音频语义特征序列,和/或,提取文本数据中的文本语义特征序列。将文本语义特征序列向音频数据的时间维度对齐处理,生成文本语义时序序列,将视频语义特征序列、音频语义特征序列和/或文本语义时序序列按照时间维度融合,生成多模态语义特征序列,获取的是语义特征而非低层特征,可以更加准确的表示待识别多模态数据组的情感特征,保留多模态时空关系的特征对齐和融合,根据该多模态语义特征序列获取的目标情感的准确度更高,因此有效提升了情感识别的准确率。
区别于现有技术,本发明获取的是语义特征而非低层特征,可以更加准确的表示待识别多模态数据组的情感特征,保留多模态时空关系的特征对齐和融合,根据该多模态语义特征序列获取的目标情感的准确度更高,因此有效提升了情感识别的准确率。
以上所揭露的仅为本发明较佳实施例而已,当然不能以此来限定本发明之权利范围,因此依本发明权利要求所作的等同变化,仍属本发明所涵盖的范围。

Claims (20)

  1. 一种情感识别方法,其特征在于,包括:
    获取待识别多模态数据组,所述待识别多模态数据组包括视频数据、音频数据和/或文本数据中的至少两个;
    提取所述视频数据的视频语义特征序列,提取所述音频数据的音频语义特征序列,和/或,提取所述文本数据中的文本语义特征序列;
    将所述文本语义特征序列向所述音频数据的时间维度对齐处理,生成文本语义时序序列;
    将所述视频语义特征序列、所述音频语义特征序列和/或所述文本语义时序序列按照所述时间维度融合,生成多模态语义特征序列;
    将所述多模态语义特征序列输入预训练的情感识别神经网络,将所述情感识别神经网络的输出结果作为所述待识别数据组对应的目标情感。
  2. 根据权利要求1所述的情感识别方法,其特征在于,所述将所述文本语义特征序列向所述音频数据的时间维度对齐处理的步骤,包括:
    获取音频数据的至少一个发音音素,获取每个所述发音音素对应的文本语义特征序列中的文本语义特征数据;
    获取每个所述发音音素的时刻位置,将所述文本语义特征数据与对应的所述发音音素的所述时刻位置对齐。
  3. 根据权利要求2所述的情感识别方法,其特征在于,所述将所述多模态语义特征序列输入预训练的情感识别神经网络的步骤,包括:
    分别获取所述视频语义特征序列、所述音频语义特征序列和/或所述文本语义时序序列每一时刻的所述视频语义特征数据、所述音频语义特征数据和/或所述文本语义特征数据;
    将同一时刻的所述视频语义特征数据、所述音频语义特征数据和/或所述文本语义特征数据串联成语义特征单元。
  4. 根据权利要求3所述的情感识别方法,其特征在于,所述将所述多模态语义特征序列输入预训练的情感识别神经网络,获取待识别数据组的包括的情感的步骤,包括:
    分别将每一时刻的所述语义特征单元输入预训练的单元识别神经网络,将 所述单元识别神经网络的输出结果作为每一时刻的情绪识别结果;
    将所述每一时刻的情绪识别结果按照时间排序,生成情绪识别序列;
    将所述情绪识别序列输入预训练的情绪识别神经网络,获取待识别多模态数据组的包括的情感。
  5. 根据权利要求4所述的情感识别方法,其特征在于,所述将所述情绪识别序列输入预训练的情绪识别神经网络的步骤,包括:
    获取每一时刻的情绪识别结果的权重,将每一时刻的情绪识别结果与其对应的权重进行点乘运算,将所述点乘运算后的所述情绪识别序列输入预训练的情绪识别神经网络。
  6. 根据权利要求5所述的情感识别方法,其特征在于,
    所述获取每一时刻的情绪识别结果的权重的步骤,包括:
    对所述情绪识别序列进行注意力运算,获取每一时刻的情绪识别结果的权重。
  7. 根据权利要求4所述的情感识别方法,其特征在于,
    所述单元识别神经网络包括卷积神经网络层和双向长短记忆神经网络层。
  8. 根据权利要求1所述的情感识别方法,其特征在于,
    所述情感识别神经网络为全连接神经网络。
  9. 根据权利要求1所述的情感识别方法,其特征在于,所述将所述多模态语义特征序列输入预训练的情感识别神经网络的步骤之前,包括:
    训练所述情感识别神经网络;
    所述训练所述情感识别神经网络的步骤,包括:
    准备多个训练多模态特征序列,标注每个所述训练多模态特征序列的目标情感;
    定义训练的所述情感识别神经网络的结构、损失函数和终止条件;
    将所述多个多模态特征序列及其对应的目标情感为输入所述情感识别神经网络进行训练。
  10. 一种智能装置,其特征在于,包括:
    获取模块,获取待识别数据组,所述待识别数据组包括视频数据、音频数据和文本数据;
    提取模块,用于提取所述视频数据的视频语义特征序列,提取所述音频数 据的音频语义特征序列,以及提取所述文本数据中的文本语义特征序列;
    对齐模块,用于将所述文本语义特征序列向所述音频数据的时间维度对齐,生成文本语义时序序列;
    串联模块,用于将所述视频语义特征序列、所述音频语义特征序列以及所述文本语义时序序列按照所述时间维度串联,生成多模态语义特征序列;
    情感模块,用于将所述多模态语义特征序列输入预训练的情感识别神经网络,将所述情感识别神经网络的输出结果作为所述待识别数据组对应的目标情感。
  11. 根据权利要求10所述的智能装置,其特征在于,所述对齐模块包括:
    第一获取子模块,用于获取音频数据的至少一个发音音素,获取每个所述发音音素对应的文本语义特征数据;
    对齐子模块,用于获取每个所述发音音素的时刻位置,将所述文本语义特征数据与对应的所述发音音素的所述时刻位置对齐。
  12. 根据权利要求10所述的智能装置,其特征在于,所述串联模块包括:
    第二获取子模块,用于分别获取所述视频语义特征序列、所述音频语义特征序列以及所述文本语义时序序列每一时刻的所述视频语义特征数据、所述音频语义特征数据和所述文本语义特征数据;
    串联子模块,用于将同一时刻的所述视频语义特征数据、所述音频语义特征数据和所述文本语义特征数据串联成语义特征单元。
  13. 根据权利要求12所述的智能装置,其特征在于,所述情感模块包括:
    情绪识别子模块,用于分别将每一时刻的所述语义特征单元输入预训练的单元识别神经网络,获取每一时刻的情绪识别结果;
    排列子模块,用于将所述每一时刻的情绪识别结果按照时间排序,生成情绪识别序列;
    情感子模块,用于将所述情绪识别序列输入预训练的情绪识别神经网络,获取待识别数据组的包括的情感。
  14. 根据权利要求13所述的智能装置,其特征在于,所述情感子模块包括:
    权重单元,用于获取每一时刻的情绪识别结果的权重,将每一时刻的情绪识别结果与其对应的权重进行点乘运算,将所述运算的所述情绪识别序列后输 入预训练的情绪识别神经网络。
  15. 根据权利要求14所述的智能装置,其特征在于,
    所述权重单元用于对所述情绪识别序列进行注意力运算,获取每一时刻的情绪识别结果的权重。
  16. 根据权利要求13所述的智能装置,其特征在于,
    所述单元识别神经网络包括卷积神经网络层和双向长短记忆网络层。
  17. 根据权利要求13所述的智能装置,其特征在于,
    所述情感识别神经网络为全连接神经网络。
  18. 根据权利要求10所述的智能装置,其特征在于,所述智能装置还包括训练模块,用于训练所述情感识别神经网络;
    所述训练模块包括:
    准备子模块,用于准备多个训练多模态特征序列,标注每个所述训练多模态特征序列的目标情感;
    定义子模块,用于定义训练的所述情感识别神经网络的结构、损失函数和终止条件;
    输入子模块,用于将所述多个多模态特征序列及其对应的目标情感为输入所述情感识别神经网络进行训练。
  19. 一种智能装置,其特征在于,包括:获取电路、处理器、存储器,所述处理器耦接所述存储器和所述获取电路,所述存储器中存储有计算机程序,所述处理器执行所述计算机程序以实现如权利要求1-9任一项所述的方法。
  20. 一种计算机可读存储介质,其特征在于,存储有计算机程序,所述计算机程序能够被处理器执行以实现如权利要求1-9任一项所述的方法。
PCT/CN2019/130065 2019-12-30 2019-12-30 情感识别方法、智能装置和计算机可读存储介质 WO2021134277A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2019/130065 WO2021134277A1 (zh) 2019-12-30 2019-12-30 情感识别方法、智能装置和计算机可读存储介质
CN201980003314.8A CN111164601B (zh) 2019-12-30 2019-12-30 情感识别方法、智能装置和计算机可读存储介质

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2019/130065 WO2021134277A1 (zh) 2019-12-30 2019-12-30 情感识别方法、智能装置和计算机可读存储介质

Publications (1)

Publication Number Publication Date
WO2021134277A1 true WO2021134277A1 (zh) 2021-07-08

Family

ID=70562368

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/130065 WO2021134277A1 (zh) 2019-12-30 2019-12-30 情感识别方法、智能装置和计算机可读存储介质

Country Status (2)

Country Link
CN (1) CN111164601B (zh)
WO (1) WO2021134277A1 (zh)

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113408503A (zh) * 2021-08-19 2021-09-17 明品云(北京)数据科技有限公司 一种情绪识别方法、装置、计算机可读存储介质及设备
CN113470787A (zh) * 2021-07-09 2021-10-01 福州大学 基于神经网络的情绪识别与脱敏训练效果评估方法
CN113688745A (zh) * 2021-08-27 2021-11-23 大连海事大学 一种基于相关节点自动挖掘及统计信息的步态识别方法
CN113704504A (zh) * 2021-08-30 2021-11-26 平安银行股份有限公司 基于聊天记录的情绪识别方法、装置、设备及存储介质
CN113704552A (zh) * 2021-08-31 2021-11-26 哈尔滨工业大学 一种基于跨模态自动对齐和预训练语言模型的情感分析方法、系统及设备
CN113743267A (zh) * 2021-08-25 2021-12-03 中国科学院软件研究所 一种基于螺旋和文本的多模态视频情感可视化方法及装置
CN113837072A (zh) * 2021-09-24 2021-12-24 厦门大学 一种融合多维信息的说话人情绪感知方法
CN114581570A (zh) * 2022-03-01 2022-06-03 浙江同花顺智能科技有限公司 一种三维脸部动作生成方法和系统
CN114821558A (zh) * 2022-03-10 2022-07-29 电子科技大学 基于文本特征对齐的多方向文本检测方法
WO2023084348A1 (en) * 2021-11-12 2023-05-19 Sony Group Corporation Emotion recognition in multimedia videos using multi-modal fusion-based deep neural network
CN116245102A (zh) * 2023-05-11 2023-06-09 广州数说故事信息科技有限公司 一种基于多头注意力和图神经网络的多模态情感识别方法
CN117058405A (zh) * 2023-07-04 2023-11-14 首都医科大学附属北京朝阳医院 一种基于图像的情绪识别方法、系统、存储介质及终端
WO2024011818A1 (zh) * 2022-07-15 2024-01-18 山东海量信息技术研究院 一种数据的情感识别方法、装置、设备及可读存储介质
CN117058405B (zh) * 2023-07-04 2024-05-17 首都医科大学附属北京朝阳医院 一种基于图像的情绪识别方法、系统、存储介质及终端

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111753549B (zh) * 2020-05-22 2023-07-21 江苏大学 一种基于注意力机制的多模态情感特征学习、识别方法
CN111832317B (zh) * 2020-07-09 2023-08-18 广州市炎华网络科技有限公司 智能信息导流方法、装置、计算机设备及可读存储介质
CN111898670B (zh) * 2020-07-24 2024-04-05 深圳市声希科技有限公司 多模态情感识别方法、装置、设备及存储介质
CN111723783B (zh) * 2020-07-29 2023-12-08 腾讯科技(深圳)有限公司 一种内容识别方法和相关装置
CN112233698B (zh) * 2020-10-09 2023-07-25 中国平安人寿保险股份有限公司 人物情绪识别方法、装置、终端设备及存储介质
CN112489635B (zh) * 2020-12-03 2022-11-11 杭州电子科技大学 一种基于增强注意力机制的多模态情感识别方法
CN112560622B (zh) * 2020-12-08 2023-07-21 中国联合网络通信集团有限公司 虚拟对象动作控制方法、装置及电子设备
CN112584062B (zh) * 2020-12-10 2023-08-08 上海幻电信息科技有限公司 背景音频构建方法及装置
CN112579745B (zh) * 2021-02-22 2021-06-08 中国科学院自动化研究所 基于图神经网络的对话情感纠错系统
CN114022668B (zh) * 2021-10-29 2023-09-22 北京有竹居网络技术有限公司 一种文本对齐语音的方法、装置、设备及介质
CN114255433B (zh) * 2022-02-24 2022-05-31 首都师范大学 一种基于面部视频的抑郁识别方法、装置及存储介质
CN115101032A (zh) * 2022-06-17 2022-09-23 北京有竹居网络技术有限公司 用于生成文本的配乐的方法、装置、电子设备和介质
CN117033637B (zh) * 2023-08-22 2024-03-22 镁佳(北京)科技有限公司 无效对话拒识模型训练方法、无效对话拒识方法及装置
CN117611845B (zh) * 2024-01-24 2024-04-26 浪潮通信信息系统有限公司 多模态数据的关联识别方法、装置、设备及存储介质

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107609572A (zh) * 2017-08-15 2018-01-19 中国科学院自动化研究所 基于神经网络和迁移学习的多模态情感识别方法、系统
CN108805089A (zh) * 2018-06-14 2018-11-13 南京云思创智信息科技有限公司 基于多模态的情绪识别方法
CN109472232A (zh) * 2018-10-31 2019-03-15 山东师范大学 基于多模态融合机制的视频语义表征方法、系统及介质
CN110033029A (zh) * 2019-03-22 2019-07-19 五邑大学 一种基于多模态情感模型的情感识别方法和装置
CN110083716A (zh) * 2019-05-07 2019-08-02 青海大学 基于藏文的多模态情感计算方法及系统
CN110188343A (zh) * 2019-04-22 2019-08-30 浙江工业大学 基于融合注意力网络的多模态情感识别方法
WO2019204186A1 (en) * 2018-04-18 2019-10-24 Sony Interactive Entertainment Inc. Integrated understanding of user characteristics by multimodal processing

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019132459A1 (ko) * 2017-12-28 2019-07-04 주식회사 써로마인드로보틱스 사용자 정서적 행동 인식을 위한 멀티 모달 정보 결합 방법 및 그 장치
WO2019144542A1 (en) * 2018-01-26 2019-08-01 Institute Of Software Chinese Academy Of Sciences Affective interaction systems, devices, and methods based on affective computing user interface
WO2019219968A1 (en) * 2018-05-18 2019-11-21 Deepmind Technologies Limited Visual speech recognition by phoneme prediction
CN108877801B (zh) * 2018-06-14 2020-10-02 南京云思创智信息科技有限公司 基于多模态情绪识别系统的多轮对话语义理解子系统
CN109614895A (zh) * 2018-10-29 2019-04-12 山东大学 一种基于attention特征融合的多模态情感识别的方法
CN109460737A (zh) * 2018-11-13 2019-03-12 四川大学 一种基于增强式残差神经网络的多模态语音情感识别方法
CN110147548B (zh) * 2019-04-15 2023-01-31 浙江工业大学 基于双向门控循环单元网络和新型网络初始化的情感识别方法

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107609572A (zh) * 2017-08-15 2018-01-19 中国科学院自动化研究所 基于神经网络和迁移学习的多模态情感识别方法、系统
WO2019204186A1 (en) * 2018-04-18 2019-10-24 Sony Interactive Entertainment Inc. Integrated understanding of user characteristics by multimodal processing
CN108805089A (zh) * 2018-06-14 2018-11-13 南京云思创智信息科技有限公司 基于多模态的情绪识别方法
CN109472232A (zh) * 2018-10-31 2019-03-15 山东师范大学 基于多模态融合机制的视频语义表征方法、系统及介质
CN110033029A (zh) * 2019-03-22 2019-07-19 五邑大学 一种基于多模态情感模型的情感识别方法和装置
CN110188343A (zh) * 2019-04-22 2019-08-30 浙江工业大学 基于融合注意力网络的多模态情感识别方法
CN110083716A (zh) * 2019-05-07 2019-08-02 青海大学 基于藏文的多模态情感计算方法及系统

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113470787A (zh) * 2021-07-09 2021-10-01 福州大学 基于神经网络的情绪识别与脱敏训练效果评估方法
CN113470787B (zh) * 2021-07-09 2024-01-30 福州大学 基于神经网络的情绪识别与脱敏训练效果评估方法
CN113408503A (zh) * 2021-08-19 2021-09-17 明品云(北京)数据科技有限公司 一种情绪识别方法、装置、计算机可读存储介质及设备
CN113743267A (zh) * 2021-08-25 2021-12-03 中国科学院软件研究所 一种基于螺旋和文本的多模态视频情感可视化方法及装置
CN113743267B (zh) * 2021-08-25 2023-06-16 中国科学院软件研究所 一种基于螺旋和文本的多模态视频情感可视化方法及装置
CN113688745B (zh) * 2021-08-27 2024-04-05 大连海事大学 一种基于相关节点自动挖掘及统计信息的步态识别方法
CN113688745A (zh) * 2021-08-27 2021-11-23 大连海事大学 一种基于相关节点自动挖掘及统计信息的步态识别方法
CN113704504A (zh) * 2021-08-30 2021-11-26 平安银行股份有限公司 基于聊天记录的情绪识别方法、装置、设备及存储介质
CN113704504B (zh) * 2021-08-30 2023-09-19 平安银行股份有限公司 基于聊天记录的情绪识别方法、装置、设备及存储介质
CN113704552A (zh) * 2021-08-31 2021-11-26 哈尔滨工业大学 一种基于跨模态自动对齐和预训练语言模型的情感分析方法、系统及设备
CN113837072A (zh) * 2021-09-24 2021-12-24 厦门大学 一种融合多维信息的说话人情绪感知方法
WO2023084348A1 (en) * 2021-11-12 2023-05-19 Sony Group Corporation Emotion recognition in multimedia videos using multi-modal fusion-based deep neural network
CN114581570A (zh) * 2022-03-01 2022-06-03 浙江同花顺智能科技有限公司 一种三维脸部动作生成方法和系统
CN114581570B (zh) * 2022-03-01 2024-01-26 浙江同花顺智能科技有限公司 一种三维脸部动作生成方法和系统
CN114821558A (zh) * 2022-03-10 2022-07-29 电子科技大学 基于文本特征对齐的多方向文本检测方法
WO2024011818A1 (zh) * 2022-07-15 2024-01-18 山东海量信息技术研究院 一种数据的情感识别方法、装置、设备及可读存储介质
CN116245102B (zh) * 2023-05-11 2023-07-04 广州数说故事信息科技有限公司 一种基于多头注意力和图神经网络的多模态情感识别方法
CN116245102A (zh) * 2023-05-11 2023-06-09 广州数说故事信息科技有限公司 一种基于多头注意力和图神经网络的多模态情感识别方法
CN117058405A (zh) * 2023-07-04 2023-11-14 首都医科大学附属北京朝阳医院 一种基于图像的情绪识别方法、系统、存储介质及终端
CN117058405B (zh) * 2023-07-04 2024-05-17 首都医科大学附属北京朝阳医院 一种基于图像的情绪识别方法、系统、存储介质及终端

Also Published As

Publication number Publication date
CN111164601A (zh) 2020-05-15
CN111164601B (zh) 2023-07-18

Similar Documents

Publication Publication Date Title
WO2021134277A1 (zh) 情感识别方法、智能装置和计算机可读存储介质
US20220180202A1 (en) Text processing model training method, and text processing method and apparatus
US20210012777A1 (en) Context acquiring method and device based on voice interaction
US11138903B2 (en) Method, apparatus, device and system for sign language translation
WO2020024484A1 (zh) 用于输出数据的方法和装置
WO2019047703A1 (zh) 音频事件检测方法、装置及计算机可读存储介质
WO2018000270A1 (zh) 一种基于用户画像的个性化回答生成方法及系统
WO2021134417A1 (zh) 交互行为预测方法、智能装置和计算机可读存储介质
WO2021127982A1 (zh) 语音情感识别方法、智能装置和计算机可读存储介质
WO2022227765A1 (zh) 生成图像修复模型的方法、设备、介质及程序产品
US20200387676A1 (en) Electronic device for performing translation by sharing context of utterance and operation method therefor
CN112418059A (zh) 一种情绪识别的方法、装置、计算机设备及存储介质
JP2021081713A (ja) 音声信号を処理するための方法、装置、機器、および媒体
WO2021127916A1 (zh) 脸部情感识别方法、智能装置和计算机可读存储介质
WO2022105118A1 (zh) 基于图像的健康状态识别方法、装置、设备及存储介质
CN112910761B (zh) 即时通讯方法、装置、设备、存储介质以及程序产品
CN109961152B (zh) 虚拟偶像的个性化互动方法、系统、终端设备及存储介质
WO2021114682A1 (zh) 会话任务生成方法、装置、计算机设备和存储介质
CN115292467B (zh) 信息处理与模型训练方法、装置、设备、介质及程序产品
Shrivastava et al. Puzzling out emotions: a deep-learning approach to multimodal sentiment analysis
WO2023040545A1 (zh) 一种数据处理方法、装置、设备、存储介质和程序产品
CN116312512A (zh) 面向多人场景的视听融合唤醒词识别方法及装置
CN111401069A (zh) 会话文本的意图识别方法、意图识别装置及终端
CN112559727B (zh) 用于输出信息的方法、装置、设备、存储介质和程序
CN112418254A (zh) 情感识别方法、装置、设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19958361

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19958361

Country of ref document: EP

Kind code of ref document: A1