WO2023050708A1 - 一种情感识别方法、装置、设备及可读存储介质 - Google Patents

一种情感识别方法、装置、设备及可读存储介质 Download PDF

Info

Publication number
WO2023050708A1
WO2023050708A1 PCT/CN2022/078284 CN2022078284W WO2023050708A1 WO 2023050708 A1 WO2023050708 A1 WO 2023050708A1 CN 2022078284 W CN2022078284 W CN 2022078284W WO 2023050708 A1 WO2023050708 A1 WO 2023050708A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
vector
audio
video
hidden state
Prior art date
Application number
PCT/CN2022/078284
Other languages
English (en)
French (fr)
Inventor
王斌强
董刚
赵雅倩
李仁刚
曹其春
刘海威
Original Assignee
苏州浪潮智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 苏州浪潮智能科技有限公司 filed Critical 苏州浪潮智能科技有限公司
Publication of WO2023050708A1 publication Critical patent/WO2023050708A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present application relates to the field of computer application technology, in particular to an emotion recognition method, device, equipment and readable storage medium.
  • the interaction ranges from early keyboard input to today's touch screen, and even voice input.
  • voice input is more recognized at the level of semantic content, such as translating voice into text, but this translation completely loses emotion-related information.
  • emotional information is added to the human-computer interaction through emotion recognition.
  • emotion recognition was generally single-modal, recognizing the emotional information carried in text or voice. But the natural transmission of human emotion is the result of a coordinated expression of multiple senses. Not only the emotion carried in the language, such as intonation also carries the emotional information, and the subsequent emotion recognition is mainly based on bimodality, mainly focusing on text and sound. Later, computer vision was also added to emotion recognition.
  • emotion recognition has focused on making the final emotion recognition results based on multimodal information such as vision, audio and text.
  • multimodal information such as vision, audio and text.
  • the existing multimodal fusion algorithm is applied to specific emotion recognition, there is a problem of poor discrimination of the extracted multimodal information, which leads to inaccurate emotion recognition results and cannot meet the needs of practical applications.
  • the purpose of this application is to provide an emotion recognition method, device, equipment and readable storage medium, which can effectively improve information discrimination by fusing eigenvectors of different modalities based on a non-uniform attention mechanism, and finally make the result of emotion recognition more accurate. precise.
  • a method for emotion recognition comprising:
  • an emotion recognition result of the target object is obtained.
  • the acquiring text audio attention weights and text video attention weights includes:
  • the text hidden state vector and the audio hidden state vector are input to the audio attention layer to obtain the output text audio attention weight;
  • the text hidden state vector and the video hidden state vector are input to the video attention layer to obtain the output text video attention weight.
  • using the comprehensive feature to obtain the emotion recognition result of the target object including:
  • a linear mapping is performed on the comprehensive feature to obtain an emotion recognition result of the target object.
  • performing linear mapping on the integrated features to obtain the emotion recognition result of the target object including:
  • a linear mapping of the number of preset emotion recognition categories is performed on the integrated features to obtain an emotion recognition result of the target object.
  • the integrated features after obtaining the emotion recognition result of the target object by using the integrated features, it further includes:
  • a fusion expression vector of non-uniform attention including:
  • a dimensionality reduction layer to perform dimensionality reduction on the text audio weighted vector and the text video weighted vector, to obtain a text audio dimensionality reduction vector and a text video dimensionality reduction vector;
  • the splicing the text audio dimensionality reduction vector and the text video dimensionality reduction vector, and performing normalization processing after splicing, to obtain the fusion expression vector including:
  • An emotion recognition device comprising:
  • the feature extraction module is used to perform feature extraction on the text, audio and video corresponding to the target object, to obtain text feature vectors, audio feature vectors and video feature vectors;
  • a feature encoding module configured to encode the text feature vector, the audio feature vector and the video feature vector using long short-term memory networks of different weights to obtain text hidden state vectors, audio hidden state vectors and video Hidden state vector;
  • a feature splicing module configured to splice the text hidden state vector with the audio hidden state vector and the video hidden state vector respectively to obtain a text audio splicing vector and a text video splicing vector;
  • a weight determination module is used to obtain text audio attention weights and text video attention weights
  • a weight fusion module for utilizing the text audio stitching vector, the text video stitching vector, the text audio attention weight and the text video attention weight to obtain a fusion expression vector of non-uniform attention
  • a comprehensive feature acquisition module used to splice the fusion expression vector, the text hidden state vector, the audio hidden state vector and the video hidden state vector to obtain a comprehensive feature
  • the recognition result determination module is used to obtain the emotion recognition result of the target object by using the integrated features.
  • An electronic device comprising:
  • a processor configured to implement the steps of the above emotion recognition method when executing the computer program.
  • a readable storage medium where a computer program is stored on the readable storage medium, and when the computer program is executed by a processor, the steps of the above emotion recognition method are implemented.
  • the audio hidden state vector and video hidden state vector after attention are used for cross-level splicing to obtain the fusion expression vector, and then use
  • the comprehensive features are obtained by splicing and fusing expression vectors, text hidden state vectors, audio hidden state vectors and video hidden state vectors.
  • the emotion recognition result of the target object is obtained. That is, based on the non-uniform attention mechanism to fuse the feature vectors of different modalities, it can effectively improve the information discrimination, and finally make the emotion recognition results more accurate.
  • the embodiment of the present application also provides an emotion recognition device, device, and readable storage medium corresponding to the above emotion recognition method, which have the above technical effects, and will not be repeated here.
  • Fig. 1 is the implementation flowchart of a kind of emotion recognition method in the embodiment of the present application
  • FIG. 2 is a schematic diagram of a backbone framework structure of an emotion recognition network based on a non-uniform attention mechanism in an embodiment of the present application;
  • FIG. 3 is a schematic diagram of multimodal fusion based on a non-uniform attention mechanism in an embodiment of the present application
  • FIG. 4 is a schematic diagram of a specific implementation of an emotion recognition method in the embodiment of the present application.
  • FIG. 5 is a schematic structural diagram of an emotion recognition device in an embodiment of the present application.
  • FIG. 6 is a schematic structural diagram of an electronic device in an embodiment of the present application.
  • FIG. 7 is a schematic structural diagram of an electronic device in an embodiment of the present application.
  • FIG. 1 is a flow chart of an emotion recognition method in an embodiment of the present application.
  • the method can be applied to the backbone framework structure of an emotion recognition network based on a non-uniform attention mechanism as shown in FIG. 2 .
  • the backbone framework structure of emotion recognition network based on non-uniform attention mechanism includes input layer, input mapping layer, feature fusion layer and output layer.
  • the input layer receives the input feature data of three different modalities. Since there is a huge semantic gap between the data of different modalities, after the input layer, the input mapping layer is designed to perform semantic mapping on the input data of different modalities, so that Data of different modalities are projected into their respective semantic spaces.
  • the mapped features are input to the feature fusion layer to generate a fusion feature vector, and finally, the fusion feature vector is input to the output layer to obtain the final emotion recognition result.
  • the main framework of the feature fusion layer uses a long short-term memory network.
  • the emotion recognition method includes the following steps:
  • S101 Perform feature extraction on text, audio and video corresponding to the target object to obtain text feature vectors, audio feature vectors and video feature vectors.
  • the target object may specifically be a user of an application that needs to perform emotion recognition.
  • the text (Textual), audio (Acoustic) and video (Visual) for feature extraction may specifically be the text, audio and video input by the user.
  • feature extraction models corresponding to text, audio, and video may be used to extract corresponding features, so as to obtain text feature vectors, audio feature vectors, and video feature vectors.
  • the feature vector in this paper is expressed as The audio feature vector is expressed as
  • the video feature vector refers to the image feature vector in the video expressed as
  • the long-short-term memory network (LSTM, Long-Short Term Memory) is a special kind of cyclic neural network, which can model the gap between different time steps by cyclically inputting data of different time steps into the memory structure of the same structure.
  • Information A single memory structure is a set of operations that receive input data and generate intermediate output variables.
  • the output intermediate variables are called hidden states (Hidden States) and cell states (Cell States).
  • the mapping vector of each modality is modeled using an LSTM.
  • text data is taken as an example to explain the operation process of LSTM.
  • L which means that this piece of text contains L words.
  • mapping vector The output of each word after passing through the input mapping layer is a mapping vector
  • the range of id is from 1 to L, the symbol t means that the vector corresponds to the expression of Text, and the dimension of the mapping vector is an integer, represented by D m , where m means mapping (Mapping).
  • This text map vector is the input to the LSTM.
  • the structural feature of LSTM is that it contains three gating units, each of which is used to control the flow of information. The three gating units are the input gate, the forgetting gate and the output gate. The output of each gating unit is a vector with the same length as the input.
  • each value in the vector is 0 to 1, and 0 represents the pair
  • the information of this position is shielded, 1 means that all the information of this position is passed, and the middle value means that the information of this position is controlled to different degrees.
  • this structure includes not only the calculation method, but also the weight of the calculation matrix.
  • two vectors need to be constructed here: the hidden state vector h t and the cell state vector c t , the dimensions of these two vectors are represented by integer D h .
  • the role of the input gate is to map the input text vector and the hidden state vector from the previous time step
  • the information of the forget gate is controlled, and the role of the forget gate is to control the cell state vector of the previous time step
  • the information flow is controlled by the output gate, and the output gate controls the amount of information flowing from the output vector of the input gate and the forget gate to the next hidden state.
  • represents the multiplication of matrix and vector
  • * represents the multiplication of corresponding elements
  • W fx , W ix , W ox , W cx represent the pair A matrix for dimension mapping
  • the dimension of the matrix is D h ⁇ D m
  • W fh , W ih , W oh , W ch represent pairs A matrix for dimension mapping
  • the dimension of the matrix is D h ⁇ D h
  • represents the sigmoid function:
  • the hidden state vector h t and the cell state vector c t are continuously updated through the above methods, and the hidden state vector of each time step is generally used to represent the output feature vector of the current LSTM memory structure.
  • the above is the process of LSTM encoding a single modality information.
  • a non-uniform attention mechanism is used to fuse the output feature vectors of different modalities during the information transfer process between adjacent time steps.
  • the specific structure is shown in Figure 3, that is, on the whole, three LSTMs with different weights are used to input the textual feature vector Audio (Acoustic) feature vector Image (Visual) feature vector in video Encode and output the corresponding hidden state vector and cell state vector: text hidden state vector Text cell state vector audio hidden state vector Audio cell state vector Video Hidden State Vector Video cell state vector
  • the cell state vector is not described too much, and the processing of the cell state vector can be processed by referring to the relevant processing method of LSTM.
  • the text hidden state vector and the audio hidden state vector are spliced on the feature dimension to obtain the spliced vector, that is, the text and audio spliced vector.
  • the text hidden state vector and the image hidden state vector are feature concatenated in the feature dimension to obtain the concatenated vector, that is, the text video concatenation vector.
  • text audio attention weights and text video attention weights may also be acquired. That is, text-audio attention weights correspond to text-audio stitching vectors, and text-video attention weights correspond to text-video stitching vectors.
  • the text audio attention weight and the text video attention weight are obtained, including:
  • Step 1 Input the text hidden state vector and the audio hidden state vector into the audio attention layer to obtain the output text audio attention weight
  • Step 2 Input the text hidden state vector and the video hidden state vector into the video attention layer to obtain the output text video attention weights.
  • An audio attention layer can be set in advance, such as the audio attention layer (Acoustic Attention Layer) shown in Figure 3.
  • the main structure of this layer is a linear map plus a sigmoid function, specifically: LinearLayer+Dropout+Sigmoid, where the Linear Layer is In the linear mapping layer, Dropout is to prevent over-fitting of parameters during the training process, and Sigmoid is to normalize the output of this layer to between 0 and 1, which can represent the degree of attention in the attention mechanism.
  • the input of this layer is the text hidden state vector and the audio hidden state vector
  • the output is the text audio attention weight.
  • the input is the text hidden state vector and the audio hidden state vector
  • the output is a weight vector
  • a video attention layer (or called an image attention layer) can be set, such as the image attention layer (Visual Attention Layer) shown in Figure 3, the main structure of this layer is a linear map plus a sigmoid function, specifically , Linear Layer+Dropout+Sigmoid, where Linear Layer is a linear mapping layer, Dropout is to prevent over-fitting of parameters during training, and Sigmoid is to normalize the output of this layer to between 0 and 1, so that Represents the degree of attention in the attention mechanism.
  • the input of this layer is text hidden state vector and video hidden state vector
  • the output is text video attention weights. For example, when the input is a text hidden state vector and image hidden state vector
  • the output is a weight vector
  • weights of the linear mapping layers corresponding to the audio attention layer and the video attention layer are not shared, that is, the two are not the same.
  • fusion After completing the stitching of feature information of text and audio, and stitching of feature information of text and video, and obtaining the text video stitching vector and text video attention weight, fusion can be performed based on the non-uniform attention mechanism, and finally the fusion expression vector is obtained.
  • LSTM represents the fusion expression vector z in the non-uniform attention mechanism
  • z is initialized as a vector of all 0s.
  • LSTM the calculation unit of LSTM, there are also things related to z that need to be learned. parameter matrix.
  • the fusion expression vector can be assigned, and finally the current text audio stitching vector, the text video The splicing vector, the text audio attention weight and the fusion expression vector matching the text video attention weight.
  • a fusion expression vector of non-uniform attention including:
  • Step 1 Multiply the text audio splicing vector and the text audio attention weight to obtain the text audio weighted vector
  • Step 2 multiplying the text video splicing vector and the text video attention weight to obtain the text video weighted vector
  • Step 3 using the dimensionality reduction layer to reduce the dimensionality of the text audio weighted vector and the text video weighted vector, to obtain the text audio dimensionality reduction vector and the text video dimensionality reduction vector;
  • Step 4 Concatenate the text audio dimensionality reduction vector and the text video dimensionality reduction vector, and perform normalization processing after splicing to obtain the fusion expression vector.
  • the text audio splicing vector is weighted, that is, the text audio splicing vector and the text audio attention weight are multiplied to obtain the text audio weighting vector, and the text audio weighting vector is the weight assignment of the text audio splicing vector result.
  • the weight assignment of the text video mosaic vector can also refer to this, so as to obtain the text video weighted vector.
  • the weighted feature vector can be obtained by multiplying the concatenated vector with the corresponding weight vector
  • the dimension reduction layer (Dimension Reduction Layer) further compresses the dimension of the feature vector containing semantic information
  • the structure of the dimension reduction layer is defined as Linear Layer+Dropout, where Linear Layer is a linear mapping layer, and Dropout is to prevent parameter overfitting.
  • the weighted text audio weighted vector and the text video weighted vector obtained based on the weighting are respectively reduced through different dimensionality reduction layers, and then the output vectors, that is, the text audio dimensionality reduction vector and the text video dimensionality reduction vector are concatenated (Concatenate), and after regression
  • the exponential function (softmax function) is normalized to obtain the final fusion expression vector of non-uniform attention.
  • dimensionality reduction can also be performed on the text hidden state vector to obtain the text hidden state dimensionality reduction vector.
  • Dimension vectors, and normalized processing after splicing, to obtain fusion expression vectors including: splicing text audio dimensionality reduction vectors, text video dimensionality reduction vectors and text implicit state dimensionality reduction vectors, and normalization processing after splicing , to get the fusion expression vector.
  • the text hidden state vector, and the weighted feature vector text audio weighted vector and text video weighted vector are respectively reduced through different dimensionality reduction layers, and then the output vectors are spliced together and passed through softmax
  • the function is normalized to obtain the final fusion expression vector z 1 of non-uniform attention.
  • the text hidden state vector and based on the weighted eigenvectors and The three dimensions are reduced through different dimensionality reduction layers, and then the output vectors are spliced together, and normalized by the softmax function to obtain the final fusion expression z 1 of non-uniform attention.
  • the fusion expression vector, text hidden state vector, audio hidden state vector and video hidden state vector can be spliced, get the composite features.
  • the splicing sequence there is no limitation on the splicing sequence, and it is only necessary to ensure that the sequence is consistent during training and application.
  • a linear mapping may be performed on the integrated features to obtain the emotion recognition result of the target object.
  • performing linear mapping on the comprehensive feature to obtain the emotion recognition result of the target object may specifically include: performing linear mapping on the comprehensive feature with a preset number of emotion recognition categories to obtain the emotion recognition result of the target object.
  • the interaction information matching the emotion recognition result can also be output to the target object.
  • the emotion recognition result can also be saved, so as to track the emotion change of the target object.
  • the audio hidden state vector and video hidden state vector after attention are used for cross-level splicing to obtain the fusion expression vector, and then use
  • the comprehensive features are obtained by splicing and fusing expression vectors, text hidden state vectors, audio hidden state vectors and video hidden state vectors.
  • the emotion recognition result of the target object is obtained. That is, based on the non-uniform attention mechanism to fuse the feature vectors of different modalities, it can effectively improve the information discrimination, and finally make the emotion recognition results more accurate.
  • the data is divided into training and testing. Before starting the implementation, first construct the training data and define the model, and then use the training data to update the model parameters. If the conditions for model convergence are not met, proceed to the model For the update of parameters, if the condition of model convergence is met, enter the test phase, input the test data, and the model calculates the output result, and the whole process ends.
  • model convergence conditions here not only include the above-mentioned number of training times reaching the set number of times or the degree of training error decline to a certain range, but also the threshold of the error between the predicted value and the real value can be set, when the error of the model When it is less than a given threshold, it can be judged that the training stops.
  • the model loss function it can be adjusted according to the number of emotional categories contained in the input data. If there are two types (generally defined as positive and negative emotions), the mean absolute error (Mean Absolute Error) can be used as the loss function , and other measurement methods such as mean square error (Mean Square Error) can also be used.
  • the RMSprob Root Mean Square propagation
  • other parameter optimization methods based on gradient descent including but not limited to Stochastic Gradient Descent (SGD), Adagrad (Adaptive Subgradient ), Adam (Adaptive Moment Estimation), Adamax (Adam variant based on infinite norm), ASGD (Averaged Stochastic Gradient Descent), RMSprob, etc.
  • CMUMOSI contains three data sets CMUMOSI, CMUMOSEI, and IEMOCAP.
  • CMUMOSI is used as an example to illustrate. It should be noted that the same operation is also applicable to similar data sets including but not limited to CMUMOSEI and IEMOCAP.
  • the CMUMOSI dataset contains 2199 selfie video clips, which are divided into three parts as a whole: training set, validation set and test set.
  • the training set can contain 1284 sample data
  • the verification set can contain 229 sample data
  • the test set can contain 686 sample data.
  • the different modal data are: the text is a sentence containing up to 50 words, if the number of words in the sentence is less than 50, it will be filled with 0; the image data (that is, the image in the video) is a video sequence aligned with each word
  • the feature expression of the image, the expression corresponding to each video sequence is a vector with a dimension of 20, and the audio segment corresponding to each word is compressed into a feature expression, and the expression of each audio segment is a vector with a dimension of 5.
  • each sample data corresponds to a value, and the range of values is (-3, 3), representing the most negative emotion to the most positive emotion respectively.
  • 0 is used as the dividing line to divide the emotion Identify tasks that are divided into two categories (greater than or equal to 0 is defined as positive emotion, and less than 0 is defined as negative emotion).
  • the loss function Based on the loss function, according to the specific implementation, select an appropriate loss function to measure the output prediction value of the model and the label value in the data set during the training process.
  • the mean absolute error (Mean Absolute Error) is used as the loss function here.
  • the RMSprob Root Mean Square propagation
  • the parameters are first updated on the training set. After adjusting the parameters (one Epoch) on the entire training set each time, the loss is calculated and recorded on the verification set, and the number of training Epochs is set. Here, it is set to 10. The model with the smallest loss on the validation set is selected as the model for the final training output.
  • the information of the three modalities in the test data is input into the trained model for forward calculation, and the final emotion recognition output is obtained.
  • the construction model of the non-uniform attention mechanism module is adopted.
  • the idea of the non-uniform attention mechanism is to use the attention mechanism according to the input of different modalities.
  • the text feature with strong discrimination is used as the main feature to guide the fusion of the other two features, mainly including feature splicing operation, two attention layers, two dimensionality reduction layers connected to the attention layer; dimensionality reduction layer based on text features , and finally stitching plus softmax to get the fusion feature expression. It is worth noting that what is protected here is the framework of the non-uniform attention mechanism, in which the design of the specific attention layer and dimensionality reduction layer can select other modules with similar functions.
  • the number of configurable emotion recognition categories that is, for emotion recognition tasks, this application divides the types of emotion recognition into binary classification and multi-classification according to different divisions of data set labels in the specific implementation process, and according to different types
  • the task is adapted to different loss functions for error measurement, and at the same time, it can be adapted to a variety of different model parameter optimization algorithms for model parameter update.
  • Scalability of multi-angle attention mechanism that is, in addition to the emotion recognition tasks listed in the embodiments, it can also be applied to various other tasks involving multi-modal feature fusion, such as multi-modal video classification, multi-modal video person recognition, and the like.
  • the emotion recognition method proposed in this application that is, the multimodal emotion recognition method based on the non-uniform attention mechanism has the following significant advantages:
  • the number of emotion recognition categories can be configured. By classifying the labels of the data set, the recognition of different numbers of emotion types can be realized. At the same time, according to the setting of the recognition number, different loss functions are selected to update the model parameters.
  • the number of attention layers in this application is not limited to one, and the output of attention modules from different angles can be spliced together by extending the same structure and using different weight parameters. What needs to be changed is only the subsequent drop The input dimension of the dimensional operation does not need to change other structures of the network, so as to realize the multi-angle multi-head attention mechanism.
  • an embodiment of the present application further provides an emotion recognition device, and the emotion recognition device described below and the emotion recognition method described above may be referred to in correspondence.
  • the device includes the following modules:
  • the feature extraction module 101 is used to perform feature extraction on text, audio and video corresponding to the target object, to obtain text feature vectors, audio feature vectors and video feature vectors;
  • the feature coding module 102 is used to encode text feature vectors, audio feature vectors and video feature vectors using long and short-term memory networks of different weights respectively to obtain text hidden state vectors, audio hidden state vectors and video hidden state vectors;
  • Feature splicing module 103 is used for carrying out feature splicing with text hidden state vector and audio frequency hidden state vector, video hidden state vector respectively, obtains text audio splicing vector and text video splicing vector;
  • Weight determination module 104 for obtaining text audio attention weight and text video attention weight
  • the weight fusion module 105 is used to obtain the fusion expression vector of non-uniform attention by using the text audio stitching vector, the text video stitching vector, the text audio attention weight and the text video attention weight;
  • the comprehensive feature acquisition module 106 is used for splicing fusion expression vectors, text hidden state vectors, audio hidden state vectors and video hidden state vectors to obtain comprehensive features;
  • Recognition result determining module 107 is used for utilizing integrated feature, obtains the emotion recognition result of target object.
  • the audio hidden state vector and video hidden state vector after attention are used for cross-level splicing to obtain the fusion expression vector, and then use
  • the comprehensive features are obtained by splicing and fusing expression vectors, text hidden state vectors, audio hidden state vectors and video hidden state vectors.
  • the emotion recognition result of the target object is obtained. That is, based on the non-uniform attention mechanism to fuse the feature vectors of different modalities, it can effectively improve the information discrimination, and finally make the emotion recognition results more accurate.
  • the weight determination module 104 is specifically configured to input the text hidden state vector and the audio hidden state vector into the audio attention layer to obtain the output text audio attention weight;
  • the state vector and video hidden state vector are input to the video attention layer, and the output text video attention weights are obtained.
  • the recognition result determination module 107 is specifically configured to linearly map the integrated features to obtain the emotion recognition result of the target object.
  • the recognition result determination module 107 is specifically configured to perform linear mapping of the number of preset emotion recognition categories on the integrated features to obtain the emotion recognition result of the target object.
  • the emotion interaction module is configured to output interaction information matching the emotion recognition result to the target object after the emotion recognition result of the target object is obtained by using the integrated features.
  • the weight fusion module 105 is specifically used to multiply the text audio splicing vector and the text audio attention weight to obtain the text sound weighted vector; to text video splicing vector and text video attention weight Perform multiplication processing to obtain the text video weighted vector; use the dimensionality reduction layer to reduce the dimensionality of the text audio weighted vector and the text video weighted vector to obtain the text audio dimensionality reduction vector and the text video dimensionality reduction vector; stitch the text audio dimensionality reduction vector and the text The dimensionality reduction vector of the video is normalized after splicing to obtain the fusion expression vector.
  • the text dimension reduction module is used to reduce the dimension of the text hidden state vector to obtain the text hidden state dimension reduction vector
  • the weight fusion module 105 is specifically used to concatenate the text audio dimensionality reduction vector, the text video dimensionality reduction vector and the text implicit state dimensionality reduction vector, and perform normalization processing after splicing to obtain the fusion expression vector.
  • the embodiment of the present application also provides an electronic device, and the electronic device described below and the emotion recognition method described above can be referred to in correspondence.
  • the electronic equipment includes:
  • memory 332 for storing computer programs
  • the processor 322 is configured to implement the steps of the emotion recognition method in the above method embodiment when executing the computer program.
  • FIG. 7 is a schematic structural diagram of an electronic device provided in this embodiment.
  • the electronic device may have relatively large differences due to different configurations or performances, and may include one or more processors ( central processing units (CPU) 322 (eg, one or more processors) and memory 332 that stores one or more computer applications 342 or data 344.
  • the storage 332 may be a short-term storage or a persistent storage.
  • the program stored in the memory 332 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations on the data processing device.
  • the central processing unit 322 may be configured to communicate with the memory 332 , and execute a series of instruction operations in the memory 332 on the electronic device 301 .
  • the electronic device 301 may also include one or more power sources 326 , one or more wired or wireless network interfaces 350 , one or more input and output interfaces 358 , and/or, one or more operating systems 341 .
  • the steps in the emotion recognition method described above can be realized by the structure of the electronic device.
  • the embodiment of the present application further provides a readable storage medium, and a readable storage medium described below and an emotion recognition method described above can be referred to in correspondence.
  • a readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, the steps of the emotion recognition method in the foregoing method embodiments are implemented.
  • the readable storage medium can be a USB flash drive, a mobile hard disk, a read-only memory (Read-Only Memory, ROM), a random access memory (Random Access Memory, RAM), a magnetic disk or an optical disk, etc., which can store program codes.
  • readable storage media can be a USB flash drive, a mobile hard disk, a read-only memory (Read-Only Memory, ROM), a random access memory (Random Access Memory, RAM), a magnetic disk or an optical disk, etc., which can store program codes.
  • readable storage media can be a USB flash drive, a mobile hard disk, a read-only memory (Read-Only Memory, ROM), a random access memory (Random Access Memory, RAM), a magnetic disk or an optical disk, etc., which can store program codes.
  • readable storage media can be a USB flash drive, a mobile hard disk, a read-only memory (Read-Only Memory, ROM), a random access memory (Random Access Memory, RAM),

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Image Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本申请公开了一种情感识别方法、情感识别装置、设备及可读存储介质,考虑到不同模态之间特征对最终情感识别任务的判别性贡献不同,在提取到各模态的特征向量之后,分别利用不同权重的长短时记忆网络对各模态的特征向量进行编码,得到对应的隐含状态向量。为了充分利用文本特征在情感识别当中的强判别性,采用跨越的连接方式和注意力之后的音频隐含状态向量和视频隐含状态向量进行跨层次的拼接,得到融合表达向量,进而拼接相关向量得到综合特征。最终,利用综合特征,得到目标对象的情感识别结果。即,基于非均匀的注意力机制来融合不同模态的特征向量,能够有效提升信息判别性,最终使得情感识别结果更加准确。

Description

一种情感识别方法、装置、设备及可读存储介质
本申请要求在2021年9月29日提交中国专利局、申请号为202111148250.6、发明名称为“一种情感识别方法、装置、设备及可读存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算机应用技术领域,特别是涉及一种情感识别方法、装置、设备及可读存储介质。
背景技术
在人们日常的交互过程中,情感占据着重要的部分。而在应用中,交互从早期的键盘输入到如今的触屏,甚至是语音输入等。在应用中,语音输入,更多识别的还停留在语义内容的层面,如语音翻译成文本,但是这种翻译完全损失了情感相关的信息。
为了使应用能够提供更好的人机交互体验,通过情感识别来将情感信息添加到人机交互当中。早期的情感识别一般是单模态,识别出文本或者语音当中携带的情感信息。但人类情感的自然传递是一个多个感官协同表达的结果。不仅仅是语言中携带的情感,如语调也携带着情感的信息,随后主要基于双模态的情感识别,主要集中在文本和声音。而后,计算机视觉也加入到情感识别中。
即,情感识别已经集中于基于多模态,如视觉,音频和文本三个方面的信息来做出最终的情感识别结果。但是,现有的多模态融合算法应用到具体情感识别当中,存在提取的多模态信息判别性差的问题,进而导致情感识别结果不准确,无法满足实际应用需求。
综上所述,如何有效地解决情感识别中信息判别性差等问题,是目前本领域技术人员急需解决的技术问题。
发明内容
本申请的目的是提供一种情感识别方法、装置、设备及可读存储介质,基 于非均匀的注意力机制来融合不同模态的特征向量,能够有效提升信息判别性,最终使得情感识别结果更加准确。
为解决上述技术问题,本申请提供如下技术方案:
一种情感识别方法,包括:
对目标对象对应的文本、音频和视频进行特征提取,得到文本特征向量、音频特征向量和视频特征向量;
分别利用不同权重的长短时记忆网络对所述文本特征向量、所述音频特征向量和所述视频特征向量进行编码,得到文本隐含状态向量、音频隐含状态向量和视频隐含状态向量;
将所述文本隐含状态向量分别与所述音频隐含状态向量、所述视频隐含状态向量进行特征拼接,得到文本音频拼接向量和文本视频拼接向量;
获取文本音频注意权重和文本视频注意权重;
利用所述文本音频拼接向量、所述文本视频拼接向量、所述文本音频注意权重和所述文本视频注意权重,得到非均匀注意力的融合表达向量;
拼接所述融合表达向量、所述文本隐含状态向量、所述音频隐含状态向量和所述视频隐含状态向量,得到综合特征;
利用所述综合特征,得到所述目标对象的情感识别结果。
可选地,所述获取文本音频注意权重和文本视频注意权重,包括:
将所述文本隐含状态向量和所述音频隐含状态向量输入至音频注意力层,得到输出的所述文本音频注意权重;
将所述文本隐含状态向量和所述视频隐含状态向量输入至视频注意力层,得到输出的所述文本视频注意权重。
可选地,利用所述综合特征,得到所述目标对象的情感识别结果,包括:
对所述综合特征进行线性映射,得到所述目标对象的情感识别结果。
可选地,对所述综合特征进行线性映射,得到所述目标对象的情感识别结果,包括:
对所述综合特征进行预设情感识别类别数目的线性映射,得到所述目标对象的情感识别结果。
可选地,在利用所述综合特征,得到所述目标对象的情感识别结果之后, 还包括:
向所述目标对象输出与所述情感识别结果匹配的交互信息。
可选地,利用所述文本音频拼接向量、所述文本音频注意权重、所述文本视频拼接向量和所述文本视频注意权重,得到非均匀注意力的融合表达向量,包括:
对所述文本音频拼接向量和所述文本音频注意权重进行相乘处理,得到文本音频加权向量;
对所述文本视频拼接向量和所述文本视频注意权重进行相乘处理,得到文本视频加权向量;
利用降维层对所述文本音频加权向量和所述文本视频加权向量进行降维,得到文本音频降维向量和文本视频降维向量;
拼接所述文本音频降维向量和所述文本视频降维向量,并在拼接后进行归一化处理,得到所述融合表达向量。
可选地,还包括:
对所述文本隐含状态向量进行降维,得到文本隐含状态降维向量;
相应地,所述拼接所述文本音频降维向量和所述文本视频降维向量,并在拼接后进行归一化处理,得到所述融合表达向量,包括:
拼接所述文本音频降维向量、所述文本视频降维向量和所述文本隐含状态降维向量,并在拼接后进行归一化处理,得到所述融合表达向量。
一种情感识别装置,包括:
特征提取模块,用于对目标对象对应的文本、音频和视频进行特征提取,得到文本特征向量、音频特征向量和视频特征向量;
特征编码模块,用于分别利用不同权重的长短时记忆网络对所述文本特征向量、所述音频特征向量和所述视频特征向量进行编码,得到文本隐含状态向量、音频隐含状态向量和视频隐含状态向量;
特征拼接模块,用于将所述文本隐含状态向量分别与所述音频隐含状态向量、所述视频隐含状态向量进行特征拼接,得到文本音频拼接向量和文本视频拼接向量;
权重确定模块,用于获取文本音频注意权重和文本视频注意权重;
权重融合模块,用于利用所述文本音频拼接向量、所述文本视频拼接向量、所述文本音频注意权重和所述文本视频注意权重,得到非均匀注意力的融合表达向量;
综合特征获取模块,用于拼接所述融合表达向量、所述文本隐含状态向量、所述音频隐含状态向量和所述视频隐含状态向量,得到综合特征;
识别结果确定模块,用于利用所述综合特征,得到所述目标对象的情感识别结果。
一种电子设备,包括:
存储器,用于存储计算机程序;
处理器,用于执行所述计算机程序时实现上述情感识别方法的步骤。
一种可读存储介质,所述可读存储介质上存储有计算机程序,所述计算机程序被处理器执行时实现上述情感识别方法的步骤。
应用本申请实施例所提供的方法,对目标对象对应的文本、音频和视频进行特征提取,得到文本特征向量、音频特征向量和视频特征向量;分别利用不同权重的长短时记忆网络对文本特征向量、音频特征向量和视频特征向量进行编码,得到文本隐含状态向量、音频隐含状态向量和视频隐含状态向量;将文本隐含状态向量分别与音频隐含状态向量、视频隐含状态向量进行特征拼接,得到文本音频拼接向量和文本视频拼接向量;获取文本音频注意权重和文本视频注意权重;利用所述文本音频拼接向量、所述文本视频拼接向量、所述文本音频注意权重和所述文本视频注意权重,得到非均匀注意力的融合表达向量;拼接融合表达向量、文本隐含状态向量、音频隐含状态向量和视频隐含状态向量,得到综合特征;利用综合特征,得到目标对象的情感识别结果。
考虑到不同模态之间特征对最终情感识别任务的判别性贡献不同,在本申请中用不同的注意力机制来加权来自各模态的信息,即在提取到文本特征向量、音频特征向量和视频特征向量之后,分别利用不同权重的长短时记忆网络对文本特征向量、音频特征向量和视频特征向量进行编码,得到文本隐含状态向量、音频隐含状态向量和视频隐含状态向量。此外,为了充分利用文本特征在情感识别当中的强判别性,采用跨越的连接方式和注意力之后的音频隐含状态向量和视频隐含状态向量进行跨层次的拼接,得到融合表达向量,然后采用 拼接融合表达向量、文本隐含状态向量、音频隐含状态向量和视频隐含状态向量的方式,得到综合特征。最终,利用综合特征,得到目标对象的情感识别结果。即,基于非均匀的注意力机制来融合不同模态的特征向量,能够有效提升信息判别性,最终使得情感识别结果更加准确。
相应地,本申请实施例还提供了与上述情感识别方法相对应的情感识别装置、设备和可读存储介质,具有上述技术效果,在此不再赘述。
附图说明
为了更清楚地说明本申请实施例或相关技术中的技术方案,下面将对实施例或相关技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为本申请实施例中一种情感识别方法的实施流程图;
图2为本申请实施例中一种基于非均匀注意力机制的情感识别网络主干框架结构示意图;
图3为本申请实施例中一种基于非均匀注意力机制的多模态融合示意图;
图4为本申请实施例中一种情感识别方法的具体实施示意图;
图5为本申请实施例中一种情感识别装置的结构示意图;
图6为本申请实施例中一种电子设备的结构示意图;
图7为本申请实施例中一种电子设备的具体结构示意图。
具体实施方式
为了使本技术领域的人员更好地理解本申请方案,下面结合附图和具体实施方式对本申请作进一步的详细说明。显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
请参考图1,图1为本申请实施例中一种情感识别方法的流程图,该方法可以应用于如图2所示的基于非均匀注意力机制的情感识别网络主干框架结 构中。基于非均匀注意力机制的情感识别网络主干框架结构包括输入层、输入映射层,特征融合层和输出层。输入层接收输入的三种不同模态的特征数据,由于不同模态的数据之间存在巨大的语义鸿沟,在输入层之后,设计输入映射层对输入的不同模态的数据进行语义映射,使得不同模态的数据投射到各自的语义空间。之后,将映射之后的特征输入特征融合层产生融合特征向量,最后,融合特征向量输入到输出层获得最终的情感识别结果。为了更好地建模视频序列间的时间信息,特征融合层的主体框架使用长短时记忆网络。
该情感识别方法包括以下步骤:
S101、对目标对象对应的文本、音频和视频进行特征提取,得到文本特征向量、音频特征向量和视频特征向量。
其中,目标对象可以具体为需要进行情感识别的某个应用的用户。进行特征提取的文本(Textual)、音频(Acoustic)和视频(Visual)则可以具体为该用户输入的文本、音频和视频。
在本实施例中,可以采用文本、音频和视频分别对应的特征提取模型进行相应特征提取,从而得到文本特征向量、音频特征向量和视频特征向量。
为了便于说明,在本申请实施例中,本文特征向量表示为
Figure PCTCN2022078284-appb-000001
音频特征向量表示为
Figure PCTCN2022078284-appb-000002
视频特征向量即指视频中图像特征向量表示为
Figure PCTCN2022078284-appb-000003
S102、分别利用不同权重的长短时记忆网络对文本特征向量、音频特征向量和视频特征向量进行编码,得到文本隐含状态向量、音频隐含状态向量和视频隐含状态向量。
其中,长短时记忆网络(LSTM,Long-Short Term Memory),是一种特殊的循环神经网络,通过循环地将不同时间步的数据输入到相同结构的记忆结构中,来建模不同时间步间的信息。单个的记忆结构是一组运算的集合,该组运算接收输入数据,生成中间的输出变量,在LSTM中,输出的中间变量称为隐含状态(Hidden States)和细胞状态(Cell States)。每个模态的映射向量分别使用一个LSTM来进行建模,这里以文本数据为例,对LSTM的运算过程进行解释。假设一段文本长度为L,代表这段文本包含L个单词。每个单词经过输入映射层之后的输出为映射向量
Figure PCTCN2022078284-appb-000004
其中id的范围是1到L,t符号代表该向量对应的是文本(Text)的表达,映射向量的维度是一个整数,用D m表示, 其中m的含义是映射(Mapping)。该文本映射向量就是LSTM的输入。LSTM的结构特点是包含三个门控单元,每个门控单元是用来控制信息流动的。三个门控单元分别为输入门,遗忘门和输出门,每个门控单元的输出是一个和输入等长的向量,该向量中每个数值的取值范围是0到1,0代表对该位置的信息进行屏蔽,1代表对该位置的信息进行全部通过,中间值代表对该位置的信息进行不同程度的控制。因为LSTM的记忆结构是完全相同的,这个结构不仅包含计算方式,还包括其中计算矩阵的权重,为了保持形式的统一,这里需要构建两个向量:隐含状态向量h t和细胞状态向量c t,这两个向量的维度用整数D h表示。输入门的作用是对输入的文本映射向量
Figure PCTCN2022078284-appb-000005
和上一时间步的隐含状态向量
Figure PCTCN2022078284-appb-000006
的信息进行控制,遗忘门的作用是对上一时间步的细胞状态向量
Figure PCTCN2022078284-appb-000007
的信息流动进行控制,输出门控制的是来自输入门和遗忘门的输出向量流动到下一隐含状态的信息量。具体地,以上过程用公式描述:
Figure PCTCN2022078284-appb-000008
Figure PCTCN2022078284-appb-000009
Figure PCTCN2022078284-appb-000010
Figure PCTCN2022078284-appb-000011
Figure PCTCN2022078284-appb-000012
Figure PCTCN2022078284-appb-000013
其中,·代表矩阵和向量的乘法,*代表对应元素相乘,W fx,W ix,W ox,W cx代表对
Figure PCTCN2022078284-appb-000014
进行维度映射的矩阵,矩阵的维度是D h×D m,W fh,W ih,W oh,W ch代表对
Figure PCTCN2022078284-appb-000015
进行维度映射的矩阵,矩阵的维度是D h×D h
Figure PCTCN2022078284-appb-000016
代表的是细胞状态的一个中间变量,σ代表sigmoid函数:
Figure PCTCN2022078284-appb-000017
tanh代表非线性映射:
Figure PCTCN2022078284-appb-000018
通过以上方式不断更新隐含状态向量h t和细胞状态向量c t,一般使用每个时间步的隐含状态向量来代表当前LSTM记忆结构的输出特征向量。
以上即LSTM对单个模态信息编码的过程。
在本申请实施例中,为了充分考虑不同模态间判别性特征的融合,在相邻时间步间的信息传递过程中,以非均匀的注意力机制来融合不同模态的输出特征向量。具体的结构如图3所示,即从整体上看,分别使用三个不同权重的LSTM对输入的本文(Textual)特征向量
Figure PCTCN2022078284-appb-000019
音频(Acoustic)特征向量
Figure PCTCN2022078284-appb-000020
视频中图像(Visual)特征向量
Figure PCTCN2022078284-appb-000021
进行编码输出对应的隐含状态向量和细胞状态向量:文本隐含状态向量
Figure PCTCN2022078284-appb-000022
文本细胞状态向量
Figure PCTCN2022078284-appb-000023
音频隐含状态向量
Figure PCTCN2022078284-appb-000024
音频细胞状态向量
Figure PCTCN2022078284-appb-000025
视频隐含状态向量
Figure PCTCN2022078284-appb-000026
视频细胞状态向量
Figure PCTCN2022078284-appb-000027
需要注意的是,在本申请实施例中对于细胞状态向量没有过多阐述,对于细胞状态向量的处理参照LSTM的相关处理方式进行处理即可。
S103、将文本隐含状态向量分别与音频隐含状态向量、视频隐含状态向量进行特征拼接,得到文本音频拼接向量和文本视频拼接向量。
由于文本特征对于情感识别具有强判别性,将文本隐含状态向量和音频隐含状态向量在特征维度上进行特征拼接,得到拼接后的向量,即文本音频拼接向量。
类似的,将文本隐含状态向量和图像隐含状态向量在特征维度上进行特征拼接,得到拼接后的向量,即文本视频拼接向量。
请参考图3中使用id为1和2为例,对文本隐含状态向量分别与音频隐含状态向量、视频隐含状态向量进行特征拼接进行详细说明,将输出的文本隐含状态向量
Figure PCTCN2022078284-appb-000028
和音频隐含状态向量
Figure PCTCN2022078284-appb-000029
在特征维度上进行特征拼接,得到拼接后的向量
Figure PCTCN2022078284-appb-000030
类似的,将输出的文本隐含状态向量
Figure PCTCN2022078284-appb-000031
和图像隐含状态向量
Figure PCTCN2022078284-appb-000032
在特征维度上进行特征拼接,得到拼接后的向量
Figure PCTCN2022078284-appb-000033
S104、获取文本音频注意权重和文本视频注意权重。
在本实施例中,为了区别不同的注意力权重,还可以获取文本音频注意权重和文本视频注意权重。即,文本音频注意权重对应文本音频拼接向量,文本视频注意权重对应文本视频拼接向量。
具体的,获取文本音频注意权重和文本视频注意权重,包括:
步骤一、将文本隐含状态向量和音频隐含状态向量输入至音频注意力层,得到输出的文本音频注意权重;
步骤二、将文本隐含状态向量和视频隐含状态向量输入至视频注意力层,得到输出的文本视频注意权重。
为便于描述,下面将上述两个步骤结合起来进行说明。
可以预先设置一个音频注意力层,如图3所示的音频注意力层(Acoustic Attention Layer),该层的主要结构是线性映射加sigmoid函数,具体为:LinearLayer+Dropout+Sigmoid,其中Linear Layer是线性映射层,Dropout是为了防止训练过程中参数的过拟合,Sigmoid是为了将该层的输出归一化到0和1之间,这样能代表注意力机制当中的注意力程度。该层的输入文本隐含状态向量和音频隐含状态向量,输出为文本音频注意权重。例如,输入为文本隐含状态向量
Figure PCTCN2022078284-appb-000034
和音频隐含状态向量
Figure PCTCN2022078284-appb-000035
则输出为一个权重向量
Figure PCTCN2022078284-appb-000036
相应地,可以设置一个视频注意力层(或称之为图像注意力层),如图3所示的图像注意力层(Visual Attention Layer),该层的主要结构是线性映射加sigmoid函数,具体的,Linear Layer+Dropout+Sigmoid,其中Linear Layer是线性映射层,Dropout是为了防止训练过程中参数的过拟合,Sigmoid是为了将该层的输出归一化到0和1之间,这样能代表注意力机制当中的注意力程度。该层的输入为文本隐含状态向量和视频隐含状态向量,输出为文本视频注意权重。例如,当输入为文本隐含状态向量
Figure PCTCN2022078284-appb-000037
和图像隐含状态向量
Figure PCTCN2022078284-appb-000038
输出是一个权重向量
Figure PCTCN2022078284-appb-000039
需要注意的是,音频注意力层和视频注意力层分别对应的线性映射层的权重是不共享的,即二者并不相同。
S105、利用所述文本音频拼接向量、所述文本视频拼接向量、所述文本音频注意权重和所述文本视频注意权重,得到非均匀注意力的融合表达向量。
完成文本与音频的特征信息拼接,文本与视频的特征信息拼接,并获得文本视频拼接向量和文本视频注意权重之后,便可基于非均匀注意力机制进行融合,最终得到融合表达向量。
具体的,即在LSTM输入部分增加了一个元素,就是代表非均匀注意力机制中的融合表达向量z,z初始化为全0向量,在LSTM的计算单元当中, 也存在与z相关的需要学习的参数矩阵。
得到文本音频拼接向量、所述文本视频拼接向量、所述文本音频注意权重和所述文本视频注意权重之后,便可对融合表达向量进行赋值,最终得到与当前文本音频拼接向量、所述文本视频拼接向量、所述文本音频注意权重和所述文本视频注意权重匹配的融合表达向量。
具体的,利用文本音频拼接向量、所述文本视频拼接向量、所述文本音频注意权重和所述文本视频注意权重,得到非均匀注意力的融合表达向量,包括:
步骤一、对文本音频拼接向量和文本音频注意权重进行相乘处理,得到文本音频加权向量;
步骤二、对文本视频拼接向量和文本视频注意权重进行相乘处理,得到文本视频加权向量;
步骤三、利用降维层对文本音频加权向量和文本视频加权向量进行降维,得到文本音频降维向量和文本视频降维向量;
步骤四、拼接文本音频降维向量和文本视频降维向量,并在拼接后进行归一化处理,得到融合表达向量。
为了便于描述,下面将上述四个步骤结合起来进行说明。
即,首先对文本音频拼接向量进行权重赋值,即对文本音频拼接向量和文本音频注意权重进行相乘处理,得到文本音频加权向量,文本音频加权向量即为对文本音频拼接向量进行权重赋值后的结果。相应地,文本视频拼接向量的权重赋值亦可参照与此,从而得到文本视频加权向量。
例如,将拼接后的向量和对应的权重向量相乘,可以获得加权之后的特征向量
Figure PCTCN2022078284-appb-000040
其中,降维层(Dimension Reduction Layer),将包含语义信息的特征向量的维度进一步压缩,降维层的结构定义为Linear Layer+Dropout,其中Linear Layer是线性映射层,Dropout是为了防止训练过程中参数的过拟合。基于加权得到的文本音频加权向量和文本视频加权向量分别经过不同的降维层进行降维,然后将输出向量,即文本音频降维向量和文本视频降维向量拼接(Concatenate)起来,并经过归一化指数函数(softmax函数)进行归一化获得最终非均匀注意力的融合表达向量。
可选地,为了充分利用本文表达中的有效信息,还可以对文本隐含状态向量进行降维,得到文本隐含状态降维向量,相应地,步骤四拼接文本音频降维向量和文本视频降维向量,并在拼接后进行归一化处理,得到融合表达向量,包括:拼接文本音频降维向量、文本视频降维向量和文本隐含状态降维向量,并在拼接后进行归一化处理,得到融合表达向量。也就是说,将文本隐含状态向量,和基于加权得到的特征向量文本音频加权向量和文本视频加权向量三者分别经过不同的降维层进行降维,然后将输出向量拼接起来,并经过softmax函数进行归一化获得最终非均匀注意力的融合表达向量z 1
例如,如图3所示,可将文本隐含状态向量
Figure PCTCN2022078284-appb-000041
和基于加权得到的特征向量
Figure PCTCN2022078284-appb-000042
Figure PCTCN2022078284-appb-000043
三者分别经过不同的降维层进行降维,然后将输出向量拼接起来,并经过softmax函数进行归一化获得最终非均匀注意力的融合表达z 1
S106、拼接融合表达向量、文本隐含状态向量、音频隐含状态向量和视频隐含状态向量,得到综合特征。
得到融合表达向量、文本隐含状态向量音频隐含状态向量和视频隐含状态向量之后,便可对融合表达向量、文本隐含状态向量、音频隐含状态向量和视频隐含状态向量进行拼接,得到综合特征。在本实施例中,对于拼接顺序并不做限定,仅需训练和应用时保障顺序一致即可。
也就是说,针对每一个id进行不断重复计算过程,最终获得id=L时对应的表达z L
Figure PCTCN2022078284-appb-000044
然后将四个特征向量拼接(续接)起来,将拼接结果作为综合特征。
S107、利用综合特征,得到目标对象的情感识别结果。
具体的,可以对综合特征进行线性映射,得到目标对象的情感识别结果。
考虑到情感识别可划分为不同识别类别数目,如划分为两大类:积极和消极,如划分为流大类:开心、伤心、恐惧、恶心、生气和惊讶。因此,在对综合特征进行线性映射,得到目标对象的情感识别结果,可以具体包括:对综合特征进行预设情感识别类别数目的线性映射,得到目标对象的情感识别结果。
在利用综合特征,得到目标对象的情感识别结果之后,还可以向目标对象输出与情感识别结果匹配的交互信息。当然,也可以将情感识别结果进行保存,从而追踪目标对象的情感变化。
应用本申请实施例所提供的方法,考虑到不同模态之间特征对最终情感识别任务的判别性贡献不同,在本申请中用不同的注意力机制来加权来自各模态的信息,即在提取到文本特征向量、音频特征向量和视频特征向量之后,分别利用不同权重的长短时记忆网络对文本特征向量、音频特征向量和视频特征向量进行编码,得到文本隐含状态向量、音频隐含状态向量和视频隐含状态向量。此外,为了充分利用文本特征在情感识别当中的强判别性,采用跨越的连接方式和注意力之后的音频隐含状态向量和视频隐含状态向量进行跨层次的拼接,得到融合表达向量,然后采用拼接融合表达向量、文本隐含状态向量、音频隐含状态向量和视频隐含状态向量的方式,得到综合特征。最终,利用综合特征,得到目标对象的情感识别结果。即,基于非均匀的注意力机制来融合不同模态的特征向量,能够有效提升信息判别性,最终使得情感识别结果更加准确。
为便于本领域技术人员更好地理解本申请实施例所提供的情感识别方法,下面结合具体实施情况,对情感识别方法进行详细说明。
请参考图4,从整体上数据分为训练和测试,在开始实施之前,首先构建训练数据和定义模型,随后使用训练数据对模型参数进行更新,如果不满足模型收敛的条件,则继续进行模型参数的更新,如果满足模型收敛的条件,进入测试阶段,输入测试数据,模型计算输出结果,整个流程结束。
需要注意的是,这里的模型收敛条件不仅包含上述的训练次数到达设定的次数或者训练误差下降程度稳定到一定范围,还可以设定预测值和真实值间的误差的阈值,当模型的误差小于给定阈值的时候,可以判定训练停止。在模型损失函数的定义上,可以根据输入数据包含的情感类别数目进行调整,如果是两种类型(一般定义为积极和消极两种情感),可以采用平均绝对误差(Mean Absolute Error)作为损失函数,也可以采用均方误差(Mean Square Error)等其他度量的方法。如果是多种类型,可以选用适用于多分类的交叉熵损失函数,或者适用于多分类模型的其他改进方法。在模型的参数更新方面,可以采用RMSprob(Root Mean Square propagation)算法,同时也可以选用其它基于梯度下降的参数优化方法,包括但不限于随机梯度下降(Stochastic Gradient Descent,SGD),Adagrad(Adaptive Subgradient),Adam(Adaptive Moment  Estimation),Adamax(Adam基于无穷范数的变种),ASGD(Averaged Stochastic Gradient Descent),RMSprob等。
为了更清楚明确地说明本申请的技术方案,接下来按照本申请的内容构建神经网络,进行情感识别,以便对本申请的具体实施进行详细说明。需要注意的是,此处所描述的具体实施方式仅用于解释本申请,而并非限定本申请。
获取多模态的情感识别数据集,该数据集中包含CMUMOSI,CMUMOSEI,IEMOCAP三个数据集,在本文中以CMUMOSI为例进行说明。需要注意的是,同样的操作在包括但不限于CMUMOSEI,IEMOCAP的同类数据集上一样适用。CMUMOSI数据集包含2199个自拍的视频片段,整体上被划分为三个部分:训练集,验证集和测试集。
基于视频数据提取的特征数据,其中训练集可包含1284个样本数据,验证集包含229个样本数据,测试集包含686个样本数据。不同模态数据分别是:文本是一个包含最多50个单词的句子,如果句子单词数目不足50,则使用0来填充;图像数据(即视频中的图像)是对和每个单词对齐的视频序列图像的特征表达,每段视频序列对应的表达是一个维度为20的向量,同样的每个单词对应的音频片段被压缩成一个特征表达,每个音频片段的表达是一个维度为5的向量。对于输出标签,每个样本数据对应一个数值,数值的范围是(-3,3),分别代表从最消极的情感到最积极的情感,在本次实施中,通过0为分界线,将情感识别分为两分类的任务(大于等于0定义为积极情感,小于0定义为消极情感)。
定义网络结构,参照图2和图3,分别使用三个不同参数的LSTM来进行三个模态的进一步特征表达,在时间步上,插入设计好的非均匀注意力机制模块,用来获得三种模态的融合特征。最后一个时间步的融合特征和各个LSTM最终的隐藏状态向量表达拼接在一起,经过softmax归一化之后,最后通过一个线性的映射层来获得输出。
基于损失函数,根据具体实施情况,选择合适的损失函数来度量训练过程中模型的输出预测值和数据集中的标签值。本次实施中因为是二分类,所以这里采用平均绝对误差(Mean Absolute Error)作为损失函数。
按照上文中的参数优化方法,根据实际实施情况,选择合适的优化方法来更新模型中需要更新的参数。本次实施中采用RMSprob(Root Mean Square propagation)方法来更新参数。
在训练过程中,首先在训练集上进行参数的更新,每次在整个训练集上调整一遍参数(一个Epoch)之后,在验证集上进行损失计算并记录,设置训练的Epoch数目,这里设置为10。选择验证集上损失最小的模型作为最终训练输出的模型。
将测试数据中的三种模态的信息输入到训练好的模型中进行前向计算,得到最终的情感识别输出。
可见,该情感识别方法实施过程中,采用非均匀注意力机制模块的构建模型,非均匀注意力机制的思想是根据不同模态的输入分别采用注意力机制,在具体实现上,在情感识别中具有强判别性的文本特征作为主要特征来指导其他两种特征的融合,主要包括特征拼接操作,两个注意力层,两个和注意力层相连的降维层;基于文本特征的降维层,最终拼接加softmax得到融合特征表达。值得注意的,这里保护的是非均匀注意力机制的框架,其中具体的注意力层和降维层的设计可以选择其他类似功能的模块。
可配置的情感识别类别数目,即针对情感识别任务,本申请根据对数据集标签的不同划分,在具体的实施过程中,将情感识别的类型分为二分类和多分类,并根据不同类型的任务适配不同的损失函数来进行误差度量,同时可适配多种不同的模型参数优化算法进行模型参数更新。
多角度注意力机制的可扩展。即,除了可应用于实施例中列举的情感识别任务外,还可应用于多种涉及多模态特征融合的其他任务,比如多模态视频分类,多模态视频人物识别等。
与现有的多模态情感识别方法相比,本申请所提出情感识别方法,即基于非均匀注意力机制的多模态情感识别方法具有以下显著优点:
(1)、利用不同模态之间特征对最终识别任务的判别性贡献不同,提出采用不同的注意力机制来加权来自各模态的信息;
(2)、充分利用文本特征在情感识别当中的强判别性,采用跨越的连接方式和注意力层之后的音频融合特征和图像融合特征进行跨层次的拼接,补充在注意力层的计算过程中文本信息的损失;
(3)、可配置的情感识别类别数目,通过对数据集的标签进行类别划分,可以实现不同数目情感类型的识别,同时根据识别数目的设置,选择不同的损失函数进行模型参数的更新。
需要注意的是,本申请中的注意力层的个数不限于一个,还可以通过扩展相同的结构,使用不同的权重参数,将不同角度的注意力模块输出拼接起来,需要改变的只是后续降维操作的输入维度,而不需要改变网络的其他结构,从而实现多角度的多头注意力机制。
相应于上面的方法实施例,本申请实施例还提供了一种情感识别装置,下文描述的情感识别装置与上文描述的情感识别方法可相互对应参照。
参见图5所示,该装置包括以下模块:
特征提取模块101,用于对目标对象对应的文本、音频和视频进行特征提取,得到文本特征向量、音频特征向量和视频特征向量;
特征编码模块102,用于分别利用不同权重的长短时记忆网络对文本特征向量、音频特征向量和视频特征向量进行编码,得到文本隐含状态向量、音频隐含状态向量和视频隐含状态向量;
特征拼接模块103,用于将文本隐含状态向量分别与音频隐含状态向量、视频隐含状态向量进行特征拼接,得到文本音频拼接向量和文本视频拼接向量;
权重确定模块104,用于获取文本音频注意权重和文本视频注意权重;
权重融合模块105,用于利用所述文本音频拼接向量、所述文本视频拼接向量、所述文本音频注意权重和所述文本视频注意权重,得到非均匀注意力的融合表达向量;
综合特征获取模块106,用于拼接融合表达向量、文本隐含状态向量、音频隐含状态向量和视频隐含状态向量,得到综合特征;
识别结果确定模块107,用于利用综合特征,得到目标对象的情感识别结 果。
应用本申请实施例所提供的装置,考虑到不同模态之间特征对最终情感识别任务的判别性贡献不同,在本申请中用不同的注意力机制来加权来自各模态的信息,即在提取到文本特征向量、音频特征向量和视频特征向量之后,分别利用不同权重的长短时记忆网络对文本特征向量、音频特征向量和视频特征向量进行编码,得到文本隐含状态向量、音频隐含状态向量和视频隐含状态向量。此外,为了充分利用文本特征在情感识别当中的强判别性,采用跨越的连接方式和注意力之后的音频隐含状态向量和视频隐含状态向量进行跨层次的拼接,得到融合表达向量,然后采用拼接融合表达向量、文本隐含状态向量、音频隐含状态向量和视频隐含状态向量的方式,得到综合特征。最终,利用综合特征,得到目标对象的情感识别结果。即,基于非均匀的注意力机制来融合不同模态的特征向量,能够有效提升信息判别性,最终使得情感识别结果更加准确。
在本申请的一种具体实施方式中,权重确定模块104,具体用于将文本隐含状态向量和音频隐含状态向量输入至音频注意力层,得到输出的文本音频注意权重;将文本隐含状态向量和视频隐含状态向量输入至视频注意力层,得到输出的文本视频注意权重。
在本申请的一种具体实施方式中,识别结果确定模块107,具体用于对综合特征进行线性映射,得到目标对象的情感识别结果。
在本申请的一种具体实施方式中,识别结果确定模块107,具体用于对综合特征进行预设情感识别类别数目的线性映射,得到目标对象的情感识别结果。
在本申请的一种具体实施方式中,还包括:
情感交互模块,用于在利用综合特征,得到目标对象的情感识别结果之后,向目标对象输出与情感识别结果匹配的交互信息。
在本申请的一种具体实施方式中,权重融合模块105,具体用于对文本音频拼接向量和文本音频注意权重进行相乘处理,得到文本音加权向量;对文本视频拼接向量和文本视频注意权重进行相乘处理,得到文本视频加权向量;利用降维层对文本音频加权向量和文本视频加权向量进行降维,得到文本音频降维向量和文本视频降维向量;拼接文本音频降维向量和文本视频降维向量,并 在拼接后进行归一化处理,得到融合表达向量。
在本申请的一种具体实施方式中,还包括:
文本降维模块,用于对文本隐含状态向量进行降维,得到文本隐含状态降维向量;
相应地,权重融合模块105,具体用于拼接文本音频降维向量、文本视频降维向量和文本隐含状态降维向量,并在拼接后进行归一化处理,得到融合表达向量。
相应于上面的方法实施例,本申请实施例还提供了一种电子设备,下文描述的一种电子设备与上文描述的一种情感识别方法可相互对应参照。
参见图6所示,该电子设备包括:
存储器332,用于存储计算机程序;
处理器322,用于执行计算机程序时实现上述方法实施例的情感识别方法的步骤。
具体的,请参考图7,图7为本实施例提供的一种电子设备的具体结构示意图,该电子设备可因配置或性能不同而产生比较大的差异,可以包括一个或一个以上处理器(central processing units,CPU)322(例如,一个或一个以上处理器)和存储器332,存储器332存储有一个或一个以上的计算机应用程序342或数据344。其中,存储器332可以是短暂存储或持久存储。存储在存储器332的程序可以包括一个或一个以上模块(图示没标出),每个模块可以包括对数据处理设备中的一系列指令操作。更进一步地,中央处理器322可以设置为与存储器332通信,在电子设备301上执行存储器332中的一系列指令操作。
电子设备301还可以包括一个或一个以上电源326,一个或一个以上有线或无线网络接口350,一个或一个以上输入输出接口358,和/或,一个或一个以上操作系统341。
上文所描述的情感识别方法中的步骤可以由电子设备的结构实现。
相应于上面的方法实施例,本申请实施例还提供了一种可读存储介质,下文描述的一种可读存储介质与上文描述的一种情感识别方法可相互对应参照。
一种可读存储介质,可读存储介质上存储有计算机程序,计算机程序被处理器执行时实现上述方法实施例的情感识别方法的步骤。
该可读存储介质具体可以为U盘、移动硬盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、磁碟或者光盘等各种可存储程序代码的可读存储介质。
本领域技术人员还可以进一步意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。本领域技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。

Claims (10)

  1. 一种情感识别方法,其特征在于,包括:
    对目标对象对应的文本、音频和视频进行特征提取,得到文本特征向量、音频特征向量和视频特征向量;
    分别利用不同权重的长短时记忆网络对所述文本特征向量、所述音频特征向量和所述视频特征向量进行编码,得到文本隐含状态向量、音频隐含状态向量和视频隐含状态向量;
    将所述文本隐含状态向量分别与所述音频隐含状态向量、所述视频隐含状态向量进行特征拼接,得到文本音频拼接向量和文本视频拼接向量;
    获取文本音频注意权重和文本视频注意权重;
    利用所述文本音频拼接向量、所述文本视频拼接向量、所述文本音频注意权重和所述文本视频注意权重,得到非均匀注意力的融合表达向量;
    拼接所述融合表达向量、所述文本隐含状态向量、所述音频隐含状态向量和所述视频隐含状态向量,得到综合特征;
    利用所述综合特征,得到所述目标对象的情感识别结果。
  2. 根据权利要求1所述的情感识别方法,其特征在于,所述获取文本音频注意权重和文本视频注意权重,包括:
    将所述文本隐含状态向量和所述音频隐含状态向量输入至音频注意力层,得到输出的所述文本音频注意权重;
    将所述文本隐含状态向量和所述视频隐含状态向量输入至视频注意力层,得到输出的所述文本视频注意权重。
  3. 根据权利要求1所述的情感识别方法,其特征在于,利用所述综合特征,得到所述目标对象的情感识别结果,包括:
    对所述综合特征进行线性映射,得到所述目标对象的情感识别结果。
  4. 根据权利要求3所述的情感识别方法,其特征在于,对所述综合特征进行线性映射,得到所述目标对象的情感识别结果,包括:
    对所述综合特征进行预设情感识别类别数目的线性映射,得到所 述目标对象的情感识别结果。
  5. 根据权利要求1所述的情感识别方法,其特征在于,在利用所述综合特征,得到所述目标对象的情感识别结果之后,还包括:
    向所述目标对象输出与所述情感识别结果匹配的交互信息。
  6. 根据权利要求1至5任一项所述的情感识别方法,其特征在于,利用所述文本音频拼接向量、所述文本视频拼接向量、所述文本音频注意权重和所述文本视频注意权重,得到非均匀注意力的融合表达向量,包括:
    对所述文本音频拼接向量和所述文本音频注意权重进行相乘处理,得到文本音频加权向量;
    对所述文本视频拼接向量和所述文本视频注意权重进行相乘处理,得到文本视频加权向量;
    利用降维层对所述文本音频加权向量和所述文本视频加权向量进行降维,得到文本音频降维向量和文本视频降维向量;
    拼接所述文本音频降维向量和所述文本视频降维向量,并在拼接后进行归一化处理,得到所述融合表达向量。
  7. 根据权利要求6所述的情感识别方法,其特征在于,还包括:
    对所述文本隐含状态向量进行降维,得到文本隐含状态降维向量;
    相应地,所述拼接所述文本音频降维向量和所述文本视频降维向量,并在拼接后进行归一化处理,得到所述融合表达向量,包括:
    拼接所述文本音频降维向量、所述文本视频降维向量和所述文本隐含状态降维向量,并在拼接后进行归一化处理,得到所述融合表达向量。
  8. 一种情感识别装置,其特征在于,包括:
    特征提取模块,用于对目标对象对应的文本、音频和视频进行特征提取,得到文本特征向量、音频特征向量和视频特征向量;
    特征编码模块,用于分别利用不同权重的长短时记忆网络对所述文本特征向量、所述音频特征向量和所述视频特征向量进行编码,得 到文本隐含状态向量、音频隐含状态向量和视频隐含状态向量;
    特征拼接模块,用于将所述文本隐含状态向量分别与所述音频隐含状态向量、所述视频隐含状态向量进行特征拼接,得到文本音频拼接向量和文本视频拼接向量;
    权重确定模块,用于获取文本音频注意权重和文本视频注意权重;
    权重融合模块,用于利用所述文本音频拼接向量、所述文本视频拼接向量、所述文本音频注意权重和所述文本视频注意权重,得到非均匀注意力的融合表达向量;
    综合特征获取模块,用于拼接所述融合表达向量、所述文本隐含状态向量、所述音频隐含状态向量和所述视频隐含状态向量,得到综合特征;
    识别结果确定模块,用于利用所述综合特征,得到所述目标对象的情感识别结果。
  9. 一种电子设备,其特征在于,包括:
    存储器,用于存储计算机程序;
    处理器,用于执行所述计算机程序时实现如权利要求1至7任一项所述情感识别方法的步骤。
  10. 一种可读存储介质,其特征在于,所述可读存储介质上存储有计算机程序,所述计算机程序被处理器执行时实现如权利要求1至7任一项所述情感识别方法的步骤。
PCT/CN2022/078284 2021-09-29 2022-02-28 一种情感识别方法、装置、设备及可读存储介质 WO2023050708A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111148250.6 2021-09-29
CN202111148250.6A CN114021524B (zh) 2021-09-29 2021-09-29 一种情感识别方法、装置、设备及可读存储介质

Publications (1)

Publication Number Publication Date
WO2023050708A1 true WO2023050708A1 (zh) 2023-04-06

Family

ID=80055300

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/078284 WO2023050708A1 (zh) 2021-09-29 2022-02-28 一种情感识别方法、装置、设备及可读存储介质

Country Status (2)

Country Link
CN (1) CN114021524B (zh)
WO (1) WO2023050708A1 (zh)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114021524B (zh) * 2021-09-29 2024-02-27 苏州浪潮智能科技有限公司 一种情感识别方法、装置、设备及可读存储介质
CN114913590B (zh) * 2022-07-15 2022-12-27 山东海量信息技术研究院 一种数据的情感识别方法、装置、设备及可读存储介质
CN116039653B (zh) * 2023-03-31 2023-07-04 小米汽车科技有限公司 状态识别方法、装置、车辆及存储介质
CN117435917B (zh) * 2023-12-20 2024-03-08 苏州元脑智能科技有限公司 一种情感识别方法、系统、装置及介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112559835A (zh) * 2021-02-23 2021-03-26 中国科学院自动化研究所 多模态情感识别方法
CN112560830A (zh) * 2021-02-26 2021-03-26 中国科学院自动化研究所 多模态维度情感识别方法
US20210103762A1 (en) * 2019-10-02 2021-04-08 King Fahd University Of Petroleum And Minerals Multi-modal detection engine of sentiment and demographic characteristics for social media videos
US20210151034A1 (en) * 2019-11-14 2021-05-20 Comcast Cable Communications, Llc Methods and systems for multimodal content analytics
CN113095357A (zh) * 2021-03-04 2021-07-09 山东大学 基于注意力机制与gmn的多模态情感识别方法及系统
CN114021524A (zh) * 2021-09-29 2022-02-08 苏州浪潮智能科技有限公司 一种情感识别方法、装置、设备及可读存储介质

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113255755B (zh) * 2021-05-18 2022-08-23 北京理工大学 一种基于异质融合网络的多模态情感分类方法

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210103762A1 (en) * 2019-10-02 2021-04-08 King Fahd University Of Petroleum And Minerals Multi-modal detection engine of sentiment and demographic characteristics for social media videos
US20210151034A1 (en) * 2019-11-14 2021-05-20 Comcast Cable Communications, Llc Methods and systems for multimodal content analytics
CN112559835A (zh) * 2021-02-23 2021-03-26 中国科学院自动化研究所 多模态情感识别方法
CN112560830A (zh) * 2021-02-26 2021-03-26 中国科学院自动化研究所 多模态维度情感识别方法
CN113095357A (zh) * 2021-03-04 2021-07-09 山东大学 基于注意力机制与gmn的多模态情感识别方法及系统
CN114021524A (zh) * 2021-09-29 2022-02-08 苏州浪潮智能科技有限公司 一种情感识别方法、装置、设备及可读存储介质

Also Published As

Publication number Publication date
CN114021524A (zh) 2022-02-08
CN114021524B (zh) 2024-02-27

Similar Documents

Publication Publication Date Title
CN107979764B (zh) 基于语义分割和多层注意力框架的视频字幕生成方法
WO2023050708A1 (zh) 一种情感识别方法、装置、设备及可读存储介质
JP7193252B2 (ja) 画像の領域のキャプション付加
CN111897933B (zh) 情感对话生成方法、装置及情感对话模型训练方法、装置
CN111368993B (zh) 一种数据处理方法及相关设备
CN114694076A (zh) 基于多任务学习与层叠跨模态融合的多模态情感分析方法
WO2021037113A1 (zh) 一种图像描述的方法及装置、计算设备和存储介质
CN111951805A (zh) 一种文本数据处理方法及装置
CN115329779B (zh) 一种多人对话情感识别方法
Cai et al. Multi-modal emotion recognition from speech and facial expression based on deep learning
CN114511906A (zh) 基于跨模态动态卷积的视频多模态情感识别方法、装置及计算机设备
CN116720004B (zh) 推荐理由生成方法、装置、设备及存储介质
CN113421547B (zh) 一种语音处理方法及相关设备
Zhang et al. Deep learning-based multimodal emotion recognition from audio, visual, and text modalities: A systematic review of recent advancements and future prospects
CN114140885A (zh) 一种情感分析模型的生成方法、装置、电子设备以及存储介质
CN114443899A (zh) 视频分类方法、装置、设备及介质
Lin et al. PS-mixer: A polar-vector and strength-vector mixer model for multimodal sentiment analysis
CN114882862A (zh) 一种语音处理方法及相关设备
CN116975776A (zh) 一种基于张量和互信息的多模态数据融合方法和设备
CN115964638A (zh) 多模态社交数据情感分类方法、系统、终端、设备及应用
Gao A two-channel attention mechanism-based MobileNetV2 and bidirectional long short memory network for multi-modal dimension dance emotion recognition
Dai et al. Weakly-supervised multi-task learning for multimodal affect recognition
CN112541541B (zh) 基于多元素分层深度融合的轻量级多模态情感分析方法
CN116933051A (zh) 一种用于模态缺失场景的多模态情感识别方法及系统
KR102504722B1 (ko) 감정 표현 영상 생성을 위한 학습 장치 및 방법과 감정 표현 영상 생성 장치 및 방법

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22874111

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE