CN114140885A - Emotion analysis model generation method and device, electronic equipment and storage medium - Google Patents

Emotion analysis model generation method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN114140885A
CN114140885A CN202111450929.0A CN202111450929A CN114140885A CN 114140885 A CN114140885 A CN 114140885A CN 202111450929 A CN202111450929 A CN 202111450929A CN 114140885 A CN114140885 A CN 114140885A
Authority
CN
China
Prior art keywords
emotion
feature vector
fusion
analysis model
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111450929.0A
Other languages
Chinese (zh)
Inventor
邱锋
谢程阳
丁彧
吕唐杰
范长杰
胡志鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Netease Hangzhou Network Co Ltd
Original Assignee
Netease Hangzhou Network Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Netease Hangzhou Network Co Ltd filed Critical Netease Hangzhou Network Co Ltd
Priority to CN202111450929.0A priority Critical patent/CN114140885A/en
Publication of CN114140885A publication Critical patent/CN114140885A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)

Abstract

The application discloses a method and a device for generating an emotion analysis model, electronic equipment and a computer storage medium, wherein the method for generating the emotion analysis model comprises the following steps: acquiring a multi-modal feature vector of a sample object marked with emotional state; obtaining shared feature vectors among the multi-modal feature vectors and private feature vectors corresponding to all the modalities; performing fusion processing on modal feature vectors corresponding to each modal in the multi-modal feature vectors to obtain a first emotion fusion feature vector of the sample object; fusing the private characteristic vectors of the modes to obtain a second emotion fused characteristic vector of the sample object; splicing at least two of the first emotion fusion feature vector, the second emotion fusion feature vector and the shared feature vector, taking the splicing result and the emotion state marked by the sample object as training samples to train a preset initial emotion analysis model, and obtaining an emotion analysis model for determining the emotion of the user.

Description

Emotion analysis model generation method and device, electronic equipment and storage medium
Technical Field
The application relates to the field of artificial intelligence, in particular to a method and a device for generating an emotion model, electronic equipment and a storage medium.
Background
With the development of artificial intelligence, intelligent interaction plays an increasingly important role in more and more fields.
Human excellence in interpreting the emotional state of interlocutors from various modal signals, including: the speaker's accent, phonetic text, facial expressions, etc. The ability to impart emotional comprehension to machines has long been a research goal of those skilled in the art. The technology can be widely applied to scenes such as interactive games, interactive movies, virtual tour guides, virtual assistants, artificial intelligence clients and the like.
At present, how to decode human emotion from a complex human-computer interaction process is still an important problem faced by those skilled in the art, and to solve the problem, the prior art generally directly learns a single-mode signal or a multi-mode signal of a sample character to perform emotion analysis, so as to obtain an emotion recognition model for recognizing the emotion of a target character. However, such schemes cannot sufficiently and effectively analyze the emotional characteristics of the person.
Therefore, how to sufficiently and effectively analyze multi-modal information becomes a technical problem that needs to be solved urgently by those skilled in the art.
Disclosure of Invention
The embodiment of the application provides a method and a device for generating an emotion analysis model, electronic equipment and a computer storage medium, so as to solve the technical problem that the emotion characteristics of a person cannot be obtained through sufficient and effective analysis in the prior art. The application also provides an emotion analysis method and a corresponding device, electronic equipment and computer storage medium thereof.
The method for generating the emotion analysis model provided by the embodiment of the application comprises the following steps:
acquiring a multi-modal feature vector of a sample object marked with emotional state;
obtaining shared feature vectors among the multi-modal feature vectors and private feature vectors corresponding to all the modalities;
performing fusion processing on modal feature vectors corresponding to the modes in the multi-modal feature vectors to obtain a first emotion fusion feature vector of the sample object;
performing fusion processing on the private characteristic vectors of the modes to obtain a second emotion fusion characteristic vector of the sample object;
and splicing at least two of the first emotion fusion feature vector, the second emotion fusion feature vector and the shared feature vector, and training a preset initial emotion analysis model by taking a splicing result and the emotion state marked by the sample object as training samples to obtain an emotion analysis model for determining the emotion of the user.
Optionally, the obtaining shared feature vectors among the multi-modal feature vectors and private feature vectors corresponding to each modality includes:
obtaining the shared feature vector according to the mean value of the multi-modal feature vectors;
and decoupling each modal feature vector in the multi-modal feature vectors according to the shared features to obtain the private feature vector corresponding to each modal.
Optionally, the obtaining the multi-modal feature vector of the sample object marked with the emotional state includes:
and obtaining the multi-modal feature vector through a convolutional neural network and multi-modal information.
Optionally, the performing fusion processing on the modal feature vectors corresponding to the modalities in the multi-modal feature vector to obtain the first emotion fusion feature vector of the sample object includes:
and performing fusion processing on the modal feature vectors of each mode in the multi-mode feature vector set in an outer product mode to obtain a first emotion fusion feature vector of the sample object.
Optionally, the fusing the private feature vectors of the modalities to obtain a second emotion fused feature vector of the sample object includes:
and carrying out fusion processing on the private characteristic vectors of all the modes in an outer product mode to obtain a second emotion fusion characteristic vector of the sample object.
Optionally, the splicing at least two of the first emotion fusion feature vector, the second emotion fusion feature vector, and the shared feature vector, and training a preset initial emotion analysis model by using a splicing result and an emotion state labeled by the sample object as a training sample to obtain an emotion analysis model for determining user emotion, including:
splicing the first emotion fusion feature vector, the second emotion fusion feature vector and the shared feature vector to obtain a first sample feature matrix;
and taking the first sample feature matrix and the emotional state labeled by the sample object as training samples to train the initial emotion analysis model to obtain the emotion analysis model.
Optionally, the splicing at least two of the first emotion fusion feature vector, the second emotion fusion feature vector, and the shared feature vector, and training a preset initial emotion analysis model by using a splicing result and an emotion state labeled by the sample object as a training sample to obtain an emotion analysis model for determining user emotion, including:
splicing the first emotion fusion feature vector and the second emotion fusion feature vector to obtain a second sample feature matrix;
and taking the second sample feature matrix and the emotional state labeled by the sample object as training samples to train the initial emotion analysis model, so as to obtain the emotion analysis model.
Optionally, the splicing at least two of the first emotion fusion feature vector, the second emotion fusion feature vector, and the shared feature vector, and training a preset initial emotion analysis model by using a splicing result and an emotion state labeled by the sample object as a training sample to obtain an emotion analysis model for determining user emotion, including:
splicing the first emotion fusion feature vector and the shared feature vector to obtain a third sample feature matrix;
and taking the third sample feature matrix and the emotional state labeled by the sample object as training samples to train the initial emotion analysis model, so as to obtain the emotion analysis model.
Optionally, the splicing at least two of the first emotion fusion feature vector, the second emotion fusion feature vector, and the shared feature vector, and training a preset initial emotion analysis model by using a splicing result and an emotion state labeled by the sample object as a training sample to obtain an emotion analysis model for determining user emotion, including:
splicing the second emotion fusion feature vector and the shared feature vector to obtain a fourth sample feature matrix;
and taking the fourth sample feature matrix and the emotional state labeled by the sample object as training samples to train the initial emotion analysis model, so as to obtain the emotion analysis model.
Optionally, the splicing at least two of the first emotion fusion feature vector, the second emotion fusion feature vector, and the shared feature vector, and training a preset initial emotion analysis model by using a splicing result and an emotion state labeled by the sample object as a training sample, include:
splicing at least two of the first emotion fusion feature vector, the second emotion fusion feature vector and the shared feature vector to obtain a splicing result containing a sample feature matrix;
inputting the sample feature matrix into the initial emotion analysis model, and obtaining a predicted value aiming at the emotional state of the sample object through the initial emotion analysis model;
and performing classification task training and regression prediction task training on the initial emotion analysis model according to the emotion state predicted value and the emotion state labeled by the sample object, and taking the trained initial emotion analysis model as the emotion analysis model.
Optionally, the method further includes:
determining the probability of each matrix element in the sample characteristic matrix;
according to the probability of each matrix element, giving a weight to each matrix element;
adjusting the sample characteristic matrix based on the weight of each matrix element to obtain a sample characteristic matrix after weight adjustment;
and taking the sample feature matrix after the weight adjustment and the emotional state labeled by the sample object as the training sample.
Optionally, the multi-modality includes at least two of audio, text, and images.
The application also provides a method for generating the emotion analysis model, which comprises the following steps:
acquiring a multi-modal feature vector of a sample object marked with emotional state;
obtaining shared feature vectors among the multi-modal feature vectors and private feature vectors corresponding to all the modalities;
performing fusion processing on the private characteristic vectors of the modes to obtain a second emotion fusion characteristic vector of the sample object;
and taking one of the second emotion fusion feature and the shared feature and the emotion state marked by the sample object as a training sample to train a preset initial emotion analysis model, so as to obtain an emotion analysis model for determining the emotion of the user.
This application provides an emotion analysis model's generating device simultaneously, includes:
a first obtaining module, configured to obtain a multi-modal feature vector of a sample object labeled with an emotional state
The second acquisition module is used for acquiring shared feature vectors among the multi-modal feature vectors and private feature vectors corresponding to all the modalities;
the first fusion module is used for performing fusion processing on modal feature vectors corresponding to all the modalities in the multi-modal feature vectors to obtain a first emotion fusion feature vector of the sample object;
the second fusion module is used for carrying out fusion processing on the private characteristic vectors of the modes to obtain a second emotion fusion characteristic vector of the sample object;
and the first training module is used for splicing at least two of the first emotion fusion characteristic vector, the second emotion fusion characteristic vector and the shared characteristic vector, training a preset initial emotion analysis model by taking a splicing result and the emotion state marked by the sample object as training samples, and obtaining an emotion analysis model for determining the emotion of the user.
This application provides an emotion analysis model's generating device simultaneously, includes:
the third acquisition module is used for acquiring the multi-modal feature vector of the sample object marked with the emotional state;
the fourth acquisition module is used for acquiring shared feature vectors among the multi-modal feature vectors and private feature vectors corresponding to all the modalities;
the third fusion module is used for carrying out fusion processing on the private characteristic vectors of the modes to obtain a second emotion fusion characteristic vector of the sample object;
and the second training module is used for taking one of the second emotion fusion characteristic and the shared characteristic and the emotion state marked by the sample object as a training sample to train a preset initial emotion analysis model so as to obtain an emotion analysis model for determining the emotion of the user.
The application also provides an emotion analysis method, which comprises the following steps:
acquiring a multi-modal feature vector of a target object;
obtaining shared feature vectors among the multi-modal feature vectors and private feature vectors corresponding to all the modalities;
performing fusion processing on modal feature vectors corresponding to the modes in the multi-modal feature vectors to obtain a first emotion fusion feature vector of the target object;
performing fusion processing on the private characteristic vectors of the modes to obtain a second emotion fusion characteristic vector of the target object;
splicing at least two of the first emotion fusion feature vector, the second emotion fusion feature vector and the shared feature vector, and inputting splicing results into an emotion analysis model for determining user emotion to obtain emotion analysis results of the target user;
wherein, the emotion analysis model is obtained by any one of the generation methods of the emotion analysis models.
Optionally, an emotion analysis method is characterized by including:
acquiring a multi-modal feature vector of a target object;
obtaining shared feature vectors among the multi-modal feature vectors and private feature vectors corresponding to all the modalities;
performing fusion processing on the private characteristic vectors of the modes to obtain a second emotion fusion characteristic vector of the target object;
inputting one of the second emotion fusion feature or the shared feature into an emotion analysis model for determining user emotion to obtain an emotion analysis result of the target user;
wherein, the emotion analysis model is obtained according to the production method of any one of the emotion analysis models.
This application provides an emotion analysis device simultaneously, includes:
the fifth acquisition module is used for acquiring the multi-modal feature vector of the target object;
a sixth obtaining module, configured to obtain shared feature vectors among the multi-modal feature vectors and private feature vectors corresponding to the modalities;
the fourth fusion module is used for performing fusion processing on the modal feature vectors corresponding to the modalities in the multi-modal feature vectors to obtain a first emotion fusion feature vector of the target object;
a fifth fusion module, configured to perform fusion processing on the private feature vectors of the respective modalities to obtain a second emotion fusion feature vector of the target object;
the first analysis module is used for splicing at least two of the first emotion fusion characteristic vector, the second emotion fusion characteristic vector and the shared characteristic vector, inputting splicing results into an emotion analysis model for determining user emotion, and obtaining emotion analysis results of the target user;
wherein, the emotion analysis model is obtained according to the production method of any one of the emotion analysis models.
This application provides an emotion analysis device simultaneously, includes:
the seventh acquisition module is used for acquiring the multi-modal feature vector of the target object;
an eighth obtaining module, configured to obtain shared feature vectors among the multi-modal feature vectors and private feature vectors corresponding to each modality;
a sixth fusion module, configured to perform fusion processing on the private feature vectors of the respective modalities to obtain a second emotion fusion feature vector of the target object;
the second analysis module is used for inputting one of the second emotion fusion feature or the shared feature into an emotion analysis model used for determining user emotion to obtain an emotion analysis result of the target user;
wherein, the emotion analysis model is obtained according to the production method of any one of the emotion analysis models.
This application provides an electronic equipment simultaneously, includes:
a processor;
a memory for storing a program of a method, which when read run by the processor, performs any of the methods described above.
The present application also provides a computer storage medium storing a computer program that, when executed, performs any of the methods described above
Compared with the prior art, the method has the following advantages:
the generating method of the emotion analysis model comprises the following steps: acquiring a multi-modal feature vector of a sample object marked with emotional state; obtaining shared feature vectors among the multi-modal feature vectors and private feature vectors corresponding to all the modalities; performing fusion processing on modal feature vectors corresponding to the modes in the multi-modal feature vectors to obtain a first emotion fusion feature vector of the sample object; performing fusion processing on the private characteristic vectors of the modes to obtain a second emotion fusion characteristic vector of the sample object; and splicing at least two of the first emotion fusion feature vector, the second emotion fusion feature vector and the shared feature vector, and training a preset initial emotion analysis model by taking a splicing result and the emotion state marked by the sample object as training samples to obtain an emotion analysis model for determining the emotion of the user.
According to the emotion analysis model generation method, the collected multi-modal feature vectors are decoupled and fused in different modes, all multi-modal features obtained in different modes are fused, and further the modal features obtained in different modes are spliced to obtain a training sample so as to train a neural network to obtain the emotion analysis model capable of analyzing emotion. The method fully considers the connection and difference among the modal feature vectors of the sample object, enhances the emotion characterization capability of the training sample, and can fully and effectively analyze the multi-modal information by the emotion feature analysis model obtained by the method.
Drawings
FIG. 1 is a flowchart of a method for generating an emotion analysis model according to an embodiment of the present application;
FIG. 2 is a logic diagram of emotion analysis model training provided in accordance with another embodiment of the present application;
FIG. 3 is a flowchart of a method for generating an emotion analysis model according to another embodiment of the present application;
FIG. 4 is a flowchart of a sentiment analysis method according to another embodiment of the present application;
FIG. 5 is a flowchart of a sentiment analysis method according to another embodiment of the present application;
FIG. 6 is a schematic structural diagram of an apparatus for generating an emotion analysis model according to another embodiment of the present application;
FIG. 7 is a schematic structural diagram of an apparatus for generating an emotion analysis model according to another embodiment of the present application;
FIG. 8 is a schematic structural diagram of an emotion analyzing apparatus according to another embodiment of the present application;
FIG. 9 is a schematic structural diagram of an emotion analyzing apparatus according to another embodiment of the present application;
fig. 10 is a schematic structural diagram of an electronic device according to another embodiment of the present application.
Detailed Description
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application. This application is capable of implementation in many different ways than those herein set forth and of similar import by those skilled in the art without departing from the spirit of this application and is therefore not limited to the specific implementations disclosed below.
The application provides a method and a device for generating an emotion analysis model, electronic equipment and a computer storage medium, and also provides an emotion analysis method and a device, electronic equipment and a computer storage medium. Details will be described in the following examples one by one.
The application provides a method for generating an emotion analysis model, which is characterized in that: after the multi-modal emotion feature data of the sample objects marked with the emotion states are analyzed to obtain multi-modal emotion vectors of the sample objects, the multi-modal emotion feature data are subjected to fusion processing based on different angles and integrated into the multi-modal fusion feature vector with strong robustness and strong emotion expression capability, and the multi-modal fusion feature vector and the emotion states marked by the training samples are used as initial emotion analysis models preset by training of the training samples to obtain emotion analysis models.
A first embodiment of the present application provides a method for generating an emotion analysis model, please refer to fig. 1, which is a flowchart of a method for generating an emotion analysis model according to an embodiment of the present application, the method includes: step S101 to step S105.
Step S101, obtaining a multi-modal feature vector of a sample object marked with emotional state.
In the first embodiment of the present application, the sample object marked with emotional state may be understood as data information corresponding to a person obtained from the internet or a database, and the emotional state of the person can be intuitively obtained through the person information, for example: the sample object can be a section of communication video of a person obtained from the Internet, and the communication information generated in the video, the intuitive information such as the expression of the person and the like can directly determine the emotional state of the person; another example is: the sample object can be a piece of voice conversation data of the service personnel and the user, which is collected in the manual customer service database, and the emotional state of the user in the conversation process can be directly reflected in the language text and the speaking gas of the user in the conversation.
In an embodiment of the application, the multi-modality comprises at least two of audio, text, and image. The modality refers to an interaction mode between senses (such as vision, hearing and the like) and external environments (such as human beings, machines and animals), for example: assuming that the sample object is a piece of communication video of a person obtained from the internet, a plurality of modal information such as an image, a gesture, voice, an expression, communication text and the like of the person can be extracted from the video. The multi-pass convolutional neural network and multi-modal information acquisition specifically comprises the following steps S101-1 to S101-2:
step S101-1, multi-modal information of the sample object is obtained, wherein the multi-modal information comprises at least two of audio information, text information and image information.
The multi-modal information is a set of multi-modal information of the sample object, and in a first optional embodiment of the present application, the multi-modal information includes at least two of audio information, text information, and image information of the sample object, for example: assuming that audio, language text, and images of a character can be extracted from an interactive video, the multimodal information is corresponding text audio information, text information, and facial expression information of the character. It should be understood that the facial expression information is only one kind of image of the person, and in other embodiments, the image information may be motion information of the person, for example: gestures of a person, posture information, etc.
S101-2, acquiring a modal feature vector corresponding to the multi-modal information through a convolutional neural network based on the multi-modal information; the convolutional neural network is a model which is established by utilizing a modal feature recognition technology in machine learning and is used for determining modal feature vectors corresponding to modal information, and the modal feature recognition technology in machine learning is a process of processing different forms of modal information when the model completes analysis and recognition tasks.
In a specific application process, the convolutional neural network is obtained by training Machine Learning (ML), and the Machine Learning (which is a multi-domain cross subject and relates to multi-domain subjects such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like) is specially used for researching how a computer simulates or realizes the Learning behavior of human beings so as to obtain new knowledge or skills and reorganize the existing knowledge structure to continuously improve the new energy of the computer. Machine learning generally includes techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and the like. Machine learning belongs to a branch of Artificial Intelligence (AI) technology.
In specific applications, modal features corresponding to different modalities may be extracted through a convolutional neural network, for example: the audio information sent by the sample object can be identified through a Speech recognition model (Speech-Bert), and a corresponding audio feature vector is obtained; determining a voice Text in the sample object interactive audio through a Text recognition model (Text-Bert), and obtaining a corresponding Text feature vector; the facial expression of the sample object can also be determined by the expression recognition model, and the corresponding image feature vector is obtained. The present application is not limited thereto.
Step S102, obtaining shared feature vectors among the multi-modal feature vectors and private feature vectors corresponding to each mode.
The purpose of step S102 is to extract features common to different modalities using characteristics common to the modalities, and to extract features unique to each modality based on the extracted common features.
Specifically, the step S102 includes the following steps S102-1 to S102-2.
And S102-1, processing the modal feature vectors corresponding to the modes in the multi-modal feature vectors in a mode of solving a vector mean value, and determining the shared feature vectors.
Specifically, in the first embodiment of the present application, the multi-modal feature vector is configured to include: audio feature vector haText feature vector htImage feature vector hvThen, the step S102-1 is implemented by the following formula (1):
Figure BDA0003385208620000101
wherein S is a shared feature vector of the multi-modal feature vector.
And S102-2, decoupling the feature vectors of each mode in the multi-mode feature vectors based on the shared feature vectors, and determining the private feature vectors of each mode.
The step S102-2 is to decouple the shared feature vector from the feature vectors of each modality in a parameter-free decoupling manner based on the extracted shared feature vector, and retain unique features of each modality as private features.
Specifically, the step S102-2 can be implemented by the following formula (2):
im=hm-S;m∈{t,v,a}____ (2)
wherein imA private feature vector corresponding to each modality, wherein iaFor audio private feature vectors, itFor text private feature vectors, ivIs an image private feature vector.
In the first embodiment provided by the application, the private feature vector and the common feature vector are obtained for the purpose of serving as a part of a sample for training an emotion analysis model to enhance the difference and common expression capability between the sample and multimodal feature information.
Further, in order to reduce the amount of calculation for the feature vectors and reduce the similarity between the private features, in a preferred embodiment of the present application, a Pooling layer (Pooling) may be introduced to perform dimension reduction on the private feature vectors of each modality.
The pooling layer is one of the common components in current convolutional neural networks, and reduces the amount of computation by sampling data in a partitioned manner to downsample a large matrix into a small matrix, while preventing overfitting. The pooling layer is generally provided with a maximum pooling layer and an average pooling layer, the maximum pooling layer selects the maximum value of each small region as a pooling result, the average pooling layer selects an average value as a pooling result, and the specific selection of the pooling layer is to process the private characteristic vector and the shared characteristic vector and can be selected according to actual needs. The present application is not limited thereto.
Step S103, carrying out fusion processing on the modal feature vectors corresponding to the modes in the multi-modal feature vectors to obtain a first emotion fusion feature vector of the sample object.
The purpose of step S103 is to perform complete multi-modal feature integration based on each single-modal feature vector, thereby avoiding the problem of modal bias.
In an optional embodiment of the present application, the feature vectors of the respective modalities in the multi-modal feature vector may be subjected to a fusion process by means of an outer product to obtain the first emotion fusion feature of the sample object. The high-order relation among the modal characteristics can be established through the calculation mode of the outer product, and meanwhile, the problem of modal bias can be avoided.
Specifically, it is still assumed that the modal feature vector includes: audio feature vector haText feature vector htImage feature vector hvThen the first emotion fusion feature is M ═ ha×ht×hvWherein M represents the first emotion fusion characteristic.
Further, similarly to the step S102, in order to reduce the similarity between the first emotion fused feature vector and other feature vectors, a pooling layer may also be introduced to perform dimensionality reduction on the first emotion fused feature vector.
And step S104, carrying out fusion processing on the private characteristic vectors of the modes to obtain a second emotion fusion characteristic vector of the sample object.
Similar to the step S103, the step S104 also fuses the private feature vectors of each modality in an outer product manner to establish a high-order relationship between the private modality features, so as to avoid the modality bias problem.
Specifically, let i denoteaFor audio private feature vectors, itFor text private feature vectors, ivIs an image private feature vector, the second modality feature I ═ Ia×it×ivWherein I represents the second emotion fusion feature.
Further, similarly to the above steps S102 and S103, in order to reduce the similarity between the second emotion fused feature vector and other feature vectors, a pooling layer may also be introduced to perform dimensionality reduction on the second emotion fused feature vector.
And step S105, splicing at least two of the first emotion fusion feature vector, the second emotion fusion feature vector and the shared feature vector, taking a splicing result and the emotion state marked by the sample object as training samples to train a preset initial emotion analysis model, and obtaining an emotion analysis model for determining the emotion of the user.
In an optional implementation manner of the present application, in order to improve the robustness of a training sample and improve the representation energy of the training sample on an emotional state to the maximum extent, the first emotion fusion feature vector, the second emotion fusion feature vector, and the shared feature vector may be selected to be spliced to obtain the training sample.
Specifically, the step S105 includes the following steps S105-1 to S105-2.
And S105-1, splicing the first emotion fusion feature vector, the second emotion fusion feature vector and the shared feature vector to obtain a first sample feature matrix.
The purpose of the above step S105-1 is to integrate three different visual multi-modal features into one integrated multi-modal interactive feature, i.e. the first sample feature matrix, by means of fusion in a splicing manner.
For example, if the shared eigenvector S is [ S1, S2, S2], the first emotion fusion eigenvector M is [ M1, M2, M3], the second emotion fusion eigenvector I is [ I1, I2, I3], the first sample eigenvector matrix is a 3 × 3 matrix composed of the three eigenvectors, or a 1 × 9 matrix composed of the three eigenvectors.
It can be understood that the above method for obtaining the sample feature matrix by fusing three feature vectors is only an optional implementation manner given in the embodiment of the present application, and other different implementation manners may also be adopted to obtain the sample feature matrix.
Optionally, the first emotion fusion feature vector and the second emotion fusion feature vector may be spliced to obtain a second sample feature matrix, or the first emotion fusion feature vector and the shared feature vector may be spliced to obtain a third sample feature matrix, or the second emotion fusion feature vector and the shared feature vector may be spliced to obtain a fourth sample feature matrix. The above ways of splicing the sample feature matrices are only simple modifications of step S105-1 in the first embodiment of the present application, and the present application is not limited thereto.
Further, in an optional implementation manner of the present application, importance quantization processing may be performed on each element in the sample feature matrix according to an information theory, so as to further mine the ability of each feature to characterize an emotional state.
Specifically, a Norm-gate module is introduced in the process of quantifying the importance of each element in the sample matrix, and the Norm-gate module can enable each element in the sample feature matrix to be adjusted adaptively. The Norm-gate module is a module without additional learning parameters, and by using the module, overfitting of training samples caused by adaptive operation of the module in a model process can be avoided. The Norm-gate module is based on the assumption that all elements in a data set follow a normal distribution, and thus the smaller the probability of the occurrence of an element of the mean of the principle distribution. Based on this, it is known from information theory in the field of machine learning that, if a training sample set is provided to a convolutional neural network for learning, the convolutional neural network learns less training samples with smaller occurrence probability in a data set, which means that the training samples with less occurrence probability in the data set have lower importance than other training samples, but for a convolutional neural network, it is necessary to treat each training sample in the training sample set indiscriminately in order to improve the learning efficiency and learning accuracy.
Furthermore, the sample feature matrix serving as the training sample is formed by fusing feature vectors of different forms of multiple modalities, so that the emotional state of the sample object is naturally embodied in each feature vector element, and it can be understood that in the process of expressing emotion by people, some subtle and imperceptible emotion exposure modes may exist, and in the process of communication by people, a subtle action of a person may reflect the current emotional state of the person. These subtle actions are all reflected in each element of the sample feature matrix, only the number of occurrences of that element determines its importance. But this importance should not be a factor in the initial emotion analysis model training process.
Therefore, the first embodiment of the present application adopts a Norm-gate module to determine the probability of occurrence of each element in the sample feature matrix, and assigns a higher weight to an element with a lower probability of occurrence, so as to adaptively adjust the importance degree of the element and enhance the ability of the element to characterize an emotional state.
Step S105-2, training the initial emotion analysis model by taking the first sample feature matrix and the emotion state labeled by the sample object as training samples to obtain the emotion analysis model.
The initial emotion analysis model is an initial machine learning model, such as an initial convolutional neural network model. In a specific application process, training samples corresponding to a plurality of sample objects are obtained through the sample objects to train the initial machine learning model, so that internal parameters of the initial machine learning model are adjusted, an emotion fusion feature vector capable of being input according to a target user is obtained, and the current emotion state of the target user is output.
Specifically, the step S105-2 trains the initial emotion analysis model in the following manner.
Firstly, splicing at least two of the first emotion fusion feature vector, the second emotion fusion feature vector and the shared feature vector to obtain a splicing result containing a sample feature matrix.
Secondly, inputting the sample characteristic matrix into the initial emotion analysis model, and obtaining a predicted value aiming at the emotional state of the sample object through the initial emotion analysis model.
And finally, performing classification task training and regression prediction task training on the initial emotion analysis model according to the emotion state predicted value and the emotion state labeled by the sample object, and taking the trained initial emotion analysis model as the emotion analysis model.
In other words, in the process of model training, the difference between the emotional state of the sample object predicted by the model and the emotional state labeled by the sample object is continuously detected, and the internal parameters of the model are continuously adjusted through a classification training task and a regression task.
In the field of machine learning, the classification task refers to an approximation task of a mapping function of input variables to discrete output variables, where the mapping function is used to predict a class for a given observation. In the first embodiment of the present application, the input variable refers to a sample feature matrix of the sample object, the output variable refers to an emotional state of the sample object predicted by a model according to the sample feature matrix, and a category given by the mapping function is a category of a preset emotional state, for example: sad, happy, angry, etc. emotion categories.
In the process of model training, along with the expansion of training samples and the continuous learning of the models, the precision of the trained models is gradually improved until the error rate of the trained emotion recognition models for emotion classification is smaller than a preset threshold.
For example, for a simple calculation method, assuming that the predetermined threshold is 2%, if 1 of the emotional states of the sample objects output by the trained emotion recognition model for the sample feature matrices of 100 sample objects is wrong, the error rate of the classification model is 1%, and it is determined that the classification accuracy of the prediction model satisfies the predetermined threshold.
In an alternative embodiment of the present application, a cross-entropy loss function L is employedCEDetermining whether a classification task of a training model is completed, wherein an error rate of the emotion recognition model is calculated through the cross entropy loss function, and specifically, the cross entropyThe loss function is embodied by the following equation (3):
Figure BDA0003385208620000141
wherein, yiFor the true value of the emotional state labeled in the ith sample object,
Figure BDA0003385208620000142
and outputting a predicted value of the emotional state of the ith sample object for the emotion recognition model, wherein N is the number of samples adopted for training the emotion recognition model.
In the field of machine learning, the regression prediction task is to find out various factors influencing a prediction target by taking a prediction correlation principle as a basis, then find out approximate expressions of functional relations between the factors and the prediction target, and find out the factors by a mathematical method. Parameters are estimated for the model using the training samples, and the model is error tested. If the model determines, the model can be used to predict the change in the value of the factor.
In an alternative embodiment of the present application, a mean square error loss function L is usedMSEAnd carrying out error detection on the model, specifically, the mean square error loss function is represented by the following formula (4):
Figure BDA0003385208620000143
likewise, yiFor the true value of the emotional state labeled in the ith sample object,
Figure BDA0003385208620000144
and outputting a predicted value of the emotional state of the ith sample object for the emotion recognition model, wherein N is the number of samples adopted for training the emotion recognition model.
The emotion analysis model for determining the emotion of the user can be obtained by adopting the methods of the steps S101 to S105.
Therefore, according to the method for generating the emotion analysis model, the collected multi-modal feature vectors are decoupled and fused in different modes, the multi-modal features obtained in different modes are fused, and the modal features obtained in different modes are further spliced to obtain a training sample so as to train the neural network to obtain the emotion analysis model capable of analyzing emotion. The method fully considers the connection and difference among the modal feature vectors of the sample object, enhances the emotion characterization capability of the training sample, and can fully and effectively analyze the multi-modal information by the emotion feature analysis model obtained by the method.
In addition, in the process of testing the emotion analysis model obtained by training, if the extraction of modal features is removed, the number of network parameters generated in the use process of the model is only about 0.3M, and the calculated amount is about 2 MFLOPs. Therefore, the emotion analysis model obtained by the method of the first embodiment of the application can save a large amount of calculation cost and machine cost while ensuring accurate analysis of emotion, and the distance between a person and a machine can be shortened by analyzing the emotion state of the person by using the model, so that a user can obtain better user experience.
Further, for further understanding of the method for generating an emotion analysis model provided in the first embodiment of the present application, a detailed description is provided below with reference to fig. 2, where fig. 2 is a logic diagram of emotion analysis model training provided in another embodiment of the present application.
The process of obtaining an emotion analysis model for determining the emotion of a user by acquiring a voice communication video of a person as a sample, extracting a multi-modal feature set from the sample, and training an initial emotion analysis model based on the multi-modal feature set is described in fig. 2.
Fig. 2 includes: a modality extraction module 201, a modality processing module 202, a sample acquisition module 203, a sample processing module 204, and a training module 205.
The emotional state of the person should be happy as can be known from the voice text information (e.g., "That's done of crazy" in fig. 2) of the person and the expression information in the image of the person in the modality extraction module 201. On this basis, after the modal extraction module 201 obtains the voice communication video of the person, firstly, the voice communication video is subjected to continuous video clip cutting, modal analysis is performed on the voice communication video based on each video clip, facial expression image information, audio information and voice text information of a task in the voice communication video are obtained, and meanwhile, a modal feature extraction model is used for obtaining feature vectors corresponding to each modality.
Specifically, as shown in FIG. 2, in FIG. 2
Figure BDA0003385208620000151
It can be understood as a convolutional neural network for obtaining feature vectors of each modality according to the multi-modality information set, where a represents audio information, t represents speech text information, v represents facial expression image information, and l represents a serial number of a video clip processed by the convolutional neural network. The modal characteristic extraction model outputs the audio characteristic vector h according to the input characteristic informationaSpeech text feature vector htAnd facial expression feature vector hv
After obtaining the modal feature vectors of the modalities, the modality processing module 202 performs fusion processing on the three feature vectors to obtain a first fusion feature vector M, and meanwhile obtains a shared feature vector S among the three feature vectors by an average processing method, and determines the private feature vectors of the three feature vectors (wherein, the facial expression feature vector h is determined based on the shared feature vector S and the three feature vectors)vIs a private feature vector ofvAudio feature vector haIs a private feature vector ofaSpeech text feature vector htIs a private feature vector oft)。
After the private characteristic vector is determined, the three private characteristic vectors are subjected to fusion processing to obtain a second fusion characteristic vector I, and the three private characteristic vectors are subjected to fusion processingBefore merging, the three private feature vectors need to be subjected to first dimension reduction, and the similarity between the private features needs to be reduced. In particular, the dimensionality reduction may be achieved by introducing a pooling layer, as shown in fig. 2, which is shown in fig. 2
Figure BDA0003385208620000161
I.e. to the private feature vector ivFeature vectors obtained after dimensionality reduction, corresponding
Figure BDA0003385208620000162
For the private feature vector iaThe feature vectors obtained after the dimensionality reduction is performed,
Figure BDA0003385208620000163
for the private feature vector itAnd (5) carrying out dimensionality reduction on the obtained feature vector.
After determining the first fusion eigenvector M, the second fusion eigenvector I and the shared eigenvector S, the sample obtaining module 204 splices the first fusion eigenvector θ, the second fusion eigenvector I and the shared eigenvector S to obtain an initial sample matrix F0And the initial sample matrix F0Sent to the sample processing module 205, where the sample processing module 205 pairs the initial sample matrix F with the Norm-gate module0The importance degree of each element in the initial emotion analysis model is processed, the sample matrix after the importance degree processing is used as a training sample to carry out classification task training and regression prediction task training on the initial emotion analysis model, and finally the emotion analysis model used for determining the emotion of the user is obtained.
Similar to the first embodiment, the second embodiment of the present application provides another emotion analysis model generation method, which is basically similar to the first embodiment, so that the description is simple, and the relevant points can be found in the partial description of the first embodiment.
Please refer to fig. 3, which is a flowchart illustrating a method for generating an emotion analysis model according to another embodiment of the present application, the method includes steps S301 to S304.
Step S301, obtaining a multi-modal feature vector of the sample object marked with the emotional state.
Step S301 is substantially the same as step S101 in the first embodiment of the present application, and for the specific explanation of this step, reference may be made to step 101 in the first embodiment of the present application, which is only briefly described here.
Optionally, the obtaining the multi-modal feature vector of the sample object marked with the emotional state includes:
and obtaining the multi-modal feature vector through a convolutional neural network and multi-modal information.
Step S302, obtaining shared feature vectors among the multi-modal feature vectors and private feature vectors corresponding to each modality.
Step S302 is substantially the same as step S102 in the first embodiment of the present application, and for the specific explanation of this step, reference may be made to step 101 in the first embodiment of the present application, which is only briefly described here.
Optionally, the step S302 specifically includes:
obtaining the shared feature vector according to the mean value of the multi-modal feature vectors;
and decoupling each modal feature vector in the multi-modal feature vectors according to the shared features to obtain the private feature vector corresponding to each modal.
Step S303, carrying out fusion processing on the modal feature vectors corresponding to the modes in the multi-modal feature vector to obtain a first emotion fusion feature vector of the sample object.
Step S303 is substantially the same as step S104 in the first embodiment of the present application, and for the specific explanation of this step, reference may be made to step S104 in the first embodiment of the present application, which is only briefly described here.
Optionally, in the step S303, the private feature vectors of the respective modalities are fused in an outer product manner, so as to obtain a second emotion fused feature vector of the sample object.
Step S304, taking one of the second emotion fusion feature and the shared feature and the emotion state marked by the sample object as a training sample to train a preset initial emotion analysis model, and obtaining an emotion analysis model for determining the emotion of the user.
The step S304 is substantially the same as the step S105 in the first embodiment of the present application, except that in the second embodiment of the present application, any one of the second emotion fusion feature or the shared feature is adopted as the training sample to train the preset initial emotion analysis model.
Optionally, the specific training process of the model is as follows:
obtaining a predicted value aiming at the emotional state of the sample object through the initial emotion analysis model by using the second emotion fusion feature or the shared feature data of the initial emotion analysis model;
and performing classification task training and regression prediction task training on the initial emotion analysis model according to the emotion state predicted value and the emotion state labeled by the sample object, and taking the trained initial emotion analysis model as the emotion analysis model.
The first embodiment and the second embodiment of the present application respectively describe a method for generating an emotion analysis model, and a third embodiment of the present application provides an emotion analysis method corresponding to the first embodiment, please refer to fig. 4, which is a flowchart of an emotion analysis method provided in another embodiment of the present application.
As shown in fig. 4, the emotion analyzing method provided in the third embodiment of the present application includes steps S401 to S405.
Step S401, obtaining a multi-modal feature vector of a sample object marked with emotional state
The emotion analysis method provided by the third embodiment of the application is used for identifying the emotion state of a target user in the human-computer interaction process, and specifically, the emotion of a target object is analyzed through an emotion analysis model obtained by the first method embodiment provided by the application.
The target object can be understood as a main user in a human-computer interaction process, and the obtaining of the multi-modal feature set of the target object is to obtain the multi-modal feature vector set of the target object based on the current interaction behavior of the target object.
Specifically, the obtaining manner of the multi-modal feature vector set is similar to the process of obtaining the multi-modal feature vector set of the sample object in step S101 of the first method embodiment of the present application, that is, the obtaining the multi-modal feature vector set of the target object includes:
and obtaining the multi-modal feature vector through a convolutional neural network and multi-modal information.
Step S402, obtaining shared feature vectors among the multi-modal feature vectors and private feature vectors corresponding to each modality.
Optionally, the step S402 includes:
processing the modal feature vector corresponding to each modal in the multi-modal feature vector in a mode of solving a vector mean value to determine the shared feature vector;
and decoupling the modal feature vectors corresponding to each mode in the multi-mode feature vectors based on the shared feature vectors, and determining the private feature vectors of each mode.
Step S403, performing fusion processing on the modal feature vectors corresponding to the respective modalities in the multi-modal feature vectors to obtain a first emotion fusion feature vector of the target object.
Optionally, the step S403 includes: and performing fusion processing on the modal feature vectors of each mode in the multi-mode feature vector set in an outer product mode to obtain a first emotion fusion feature vector of the sample object.
Step S404, the private characteristic vectors of all the modes are subjected to fusion processing, and a second emotion fusion characteristic vector of the target object is obtained.
Optionally, the step S404 includes: and carrying out fusion processing on the private characteristic vectors of all the modes in an outer product mode to obtain a second emotion fusion characteristic vector of the sample object.
Step S405, splicing at least two of the first emotion fusion feature vector, the second emotion fusion feature vector and the shared feature vector, and inputting a splicing result into an emotion analysis model for determining user emotion to obtain an emotion analysis result of the target user.
Splicing the first emotion fusion feature vector and the second emotion fusion feature vector, and inputting the spliced first emotion fusion feature vector and the second emotion fusion feature vector into an emotion analysis model for determining user emotion to obtain an emotion analysis result of the target user;
or, the first emotion fusion feature vector and the shared feature vector are spliced and input into an emotion analysis model for determining user emotion, and an emotion analysis result of the target user is obtained;
or splicing the second emotion fusion feature vector and the shared feature vector and inputting the second emotion fusion feature vector and the shared feature vector into an emotion analysis model for determining user emotion to obtain an emotion analysis result of the target user.
A fourth embodiment of the present application provides an emotion analysis method corresponding to the second embodiment, please refer to fig. 5, which is a flowchart of an emotion analysis method provided in another embodiment of the present application. Since the method is basically similar to the third embodiment, the description is simple, and reference may be made to the partial description of the third embodiment.
As shown in fig. 5, a sentiment analysis method according to another embodiment of the present application includes the following steps S501 to S504.
Step S501, multi-modal feature vectors of the target object are obtained.
Optionally, the obtaining the multi-modal feature vector of the target object includes:
and obtaining the multi-modal feature vector through a convolutional neural network and multi-modal information.
Step S502, obtaining shared feature vectors among the multi-modal feature vectors and private feature vectors corresponding to each modality.
Optionally, step S502 includes:
processing the modal feature vector corresponding to each modal in the multi-modal feature vector in a mode of solving a vector mean value to determine the shared feature vector;
and decoupling the modal feature vectors corresponding to each mode in the multi-mode feature vectors based on the shared feature vectors, and determining the private feature vectors of each mode.
Step S503, performing fusion processing on the private feature vectors of the respective modalities to obtain a second emotion fusion feature vector of the target object.
Optionally, step S503 includes: and performing fusion processing on the feature vectors of each mode in the multi-mode feature vector set in an outer product mode to obtain a first emotion fusion feature vector of the sample object.
Step S504, inputting one of the second emotion fusion characteristics or the shared characteristics into an emotion analysis model for determining user emotion, and obtaining an emotion analysis result of the target user.
Inputting the second emotion fusion feature into an emotion analysis model for determining user emotion to obtain an emotion analysis result of the target user;
or inputting an emotion analysis model for determining user emotion into the shared characteristics to obtain an emotion analysis result of the target user.
Corresponding to the first method embodiment, another embodiment of the present application provides an apparatus for generating an emotion analysis model. Since the apparatus is basically similar to the first method embodiment of the present application, the description is simple, and the relevant point can be found in the partial description of the first method embodiment.
Please refer to fig. 6, which is a schematic structural diagram of a device for generating an emotion analysis model according to another embodiment of the present application.
The emotion analysis model generation device comprises:
a first obtaining module 601, configured to obtain a multi-modal feature vector of a sample object labeled with an emotional state
A second obtaining module 602, configured to obtain shared feature vectors among the multi-modal feature vectors and private feature vectors corresponding to each modality;
a first fusion module 603, configured to perform fusion processing on the modal feature vectors corresponding to the modalities in the multi-modal feature vectors to obtain a first emotion fusion feature vector of the sample object;
a second fusion module 604, configured to perform fusion processing on the private feature vectors of the modalities to obtain a second emotion fusion feature vector of the sample object;
the first training module 605 is configured to splice at least two of the first emotion fusion feature vector, the second emotion fusion feature vector, and the shared feature vector, train a preset initial emotion analysis model by using a splicing result and an emotion state labeled by the sample object as training samples, and obtain an emotion analysis model for determining an emotion of the user.
The obtaining of the shared feature vectors among the multi-modal feature vectors and the private feature vectors corresponding to the respective modalities includes:
obtaining the shared feature vector according to the mean value of the multi-modal feature vectors;
and decoupling each modal feature vector in the multi-modal feature vectors according to the shared features to obtain the private feature vector corresponding to each modal.
Optionally, the obtaining the multi-modal feature vector of the sample object marked with the emotional state includes:
and obtaining the multi-modal feature vector through a convolutional neural network and multi-modal information.
Optionally, the performing fusion processing on the modal feature vectors corresponding to the modalities in the multi-modal feature vector to obtain the first emotion fusion feature vector of the sample object includes:
and performing fusion processing on the modal feature vectors of each mode in the multi-mode feature vector set in an outer product mode to obtain a first emotion fusion feature vector of the sample object.
Optionally, the fusing the private feature vectors of the modalities to obtain a second emotion fused feature vector of the sample object includes:
and carrying out fusion processing on the private characteristic vectors of all the modes in an outer product mode to obtain a second emotion fusion characteristic vector of the sample object.
Optionally, the splicing at least two of the first emotion fusion feature vector, the second emotion fusion feature vector, and the shared feature vector, and training a preset initial emotion analysis model by using a splicing result and an emotion state labeled by the sample object as a training sample to obtain an emotion analysis model for determining user emotion, including:
splicing the first emotion fusion feature vector, the second emotion fusion feature vector and the shared feature vector to obtain a first sample feature matrix;
and taking the first sample feature matrix and the emotional state labeled by the sample object as training samples to train the initial emotion analysis model to obtain the emotion analysis model.
Optionally, the splicing at least two of the first emotion fusion feature vector, the second emotion fusion feature vector, and the shared feature vector, and training a preset initial emotion analysis model by using a splicing result and an emotion state labeled by the sample object as a training sample to obtain an emotion analysis model for determining user emotion, including:
splicing the first emotion fusion feature vector and the second emotion fusion feature vector to obtain a second sample feature matrix;
and taking the second sample feature matrix and the emotional state labeled by the sample object as training samples to train the initial emotion analysis model, so as to obtain the emotion analysis model.
Optionally, the splicing at least two of the first emotion fusion feature vector, the second emotion fusion feature vector, and the shared feature vector, and training a preset initial emotion analysis model by using a splicing result and an emotion state labeled by the sample object as a training sample to obtain an emotion analysis model for determining user emotion, including:
splicing the first emotion fusion feature vector and the shared feature vector to obtain a third sample feature matrix;
and taking the third sample feature matrix and the emotional state labeled by the sample object as training samples to train the initial emotion analysis model, so as to obtain the emotion analysis model.
Optionally, the splicing at least two of the first emotion fusion feature vector, the second emotion fusion feature vector, and the shared feature vector, and training a preset initial emotion analysis model by using a splicing result and an emotion state labeled by the sample object as a training sample to obtain an emotion analysis model for determining user emotion, including:
splicing the second emotion fusion feature vector and the shared feature vector to obtain a fourth sample feature matrix;
and taking the fourth sample feature matrix and the emotional state labeled by the sample object as training samples to train the initial emotion analysis model, so as to obtain the emotion analysis model.
Optionally, the splicing at least two of the first emotion fusion feature vector, the second emotion fusion feature vector, and the shared feature vector, and training a preset initial emotion analysis model by using a splicing result and an emotion state labeled by the sample object as a training sample, include:
splicing at least two of the first emotion fusion feature vector, the second emotion fusion feature vector and the shared feature vector to obtain a splicing result containing a sample feature matrix;
inputting the sample feature matrix into the initial emotion analysis model, and obtaining a predicted value aiming at the emotional state of the sample object through the initial emotion analysis model;
and performing classification task training and regression prediction task training on the initial emotion analysis model according to the emotion state predicted value and the emotion state labeled by the sample object, and taking the trained initial emotion analysis model as the emotion analysis model.
Optionally, the multi-modality comprises at least two of audio, text and image.
Optionally, the apparatus further comprises:
the sample weight adjusting module is used for determining the probability of each matrix element in the sample characteristic matrix; according to the probability of each matrix element, giving a weight to each matrix element;
adjusting the sample characteristic matrix based on the weight of each matrix element to obtain a sample characteristic matrix after weight adjustment; and taking the sample feature matrix after the weight adjustment and the emotional state labeled by the sample object as the training sample.
A sixth embodiment of the present application provides an emotion analysis model generation device corresponding to the second embodiment. Since the device is basically similar to the second embodiment, the description is simple, and reference may be made to the partial description of the second embodiment.
Please refer to fig. 7, which is a schematic structural diagram of an apparatus for generating an emotion analysis model according to another embodiment of the present application.
The emotion analysis model generation device comprises:
a third obtaining module 701, configured to obtain a multi-modal feature vector of a sample object labeled with an emotional state;
a fourth obtaining module 702, configured to obtain shared feature vectors among the multi-modal feature vectors and private feature vectors corresponding to each modality;
a third fusion module 703, configured to perform fusion processing on the private feature vectors of the respective modalities to obtain a second emotion fusion feature vector of the sample object;
a second training module 704, configured to train a preset initial emotion analysis model by using one of the second emotion fusion feature and the shared feature and an emotion state labeled by the sample object as a training sample, so as to obtain an emotion analysis model for determining user emotion.
Optionally, the obtaining shared feature vectors among the multi-modal feature vectors and private feature vectors corresponding to each modality includes:
obtaining the shared feature vector according to the mean value of the multi-modal feature vectors;
and decoupling each modal feature vector in the multi-modal feature vectors according to the shared features to obtain the private feature vector corresponding to each modal.
Optionally, the obtaining the multi-modal feature vector of the sample object marked with the emotional state includes:
and obtaining the multi-modal feature vector through a convolutional neural network and multi-modal information.
Optionally, the fusing the private feature vectors of the modalities to obtain a second emotion fused feature vector of the sample object includes:
and carrying out fusion processing on the private characteristic vectors of all the modes in an outer product mode to obtain a second emotion fusion characteristic vector of the sample object.
Optionally, the multi-modality includes at least two of audio, text, and images.
Optionally, the apparatus further comprises:
the sample weight adjusting module is used for determining the probability of each matrix element in the sample characteristic matrix; according to the probability of the occurrence of each matrix element, giving a weight to each element; and adjusting the sample feature matrix based on the weight of each element to obtain a sample feature matrix with the adjusted weight as the training sample.
A seventh embodiment of the present application provides an emotion analyzing apparatus corresponding to the third embodiment. Since the device is basically similar to the third embodiment, the description is simple, and reference may be made to the partial description of the third embodiment.
Please refer to fig. 8, which is a schematic structural diagram of an emotion analyzing apparatus according to another embodiment of the present application.
The device includes:
a fifth obtaining module 801, configured to obtain a multi-modal feature vector of a target object;
a sixth obtaining module 802, configured to obtain shared feature vectors among the multi-modal feature vectors and private feature vectors corresponding to each modality;
a fourth fusion module 803, configured to perform fusion processing on the modal feature vectors corresponding to the respective modalities in the multi-modal feature vectors to obtain a first emotion fusion feature vector of the target object;
a fifth fusion module 804, configured to perform fusion processing on the private feature vectors of the respective modalities to obtain a second emotion fusion feature vector of the target object;
a first analysis module 805, configured to splice at least two of the first emotion fusion feature vector, the second emotion fusion feature vector, and the shared feature vector, and input a splicing result into an emotion analysis model for determining user emotion, so as to obtain an emotion analysis result of the target user.
The emotion analysis model is obtained according to the emotion analysis model generation method provided by the first embodiment of the application.
Optionally, the obtaining shared feature vectors among the multi-modal feature vectors and private feature vectors corresponding to each modality includes:
obtaining the shared feature vector according to the mean value of the multi-modal feature vectors;
and decoupling each modal feature vector in the multi-modal feature vectors according to the shared features to obtain the private feature vector corresponding to each modal. Optionally, the obtaining the multi-modal feature vector of the target object includes:
and obtaining the multi-modal feature vector through a convolutional neural network and multi-modal information.
Optionally, the obtaining shared feature vectors among the multi-modal feature vectors and private feature vectors corresponding to each modality includes:
obtaining the shared feature vector according to the mean value of the multi-modal feature vectors;
and decoupling each modal feature vector in the multi-modal feature vectors according to the shared features to obtain the private feature vector corresponding to each modal.
Optionally, the performing fusion processing on the modal feature vectors corresponding to the modalities in the multi-modal feature vector to obtain the first emotion fusion feature vector of the target object includes:
and performing fusion processing on the modal feature vectors of each mode in the multi-mode feature vector set in an outer product mode to obtain a first emotion fusion feature vector of the sample object.
Optionally, the fusing the private feature vectors of the modalities to obtain a second emotion fused feature vector of the target object includes:
and carrying out fusion processing on the private characteristic vectors of all the modes in an outer product mode to obtain a second emotion fusion characteristic vector of the sample object.
An eighth embodiment of the present application provides an emotion analyzing apparatus corresponding to the fourth embodiment. Since the device is basically similar to the fourth embodiment, the description is simple, and reference may be made to the partial description of the fourth embodiment.
Please refer to fig. 9, which is a schematic structural diagram of an emotion analyzing apparatus according to another embodiment of the present application.
As shown in fig. 9, the apparatus includes:
a seventh obtaining module 901, configured to obtain a multi-modal feature vector of the target object;
an eighth obtaining module 902, configured to obtain shared feature vectors among the multi-modal feature vectors and private feature vectors corresponding to each modality;
a sixth fusion module, configured to perform fusion processing on the private feature vectors of the respective modalities to obtain a second emotion fusion feature vector of the target object;
and the second analysis module is used for inputting one of the second emotion fusion characteristics or the shared characteristics into an emotion analysis model used for determining user emotion to obtain an emotion analysis result of the target user.
The emotion analysis model is obtained by the emotion analysis model generation method provided by the second embodiment of the application.
Optionally, the obtaining the multi-modal feature vector of the target object includes:
and obtaining the multi-modal feature vector through a convolutional neural network and multi-modal information.
Optionally, the obtaining shared feature vectors among the multi-modal feature vectors and private feature vectors corresponding to each modality includes:
obtaining the shared feature vector according to the mean value of the multi-modal feature vectors;
and decoupling each modal feature vector in the multi-modal feature vectors according to the shared features to obtain the private feature vector corresponding to each modal.
Optionally, the fusing the private feature vectors of the modalities to obtain a second emotion fused feature vector of the target object includes:
and carrying out fusion processing on the private characteristic vectors of all the modes in an outer product mode to obtain a second emotion fusion characteristic vector of the sample object.
Please refer to fig. 10, which is a schematic structural diagram of an electronic device according to another embodiment of the present application.
The point-on-device comprises: processor 1001
The memory 1002 is used for storing a program of the method, and the program is read by the processor 1001 and executed to execute any one of the methods of the above embodiments.
Another embodiment of the present application also provides a computer storage medium storing a computer program that, when executed, performs any one of the methods of the embodiments.
It should be noted that, for the detailed description of the electronic device and the computer storage medium provided in the embodiments of the present application, reference may be made to the related description of the foregoing method embodiments provided in the present application, and details are not repeated here.
Although the present application has been described with reference to the preferred embodiments, it is not intended to limit the present application, and those skilled in the art can make variations and modifications without departing from the spirit and scope of the present application, therefore, the scope of the present application should be determined by the claims that follow.
In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.
1. Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, computer readable media does not include non-transitory computer readable media (transmitter 7 media), such as modulated data signals and carrier waves.
2. It will be apparent to those skilled in the art that embodiments of the present application may be provided as a system or an electronic device. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Claims (21)

1. A method for generating an emotion analysis model, comprising:
acquiring a multi-modal feature vector of a sample object marked with emotional state;
obtaining shared feature vectors among the multi-modal feature vectors and private feature vectors corresponding to all the modalities;
performing fusion processing on modal feature vectors corresponding to the modes in the multi-modal feature vectors to obtain a first emotion fusion feature vector of the sample object;
performing fusion processing on the private characteristic vectors of the modes to obtain a second emotion fusion characteristic vector of the sample object;
and splicing at least two of the first emotion fusion feature vector, the second emotion fusion feature vector and the shared feature vector, and training a preset initial emotion analysis model by taking a splicing result and the emotion state marked by the sample object as training samples to obtain an emotion analysis model for determining the emotion of the user.
2. The method according to claim 1, wherein the obtaining of the shared feature vector among the multi-modal feature vectors and the private feature vector corresponding to each modality comprises:
obtaining the shared feature vector according to the mean value of the multi-modal feature vectors;
and decoupling each modal feature vector in the multi-modal feature vectors according to the shared features to obtain the private feature vector corresponding to each modal.
3. The method of claim 1, wherein obtaining multi-modal feature vectors for sample objects labeled with emotional states comprises:
and obtaining the multi-modal feature vector through a convolutional neural network and multi-modal information.
4. The method according to claim 1, wherein the fusing the modal feature vectors corresponding to the respective modalities in the multi-modal feature vector to obtain the first emotion fused feature vector of the sample object, comprises:
and performing fusion processing on the modal feature vectors of each mode in the multi-mode feature vector set in an outer product mode to obtain a first emotion fusion feature vector of the sample object.
5. The method according to claim 1, wherein the fusing the private feature vectors of the modalities to obtain a second emotion fused feature vector of the sample object comprises:
and carrying out fusion processing on the private characteristic vectors of all the modes in an outer product mode to obtain a second emotion fusion characteristic vector of the sample object.
6. The method according to claim 1, wherein the splicing at least two of the first emotion fusion feature vector, the second emotion fusion feature vector and the shared feature vector, and training a preset initial emotion analysis model by using a spliced result and an emotion state labeled by the sample object as training samples to obtain an emotion analysis model for determining user emotion, comprises:
splicing the first emotion fusion feature vector, the second emotion fusion feature vector and the shared feature vector to obtain a first sample feature matrix;
and taking the first sample feature matrix and the emotional state labeled by the sample object as training samples to train the initial emotion analysis model to obtain the emotion analysis model.
7. The method according to claim 1, wherein the splicing at least two of the first emotion fusion feature vector, the second emotion fusion feature vector and the shared feature vector, and training a preset initial emotion analysis model by using a spliced result and an emotion state labeled by the sample object as training samples to obtain an emotion analysis model for determining user emotion, comprises:
splicing the first emotion fusion feature vector and the second emotion fusion feature vector to obtain a second sample feature matrix;
and taking the second sample feature matrix and the emotional state labeled by the sample object as training samples to train the initial emotion analysis model, so as to obtain the emotion analysis model.
8. The method according to claim 1, wherein the splicing at least two of the first emotion fusion feature vector, the second emotion fusion feature vector and the shared feature vector, and training a preset initial emotion analysis model by using a spliced result and an emotion state labeled by the sample object as training samples to obtain an emotion analysis model for determining user emotion, comprises:
splicing the first emotion fusion feature vector and the shared feature vector to obtain a third sample feature matrix;
and taking the third sample feature matrix and the emotional state labeled by the sample object as training samples to train the initial emotion analysis model, so as to obtain the emotion analysis model.
9. The method according to claim 1, wherein the splicing at least two of the first emotion fusion feature vector, the second emotion fusion feature vector and the shared feature vector, and training a preset initial emotion analysis model by using a spliced result and an emotion state labeled by the sample object as training samples to obtain an emotion analysis model for determining user emotion, comprises:
splicing the second emotion fusion feature vector and the shared feature vector to obtain a fourth sample feature matrix;
and taking the fourth sample feature matrix and the emotional state labeled by the sample object as training samples to train the initial emotion analysis model, so as to obtain the emotion analysis model.
10. The method according to claim 1, wherein the stitching at least two of the first emotion fused feature vector, the second emotion fused feature vector and the shared feature vector, and training a preset initial emotion analysis model by using a stitching result and an emotion state labeled by the sample object as training samples comprises:
splicing at least two of the first emotion fusion feature vector, the second emotion fusion feature vector and the shared feature vector to obtain a splicing result containing a sample feature matrix;
inputting the sample feature matrix into the initial emotion analysis model, and obtaining a predicted value aiming at the emotional state of the sample object through the initial emotion analysis model;
and performing classification task training and regression prediction task training on the initial emotion analysis model according to the emotion state predicted value and the emotion state labeled by the sample object, and taking the trained initial emotion analysis model as the emotion analysis model.
11. The method of claim 10, further comprising:
determining the probability of each matrix element in the sample characteristic matrix;
according to the probability of each matrix element, giving a weight to each matrix element;
adjusting the sample characteristic matrix based on the weight of each matrix element to obtain a sample characteristic matrix after weight adjustment;
and taking the sample feature matrix after the weight adjustment and the emotional state labeled by the sample object as the training sample.
12. The method of claim 1, wherein the multiple modalities include at least two of audio, text, and images.
13. A method for generating an emotion analysis model, comprising:
acquiring a multi-modal feature vector of a sample object marked with emotional state;
obtaining shared feature vectors among the multi-modal feature vectors and private feature vectors corresponding to all the modalities;
performing fusion processing on the private characteristic vectors of the modes to obtain a second emotion fusion characteristic vector of the sample object;
and taking one of the second emotion fusion feature and the shared feature and the emotion state marked by the sample object as a training sample to train a preset initial emotion analysis model, so as to obtain an emotion analysis model for determining the emotion of the user.
14. An emotion analysis method, comprising:
acquiring a multi-modal feature vector of a target object;
obtaining shared feature vectors among the multi-modal feature vectors and private feature vectors corresponding to all the modalities;
performing fusion processing on modal feature vectors corresponding to the modes in the multi-modal feature vectors to obtain a first emotion fusion feature vector of the target object;
performing fusion processing on the private characteristic vectors of the modes to obtain a second emotion fusion characteristic vector of the target object;
splicing at least two of the first emotion fusion feature vector, the second emotion fusion feature vector and the shared feature vector, and inputting splicing results into an emotion analysis model for determining user emotion to obtain emotion analysis results of the target user;
wherein the emotion analysis model is obtained by the method of any one of claims 1 to 12.
15. An emotion analysis method, comprising:
acquiring a multi-modal feature vector of a target object;
obtaining shared feature vectors among the multi-modal feature vectors and private feature vectors corresponding to all the modalities;
performing fusion processing on the private characteristic vectors of the modes to obtain a second emotion fusion characteristic vector of the target object;
inputting one of the second emotion fusion feature or the shared feature into an emotion analysis model for determining user emotion to obtain an emotion analysis result of the target user;
wherein the emotion analysis model is obtained by the method of claim 13.
16. An apparatus for generating an emotion analysis model, comprising:
a first obtaining module, configured to obtain a multi-modal feature vector of a sample object labeled with an emotional state
The second acquisition module is used for acquiring shared feature vectors among the multi-modal feature vectors and private feature vectors corresponding to all the modalities;
the first fusion module is used for performing fusion processing on modal feature vectors corresponding to all the modalities in the multi-modal feature vectors to obtain a first emotion fusion feature vector of the sample object;
the second fusion module is used for carrying out fusion processing on the private characteristic vectors of the modes to obtain a second emotion fusion characteristic vector of the sample object;
and the first training module is used for splicing at least two of the first emotion fusion characteristic vector, the second emotion fusion characteristic vector and the shared characteristic vector, training a preset initial emotion analysis model by taking a splicing result and the emotion state marked by the sample object as training samples, and obtaining an emotion analysis model for determining the emotion of the user.
17. An apparatus for generating an emotion analysis model, comprising:
the third acquisition module is used for acquiring the multi-modal feature vector of the sample object marked with the emotional state;
the fourth acquisition module is used for acquiring shared feature vectors among the multi-modal feature vectors and private feature vectors corresponding to all the modalities;
the third fusion module is used for carrying out fusion processing on the private characteristic vectors of the modes to obtain a second emotion fusion characteristic vector of the sample object;
and the second training module is used for taking one of the second emotion fusion characteristic and the shared characteristic and the emotion state marked by the sample object as a training sample to train a preset initial emotion analysis model so as to obtain an emotion analysis model for determining the emotion of the user.
18. An emotion analysis device, comprising:
the fifth acquisition module is used for acquiring the multi-modal feature vector of the target object;
a sixth obtaining module, configured to obtain shared feature vectors among the multi-modal feature vectors and private feature vectors corresponding to the modalities;
the fourth fusion module is used for performing fusion processing on the modal feature vectors corresponding to the modalities in the multi-modal feature vectors to obtain a first emotion fusion feature vector of the target object;
a fifth fusion module, configured to perform fusion processing on the private feature vectors of the respective modalities to obtain a second emotion fusion feature vector of the target object;
the first analysis module is used for splicing at least two of the first emotion fusion characteristic vector, the second emotion fusion characteristic vector and the shared characteristic vector, inputting splicing results into an emotion analysis model for determining user emotion, and obtaining emotion analysis results of the target user;
wherein the emotion analysis model is obtained by the apparatus of claim 14.
19. An emotion analysis device, comprising:
the seventh acquisition module is used for acquiring the multi-modal feature vector of the target object;
an eighth obtaining module, configured to obtain shared feature vectors among the multi-modal feature vectors and private feature vectors corresponding to each modality;
a sixth fusion module, configured to perform fusion processing on the private feature vectors of the respective modalities to obtain a second emotion fusion feature vector of the target object;
the second analysis module is used for inputting one of the second emotion fusion feature or the shared feature into an emotion analysis model used for determining user emotion to obtain an emotion analysis result of the target user;
wherein the emotion analysis model is obtained by the apparatus of claim 15.
20. An electronic device, comprising:
a processor;
memory for storing a program of a method, which program, when read run by a processor, performs the method of any of claims 1-15.
21. A computer storage medium, characterized in that it stores a computer program which, when executed, performs the method of any one of claims 1-15.
CN202111450929.0A 2021-11-30 2021-11-30 Emotion analysis model generation method and device, electronic equipment and storage medium Pending CN114140885A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111450929.0A CN114140885A (en) 2021-11-30 2021-11-30 Emotion analysis model generation method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111450929.0A CN114140885A (en) 2021-11-30 2021-11-30 Emotion analysis model generation method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN114140885A true CN114140885A (en) 2022-03-04

Family

ID=80386786

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111450929.0A Pending CN114140885A (en) 2021-11-30 2021-11-30 Emotion analysis model generation method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114140885A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114926837A (en) * 2022-05-26 2022-08-19 东南大学 Emotion recognition method based on human-object space-time interaction behavior
CN115880698A (en) * 2023-03-08 2023-03-31 南昌航空大学 Depression emotion recognition method based on microblog posting content and social behavior characteristics
CN116758462A (en) * 2023-08-22 2023-09-15 江西师范大学 Emotion polarity analysis method and device, electronic equipment and storage medium
CN117576520A (en) * 2024-01-16 2024-02-20 中国科学技术大学 Training method of target detection model, target detection method and electronic equipment

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114926837A (en) * 2022-05-26 2022-08-19 东南大学 Emotion recognition method based on human-object space-time interaction behavior
CN114926837B (en) * 2022-05-26 2023-08-04 东南大学 Emotion recognition method based on human-object space-time interaction behavior
CN115880698A (en) * 2023-03-08 2023-03-31 南昌航空大学 Depression emotion recognition method based on microblog posting content and social behavior characteristics
CN116758462A (en) * 2023-08-22 2023-09-15 江西师范大学 Emotion polarity analysis method and device, electronic equipment and storage medium
CN117576520A (en) * 2024-01-16 2024-02-20 中国科学技术大学 Training method of target detection model, target detection method and electronic equipment
CN117576520B (en) * 2024-01-16 2024-05-17 中国科学技术大学 Training method of target detection model, target detection method and electronic equipment

Similar Documents

Publication Publication Date Title
CN112560830B (en) Multi-mode dimension emotion recognition method
Zadeh et al. Factorized multimodal transformer for multimodal sequential learning
CN114140885A (en) Emotion analysis model generation method and device, electronic equipment and storage medium
CN116171473A (en) Bimodal relationship network for audio-visual event localization
CN109766557A (en) A kind of sentiment analysis method, apparatus, storage medium and terminal device
WO2023050708A1 (en) Emotion recognition method and apparatus, device, and readable storage medium
Ristea et al. Emotion recognition system from speech and visual information based on convolutional neural networks
Danisman et al. Intelligent pixels of interest selection with application to facial expression recognition using multilayer perceptron
Raut Facial emotion recognition using machine learning
Phan et al. Consensus-based sequence training for video captioning
Lin et al. PS-mixer: A polar-vector and strength-vector mixer model for multimodal sentiment analysis
CN112418166A (en) Emotion distribution learning method based on multi-mode information
CN118296150B (en) Comment emotion recognition method based on multi-countermeasure network improvement
KR20190128933A (en) Emotion recognition apparatus and method based on spatiotemporal attention
Zhu et al. Multimodal deep denoise framework for affective video content analysis
CN116933051A (en) Multi-mode emotion recognition method and system for modal missing scene
Tang et al. Acoustic feature learning via deep variational canonical correlation analysis
Pei et al. Continuous affect recognition with weakly supervised learning
Huang et al. Learning long-term temporal contexts using skip RNN for continuous emotion recognition
Agrawal et al. Multimodal personality recognition using cross-attention transformer and behaviour encoding
Wu et al. Speaker personality recognition with multimodal explicit many2many interactions
Yang et al. SMFNM: Semi-supervised multimodal fusion network with main-modal for real-time emotion recognition in conversations
CN116975602A (en) AR interactive emotion recognition method and system based on multi-modal information double fusion
Benavent-Lledo et al. Predicting human-object interactions in egocentric videos
Agrawal et al. Fusion based emotion recognition system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination