CN108985358B - Emotion recognition method, device, equipment and storage medium - Google Patents

Emotion recognition method, device, equipment and storage medium Download PDF

Info

Publication number
CN108985358B
CN108985358B CN201810694899.XA CN201810694899A CN108985358B CN 108985358 B CN108985358 B CN 108985358B CN 201810694899 A CN201810694899 A CN 201810694899A CN 108985358 B CN108985358 B CN 108985358B
Authority
CN
China
Prior art keywords
modal
session
information
session information
emotion
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810694899.XA
Other languages
Chinese (zh)
Other versions
CN108985358A (en
Inventor
林英展
陈炳金
梁一川
凌光
周超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201810694899.XA priority Critical patent/CN108985358B/en
Publication of CN108985358A publication Critical patent/CN108985358A/en
Application granted granted Critical
Publication of CN108985358B publication Critical patent/CN108985358B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • G06F18/253
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state

Abstract

The embodiment of the invention discloses a method, a device, equipment and a storage medium for emotion recognition. Wherein, the method comprises the following steps: determining a fusion session feature of the multi-modal session information; and inputting the fusion session characteristics of the multi-modal session information into a pre-constructed multi-modal emotion recognition model to obtain the emotion characteristics of the multi-modal session information. According to the technical scheme provided by the embodiment of the invention, the session features of each modality in the multi-modality session information are fused to obtain the fused session features, and the fused session features are input into a uniform multi-modality emotion recognition model for model training, so that the final emotion result can be directly predicted without respectively training the recognition models of each modality and carrying out fusion of different model results. The sample training process is simplified, and the accuracy of the emotion recognition result is improved.

Description

Emotion recognition method, device, equipment and storage medium
Technical Field
The embodiment of the invention relates to the technical field of artificial intelligence, in particular to a method, a device, equipment and a storage medium for emotion recognition.
Background
With the development of artificial intelligence, intelligent interaction plays an increasingly important role in more and more fields. In the intelligent interaction, an important direction is how to identify the current emotional state of the user in the multi-modal interaction process, so that feedback of an emotional level is provided for the whole intelligent interaction system, adjustment is made in time, the user in different emotional states can be responded, and the service quality of the whole interaction process is improved.
At present, the main emotion recognition method is shown in fig. 1, and the whole process is as follows: the method comprises the steps of independently modeling each mode such as voice, text and expression images, finally fusing the results of each model together, carrying out fusion judgment on the results of a plurality of modes according to rules or machine learning models, and finally outputting an integral multi-mode emotion recognition result.
Because the same word has different meanings in different scenes, the expressed emotional states are different, and the generality of the method is poor; in addition, the method also needs to rely on manual operation to collect a large amount of data, so that the cost is high and the result controllability is poor.
Disclosure of Invention
The embodiment of the invention provides an emotion recognition method, device, equipment and storage medium, which simplify the sample training process and improve the accuracy of emotion recognition results.
In a first aspect, an embodiment of the present invention provides an emotion recognition method, where the method includes:
determining a fusion session feature of the multi-modal session information;
and inputting the fusion session characteristics of the multi-modal session information into a pre-constructed multi-modal emotion recognition model to obtain the emotion characteristics of the multi-modal session information. In a second aspect, an embodiment of the present invention further provides an emotion recognition apparatus, where the apparatus includes:
the fusion characteristic determining module is used for determining fusion session characteristics of the multi-modal session information;
and the emotion characteristic determination module is used for inputting the fusion session characteristics of the multi-modal session information into a pre-constructed multi-modal emotion recognition model to obtain the emotion characteristics of the multi-modal session information.
In a third aspect, an embodiment of the present invention further provides an apparatus, where the apparatus includes:
one or more processors;
storage means for storing one or more programs;
when executed by the one or more processors, cause the one or more processors to implement the emotion recognition method of any of the first aspects.
In a fourth aspect, an embodiment of the present invention further provides a storage medium on which a computer program is stored, where the program is executed by a processor to implement the emotion recognition method according to any of the first aspects.
According to the technical scheme provided by the embodiment of the invention, the session features of each modality in the multi-modality session information are fused to obtain the fused session features, and the fused session features are input into a uniform multi-modality emotion recognition model for model training, so that the final emotion result can be directly predicted without respectively training the recognition models of each modality and carrying out fusion of different model results. The sample training process is simplified, and the accuracy of the emotion recognition result is improved.
Drawings
FIG. 1 is a schematic diagram of multi-modal emotion recognition based on independent modal training provided by the prior art;
fig. 2A is a flowchart of an emotion recognition method provided in the first embodiment of the present invention;
FIG. 2B is a schematic diagram of a learning model based on multi-modal feature fusion to which the present invention is applicable;
fig. 3 is a flowchart of an emotion recognition method provided in the second embodiment of the present invention;
fig. 4 is a block diagram showing a structure of an emotion recognition apparatus provided in the third embodiment of the present invention;
fig. 5 is a schematic structural diagram of an apparatus provided in the fourth embodiment of the present invention.
Detailed Description
The embodiments of the present invention will be described in further detail with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the embodiments of the invention and that no limitation of the invention is intended. It should be further noted that, for convenience of description, only some structures, not all structures, relating to the embodiments of the present invention are shown in the drawings.
Example one
Fig. 2A is a flowchart of an emotion recognition method provided in an embodiment of the present invention, and fig. 2B is a schematic diagram of a learning model based on multi-modal feature fusion, which is applied in the embodiment of the present invention. The embodiment is suitable for the situation of how to accurately recognize the emotion of the user in the multi-modal interaction process. The method can be executed by the emotion recognition device provided by the embodiment of the invention, and the device can be realized in a software and/or hardware manner and can be integrated in a computing device. Referring to fig. 2A and 2B, the method specifically includes:
s210, determining the fusion session characteristics of the multi-modal session information.
The modality is a term used in interaction, and the multi-modality is a phenomenon of interaction by comprehensively using various means such as texts, images, videos, voice, gestures and the like and symbol carriers. Correspondingly, the multi-modal session information is session information simultaneously including at least two modalities, such as three modalities of speech, text and image.
The converged session features are obtained by fusing session features of different modalities included in one session information. Alternatively, a deep learning model may be employed, while considering a plurality of modal characteristics contained in one session information to determine a fused session characteristic of the multimodal session information.
S220, inputting the fusion session characteristics of the multi-modal session information into a pre-constructed multi-modal emotion recognition model to obtain the emotion characteristics of the multi-modal session information.
The multi-modal emotion recognition model is a model established based on language recognition, intelligent image recognition, text recognition technology and the like in artificial intelligence; specifically, the initial machine learning model, such as a neural network model, may be trained in advance using the sample data set. The emotion characteristics are multi-mode emotion recognition results and are used for representing an attitude of the individual to external things, and the attitude can comprise emotion types, emotion intensity and the like; the emotion types can include happiness, anger, sadness, music and the like; emotional intensity is the degree of intensity used to characterize a certain emotion.
For example, before inputting the merged session features of the multi-modal session information into the pre-constructed multi-modal emotion recognition model, the method may further include: and training the initial machine learning model according to the fusion session characteristics of the multi-modal session sample information and the emotion characteristics of the multi-modal session sample information to obtain a multi-modal emotion recognition model.
Specifically, session information in an interaction process under various scenes is continuously accumulated to obtain a large number of fusion session features of multimodal session sample information and emotion features of corresponding multimodal session sample information, the fusion session features and the emotion features are used as training sample sets and input into a neural network to be trained, and after the training of each sample, a multimodal emotion recognition model is obtained. When the integrated session features of the multi-modal session information are input into the multi-modal emotion recognition model, the model judges the input integrated session features by combining the existing parameters of the model and outputs the corresponding emotion features.
It should be noted that, in the prior art, since it is necessary to separately establish an identification model for each modality and weight each model result to obtain a final emotion result, a large number of training samples are required, and there is a problem that the quality of a model learned by a single modality is poor, and the overall emotion identification effect is poor. In the embodiment, referring to fig. 2B, since the session features of each modality in the multimodal session information are directly fused to obtain the fused session features, and the fused session features are only required to be input into a unified multimodal emotion recognition model for model training, the final emotion features can be output, and the training samples are greatly reduced compared with the prior art; and due to the fusion of the multi-modal conversation characteristics, the multi-modal emotion recognition model can learn not only the characteristic information of each modality, but also the characteristic relation among different modalities, and the problem that in the prior art, the overall emotion recognition effect is poor due to poor quality of a model learned by a single modality can be avoided.
Text and voice bimodal session information is taken as an example for explanation. When the user says that the user wants to buy apple X and wants the ticket, the sentence is uncertain whether the sentence can be marked as a negative emotion by respectively considering text modal information and voice modal information by adopting the prior art, and finally the emotion recognition result is inaccurate. However, by adopting the technical scheme of the implementation, the text modal information is considered, and the information in the speech modal aspect of the user is added, for example, the fluctuation of the speech is severe when the user speaks the sentence, and finally the emotion can be accurately recognized as the negative emotion by fusing the text + speech bimodal features.
In addition, it should be emphasized that the emotion characteristics of the multimodal conversation sample information adopted in this embodiment are obtained by labeling the multimodal conversation information under the condition of comprehensively considering each modality, which can ensure that the labeled emotion state is unambiguous, construct a more accurate data set for the following model training, and make the finally obtained multimodal emotion recognition model more accurate. In the prior art, each mode is labeled independently, and because one mode is labeled independently, emotional characteristics of one sentence may not be labeled correctly, so that the recognition accuracy of the emotional model corresponding to each mode is poor, and finally the effect of the subsequent result fusion stage is reduced.
According to the technical scheme provided by the embodiment of the invention, the session features of each modality in the multi-modality session information are fused to obtain the fused session features, and the fused session features are input into a uniform multi-modality emotion recognition model for model training, so that the final emotion result can be directly predicted without respectively training the recognition models of each modality and carrying out fusion of different model results. The sample training process is simplified, and the accuracy of the emotion recognition result is improved.
Example two
Fig. 3 is a flowchart of an emotion recognition method according to a second embodiment of the present invention, and this embodiment further optimizes a fusion session feature for determining multimodal session information based on the first embodiment. Referring to fig. 3, the method specifically includes:
s310, vector representation of at least two types of modal session information in the voice session information, the text session information and the image session information is determined respectively.
Illustratively, the multimodal conversation information may include: voice session information, text session information, and image session information. The vector representation of the session information refers to a representation of the session information on a vector space, which can be obtained through modeling.
Specifically, by respectively extracting feature parameters capable of representing emotion changes in the voice conversation information, extracting keywords such as sentence and word segmentation from the text conversation information, extracting effective dynamic expression features or static expression features from the image conversation information, and inputting the extracted keywords and the extracted effective dynamic expression features or static expression features into the vector extraction model, vector representation of the voice conversation information, vector representation of the image conversation information, and vector representation of the text conversation information can be obtained. The vector extraction model can be a comprehensive model which can convert voice characteristics, text keywords, image characteristics and the like into corresponding vector representations, and can also be formed by combining all sub models.
S320, the vector representation of the session information of at least two modalities is fused to obtain the vector representation of the fusion session characteristics of the multi-modality session information.
Specifically, the vector representations of the modal session information can be directly spliced into a long uniform vector representation according to a certain rule as the vector representation of the fusion session feature of the multimodal session information, so that the fusion of the vector representations of the modal session information is realized. The vector representation of the key information part in the vector representation of each modal session information can be extracted and spliced to obtain the vector representation of the fusion session feature of the multimodal session information.
Illustratively, fusing the vector representation of the at least two modalities of session information may include: and sequentially splicing the vector representations of the session information of at least two modalities according to a preset modality sequence.
The preset modal sequence can be a preset modal input sequence, and can be corrected according to actual conditions. For example, a certain modality can be added, deleted or inserted, so that the input sequence of each modality can be dynamically adjusted.
Specifically, after the vector representation of each modal session information corresponding to the input multi-modal session information is determined, the vector representations of the modal session information are directly connected according to the input sequence of each modality, so that the fusion of the vector representations of the multi-modal session information is realized.
Illustratively, fusing the vector representation of the at least two modalities of session information may further include: respectively extracting nonlinear characteristics represented by vectors of at least two modal session information; and fusing the nonlinear characteristics of the extracted at least two modal session information.
The non-linear feature of the vector representation is used to characterize a specific part of a vector, and may be a part of the vector representation other than 0. The non-linear characteristic of the corresponding vector representation of one modal session information refers to the vector representation of the words capable of recognizing emotion in one modal session information. For example, if the vector representation of the modal session information is [0, 1, 1, 0, 0], the non-linear feature of the vector representation of the modal session information may be [1, 1 ].
Specifically, referring to fig. 2B, in the multi-modal feature fusion Layer, a vector representation of each modal session information may be input into the deep learning model, and a Full Connection Layer (FCL) operation is performed to extract a nonlinear feature represented by the vector of each modal session information, so as to obtain a corresponding hidden Layer vector; and then the output hidden vectors are spliced together so as to realize the fusion of vector representation of the multi-modal session information.
S330, inputting the fusion session characteristics of the multi-modal session information into a pre-constructed multi-modal emotion recognition model to obtain the emotion characteristics of the multi-modal session information.
Specifically, vector representation of the fusion session features of the multi-modal session information is input into a pre-constructed multi-modal emotion recognition model, and the model can judge the input fusion session features by combining the existing parameters of the model and output corresponding emotion features.
According to the technical scheme provided by the embodiment of the invention, the vector representation of the fusion session characteristics of the multi-modal session information is obtained by fusing the vector representation of each modal session information in the multi-modal session information, and the vector representation of the fusion session characteristics is input into a uniform multi-modal emotion recognition model for model training, so that the final emotion result can be directly predicted without respectively training the recognition models of various modalities and fusing different model results. The sample training process is simplified, and the accuracy of the emotion recognition result is improved.
EXAMPLE III
Fig. 4 is a block diagram of a structure of an emotion recognition apparatus provided in a third embodiment of the present invention, where the apparatus is capable of executing an emotion recognition method provided in any embodiment of the present invention, and has functional modules and beneficial effects corresponding to the execution method. As shown in fig. 4, the apparatus may include:
a fused feature determining module 410 for determining a fused session feature of the multimodal session information;
and the emotion characteristic determination module 420 is configured to input the fusion session characteristics of the multi-modal session information into a pre-constructed multi-modal emotion recognition model to obtain emotion characteristics of the multi-modal session information.
According to the technical scheme provided by the embodiment of the invention, the session features of each modality in the multi-modality session information are fused to obtain the fused session features, and the fused session features are input into a uniform multi-modality emotion recognition model for model training, so that the final emotion result can be directly predicted without respectively training the recognition models of each modality and carrying out fusion of different model results. The sample training process is simplified, and the accuracy of the emotion recognition result is improved.
For example, the fused feature determination module 410 may include:
the multi-modal vector determining unit is used for respectively determining vector representation of at least two types of modal session information in the voice session information, the text session information and the image session information;
and the fusion vector determining unit is used for fusing the vector representation of the at least two types of modal session information to obtain the vector representation of the fusion session characteristics of the multi-modal session information.
Optionally, the fusion vector determining unit is specifically configured to:
and sequentially splicing the vector representations of the session information of at least two modalities according to a preset modality sequence.
Optionally, the fusion vector determining unit is further specifically configured to:
respectively extracting nonlinear characteristics represented by vectors of at least two modal session information; and fusing the nonlinear characteristics of the extracted at least two modal session information.
Illustratively, the apparatus may further include:
and the recognition model determining module is used for training the initial machine learning model according to the fusion session characteristics of the multi-modal session sample information and the emotion characteristics of the multi-modal session sample information to obtain a multi-modal emotion recognition model.
Example four
Fig. 5 is a schematic structural diagram of an apparatus according to a fourth embodiment of the present invention, and fig. 5 shows a block diagram of an exemplary apparatus suitable for implementing the embodiment of the present invention. The device 12 shown in fig. 5 is only an example and should not bring any limitations to the functionality and scope of use of the embodiments of the present invention. As shown in FIG. 5, device 12 is in the form of a general purpose computing device. The components of device 12 may include, but are not limited to: one or more processors or processing units 16, a system memory 28, and a bus 18 that couples various system components including the system memory 28 and the processing unit 16.
Bus 18 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures. By way of example, such architectures include, but are not limited to, Industry Standard Architecture (ISA) bus, micro-channel architecture (MAC) bus, enhanced ISA bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus.
Device 12 typically includes a variety of computer system readable media. Such media may be any available media that is accessible by device 12 and includes both volatile and nonvolatile media, removable and non-removable media.
The system memory 28 may include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM)30 and/or cache memory 32. Device 12 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 34 may be used to read from and write to non-removable, nonvolatile magnetic media (not shown in FIG. 5, and commonly referred to as a "hard drive"). Although not shown in FIG. 5, a magnetic disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In these cases, each drive may be connected to bus 18 by one or more data media interfaces. System memory 28 may include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of embodiments of the invention.
A program/utility 40 having a set (at least one) of program modules 42 may be stored, for example, in system memory 28, such program modules 42 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each of which examples or some combination thereof may comprise an implementation of a network environment. Program modules 42 generally carry out the functions and/or methodologies of embodiments described herein.
Device 12 may also communicate with one or more external devices 14 (e.g., keyboard, pointing device, display 24, etc.), with one or more devices that enable a user to interact with device 12, and/or with any devices (e.g., network card, modem, etc.) that enable device 12 to communicate with one or more other computing devices. Such communication may be through an input/output (I/O) interface 22. Also, the device 12 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the Internet) via the network adapter 20. As shown, the network adapter 20 communicates with the other modules of the device 12 via the bus 18. It should be understood that although not shown in the figures, other hardware and/or software modules may be used in conjunction with device 12, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.
The processing unit 16 executes various functional applications and data processing, for example, implementing an emotion recognition method provided by an embodiment of the present invention, by executing a program stored in the system memory 28.
EXAMPLE five
Fifth, an embodiment of the present invention further provides a computer-readable storage medium, on which a computer program (or referred to as computer-executable instructions) is stored, where the computer program, when executed by a processor, can implement the emotion recognition method according to any of the above embodiments.
Computer storage media for embodiments of the invention may employ any combination of one or more computer-readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for embodiments of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the embodiments of the present invention have been described in more detail through the above embodiments, the embodiments of the present invention are not limited to the above embodiments, and many other equivalent embodiments may be included without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims (8)

1. A method of emotion recognition, comprising:
extracting feature parameters capable of representing emotion change in voice conversation information, extracting keywords of text conversation information, extracting effective dynamic expression features or static expression features in image conversation information, and inputting the extracted features into a vector extraction model to obtain vector representation of the voice conversation information, vector representation of the image conversation information and vector representation of the text conversation information; the vector extraction model is a comprehensive model which can convert voice features, text keywords and image features into corresponding vector representations, and the voice conversation information, the text conversation information and the image conversation information are modal conversation information respectively;
fusing the vector representation of the at least two types of modal session information to obtain the vector representation of the fused session features of the multimodal session information;
inputting the fusion session characteristics of the multi-modal session information into a pre-constructed multi-modal emotion recognition model to obtain the emotion characteristics of the multi-modal session information;
wherein fusing the vector representations of the at least two modality session information comprises:
respectively extracting nonlinear features represented by vectors of the at least two modal session information;
and fusing the extracted nonlinear characteristics of the at least two modal session information.
2. The method according to claim 1, wherein fusing the vector representation of the at least two modality session information comprises:
and sequentially splicing the vector representations of the at least two types of modal session information according to a preset modal sequence.
3. The method of claim 1, wherein before inputting the merged session features of the multi-modal session information into the pre-constructed multi-modal emotion recognition model, further comprising:
training an initial machine learning model according to the fusion session characteristics of the multi-modal session sample information and the emotion characteristics of the multi-modal session sample information to obtain the multi-modal emotion recognition model.
4. An emotion recognition apparatus, comprising:
the fusion characteristic determining module is used for determining fusion session characteristics of the multi-modal session information;
the emotion feature determination module is used for inputting the fusion session features of the multi-modal session information into a pre-constructed multi-modal emotion recognition model to obtain the emotion features of the multi-modal session information;
wherein the fused feature determination module comprises:
the multi-modal vector determining unit is used for extracting keywords of the text session information and effective dynamic expression features or static expression features in the image session information by respectively extracting feature parameters capable of representing emotion changes in the voice session information, and inputting the extracted dynamic expression features or static expression features into the vector extraction model to obtain vector representation of the voice session information, vector representation of the image session information and vector representation of the text session information; the vector extraction model is a comprehensive model which can convert voice features, text keywords and image features into corresponding vector representations, and the voice conversation information, the text conversation information and the image conversation information are modal conversation information respectively;
the fusion vector determining unit is used for fusing the vector representation of the at least two types of modal session information to obtain the vector representation of the fusion session characteristics of the multi-modal session information;
wherein the fusion vector determination unit is further specifically configured to:
respectively extracting nonlinear features represented by vectors of the at least two modal session information;
and fusing the extracted nonlinear characteristics of the at least two modal session information.
5. The apparatus according to claim 4, wherein the fused vector determining unit is specifically configured to:
and sequentially splicing the vector representations of the at least two types of modal session information according to a preset modal sequence.
6. The apparatus of claim 4, further comprising:
and the recognition model determining module is used for training an initial machine learning model according to the fusion session characteristics of the multi-modal session sample information and the emotion characteristics of the multi-modal session sample information to obtain the multi-modal emotion recognition model.
7. An apparatus, characterized in that the apparatus comprises:
one or more processors;
storage means for storing one or more programs;
when executed by the one or more processors, cause the one or more processors to implement the emotion recognition method of any of claims 1-3.
8. A storage medium having stored thereon a computer program, characterized in that the program, when being executed by a processor, carries out the emotion recognition method as claimed in any of claims 1-3.
CN201810694899.XA 2018-06-29 2018-06-29 Emotion recognition method, device, equipment and storage medium Active CN108985358B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810694899.XA CN108985358B (en) 2018-06-29 2018-06-29 Emotion recognition method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810694899.XA CN108985358B (en) 2018-06-29 2018-06-29 Emotion recognition method, device, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN108985358A CN108985358A (en) 2018-12-11
CN108985358B true CN108985358B (en) 2021-03-02

Family

ID=64538992

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810694899.XA Active CN108985358B (en) 2018-06-29 2018-06-29 Emotion recognition method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN108985358B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111681645A (en) * 2019-02-25 2020-09-18 北京嘀嘀无限科技发展有限公司 Emotion recognition model training method, emotion recognition device and electronic equipment
CN111816211A (en) * 2019-04-09 2020-10-23 Oppo广东移动通信有限公司 Emotion recognition method and device, storage medium and electronic equipment
CN110083716A (en) * 2019-05-07 2019-08-02 青海大学 Multi-modal affection computation method and system based on Tibetan language
CN110021308B (en) * 2019-05-16 2021-05-18 北京百度网讯科技有限公司 Speech emotion recognition method and device, computer equipment and storage medium
CN110390956A (en) * 2019-08-15 2019-10-29 龙马智芯(珠海横琴)科技有限公司 Emotion recognition network model, method and electronic equipment
CN112183022A (en) * 2020-09-25 2021-01-05 北京优全智汇信息技术有限公司 Loss assessment method and device
CN112233698A (en) * 2020-10-09 2021-01-15 中国平安人寿保险股份有限公司 Character emotion recognition method and device, terminal device and storage medium
CN114005468A (en) * 2021-09-07 2022-02-01 华院计算技术(上海)股份有限公司 Interpretable emotion recognition method and system based on global working space

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8781989B2 (en) * 2008-01-14 2014-07-15 Aptima, Inc. Method and system to predict a data value
CN102930298B (en) * 2012-09-02 2015-04-29 北京理工大学 Audio visual emotion recognition method based on multi-layer boosted HMM
CN104835507B (en) * 2015-03-30 2018-01-16 渤海大学 A kind of fusion of multi-mode emotion information and recognition methods gone here and there and combined
CN105427869A (en) * 2015-11-02 2016-03-23 北京大学 Session emotion autoanalysis method based on depth learning
CN106503805B (en) * 2016-11-14 2019-01-29 合肥工业大学 A kind of bimodal based on machine learning everybody talk with sentiment analysis method
CN107705807B (en) * 2017-08-24 2019-08-27 平安科技(深圳)有限公司 Voice quality detecting method, device, equipment and storage medium based on Emotion identification

Also Published As

Publication number Publication date
CN108985358A (en) 2018-12-11

Similar Documents

Publication Publication Date Title
CN108985358B (en) Emotion recognition method, device, equipment and storage medium
CN107908635B (en) Method and device for establishing text classification model and text classification
CN109003624B (en) Emotion recognition method and device, computer equipment and storage medium
US9805718B2 (en) Clarifying natural language input using targeted questions
US10522136B2 (en) Method and device for training acoustic model, computer device and storage medium
CN108052577B (en) Universal text content mining method, device, server and storage medium
CN108922564B (en) Emotion recognition method and device, computer equipment and storage medium
CN110276023B (en) POI transition event discovery method, device, computing equipment and medium
CN109034203A (en) Training, expression recommended method, device, equipment and the medium of expression recommended models
WO2020073530A1 (en) Customer service robot session text classification method and apparatus, and electronic device and computer-readable storage medium
CN110415679B (en) Voice error correction method, device, equipment and storage medium
US20220382965A1 (en) Text sequence generating method and apparatus, device and medium
CN111428514A (en) Semantic matching method, device, equipment and storage medium
CN111191428B (en) Comment information processing method and device, computer equipment and medium
CN109408834B (en) Auxiliary machine translation method, device, equipment and storage medium
US20210004603A1 (en) Method and apparatus for determining (raw) video materials for news
CN114067790A (en) Voice information processing method, device, equipment and storage medium
CN112214595A (en) Category determination method, device, equipment and medium
CN112825114A (en) Semantic recognition method and device, electronic equipment and storage medium
CN110647613A (en) Courseware construction method, courseware construction device, courseware construction server and storage medium
CN111339760A (en) Method and device for training lexical analysis model, electronic equipment and storage medium
CN110555207A (en) Sentence recognition method, sentence recognition device, machine equipment and computer-readable storage medium
US11100297B2 (en) Provision of natural language response to business process query
CN110276001B (en) Checking page identification method and device, computing equipment and medium
WO2023005968A1 (en) Text category recognition method and apparatus, and electronic device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant