CN115512722A - Multi-mode emotion recognition method, equipment and storage medium - Google Patents
Multi-mode emotion recognition method, equipment and storage medium Download PDFInfo
- Publication number
- CN115512722A CN115512722A CN202211233048.8A CN202211233048A CN115512722A CN 115512722 A CN115512722 A CN 115512722A CN 202211233048 A CN202211233048 A CN 202211233048A CN 115512722 A CN115512722 A CN 115512722A
- Authority
- CN
- China
- Prior art keywords
- text
- layer
- layers
- feature
- voice
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/63—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
Abstract
The application discloses a modal emotion recognition method, a device and a storage medium, wherein multimodal feature fusion is carried out by fully utilizing voice modal information and text modal information, and dual-channel voice emotion recognition of a fusion attention mechanism is carried out by adopting a long-time memory module and a convolution module which are arranged in parallel, so that feature extraction is more perfect, voice and text features are fused in a feature layer, feature information required by final decision can be retained to the maximum extent, and the probability of classification of output emotion is more accurate finally.
Description
Technical Field
The invention belongs to the technical field of emotion recognition, and particularly relates to a multi-mode emotion recognition method, equipment and a storage medium.
Background
Emotion recognition is a further basis of man-machine interaction of artificial intelligence, and influences the development of artificial intelligence all the time, but human expression modes of emotion exist, the information content of single-mode emotion information is limited, certain defects exist, and in addition, when relevant emotion information is collected, the emotion information is easily interfered and influenced by various external factors. Therefore, multi-modal fusion is often employed to recognize emotions
With the rapid development of information, various information is easy to cause information overload, so that people cannot efficiently identify effective information, and help is difficult to be provided for self decision.
Disclosure of Invention
In order to solve or partially solve the technical problems, the invention provides a modal emotion recognition method, equipment and a storage medium, wherein multi-modal feature fusion is performed by fully utilizing voice modal information and text modal information, and dual-channel voice emotion recognition with a long-time and short-time memory module and a convolution module arranged in parallel for a fusion attention mechanism is adopted, and the specific technical scheme is as follows:
a method of multimodal emotion recognition, the method comprising:
acquiring a first original voice signal;
after preprocessing a first original voice signal, extracting voice emotion characteristics to obtain a first voice characteristic vector set, wherein the first voice characteristic vector set comprises a zero crossing rate, a Mel cepstrum coefficient, spectrum attenuation, a spectrum centroid and a chromaticity vector of the first original voice signal;
converting the first original voice signal into a text to obtain a first text;
performing text feature extraction on a first text by a Glove word embedding method to obtain a first text feature vector set corresponding to the first text;
putting the first original voice signal and all voice feature vectors in a first voice feature vector set into a voice feature learning model for operation to obtain a first voice feature; the voice feature learning model comprises a convolution module, a long-time memory module and a local fusion module, wherein the convolution module comprises a plurality of one-dimensional convolution layers, a maximum pooling layer, a global average pooling layer and a full-connection Dense layer, the plurality of one-dimensional convolution layers are distinguished according to the number of filters, the long-time memory module comprises a plurality of Bi-LSTM layers, an activation layer, a repeated input layer, an attention mechanism layer and a full-connection Dense layer, and the long-time memory module and the convolution module are arranged in parallel;
inputting the text feature vectors in the first text feature vector set into a text feature learning model for operation to obtain first text features;
performing feature fusion on a first voice feature and a first text feature to obtain a first fusion feature vector, inputting the first fusion feature vector into a multilayer deep neural network model for training, and outputting emotion classification corresponding to the first original voice signal;
the multilayer deep neural network model comprises a plurality of Dense layers and a Softmax layer, wherein each node of the Dense layers is connected with all nodes of the previous layers, and the Softmax layer is used for outputting the probability of emotion classification.
Preferably, the number of the one-dimensional convolutional layers is set to be equal to the convolution kernels of 4,4 one-dimensional convolutional layers.
Preferably, the plurality of Bi-LSTM layers of the long-and-short term memory module are 256 layers.
Preferably, the Dense layer of the multilayer deep neural network model is set to be 3 layers, wherein the Dense layer of the 3 layers is set to be 1024, 512 and 4 in sequence.
In a second aspect, a computer device is provided, comprising one or more processors;
a memory for storing one or more programs,
the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the method of the first aspect as described above.
In a third aspect, a storage medium is provided storing a computer program which, when executed by a processor, performs the method according to the first aspect.
The invention has the advantages that:
1. the long-time memory module and the short-time memory module are arranged in parallel to perform double-channel speech emotion recognition of a fusion attention mechanism, so that local and global speech feature information is well extracted, and feature extraction is more complete;
2. the speech and text features are fused in the feature layer, so that feature information required by final decision can be retained to the maximum extent, and the probability of output emotion classification is more accurate.
Drawings
Fig. 1 is a schematic flow chart of a method according to an embodiment of the present invention.
Fig. 2 is a schematic structural diagram of a computer device according to an embodiment of the present invention.
Detailed Description
The invention is described in further detail below with reference to specific embodiments and the attached drawings. Those skilled in the art will be able to implement the invention based on these teachings. Moreover, the embodiments of the present invention described in the following description are generally only some embodiments of the present invention, and not all embodiments. Therefore, all other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present invention without making creative efforts shall fall within the protection scope of the present invention.
Example (b): the embodiment provides a multi-modal emotion recognition method, as shown in fig. 1, the method includes:
acquiring a first original voice signal;
after preprocessing a first original voice signal, extracting voice emotion characteristics to obtain a first voice characteristic vector set, wherein the first voice characteristic vector set comprises a zero crossing rate, a Mel cepstrum coefficient, spectrum attenuation, a spectrum centroid and a chromaticity vector of the first original voice signal;
converting the first original voice signal into a text to obtain a first text;
performing text feature extraction on a first text by a Glove word embedding method to obtain a first text feature vector set corresponding to the first text;
putting the first original voice signal and all voice feature vectors in a first voice feature vector set into a voice feature learning model for operation to obtain a first voice feature; the voice feature learning model comprises a convolution module, a long-time memory module and a local fusion module, wherein the convolution module comprises a plurality of one-dimensional convolution layers, a maximum pooling layer, a global average pooling layer and a full-connection Dense layer, the plurality of one-dimensional convolution layers are distinguished according to the number of filters, the long-time memory module comprises a plurality of Bi-LSTM layers, an activation layer, a repeated input layer, an attention mechanism layer and a full-connection Dense layer, and the long-time memory module and the convolution module are arranged in parallel;
inputting the text feature vectors in the first text feature vector set into a text feature learning model for operation to obtain first text features;
performing feature fusion on a first voice feature and a first text feature to obtain a first fusion feature vector, inputting the first fusion feature vector into a multilayer deep neural network model for training, and outputting emotion classification corresponding to the first original voice signal;
the multilayer deep neural network model comprises a plurality of Dense layers and a Softmax layer, wherein each node of the Dense layer is connected with all nodes of the previous layer, and the Softmax layer is used for outputting the probability of emotion classification.
Furthermore, the number of the one-dimensional convolutional layers is set to be 4, and the convolution kernels of 4 one-dimensional convolutional layers are the same.
Furthermore, the plurality of Bi-LSTM layers of the long-short time memory module are 256 layers.
Further, a Dense layer of the multilayer deep neural network model is set to be 3 layers, wherein the Dense layer of the 3 layers is sequentially set to be 1024, 512 and 4.
The method has the advantages that the emotional tendency of the user is mined by automatically mining the comment information of the user on the tourism platform and the website, so that the emotional portrait of the user is established, scenic spots are recommended to the user by using an improved recommendation algorithm, and scenic spot recommendation is performed on the user; and the combination of part-of-speech rules and special sentence patterns is adopted to extract comment viewpoint data, and an ASUM theme-emotion mixed model is combined, so that the recommendation of a recommendation algorithm is more accurate.
Fig. 2 is a schematic structural diagram of an apparatus according to an embodiment of the present invention.
As shown in fig. 2, as still another embodiment of the present invention, there is provided a computer apparatus 100 including one or more Central Processing Units (CPUs) 101 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 102 or a program loaded from a storage section 108 into a Random Access Memory (RAM) 103. In the RAM103, various programs and data necessary for the operation of the apparatus 100 are also stored. The CPU101, ROM102, and RAM103 are connected to each other via a bus 104. An input/output (I/O) interface 105 is also connected to bus 104.
The following components are connected to the I/O interface 105: an input portion 106 including a keyboard, a mouse, and the like; an output section 107 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 108 including a hard disk and the like; and a communication section 109 including a network interface card such as a LAN card, a modem, or the like. The communication section 109 also connects to the I/O interface 105 as necessary via a network such as the internet to execute communication processing driver 110. A removable medium 111 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 110 as necessary, so that a computer program read out therefrom is mounted into the storage section 108 as necessary.
In particular, according to embodiments disclosed herein, the method described in embodiment 1 above may be implemented as a computer software program. For example, embodiments disclosed herein include a computer program product comprising a computer program tangibly embodied on a machine-readable medium, the computer program comprising program code for performing the method described in any of the embodiments above. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 109, and/or installed from the removable medium 111.
As yet another aspect, the present application also provides a computer-readable storage medium, which may be the computer-readable storage medium included in the apparatus of the above-described embodiment; or it may be a separate computer readable storage medium not incorporated into the device. The computer readable storage medium stores one or more programs for use by one or more processors in performing the methods described herein.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units or modules described in the embodiments of the present application may be implemented by software or hardware. The described units or modules may also be provided in a processor, for example, each of the described units may be a software program provided in a computer or a mobile intelligent device, or may be a separately configured hardware device. Wherein the designation of a unit or module does not in some way constitute a limitation of the unit or module itself.
The above description is only a preferred embodiment of the application and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention herein disclosed is not limited to the particular combination of features described above, but also encompasses other arrangements formed by any combination of the above features or their equivalents without departing from the spirit of the present application. For example, the above features may be replaced with (but not limited to) features having similar functions disclosed in the present application.
Claims (6)
1. A multi-modal emotion recognition method, the method comprising:
acquiring a first original voice signal;
after preprocessing a first original voice signal, extracting voice emotion characteristics to obtain a first voice characteristic vector set, wherein the first voice characteristic vector set comprises a zero crossing rate, a Mel cepstrum coefficient, spectrum attenuation, a spectrum centroid and a chromaticity vector of the first original voice signal;
converting the first original voice signal into a text to obtain a first text;
performing text feature extraction on a first text by a Glove word embedding method to obtain a first text feature vector set corresponding to the first text;
putting the first original voice signal and all voice feature vectors in a first voice feature vector set into a voice feature learning model for operation to obtain a first voice feature; the voice feature learning model comprises a convolution module, a long-time and short-time memory module and a local fusion module, wherein the convolution module comprises a plurality of one-dimensional convolution layers, a maximum pooling layer, a global average pooling layer and a full-connection Dense layer, the plurality of one-dimensional convolution layers are distinguished according to the number of filters, the long-time and short-time memory module comprises a plurality of Bi-LSTM layers, an activation layer, a repeated input layer, an attention mechanism layer and a full-connection Dense layer, and the long-time and short-time memory module and the convolution module are arranged in parallel;
inputting the text feature vectors in the first text feature vector set into a text feature learning model for operation to obtain first text features;
performing feature fusion on a first voice feature and a first text feature to obtain a first fusion feature vector, inputting the first fusion feature vector into a multilayer deep neural network model for training, and outputting emotion classification corresponding to the first original voice signal;
the multilayer deep neural network model comprises a plurality of Dense layers and a Softmax layer, wherein each node of the Dense layer is connected with all nodes of the previous layer, and the Softmax layer is used for outputting the probability of emotion classification.
2. The method of claim 1, wherein the number of the one-dimensional convolutional layers is set to 4,4, and the convolution kernels of the one-dimensional convolutional layers are the same.
3. The method of claim 1, wherein the plurality of Bi-LSTM layers of the long-short term memory module are 256 layers.
4. The method according to claim 1, wherein the Dense layer of the multi-layer deep neural network model is set to be 3 layers, wherein the Dense layers of the 3 layers are sequentially set to be 1024, 512 and 4.
5. A computer device, characterized by one or more processors;
a memory for storing one or more programs,
the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the method recited in any of claims 1-4.
6. A storage medium storing a computer program, characterized in that the program, when executed by a processor, implements the method according to any one of claims 11 to 4.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211233048.8A CN115512722A (en) | 2022-10-10 | 2022-10-10 | Multi-mode emotion recognition method, equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211233048.8A CN115512722A (en) | 2022-10-10 | 2022-10-10 | Multi-mode emotion recognition method, equipment and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115512722A true CN115512722A (en) | 2022-12-23 |
Family
ID=84508043
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211233048.8A Pending CN115512722A (en) | 2022-10-10 | 2022-10-10 | Multi-mode emotion recognition method, equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115512722A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113643695A (en) * | 2021-09-08 | 2021-11-12 | 浙江力石科技股份有限公司 | Dialect accent mandarin voice recognition optimization method and system |
-
2022
- 2022-10-10 CN CN202211233048.8A patent/CN115512722A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113643695A (en) * | 2021-09-08 | 2021-11-12 | 浙江力石科技股份有限公司 | Dialect accent mandarin voice recognition optimization method and system |
CN113643695B (en) * | 2021-09-08 | 2024-03-08 | 浙江力石科技股份有限公司 | Method and system for optimizing voice recognition of dialect accent mandarin |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112668671B (en) | Method and device for acquiring pre-training model | |
CN108597541B (en) | Speech emotion recognition method and system for enhancing anger and happiness recognition | |
CN110309287B (en) | Retrieval type chatting dialogue scoring method for modeling dialogue turn information | |
CN111984766B (en) | Missing semantic completion method and device | |
WO2020261234A1 (en) | System and method for sequence labeling using hierarchical capsule based neural network | |
CN110163181B (en) | Sign language identification method and device | |
CN110633577B (en) | Text desensitization method and device | |
CN110334110A (en) | Natural language classification method, device, computer equipment and storage medium | |
CN110019758B (en) | Core element extraction method and device and electronic equipment | |
CN110377905A (en) | Semantic expressiveness processing method and processing device, computer equipment and the readable medium of sentence | |
CN109299264A (en) | File classification method, device, computer equipment and storage medium | |
CN114127849A (en) | Speech emotion recognition method and device | |
CN111597341B (en) | Document-level relation extraction method, device, equipment and storage medium | |
CN113094478B (en) | Expression reply method, device, equipment and storage medium | |
CN113553412B (en) | Question-answering processing method, question-answering processing device, electronic equipment and storage medium | |
CN114942984A (en) | Visual scene text fusion model pre-training and image-text retrieval method and device | |
CN107274903A (en) | Text handling method and device, the device for text-processing | |
CN111274412A (en) | Information extraction method, information extraction model training device and storage medium | |
CN113836268A (en) | Document understanding method and device, electronic equipment and medium | |
CN115512722A (en) | Multi-mode emotion recognition method, equipment and storage medium | |
CN112632248A (en) | Question answering method, device, computer equipment and storage medium | |
CN109829040A (en) | A kind of Intelligent dialogue method and device | |
CN112307754A (en) | Statement acquisition method and device | |
CN113095072A (en) | Text processing method and device | |
CN114969195B (en) | Dialogue content mining method and dialogue content evaluation model generation method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |