CN115512722A - Multi-mode emotion recognition method, equipment and storage medium - Google Patents

Multi-mode emotion recognition method, equipment and storage medium Download PDF

Info

Publication number
CN115512722A
CN115512722A CN202211233048.8A CN202211233048A CN115512722A CN 115512722 A CN115512722 A CN 115512722A CN 202211233048 A CN202211233048 A CN 202211233048A CN 115512722 A CN115512722 A CN 115512722A
Authority
CN
China
Prior art keywords
text
layer
layers
feature
voice
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211233048.8A
Other languages
Chinese (zh)
Inventor
吴倩文
陈海江
张良友
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Lishi Technology Co Ltd
Original Assignee
Zhejiang Lishi Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Lishi Technology Co Ltd filed Critical Zhejiang Lishi Technology Co Ltd
Priority to CN202211233048.8A priority Critical patent/CN115512722A/en
Publication of CN115512722A publication Critical patent/CN115512722A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Abstract

The application discloses a modal emotion recognition method, a device and a storage medium, wherein multimodal feature fusion is carried out by fully utilizing voice modal information and text modal information, and dual-channel voice emotion recognition of a fusion attention mechanism is carried out by adopting a long-time memory module and a convolution module which are arranged in parallel, so that feature extraction is more perfect, voice and text features are fused in a feature layer, feature information required by final decision can be retained to the maximum extent, and the probability of classification of output emotion is more accurate finally.

Description

Multi-mode emotion recognition method, equipment and storage medium
Technical Field
The invention belongs to the technical field of emotion recognition, and particularly relates to a multi-mode emotion recognition method, equipment and a storage medium.
Background
Emotion recognition is a further basis of man-machine interaction of artificial intelligence, and influences the development of artificial intelligence all the time, but human expression modes of emotion exist, the information content of single-mode emotion information is limited, certain defects exist, and in addition, when relevant emotion information is collected, the emotion information is easily interfered and influenced by various external factors. Therefore, multi-modal fusion is often employed to recognize emotions
With the rapid development of information, various information is easy to cause information overload, so that people cannot efficiently identify effective information, and help is difficult to be provided for self decision.
Disclosure of Invention
In order to solve or partially solve the technical problems, the invention provides a modal emotion recognition method, equipment and a storage medium, wherein multi-modal feature fusion is performed by fully utilizing voice modal information and text modal information, and dual-channel voice emotion recognition with a long-time and short-time memory module and a convolution module arranged in parallel for a fusion attention mechanism is adopted, and the specific technical scheme is as follows:
a method of multimodal emotion recognition, the method comprising:
acquiring a first original voice signal;
after preprocessing a first original voice signal, extracting voice emotion characteristics to obtain a first voice characteristic vector set, wherein the first voice characteristic vector set comprises a zero crossing rate, a Mel cepstrum coefficient, spectrum attenuation, a spectrum centroid and a chromaticity vector of the first original voice signal;
converting the first original voice signal into a text to obtain a first text;
performing text feature extraction on a first text by a Glove word embedding method to obtain a first text feature vector set corresponding to the first text;
putting the first original voice signal and all voice feature vectors in a first voice feature vector set into a voice feature learning model for operation to obtain a first voice feature; the voice feature learning model comprises a convolution module, a long-time memory module and a local fusion module, wherein the convolution module comprises a plurality of one-dimensional convolution layers, a maximum pooling layer, a global average pooling layer and a full-connection Dense layer, the plurality of one-dimensional convolution layers are distinguished according to the number of filters, the long-time memory module comprises a plurality of Bi-LSTM layers, an activation layer, a repeated input layer, an attention mechanism layer and a full-connection Dense layer, and the long-time memory module and the convolution module are arranged in parallel;
inputting the text feature vectors in the first text feature vector set into a text feature learning model for operation to obtain first text features;
performing feature fusion on a first voice feature and a first text feature to obtain a first fusion feature vector, inputting the first fusion feature vector into a multilayer deep neural network model for training, and outputting emotion classification corresponding to the first original voice signal;
the multilayer deep neural network model comprises a plurality of Dense layers and a Softmax layer, wherein each node of the Dense layers is connected with all nodes of the previous layers, and the Softmax layer is used for outputting the probability of emotion classification.
Preferably, the number of the one-dimensional convolutional layers is set to be equal to the convolution kernels of 4,4 one-dimensional convolutional layers.
Preferably, the plurality of Bi-LSTM layers of the long-and-short term memory module are 256 layers.
Preferably, the Dense layer of the multilayer deep neural network model is set to be 3 layers, wherein the Dense layer of the 3 layers is set to be 1024, 512 and 4 in sequence.
In a second aspect, a computer device is provided, comprising one or more processors;
a memory for storing one or more programs,
the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the method of the first aspect as described above.
In a third aspect, a storage medium is provided storing a computer program which, when executed by a processor, performs the method according to the first aspect.
The invention has the advantages that:
1. the long-time memory module and the short-time memory module are arranged in parallel to perform double-channel speech emotion recognition of a fusion attention mechanism, so that local and global speech feature information is well extracted, and feature extraction is more complete;
2. the speech and text features are fused in the feature layer, so that feature information required by final decision can be retained to the maximum extent, and the probability of output emotion classification is more accurate.
Drawings
Fig. 1 is a schematic flow chart of a method according to an embodiment of the present invention.
Fig. 2 is a schematic structural diagram of a computer device according to an embodiment of the present invention.
Detailed Description
The invention is described in further detail below with reference to specific embodiments and the attached drawings. Those skilled in the art will be able to implement the invention based on these teachings. Moreover, the embodiments of the present invention described in the following description are generally only some embodiments of the present invention, and not all embodiments. Therefore, all other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present invention without making creative efforts shall fall within the protection scope of the present invention.
Example (b): the embodiment provides a multi-modal emotion recognition method, as shown in fig. 1, the method includes:
acquiring a first original voice signal;
after preprocessing a first original voice signal, extracting voice emotion characteristics to obtain a first voice characteristic vector set, wherein the first voice characteristic vector set comprises a zero crossing rate, a Mel cepstrum coefficient, spectrum attenuation, a spectrum centroid and a chromaticity vector of the first original voice signal;
converting the first original voice signal into a text to obtain a first text;
performing text feature extraction on a first text by a Glove word embedding method to obtain a first text feature vector set corresponding to the first text;
putting the first original voice signal and all voice feature vectors in a first voice feature vector set into a voice feature learning model for operation to obtain a first voice feature; the voice feature learning model comprises a convolution module, a long-time memory module and a local fusion module, wherein the convolution module comprises a plurality of one-dimensional convolution layers, a maximum pooling layer, a global average pooling layer and a full-connection Dense layer, the plurality of one-dimensional convolution layers are distinguished according to the number of filters, the long-time memory module comprises a plurality of Bi-LSTM layers, an activation layer, a repeated input layer, an attention mechanism layer and a full-connection Dense layer, and the long-time memory module and the convolution module are arranged in parallel;
inputting the text feature vectors in the first text feature vector set into a text feature learning model for operation to obtain first text features;
performing feature fusion on a first voice feature and a first text feature to obtain a first fusion feature vector, inputting the first fusion feature vector into a multilayer deep neural network model for training, and outputting emotion classification corresponding to the first original voice signal;
the multilayer deep neural network model comprises a plurality of Dense layers and a Softmax layer, wherein each node of the Dense layer is connected with all nodes of the previous layer, and the Softmax layer is used for outputting the probability of emotion classification.
Furthermore, the number of the one-dimensional convolutional layers is set to be 4, and the convolution kernels of 4 one-dimensional convolutional layers are the same.
Furthermore, the plurality of Bi-LSTM layers of the long-short time memory module are 256 layers.
Further, a Dense layer of the multilayer deep neural network model is set to be 3 layers, wherein the Dense layer of the 3 layers is sequentially set to be 1024, 512 and 4.
The method has the advantages that the emotional tendency of the user is mined by automatically mining the comment information of the user on the tourism platform and the website, so that the emotional portrait of the user is established, scenic spots are recommended to the user by using an improved recommendation algorithm, and scenic spot recommendation is performed on the user; and the combination of part-of-speech rules and special sentence patterns is adopted to extract comment viewpoint data, and an ASUM theme-emotion mixed model is combined, so that the recommendation of a recommendation algorithm is more accurate.
Fig. 2 is a schematic structural diagram of an apparatus according to an embodiment of the present invention.
As shown in fig. 2, as still another embodiment of the present invention, there is provided a computer apparatus 100 including one or more Central Processing Units (CPUs) 101 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 102 or a program loaded from a storage section 108 into a Random Access Memory (RAM) 103. In the RAM103, various programs and data necessary for the operation of the apparatus 100 are also stored. The CPU101, ROM102, and RAM103 are connected to each other via a bus 104. An input/output (I/O) interface 105 is also connected to bus 104.
The following components are connected to the I/O interface 105: an input portion 106 including a keyboard, a mouse, and the like; an output section 107 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 108 including a hard disk and the like; and a communication section 109 including a network interface card such as a LAN card, a modem, or the like. The communication section 109 also connects to the I/O interface 105 as necessary via a network such as the internet to execute communication processing driver 110. A removable medium 111 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 110 as necessary, so that a computer program read out therefrom is mounted into the storage section 108 as necessary.
In particular, according to embodiments disclosed herein, the method described in embodiment 1 above may be implemented as a computer software program. For example, embodiments disclosed herein include a computer program product comprising a computer program tangibly embodied on a machine-readable medium, the computer program comprising program code for performing the method described in any of the embodiments above. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 109, and/or installed from the removable medium 111.
As yet another aspect, the present application also provides a computer-readable storage medium, which may be the computer-readable storage medium included in the apparatus of the above-described embodiment; or it may be a separate computer readable storage medium not incorporated into the device. The computer readable storage medium stores one or more programs for use by one or more processors in performing the methods described herein.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units or modules described in the embodiments of the present application may be implemented by software or hardware. The described units or modules may also be provided in a processor, for example, each of the described units may be a software program provided in a computer or a mobile intelligent device, or may be a separately configured hardware device. Wherein the designation of a unit or module does not in some way constitute a limitation of the unit or module itself.
The above description is only a preferred embodiment of the application and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention herein disclosed is not limited to the particular combination of features described above, but also encompasses other arrangements formed by any combination of the above features or their equivalents without departing from the spirit of the present application. For example, the above features may be replaced with (but not limited to) features having similar functions disclosed in the present application.

Claims (6)

1. A multi-modal emotion recognition method, the method comprising:
acquiring a first original voice signal;
after preprocessing a first original voice signal, extracting voice emotion characteristics to obtain a first voice characteristic vector set, wherein the first voice characteristic vector set comprises a zero crossing rate, a Mel cepstrum coefficient, spectrum attenuation, a spectrum centroid and a chromaticity vector of the first original voice signal;
converting the first original voice signal into a text to obtain a first text;
performing text feature extraction on a first text by a Glove word embedding method to obtain a first text feature vector set corresponding to the first text;
putting the first original voice signal and all voice feature vectors in a first voice feature vector set into a voice feature learning model for operation to obtain a first voice feature; the voice feature learning model comprises a convolution module, a long-time and short-time memory module and a local fusion module, wherein the convolution module comprises a plurality of one-dimensional convolution layers, a maximum pooling layer, a global average pooling layer and a full-connection Dense layer, the plurality of one-dimensional convolution layers are distinguished according to the number of filters, the long-time and short-time memory module comprises a plurality of Bi-LSTM layers, an activation layer, a repeated input layer, an attention mechanism layer and a full-connection Dense layer, and the long-time and short-time memory module and the convolution module are arranged in parallel;
inputting the text feature vectors in the first text feature vector set into a text feature learning model for operation to obtain first text features;
performing feature fusion on a first voice feature and a first text feature to obtain a first fusion feature vector, inputting the first fusion feature vector into a multilayer deep neural network model for training, and outputting emotion classification corresponding to the first original voice signal;
the multilayer deep neural network model comprises a plurality of Dense layers and a Softmax layer, wherein each node of the Dense layer is connected with all nodes of the previous layer, and the Softmax layer is used for outputting the probability of emotion classification.
2. The method of claim 1, wherein the number of the one-dimensional convolutional layers is set to 4,4, and the convolution kernels of the one-dimensional convolutional layers are the same.
3. The method of claim 1, wherein the plurality of Bi-LSTM layers of the long-short term memory module are 256 layers.
4. The method according to claim 1, wherein the Dense layer of the multi-layer deep neural network model is set to be 3 layers, wherein the Dense layers of the 3 layers are sequentially set to be 1024, 512 and 4.
5. A computer device, characterized by one or more processors;
a memory for storing one or more programs,
the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the method recited in any of claims 1-4.
6. A storage medium storing a computer program, characterized in that the program, when executed by a processor, implements the method according to any one of claims 11 to 4.
CN202211233048.8A 2022-10-10 2022-10-10 Multi-mode emotion recognition method, equipment and storage medium Pending CN115512722A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211233048.8A CN115512722A (en) 2022-10-10 2022-10-10 Multi-mode emotion recognition method, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211233048.8A CN115512722A (en) 2022-10-10 2022-10-10 Multi-mode emotion recognition method, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN115512722A true CN115512722A (en) 2022-12-23

Family

ID=84508043

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211233048.8A Pending CN115512722A (en) 2022-10-10 2022-10-10 Multi-mode emotion recognition method, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN115512722A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113643695A (en) * 2021-09-08 2021-11-12 浙江力石科技股份有限公司 Dialect accent mandarin voice recognition optimization method and system

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113643695A (en) * 2021-09-08 2021-11-12 浙江力石科技股份有限公司 Dialect accent mandarin voice recognition optimization method and system
CN113643695B (en) * 2021-09-08 2024-03-08 浙江力石科技股份有限公司 Method and system for optimizing voice recognition of dialect accent mandarin

Similar Documents

Publication Publication Date Title
CN112668671B (en) Method and device for acquiring pre-training model
CN108597541B (en) Speech emotion recognition method and system for enhancing anger and happiness recognition
CN110309287B (en) Retrieval type chatting dialogue scoring method for modeling dialogue turn information
CN111984766B (en) Missing semantic completion method and device
WO2020261234A1 (en) System and method for sequence labeling using hierarchical capsule based neural network
CN110163181B (en) Sign language identification method and device
CN110633577B (en) Text desensitization method and device
CN110334110A (en) Natural language classification method, device, computer equipment and storage medium
CN110019758B (en) Core element extraction method and device and electronic equipment
CN110377905A (en) Semantic expressiveness processing method and processing device, computer equipment and the readable medium of sentence
CN109299264A (en) File classification method, device, computer equipment and storage medium
CN114127849A (en) Speech emotion recognition method and device
CN111597341B (en) Document-level relation extraction method, device, equipment and storage medium
CN113094478B (en) Expression reply method, device, equipment and storage medium
CN113553412B (en) Question-answering processing method, question-answering processing device, electronic equipment and storage medium
CN114942984A (en) Visual scene text fusion model pre-training and image-text retrieval method and device
CN107274903A (en) Text handling method and device, the device for text-processing
CN111274412A (en) Information extraction method, information extraction model training device and storage medium
CN113836268A (en) Document understanding method and device, electronic equipment and medium
CN115512722A (en) Multi-mode emotion recognition method, equipment and storage medium
CN112632248A (en) Question answering method, device, computer equipment and storage medium
CN109829040A (en) A kind of Intelligent dialogue method and device
CN112307754A (en) Statement acquisition method and device
CN113095072A (en) Text processing method and device
CN114969195B (en) Dialogue content mining method and dialogue content evaluation model generation method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination