CN114595692A

CN114595692A - Emotion recognition method, system and terminal equipment

Info

Publication number: CN114595692A
Application number: CN202011427900.6A
Authority: CN
Inventors: 曲道奎; 梁亮; 张悦; 杜振军; 王海鹏; 杜威
Original assignee: Shandong Siasun Industrial Software Research Institute Co Ltd
Current assignee: Shandong Siasun Industrial Software Research Institute Co Ltd
Priority date: 2020-12-07
Filing date: 2020-12-07
Publication date: 2022-06-07

Abstract

The invention is suitable for the technical field of intelligent recognition, and provides an emotion recognition method, an emotion recognition system and terminal equipment, wherein the emotion recognition method comprises the following steps: receiving voice sent by a user, and selecting voice information corresponding to the voice from a pre-established emotion database; extracting character information of the voice information, and storing the character information in a text file form; fusing the feature items of the voice information and the feature items of the character information to obtain fused information; and recognizing the emotion of the user according to a pre-trained emotion recognition model and the fusion information. The method starts from two aspects of language text and voice intonation, analyzes human emotion from multiple angles, extracts features, and realizes accuracy and practicability of emotion recognition of human.

Description

Emotion recognition method, system and terminal equipment

Technical Field

The invention relates to the field of intelligent recognition, in particular to an emotion recognition method, system, terminal equipment and computer readable storage medium.

Background

With the development of modern society psychology, neurology and computer science, the emotion recognition technology achieves remarkable achievement. Emotion recognition combines two fields of speech processing and natural language processing as well as psychological problems such as ethology and cognition. There are three outward manifestations of human emotion: subjective feeling, physiological arousal and behavioral expression. Currently, emotion recognition is known to draw emotion conclusions from respective analysis of facial images, speech and semantics. Although some results are obtained, the behavior of human expression language is expressed by combining facial expression, language expression, voice tone and limb movement, and it is very subjective and subjective to consider only one expression mode as the judgment basis of human emotion.

Early emotion recognition studies were based on feature modeling. According to different behaviours, features are extracted by different methods for analysis, such as: the facial expression emotion recognition is to extract features of collected images according to technologies of image processing, transformation and the like, the emotion recognition of the speech language is to use a vocabulary-based method, model building is carried out depending on vocabulary resources, and emotion recognition is carried out by mining a large amount of emotion texts and keywords, and the methods are time-consuming, complex and low in accuracy. With the explosive emergence of deep learning technology, neural network models are gradually applied to various research fields, more accurate results can be obtained by processing and analyzing images through deep learning, the most important point of deep learning is that a large number of features with strong generalization capability are needed, but emotion recognition only through a single behavior feature is not suitable for a deep learning method.

Therefore, a new technical solution is needed to solve the above technical problems.

Disclosure of Invention

In view of this, embodiments of the present invention provide an emotion recognition method, system and terminal device, by which human emotion can be accurately recognized.

A first aspect of an embodiment of the present invention provides an emotion recognition method, where the emotion recognition method includes:

receiving voice sent by a user, and selecting voice information corresponding to the voice from a pre-established emotion database;

extracting character information of the voice information, and storing the character information in a text file form;

fusing the feature items of the voice information and the feature items of the character information to obtain fused information;

and recognizing the emotion of the user according to a pre-trained emotion recognition model and the fusion information.

Optionally, in another embodiment provided by the present application, the fusing the feature item of the voice information and the feature item of the text information to obtain fused information includes:

the voice information and the character information are used as input of convolution nerves, and a feature vector of the voice information and a feature vector of the character information are respectively extracted;

and fusing the feature vector of the voice information and the feature vector of the text information to obtain a fused vector, and taking the fused vector as the fusion information.

Optionally, in another embodiment provided by the present application, the extracting the feature vector of the speech information and the feature vector of the text information respectively includes:

and extracting energy and sound waves in the voice information and extracting keywords and semantic features in the text information.

Optionally, in another embodiment provided by the present application, the pre-established emotion databases include a CASIA chinese emotion database and an accocpus series chinese emotion database.

Optionally, in another embodiment provided by the present application, the emotion recognition model is an emotion recognition model obtained through long and short memory network training.

A second aspect of an embodiment of the present invention provides an emotion recognition system, including:

the receiving module is used for receiving voice sent by a user and selecting voice information corresponding to the voice from a pre-established emotion database;

the extraction module is used for extracting the character information of the voice information and storing the character information in a text file form;

the fusion module is used for fusing the feature items of the voice information and the feature items of the text information to obtain fusion information;

and the recognition module is used for recognizing the emotion of the user according to a pre-trained emotion recognition model and the fusion information.

Optionally, in another embodiment provided by the present application, the fusion module is specifically configured to:

Optionally, in another embodiment provided by the present application, the separately extracting the feature vector of the speech information and the feature vector of the text information includes:

A third aspect of embodiments of the present invention provides a terminal device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and the processor implements the method of any one of the first aspect when executing the computer program.

A fourth aspect of the embodiments of the present invention provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the method mentioned in any one of the above first aspects.

Compared with the prior art, the embodiment of the invention has the following beneficial effects: the invention provides a sentiment recognition method based on semantics, which fuses the semantics of human sentiment expression modes. Since human speech is an important behavior signal reflecting human emotion, the research of emotion recognition based on speech language is most suitable for human emotion expression habits. Therefore, the invention starts from two aspects of language text and voice intonation, analyzes human emotion from multiple angles, extracts characteristics and realizes the accuracy and practicability of emotion recognition of human.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.

Fig. 1 is a schematic flowchart of an emotion recognition method according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of an emotion recognition system according to a second embodiment of the present invention;

FIG. 3 is a schematic diagram of an emotion recognition system provided by an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a terminal device according to a third embodiment of the present invention.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the invention. It will be apparent, however, to one skilled in the art that the present invention may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present invention with unnecessary detail.

It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the specification of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should be further understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.

As used in this specification and the appended claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to a determination" or "in response to a detection". Similarly, the phrase "if it is determined" or "if a [ described condition or event ] is detected" may be interpreted contextually to mean "upon determining" or "in response to determining" or "upon detecting [ described condition or event ]" or "in response to detecting [ described condition or event ]".

In order to illustrate the technical means of the present invention, the following description is given by way of specific examples.

Example one

Fig. 1 is a schematic flow chart of an emotion recognition method provided in an embodiment of the present invention, where the method may include the following steps:

s101: receiving voice sent by a user, and selecting voice information corresponding to the voice from a pre-established emotion database.

The pre-established emotion databases comprise a CASIA Chinese emotion database and an ACCOPus series Chinese emotion database.

In the step, a voice emotion database is collected in advance, and a CASIA Chinese emotion database and an ACCOPus series Chinese emotion database are popular in China at present. The CASIA Chinese emotion database is 500 different texts recorded by two men and two women, and is divided into six types of emotions: happy, sad, angry, surprised, neutral, and frightened. Compared with a CASIA Chinese emotion database, the ACCorpus series Chinese emotion database is richer and more representative, and a voice sub-database of the ACCorpus series Chinese emotion database is formed by 25 men and 25 women who have 5 types of emotions: neutral, happy, angry, fear and sadness, the performance is obtained respectively, the emotion is more full, and the distinction is easier.

S102: and extracting the character information of the voice information, and storing the character information in a text file form.

S103: and fusing the characteristic items of the voice information and the characteristic items of the character information to obtain fused information.

The extracting the feature vector of the voice information and the feature vector of the text information respectively includes:

The fusing the feature items of the voice information and the feature items of the text information to obtain fused information, including:

Taking a voice as an example, in order to realize the fusion of a language text and a voice tone and obtain a final fusion semantic feature, text first performs character conversion on the voice and stores the converted voice into a text file, wherein, a plurality of speech-to-text software can be used for conversion, for example: fly-by-fly, WeChat, Google Cloud speed-to-Text, watson speed-to-Text, etc., gave the most accurate conversion results. In order to realize feature fusion, the patent respectively uses a voice file and a text file as the input of a Convolutional Neural Network (CNN), respectively extracts audio features such as energy and sound waves in voice and semantic features such as keywords, front and back context dependence and the like in text, and then performs feature fusion on the two groups of features, so that the purpose of providing more complete voice semantic features to represent different types of emotions is achieved, and emotion recognition and classification are more accurate and more basis is provided. The feature fusion method is to use Canonical Correlation Analysis (CCA). The typical correlation analysis takes the correlation characteristics between two groups of characteristic vectors as an effective discrimination form, and has the advantages of not only fusing information, but also eliminating redundant information in the characteristics. The method comprises the steps of establishing a correlation function between two groups of feature vectors, and extracting the correlation features of the two groups of feature vectors to be effective discrimination vectors as fused vectors.

S104: and recognizing the emotion of the user according to a pre-trained emotion recognition model and the fusion information. The emotion recognition model is obtained through long and short memory network training.

In the step, after the CCA extracts the fusion features, the Long-Short Time Memory Network (LSTM) is used for training a emotion recognition model. LSTM networks are able to handle very well long term dependencies across features, with the network defaults to remembering longer history information, implemented with "memory gates" in the network. Meanwhile, a 'forgetting gate' is arranged in the network to determine the information to be discarded from each memory unit (also called 'cell state'), and only the key information needs to be remembered to prevent the redundancy and overfitting of network parameters caused by excessive information. Finally, the emotion is identified using the softmax function.

The following description is given with reference to specific examples:

collecting a speech emotion database: the CASIA Chinese emotion database and the ACCorpus series Chinese emotion database are used as training and verification data.

Voice to text. Using speed-to-text software, as follows: the news hears, WeChat, Google Cloud Speech-to-Text, watson Speech-to-Text, convert the Speech data into Text data and save it.

And (4) preprocessing data. Preprocessing the text data: removing stop words, special symbols and numbers, drying words and vectorizing words. Preprocessing voice data: denoising, pre-emphasis, framing, windowing, carrying out end point detection, and extracting characteristic parameters by using MFCC (Mel frequency cepstrum coefficient) as the input of a convolutional neural network.

And (5) extracting and fusing the features. And respectively extracting audio features of the voice data and semantic features of the text data by using a Convolutional Neural Network (CNN) model, and performing feature fusion on the two extracted features by using a typical correlation analysis algorithm to obtain strongly correlated fusion features.

Modeling and identifying. And (3) carrying out emotion recognition training by using a long-time memory network (LSTM) model, wherein the input of the network is the fused characteristic, and the model is classified by using a softmax function to recognize different emotions.

The invention provides a sentiment recognition method based on semantics, which fuses the semantics of human sentiment expression modes. Since human speech is an important behavior signal reflecting human emotion, the research of emotion recognition based on speech language is most suitable for human emotion expression habits. Therefore, the invention starts from two aspects of language text and voice intonation, analyzes human emotion from multiple angles, extracts characteristics and realizes the accuracy and the practicability of emotion recognition of human.

Example two

Fig. 2 is a schematic structural diagram of an emotion recognition system according to a second embodiment of the present invention, and only a part related to the second embodiment of the present invention is shown for convenience of description. Fig. 3 shows a flow diagram of an emotion recognition system provided by the present application.

The fault detection system can be a software unit, a hardware unit or a combination unit which is built in the robot, and can also be integrated into the computer or other terminals as an independent pendant.

The emotion recognition system includes:

the receiving module 21 is configured to receive a voice sent by a user, and select voice information corresponding to the voice from a pre-established emotion database;

the extraction module 22 is configured to extract text information of the voice information, and store the text information in a form of a text file;

the fusion module 23 is configured to fuse the feature items of the voice information and the feature items of the text information to obtain fusion information;

and the recognition module 24 is used for recognizing the emotion of the user according to the pre-trained emotion recognition model and the fusion information.

Optionally, in another embodiment provided by the present application, the pre-established emotion databases include a CASIA chinese emotion database and an accorus series chinese emotion database.

The working process of the emotion recognition method system is referred to the implementation process of the emotion recognition method, and is not described herein again.

EXAMPLE III

Fig. 4 is a schematic structural diagram of a terminal device according to a fourth embodiment of the present invention. As shown in fig. 4, the terminal device 4 of this embodiment includes: a processor 40, a memory 41 and a computer program 42, such as an emotion recognition method program, stored in said memory 41 and executable on said processor 40. The processor 40, when executing the computer program 42, implements the steps of the first embodiment of the method, such as the steps S101 to S104 shown in fig. 1. The processor 40, when executing the computer program 42, implements the functions of the modules/units in the above-described device embodiments, such as the functions of the modules 21 to 24 shown in fig. 4.

Illustratively, the computer program 42 may be partitioned into one or more modules/units that are stored in the memory 41 and executed by the processor 40 to implement the present invention. The one or more modules/units may be a series of computer program instruction segments capable of performing certain functions, which are used to describe the execution of the computer program 42 in the terminal device 4. For example, the computer program 42 may be divided into different modules, and the specific functions of each module are as follows:

the system comprises a setting module, a fault detection module and a fault detection module, wherein the setting module is used for setting fault detection contents of the robot, and the fault detection contents comprise an object to be detected, a detection period and a fault condition;

the detection module is used for detecting whether the object to be detected reaches a fault condition according to the detection period to obtain a detection result;

and the recording module is used for recording the running state of the object to be detected in an xml form according to the detection result.

The terminal device 4 may be a desktop computer, a notebook, a palm computer, a cloud server, or other computing devices. The terminal device may include, but is not limited to, a processor 40, a memory 41. Those skilled in the art will appreciate that fig. 4 is merely an example of a terminal device 4 and does not constitute a limitation of terminal device 4 and may include more or fewer components than shown, or some components may be combined, or different components, e.g., the terminal device may also include input-output devices, network access devices, buses, etc.

The Processor 40 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 41 may be an internal storage unit of the terminal device 4, such as a hard disk or a memory of the terminal device 4. The memory 41 may also be an external storage device of the terminal device 4, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the terminal device 4. Further, the memory 41 may also include both an internal storage unit and an external storage device of the terminal device 4. The memory 41 is used for storing the computer program and other programs and data required by the terminal device. The memory 41 may also be used to temporarily store data that has been output or is to be output.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the above embodiments, the description of each embodiment has its own emphasis, and reference may be made to the related description of other embodiments for parts that are not described or recited in any embodiment.

Those of ordinary skill in the art would appreciate that the modules, elements, and/or method steps of the various embodiments described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the technical solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, all or part of the flow of the method according to the embodiments of the present invention may also be implemented by a computer program, which may be stored in a computer-readable storage medium, and when the computer program is executed by a processor, the steps of the method embodiments may be implemented. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. An emotion recognition method, characterized in that the emotion recognition method includes:

2. The emotion recognition method of claim 1, wherein the fusing the feature items of the speech information and the feature items of the text information to obtain fused information includes:

3. The emotion recognition method of claim 2, wherein the extracting the feature vector of the speech information and the feature vector of the text information respectively comprises:

4. The emotion recognition method of claim 1, wherein the pre-established emotion databases comprise a CASIA chinese emotion database and an accoopus series chinese emotion database.

5. The emotion recognition method according to any one of claims 1 to 4, wherein the emotion recognition model is an emotion recognition model obtained by long-short memory network training.

6. An emotion recognition system, characterized in that the emotion recognition system includes:

7. The emotion recognition system of claim 6, wherein the fusion module is specifically configured to:

8. The emotion recognition system of claim 7, wherein the extracting the feature vector of the speech information and the feature vector of the text information respectively comprises:

9. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any of claims 1 to 5 when executing the computer program.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 5.