CN112016367A

CN112016367A - Emotion recognition system and method and electronic equipment

Info

Publication number: CN112016367A
Application number: CN201910468800.9A
Authority: CN
Inventors: 王晓东; 杜威; 王宏玉; 王海鹏; 邹风山; 张悦
Original assignee: Shenyang Siasun Robot and Automation Co Ltd
Current assignee: Shenyang Siasun Robot and Automation Co Ltd
Priority date: 2019-05-31
Filing date: 2019-05-31
Publication date: 2020-12-01

Abstract

The application relates to an emotion recognition system, method and electronic equipment. The robot comprises a robot and a cloud server; the robot is used for collecting images or video data and voice signals of a user, identifying the images or video data and the voice signals, respectively acquiring emotion components of the user based on expressions and voices, and uploading the emotion components based on the expressions and the voices to a cloud server; the cloud server is used for acquiring emotion components of the user based on the text according to the voice signals, and fusing the emotion components based on the expression, the voice and the text based on a weight calculation method to obtain a final emotion recognition result of the user. Compared with the prior art, the method and the device can analyze the emotion of the user at multiple angles, so that the real emotion of the user can be more accurately described.

Description

Emotion recognition system and method and electronic equipment

Technical Field

The application belongs to the technical field of artificial intelligence, and particularly relates to an emotion recognition system, method and electronic equipment.

Background

Emotion is a state in which the feeling, thought and behavior of a person are combined, and it includes a psychological response of a person to external or self-stimulation, including a physiological response accompanying such a psychological response. The mood plays a ubiquitous role in the daily work and life of people. For example, in medical care, if the emotional state of a patient, particularly a patient with an expression disorder, can be known, different care measures can be taken according to the emotion of the patient, and the care quality can be improved. In the product development process, if the emotional state of the user in the product using process can be identified and the user experience is known, the product function can be improved, and a product more suitable for the user requirement is designed. In various human-machine interaction systems, human-machine interaction becomes more friendly and natural if the system can recognize the emotional state of a human. Therefore, emotion analysis and recognition are important interdisciplinary research subjects in the fields of neuroscience, psychology, cognitive science, computer science, artificial intelligence and the like.

Currently, a general emotion recognition technology generally requires a user to wear additional auxiliary devices such as glasses or a heart rate sensor to acquire physiological data of the user, so as to perform emotion recognition. In the process of human-computer interaction, the emotion of a person needs to be recognized, but if the emotion of the person can be judged by means of additional auxiliary equipment, the application of the system or the method is limited to a great extent, and the actual application requirements cannot be met. For example, the acquisition of physiological signals requires the use of a signal capture device, which may greatly affect the expression of the user's true mood, and thus may not truly acquire the user's current true emotional state. Meanwhile, the real emotional state of the user cannot be really recognized due to the limitation of technical conditions and processing methods by means of single-mode signal analysis. For example, a person's facial expression may change slightly within a few seconds, which may lead to inaccuracies in emotion recognition if the auxiliary device is unable to capture the few seconds of change, or the time of processing is left to ignore the past, or the algorithm is misrecognized, or the user may disguise his or her facial expression.

Disclosure of Invention

The application provides an emotion recognition system, method and electronic device, which aim to solve at least one of the above technical problems in the prior art to a certain extent.

In order to solve the above problems, the present application provides the following technical solutions:

an emotion recognition system comprises a robot and a cloud server;

the robot is used for collecting images or video data and voice signals of a user, identifying the images or video data and the voice signals, respectively acquiring emotion components of the user based on expressions and voices, and uploading the emotion components based on the expressions and the voices to a cloud server;

the cloud server is used for acquiring emotion components of the user based on the text according to the voice signals, and fusing the emotion components based on the expression, the voice and the text based on a weight calculation method to obtain a final emotion recognition result of the user.

The technical scheme adopted by the embodiment of the application further comprises the following steps: the robot includes data acquisition module and emotion recognition module, the data acquisition module includes:

an image acquisition unit: the emotion recognition module is used for acquiring image or video data of a user and transmitting the acquired image or video data to the emotion recognition module;

the voice acquisition unit: the emotion recognition module is used for acquiring voice signals of users and transmitting the acquired voice signals to the emotion recognition module;

the emotion recognition module includes:

an expression recognition unit: the emotion recognition system is used for extracting effective static expression characteristics or dynamic expression characteristics through collected image or video data, training an emotion recognition model based on expression by adopting the static expression characteristics or the dynamic expression characteristics, and performing emotion type judgment and emotion intensity calculation through the emotion recognition model based on the expression to obtain an emotion component based on the expression;

a voice recognition unit: the voice recognition method is used for analyzing and extracting voice characteristic parameters capable of representing emotion change from the collected voice signals, training a voice-based emotion recognition model by adopting the voice characteristic parameters, and performing emotion type judgment and emotion intensity calculation through the voice-based emotion recognition model to obtain an emotion component based on voice.

The technical scheme adopted by the embodiment of the application further comprises the following steps: the emotion component acquisition mode based on the expression specifically comprises the following steps: analyzing the video sequence, and analyzing and retrieving the key frames in the video sequence; intercepting a plurality of sequence frames containing the same or similar expressions, performing related preprocessing operation on the intercepted sequence frames, extracting facial features in the sequence frames, and extracting dynamic expression features and static expression features based on the facial features; when the model is trained, all the dynamic expression features and the static expression features are combined, and then the emotion classification is carried out by using a feature correlation analysis method.

The technical scheme adopted by the embodiment of the application further comprises the following steps: the emotion component acquisition mode based on voice specifically comprises the following steps: after the voice signal is preprocessed, extracting voice characteristic parameters capable of expressing current sound from the voice signal, analyzing and processing the voice characteristic parameters based on statistics, and then training a emotion recognition model based on voice by using a classification method based on the voice characteristic parameters; and by utilizing the emotion recognition model, selecting a classifier to perform emotion type judgment and emotion intensity calculation by adopting a classification recognition algorithm, and performing combined judgment by using a specific weight to obtain an emotion component based on voice.

The technical scheme adopted by the embodiment of the application further comprises the following steps: the cloud server comprises:

a text recognition module: the voice recognition engine is used for converting a voice signal of a user into text information, preprocessing the text information, extracting text characteristic parameters capable of representing emotion change from the preprocessed text information, and distinguishing the text characteristic parameters through the classifier to obtain emotion components of the text;

a data fusion module: the emotion recognition system is used for fusing emotion components based on expressions, voice and text by adopting a weight calculation method, calculating a final emotion recognition result and feeding back the final emotion recognition result to the robot; the fusion method comprises weight-based fusion, statistical data-based fusion and machine learning method-based fusion, and the weight calculation method comprises static weight setting and dynamic weight setting.

Another technical scheme adopted by the embodiment of the application is as follows: a method of emotion recognition, comprising:

step a: collecting image or video data and voice signals of a user;

step b: recognizing the image or video data and the voice signal, and respectively acquiring emotion components of the user based on expressions and voices;

step c: and acquiring a text-based emotion component of the user according to the voice signal, and fusing the emotion component based on the expression, the voice and the text based on a weight calculation method to obtain a final emotion recognition result of the user.

The technical scheme adopted by the embodiment of the application further comprises the following steps: in the step b, the recognizing the image or video data and the voice signal, and respectively acquiring emotion components of the user based on expressions and voices specifically includes:

step b 1: extracting effective static expression characteristics or dynamic expression characteristics through the collected image or video data, training an emotion recognition model based on the expression by adopting the static expression characteristics or the dynamic expression characteristics, and performing emotion type judgment and emotion intensity calculation through the emotion recognition model based on the expression to obtain an emotion component based on the expression;

step b 2: analyzing and extracting voice characteristic parameters capable of representing emotion change from the acquired voice signals, training a voice-based emotion recognition model by adopting the voice characteristic parameters, and performing emotion type judgment and emotion intensity calculation through the voice-based emotion recognition model to obtain an emotion component based on voice.

The technical scheme adopted by the embodiment of the application further comprises the following steps: in step b1, the expression-based emotion component acquisition mode specifically includes: analyzing the video sequence, and analyzing and retrieving the key frames in the video sequence; intercepting a plurality of sequence frames containing the same or similar expressions, performing related preprocessing operation on the intercepted sequence frames, extracting facial features in the sequence frames, and extracting dynamic expression features and static expression features based on the facial features; when the model is trained, all the dynamic expression features and the static expression features are combined, and then the emotion classification is carried out by using a feature correlation analysis method.

The technical scheme adopted by the embodiment of the application further comprises the following steps: in step b2, the speech-based emotion component acquisition method specifically includes: after the voice signal is preprocessed, extracting voice characteristic parameters capable of expressing current sound from the voice signal, analyzing and processing the voice characteristic parameters based on statistics, and then training a emotion recognition model based on voice by using a classification method based on the voice characteristic parameters; and by utilizing the emotion recognition model, selecting a classifier to perform emotion type judgment and emotion intensity calculation by adopting a classification recognition algorithm, and performing combined judgment by using a specific weight to obtain an emotion component based on voice.

The technical scheme adopted by the embodiment of the application further comprises the following steps: in the step c, the acquiring of the emotion component based on the text of the user according to the voice signal, and the fusing of the emotion component based on the expression, the voice and the text based on the weight calculation method specifically include:

step c 1: converting a voice signal of a user into text information by calling a voice recognition engine, preprocessing the text information, extracting text characteristic parameters capable of representing emotion change from the preprocessed text information, and judging the text characteristic parameters by a classifier to obtain emotion components of the text;

step c 2: fusing emotion components based on expressions, voice and text by adopting a weight calculation method, calculating a final emotion recognition result, and feeding back the final emotion recognition result to the robot; the fusion method comprises weight-based fusion, statistical data-based fusion and machine learning method-based fusion, and the weight calculation method comprises static weight setting and dynamic weight setting.

The embodiment of the application adopts another technical scheme that: an electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the following operations of the emotion recognition method described above:

step a: collecting image or video data and voice signals of a user;

Compared with the prior art, the embodiment of the application has the advantages that: according to the emotion recognition system, the emotion recognition method and the electronic equipment, potential and slight emotion fluctuation of the user in a man-machine conversation process is captured, emotion recognition results based on multi-modal information such as expressions, voices and texts are respectively obtained by combining a relevant data processing technology and a relevant classification algorithm, and the emotion recognition results based on the multi-modal information such as the expressions, the voices and the texts are fused in a fusion algorithm and weight calculation mode to obtain a final emotion recognition result of the user. Compared with the prior art, the method and the device can analyze the emotion of the user at multiple angles, so that the real emotion of the user can be more accurately described.

Drawings

Fig. 1 is a schematic structural diagram of an emotion recognition system according to an embodiment of the present application;

FIG. 2 is a flow chart of a method of emotion recognition in an embodiment of the present application;

fig. 3 is a schematic structural diagram of a hardware device of an emotion recognition method provided in an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

Please refer to fig. 1, which is a schematic structural diagram of an emotion recognition system according to an embodiment of the present application. The emotion recognition system comprises a robot and a cloud server, wherein the robot is used for collecting multi-mode information such as images or video data and voice signals of a user, recognizing the multi-mode information such as the images or the video data and the voice signals respectively, acquiring emotion components of the user based on expressions and voices respectively by combining corresponding algorithms, and uploading the emotion components based on the expressions and the voices to the cloud server. The cloud server is used for acquiring emotion components of the user based on the text according to the voice signals, fusing the emotion components based on the expression, the voice and the text based on a weight calculation method, calculating a final emotion recognition result of the user, and returning the emotion recognition result to the robot.

Specifically, the robot comprises a data acquisition module and an emotion recognition module;

the data acquisition module comprises an image acquisition unit and a voice acquisition unit;

the image acquisition unit is used for acquiring image or video data of a user and transmitting the acquired image or video data to the emotion recognition module; in the embodiment of the application, the image acquisition unit is a camera; when a user approaches the robot, a camera mounted on the robot can detect the state of the user in real time and collect image or video data including the facial expression of the user.

The voice acquisition unit is used for acquiring voice signals of a user and transmitting the acquired voice signals to the emotion recognition module; in the embodiment of the application, the voice acquisition unit is a microphone, and when a user talks with the robot, the microphone on the robot acquires voice signals of the user.

The emotion recognition module is a PAD and comprises an expression recognition unit and a voice recognition unit;

the expression recognition unit is used for extracting effective static expression characteristics or dynamic expression characteristics through the collected image or video data, training an emotion recognition model based on the expression by adopting the static expression characteristics or the dynamic expression characteristics, selecting a classifier to perform emotion type judgment and emotion intensity calculation on the basis of the trained model to obtain emotion components based on the expression, and uploading the emotion components based on the expression to a cloud server in a wireless or wired mode; the emotion component comprises emotion types and emotion intensities corresponding to various emotions; the emotion component identification mode based on the expression specifically comprises the following steps: firstly, analyzing a video sequence, and analyzing and searching key frames in the video sequence; and then intercepting a plurality of sequence frames containing the same or similar expressions, carrying out related preprocessing operation on the intercepted plurality of sequence frames, extracting facial features in the sequence frames, and extracting dynamic expression features and static expression features based on the facial features. When the mapping model is trained, all dynamic expression features and static expression features are combined, and then emotion classification is carried out by using a feature correlation analysis method including principal component analysis and the like, so that correlation among the features is reduced, feature dimensionality is reduced, and high classification accuracy is guaranteed.

The voice recognition unit is used for analyzing and extracting voice characteristic parameters capable of representing emotion change from the collected voice signals, training a voice-based emotion recognition model by adopting the voice characteristic parameters, judging the voice characteristic parameters through a classifier to obtain voice-based emotion components, and uploading the voice-based emotion components to the cloud server; meanwhile, the voice recognition unit is also used for transmitting the collected voice signals to the cloud server, and emotion recognition based on the text is completed by the cloud server; the emotion component recognition method based on voice specifically comprises the following steps: firstly, preprocessing a voice signal to remove background noise, noise and the like; then extracting voice characteristic parameters capable of expressing the current sound from the voice signals, and carrying out analysis processing on the voice characteristic parameters based on statistics, including obtaining the mean value, variance and the like of the voice characteristic parameters; then, training a emotion recognition model based on voice by using a classification method based on the extracted voice characteristic parameters; and finally, selecting a classifier by using the trained emotion recognition model to perform emotion type judgment and emotion intensity calculation by using a classification recognition algorithm, and performing combined judgment by using a specific weight to obtain an emotion component based on voice.

In the embodiment of the present application, the classifier includes, but is not limited to, a support vector machine, a random forest, hidden markov, a neural network algorithm, and the like.

The cloud server comprises a text recognition module and a data fusion module;

the text recognition module is used for converting a voice signal of a user into text information by calling a related service (a voice recognition engine), analyzing and extracting text characteristic parameters capable of representing emotion change from the text information, and distinguishing the text characteristic parameters by a classifier to obtain emotion components of the text. Specifically, the method for recognizing the emotion component of the text specifically comprises the following steps: firstly, a voice signal of a user is converted into corresponding text information by calling related services, the text information is preprocessed, and related vocabularies such as spoken language or unrelated emotions are removed; and then, sending the preprocessed text information into a classifier, finishing classification of different emotions by using a related classification algorithm, and obtaining the emotion intensity corresponding to each emotion.

The data fusion module is used for receiving the emotion component based on the expression and the emotion component based on the voice, combining the emotion component based on the text obtained by the text recognition module, fusing the three emotion components based on a weight calculation method, calculating a final emotion recognition result and feeding the final emotion recognition result back to the robot; the fusion method of the data fusion module comprises various fusion methods such as fusion based on weight, fusion based on statistical data and fusion based on a machine learning method. The weight calculation method comprises static weight setting and dynamic weight setting, and when the three emotion components are respectively included, final emotion recognition result judgment is completed by using a static weight-based method. The static weight is derived from statistical analysis based on the existing data, namely, the historical emotion data is compared with the real emotion of the user, and then the weights of three emotion components are analyzed. The dynamic emotion weight settings are based primarily on real-time emotional feedback of facial expressions such that their weights change dynamically.

Please refer to fig. 2, which is a flowchart of an emotion recognition method according to an embodiment of the present application. The emotion recognition method in the embodiment of the application comprises the following steps:

step 100: the method comprises the following steps of collecting multi-mode information such as image or video data, voice signals and the like of a user in real time through a robot;

in step 100, acquiring image or video data of a user specifically includes: collecting through a camera arranged on the robot; when a user approaches the robot, a camera mounted on the robot can detect the state of the user in real time and collect image or video data including the facial expression of the user.

The collecting of the voice signal of the user specifically comprises: collecting through a microphone arranged on the robot; when the user has a conversation with the robot, the microphone on the robot collects the voice signal of the user.

Step 200: the method comprises the steps of respectively identifying multi-modal information such as image or video data and voice signals, respectively acquiring emotion components of a user based on expressions and voices by combining with corresponding algorithms, uploading the emotion components based on the expressions and the voices to a cloud server, and simultaneously uploading the voice signals to the cloud server;

in step 200, obtaining emotion components of the user based on expressions and voices specifically includes: identifying through PAD on the robot; the acquisition mode comprises the following steps:

step 201: extracting effective static expression characteristics or dynamic expression characteristics through collected image or video data, training an emotion recognition model based on expressions by adopting the static expression characteristics or the dynamic expression characteristics, selecting a classifier to perform emotion type judgment and emotion intensity calculation on the basis of the trained model to obtain emotion components based on the expressions, and uploading the emotion components based on the expressions to a cloud server in a wireless or wired mode;

in step 201, the emotion component includes emotion types and emotion intensities corresponding to various emotions; the emotion component identification mode based on the expression specifically comprises the following steps: firstly, analyzing a video sequence, and analyzing and searching key frames in the video sequence; and then intercepting a plurality of sequence frames containing the same or similar expressions, carrying out related preprocessing operation on the intercepted plurality of sequence frames, extracting facial features in the sequence frames, and extracting dynamic expression features and static expression features based on the facial features. When the mapping model is trained, all dynamic expression features and static expression features are combined, and then emotion classification is carried out by using a feature correlation analysis method including principal component analysis and the like, so that correlation among the features is reduced, feature dimensionality is reduced, and high classification accuracy is guaranteed.

Step 202: analyzing and extracting voice characteristic parameters capable of representing emotion change from the acquired voice signals, training a voice-based emotion recognition model by adopting the voice characteristic parameters, judging the voice characteristic parameters by a classifier to obtain a voice-based emotion component, and uploading the voice-based emotion component to a cloud server;

in step 202, the emotion component recognition method based on voice specifically includes: firstly, preprocessing a voice signal to remove background noise, noise and the like; then extracting voice characteristic parameters capable of expressing the current sound from the voice signals, and carrying out analysis processing on the voice characteristic parameters based on statistics, including obtaining the mean value, variance and the like of the voice characteristic parameters; then, training a voice-based emotion recognition model by using a plurality of classification methods based on the extracted voice characteristic parameters; and finally, selecting a classifier by using the trained emotion recognition model, performing emotion type judgment and emotion intensity calculation by adopting different types of classification recognition algorithms, and performing combined judgment by using specific weight to obtain an emotion component based on voice.

Step 300: the cloud server acquires emotion components of the user based on the text according to the voice signals, fuses the emotion components based on the expression, the voice and the text based on a weight calculation method, calculates a final emotion recognition result of the user, and returns the emotion recognition result to the robot;

in step 300, the emotion component fusion method specifically includes:

step 301: converting a voice signal of a user into text information by calling a related service (a voice recognition engine), analyzing and extracting text characteristic parameters capable of representing emotion change from the text information, and judging the text characteristic parameters by a classifier to obtain emotion components based on the text;

in step 301, the method for recognizing emotion components of a text specifically includes: firstly, a voice signal of a user is converted into corresponding text information by calling related services, the text information is preprocessed, and related vocabularies such as spoken language or unrelated emotions are removed; and then, sending the preprocessed text information into a classifier, finishing classification of different emotions by using a related classification algorithm, and obtaining the emotion intensity corresponding to each emotion.

Step 302: combining the emotion component based on the expression, the emotion component based on the voice and the emotion component based on the text, fusing the three emotion components by adopting different weight calculation methods, calculating a final emotion recognition result, and feeding the final emotion recognition result back to the robot;

in step 302, the fusion methods include various fusion methods such as weight-based fusion, statistical data-based fusion, and machine learning-based fusion. The weight calculation method comprises static weight setting and dynamic weight setting, and when the three emotion components are respectively included, final emotion recognition result judgment is completed by using a static weight-based method. The static weight is derived from statistical analysis based on the existing data, namely, the historical emotion data is compared with the real emotion of the user, and then the weights of three emotion components are analyzed. The dynamic emotion weight settings are based primarily on real-time emotional feedback of facial expressions such that their weights change dynamically.

Fig. 3 is a schematic structural diagram of a hardware device of an emotion recognition method provided in an embodiment of the present application. As shown in fig. 3, the device includes one or more processors and memory. Taking a processor as an example, the apparatus may further include: an input system and an output system.

The processor, memory, input system, and output system may be connected by a bus or other means, as exemplified by the bus connection in fig. 3.

The memory, which is a non-transitory computer readable storage medium, may be used to store non-transitory software programs, non-transitory computer executable programs, and modules. The processor executes various functional applications and data processing of the electronic device, i.e., implements the processing method of the above-described method embodiment, by executing the non-transitory software program, instructions and modules stored in the memory.

The memory may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data and the like. Further, the memory may include high speed random access memory, and may also include non-transitory memory, such as at least one disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory optionally includes memory located remotely from the processor, and these remote memories may be connected to the processing system over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input system may receive input numeric or character information and generate a signal input. The output system may include a display device such as a display screen.

The one or more modules are stored in the memory and, when executed by the one or more processors, perform the following for any of the above method embodiments:

step a: collecting image or video data and voice signals of a user;

The product can execute the method provided by the embodiment of the application, and has the corresponding functional modules and beneficial effects of the execution method. For technical details that are not described in detail in this embodiment, reference may be made to the methods provided in the embodiments of the present application.

Embodiments of the present application provide a non-transitory (non-volatile) computer storage medium having stored thereon computer-executable instructions that may perform the following operations:

step a: collecting image or video data and voice signals of a user;

Embodiments of the present application provide a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions that, when executed by a computer, cause the computer to perform the following:

step a: collecting image or video data and voice signals of a user;

According to the emotion recognition system, the emotion recognition method and the electronic equipment, potential and slight emotion fluctuation of the user in a man-machine conversation process is captured, emotion recognition results based on multi-modal information such as expressions, voices and texts are respectively obtained by combining a relevant data processing technology and a relevant classification algorithm, and the emotion recognition results based on the multi-modal information such as the expressions, the voices and the texts are fused in a fusion algorithm and weight calculation mode to obtain a final emotion recognition result of the user. Compared with the prior art, the method and the device can analyze the emotion of the user at multiple angles, so that the real emotion of the user can be more accurately described.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. An emotion recognition system is characterized by comprising a robot and a cloud server;

2. The emotion recognition system of claim 1, wherein the robot includes a data acquisition module and an emotion recognition module, the data acquisition module including:

the emotion recognition module includes:

3. The emotion recognition system of claim 2, wherein the expression-based emotion component acquisition manner is specifically: analyzing the video sequence, and analyzing and retrieving the key frames in the video sequence; intercepting a plurality of sequence frames containing the same or similar expressions, performing related preprocessing operation on the intercepted sequence frames, extracting facial features in the sequence frames, and extracting dynamic expression features and static expression features based on the facial features; when the model is trained, all the dynamic expression features and the static expression features are combined, and then the emotion classification is carried out by using a feature correlation analysis method.

4. The emotion recognition system of claim 2, wherein the speech-based emotion component acquisition mode is specifically: after the voice signal is preprocessed, extracting voice characteristic parameters capable of expressing current sound from the voice signal, analyzing and processing the voice characteristic parameters based on statistics, and then training a emotion recognition model based on voice by using a classification method based on the voice characteristic parameters; and by utilizing the emotion recognition model, selecting a classifier to perform emotion type judgment and emotion intensity calculation by adopting a classification recognition algorithm, and performing combined judgment by using a specific weight to obtain an emotion component based on voice.

5. The emotion recognition system of any one of claims 1 to 4, wherein the cloud server comprises:

6. A method of emotion recognition, comprising:

step a: collecting image or video data and voice signals of a user;

7. The emotion recognition method of claim 6, wherein in step b, the recognizing the image or video data and the voice signal and respectively obtaining emotion components of the user based on expressions and voices specifically comprises:

8. The emotion recognition method of claim 7, wherein in step b1, the expression-based emotion component acquisition mode is specifically: analyzing the video sequence, and analyzing and retrieving the key frames in the video sequence; intercepting a plurality of sequence frames containing the same or similar expressions, performing related preprocessing operation on the intercepted sequence frames, extracting facial features in the sequence frames, and extracting dynamic expression features and static expression features based on the facial features; when the model is trained, all the dynamic expression features and the static expression features are combined, and then the emotion classification is carried out by using a feature correlation analysis method.

9. The emotion recognition method of claim 7, wherein in step b2, the speech-based emotion component acquisition mode is specifically: after the voice signal is preprocessed, extracting voice characteristic parameters capable of expressing current sound from the voice signal, analyzing and processing the voice characteristic parameters based on statistics, and then training a emotion recognition model based on voice by using a classification method based on the voice characteristic parameters; and by utilizing the emotion recognition model, selecting a classifier to perform emotion type judgment and emotion intensity calculation by adopting a classification recognition algorithm, and performing combined judgment by using a specific weight to obtain an emotion component based on voice.

10. The emotion recognition method according to any one of claims 6 to 9, wherein in the step c, the obtaining of the text-based emotion component of the user from the voice signal and the fusing of the emotion components based on expression, voice and text based on the weight calculation method specifically comprises:

11. An electronic device, comprising:

at least one processor; and

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the following operations of the emotion recognition method of any of above 6 to 10:

step a: collecting image or video data and voice signals of a user;