CN113707185B - Emotion recognition method and device and electronic equipment - Google Patents

Emotion recognition method and device and electronic equipment Download PDF

Info

Publication number
CN113707185B
CN113707185B CN202111093158.4A CN202111093158A CN113707185B CN 113707185 B CN113707185 B CN 113707185B CN 202111093158 A CN202111093158 A CN 202111093158A CN 113707185 B CN113707185 B CN 113707185B
Authority
CN
China
Prior art keywords
emotion
user
voice information
feature
probability
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111093158.4A
Other languages
Chinese (zh)
Other versions
CN113707185A (en
Inventor
石奕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhuo Erzhi Lian Wuhan Research Institute Co Ltd
Original Assignee
Zhuo Erzhi Lian Wuhan Research Institute Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhuo Erzhi Lian Wuhan Research Institute Co Ltd filed Critical Zhuo Erzhi Lian Wuhan Research Institute Co Ltd
Priority to CN202111093158.4A priority Critical patent/CN113707185B/en
Publication of CN113707185A publication Critical patent/CN113707185A/en
Application granted granted Critical
Publication of CN113707185B publication Critical patent/CN113707185B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/90Pitch determination of speech signals

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Human Computer Interaction (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Computing Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Signal Processing (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Child & Adolescent Psychology (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychiatry (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

The application provides a method and a device for emotion recognition and electronic equipment, wherein the method comprises the following steps: extracting a feature map of the voice information to obtain a feature map of the voice information; performing Fourier transform on the voice information to obtain voice information after Fourier transform, and preprocessing the voice information after Fourier transform to obtain preprocessed voice information; calculating the pitch of the preprocessed voice information, adding the pitches belonging to a pitch set in the pitch of the preprocessed voice information, and carrying out normalization processing to obtain a chromaticity characteristic matrix of the voice information; inputting the feature map and the chromaticity feature matrix of the voice information into a deep full convolution network algorithm model to obtain corresponding relations between different emotions and emotion probabilities; and determining the emotion corresponding to the largest emotion probability in the corresponding relation between different emotions and emotion probabilities as the emotion of the user, and feeding back reply information corresponding to the emotion to the user according to the determined emotion of the user.

Description

Emotion recognition method and device and electronic equipment
Technical Field
The present application relates to the field of computer technologies, and in particular, to a method and an apparatus for emotion recognition, and an electronic device.
Background
At present, intelligent robots are widely applied in various industries, and simulated Artificial Intelligence (AI) scenes such as intelligent voice navigation, online robots, gold medal speaking and the like, including service scenes, are all important applications of artificial intelligence in the customer service field.
Disclosure of Invention
In order to solve the above problems, an embodiment of the present application aims to provide a method, an apparatus and an electronic device for emotion recognition.
In a first aspect, an embodiment of the present application provides a method for identifying emotion, including:
when voice information sent by a user is obtained, extracting a feature map of the voice information to obtain the feature map of the voice information;
performing Fourier transform on the voice information to obtain voice information after Fourier transform, and preprocessing the voice information after Fourier transform to obtain preprocessed voice information;
calculating the pitch of the preprocessed voice information, adding the pitches belonging to { C, C#, D, D#, E, F#, G, G#, A#, B } pitch sets in the pitch of the preprocessed voice information, and carrying out normalization processing to obtain a chromaticity characteristic matrix of the voice information;
inputting the feature map and the chromaticity feature matrix of the voice information into a deep full convolution network algorithm model to obtain corresponding relations between different emotions and emotion probabilities;
and determining the emotion corresponding to the largest emotion probability in the corresponding relation between the different emotions and the emotion probabilities as the emotion of the user.
In a second aspect, an embodiment of the present application further provides an emotion recognition device, including:
the extraction module is used for extracting a characteristic diagram of the voice information when the voice information sent by the user is obtained, so as to obtain the characteristic diagram of the voice information;
the processing module is used for carrying out Fourier transform on the voice information to obtain voice information after Fourier transform, and preprocessing the voice information after Fourier transform to obtain preprocessed voice information;
the first calculation module is used for calculating the pitch of the preprocessed voice information, adding the pitches belonging to { C, C#, D#, E, F#, G#, A, A#, B } pitch sets in the pitch of the preprocessed voice information, and carrying out normalization processing to obtain a chromaticity characteristic matrix of the voice information;
the second calculation module is used for inputting the feature map and the chromaticity feature matrix of the voice information into a deep full convolution network algorithm model to obtain corresponding relations between different emotions and emotion probabilities;
and the determining module is used for determining the emotion corresponding to the largest emotion probability in the corresponding relation between the different emotions and the emotion probabilities as the emotion of the user.
In a third aspect, embodiments of the present application also provide a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the method of the first aspect described above.
In a fourth aspect, an embodiment of the present application further provides an electronic device, where the electronic device includes a memory, a processor, and one or more programs, where the one or more programs are stored in the memory and configured to execute the steps of the method described in the first aspect by the processor.
In the solutions provided in the first to fourth aspects of the embodiments of the present application, the emotion of the user is determined by using the voice information obtained by extracting the feature map of the voice information sent by the user and the chromaticity feature matrix obtained by processing the voice information, and compared with the mode that only the response information without emotion can be fed back to the user in the related technology, the response information corresponding to the emotion of the user can be fed back to the user according to the determined emotion of the user, thereby improving the experience of the user.
In order to make the above objects, features and advantages of the present application more comprehensible, preferred embodiments accompanied with figures are described in detail below.
Drawings
In order to more clearly illustrate the embodiments of the application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a flowchart showing an emotion recognition method provided in embodiment 1 of the present application;
fig. 2 is a schematic diagram showing the structure of an emotion recognition device according to embodiment 2 of the present application;
fig. 3 shows a schematic structural diagram of an electronic device according to embodiment 3 of the present application.
Detailed Description
In the description of the present application, it should be understood that the terms "center", "longitudinal", "lateral", "length", "width", "thickness", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", "clockwise", "counterclockwise", etc. indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings are merely for convenience in describing the present application and simplifying the description, and do not indicate or imply that the devices or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and thus should not be construed as limiting the present application.
Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more such feature. In the description of the present application, the meaning of "a plurality" is two or more, unless explicitly defined otherwise.
In the present application, unless explicitly specified and limited otherwise, the terms "mounted," "connected," "secured," and the like are to be construed broadly and may be, for example, fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communication between two elements. The specific meaning of the above terms in the present application can be understood by those of ordinary skill in the art according to the specific circumstances.
In order that the above-recited objects, features and advantages of the present application will become more readily apparent, a more particular description of the application will be rendered by reference to the appended drawings and appended detailed description.
Example 1
The embodiment proposes that the execution subject of the emotion recognition method is a robot that can interact with a user.
Referring to a schematic diagram of an emotion recognition method shown in fig. 1, this embodiment proposes an emotion recognition method, which includes the following specific steps:
and 100, when the voice information sent by the user is obtained, extracting the characteristic diagram of the voice information to obtain the characteristic diagram of the voice information.
In the step 100, the process of extracting the feature map of the voice information to obtain the feature map of the voice information is the prior art, and will not be described herein.
102, performing Fourier transform on the voice information to obtain voice information after Fourier transform, and preprocessing the voice information after Fourier transform to obtain preprocessed voice information.
In step 102 above, the preprocessing includes, but is not limited to: discarding the part of the voice information with the length exceeding 1000 frames; zero padding operation is carried out on voice information with the length less than 1000 frames, so that the length of the voice information with the length less than 1000 frames reaches 1000 frames; and smoothing the voice information.
104, calculating the pitch of the preprocessed voice information, adding the pitches belonging to { C, C#, D, D#, E, F#, G, G#, A, A#, B } pitch sets in the pitch of the preprocessed voice information, and carrying out normalization processing to obtain the chromaticity characteristic matrix of the voice information.
In step 104, the specific process of calculating the pitch of the preprocessed voice information is the prior art, and will not be described herein.
The process of adding and normalizing the pitches belonging to { C, c#, D, d#, E, f#, G, g#, a, a#, B } pitch sets in the pitch sets of the preprocessed voice information to obtain the chromaticity feature matrix of the voice information is the prior art, and is not repeated here.
And 106, inputting the feature map and the chromaticity feature matrix of the voice information into a deep full convolution network algorithm model to obtain corresponding relations between different emotions and emotion probabilities.
In the step 106, the deep full convolution network algorithm model is obtained by training the voice information with different emotions and using the time context information existing in the Bi-LSTM learning data, and is used for processing the voice information to obtain the corresponding relationship between emotion and emotion probability.
The pooling mode of the depth full convolution network algorithm model comprises the following steps: global k-max pooling is chosen, and unlike traditional max pooling, for each line in the feature map, the GKMP pooling approach selects the first k maxima and takes them as the feature representation for that line. Therefore, for the feature map with the size of f×m, through the GKMP pooling manner, an output with the size of f×k can be obtained, where F represents the dimension of the feature and M represents the frame length.
In one embodiment, the deep-full convolutional network algorithm model may use a multi-layer convolutional neural network (Multiple CNN, MCNN); the feature map and the colorimetric feature matrix of the voice information are firstly subjected to MCNN, wherein the MCNN is an improvement of a Convolutional Neural Network (CNN), then subjected to Bi-LSTM layer, and finally subjected to full-connection layer FC, and the probability condition of 6 emotions (neutral emotion, happy emotion, surprise emotion, wounded emotion, vital emotion and aversion emotion) is input.
The deep full convolution network algorithm firstly uses the MCNNs algorithm to learn the characteristic representation of the spectrogram, then uses the time context information existing in Bi-LSTM learning data, uses the multi-task learning algorithm in the last layer of the full connection layer (FC), and simultaneously uses the 6 emotion Classification task Classification to strengthen the model to learn specific emotion characteristics.
The deep full convolution network algorithm firstly uses the MCNNs algorithm to well solve the training problem when the input length is not fixed, and meanwhile, the feature extraction mode is not too simple, the model complexity is relatively low, and the accuracy of feature emotion recognition is improved.
The different emotions include, but are not limited to: neutral emotion, happy emotion, surprise emotion, wounded emotion, vital emotion and aversion emotion.
Wherein the neutral emotion is an emotion representing that the user is calm.
The corresponding relation between different emotions and the emotion probability is the corresponding relation between each emotion in neutral emotion, happy emotion, surprise emotion, wounded emotion, vital energy emotion and aversion emotion and the emotion probability respectively.
And 108, determining the emotion corresponding to the largest emotion probability in the corresponding relation between the different emotions and the emotion probabilities as the emotion of the user.
Optionally, the emotion recognition method provided in this embodiment may further perform the following steps (1) to (5), and determine an emotion of the user:
(1) When the text information sent by the user is obtained, determining that the text information sent by the user in a typing mode is used by the user;
(2) Acquiring input characteristic information of a user recorded by a text input method installed on a mobile terminal used by the user;
(3) Processing the input characteristic information to obtain a characteristic vector of the input characteristic information;
(4) Inputting the feature vector of the input feature information into a support vector machine, and inputting the processing result of the support vector machine into a softmax algorithm model to obtain a corresponding relation between emotion and emotion probability;
(5) And determining the emotion corresponding to the maximum emotion probability as the emotion of the user.
In the step (2), the server may send an input feature information acquisition request to a mobile terminal used by the user. If the user input characteristic information authority of the mobile terminal used by the user is open, the user input characteristic information can be obtained from a text input method installed on the mobile terminal used by the user.
The input characteristic information includes, but is not limited to: the method comprises the steps of inputting text by a user in a preset period, stopping time of inputting the text by the user, the longest stopping time of inputting the text by the user, stopping position of inputting the text by the user, content modification times, maximum value of key-pressing time of the user in the input process in the preset period, minimum value of key-pressing time of the user in the input process in the preset period, average value of key-pressing time of the user in the input process in the preset period, median of key-pressing time of the user in the input process in the preset period, variance of key-pressing time of the user in the input process in the preset period, time of pressing from a first key to a second key in the preset period, time of pressing from the first key to the second key in the preset period, sentence length of each sentence input in the preset period and package meaning of each expression package input in the preset period.
The pause time of the text input by the user refers to the time interval between two adjacent characters input by the user in a preset period.
The number of pauses for the user to input the text refers to the number of times that the pause time is greater than a time threshold when the user inputs the text in a preset period.
The time threshold may be set to any length of time between 1 and 3 seconds.
The maximum pause time of the text input by the user refers to the maximum time interval between two adjacent characters input by the user in a preset period.
The pause position of the text input by the user is the cursor position with pause time exceeding a certain time in the text input process of the user.
In one embodiment, the certain time may be 3 seconds.
The content modification times refer to the times of clicking a 'delete (backspace) key' or a 'cancel key' of an input method in the process of inputting characters by a user within a preset period.
The maximum value of the key-pressing time of the user in the input process in the preset time period refers to the maximum time length from the time when the user presses a key in the input method to the time when the user inputs characters in the preset time period.
The minimum value of the key-pressing time of the user in the input process in the preset time period refers to the minimum time length from the time when the user presses a key in the input method to the time when the user inputs characters in the preset time period.
The average value of the key-pressing time of the user in the input process in the preset period, the median of the key-pressing time of the user in the input process in the preset period and the variance of the key-pressing time of the user in the input process in the preset period are all calculated according to the key-pressing time of the user in the input process in the preset period, and the specific process is the prior art and is not repeated here.
The time from the first key to the second key is released in the preset time period refers to the time length from the release of one key in the input method to the pressing of the next key in the input method.
The time from the first key pressing to the second key releasing in the preset period refers to the time length from the pressing of one key in the input method to the releasing of the next key in the preset period.
The preset period may be set to any length of time between 1 hour and 24 hours.
In the step (3), the process of processing the input feature information to obtain the feature vector of the input feature information is a prior art, and will not be described herein.
In the step (4), the specific process of inputting the feature vector of the input feature information into the support vector machine and inputting the processing result of the support vector machine into the softmax algorithm model to obtain the corresponding relationship between emotion and emotion probability is the prior art, and will not be described herein.
Deducing the emotion of the user by using a support vector machine based on the characteristics generated in the text input process, (selecting global k maximum pooling GKMP in a pooling mode), and finally carrying out output classification by using a softmax algorithm model to judge which emotion belongs to the six emotions of neutral emotion, happy emotion, surprise emotion, wounded emotion, gas emotion and aversion emotion.
The main flow using the Support Vector Machine (SVM) algorithm is as follows:
1) A 7-point training data set d= { (x 1, y 1), (x 2, y 2),. The..and (x 7, y 7) }, where x is given i ∈R N ,y i E, R, i=1, 2,.,. 7; y is the corresponding emotion classification;
2) Selecting an appropriate kernel function K (x, y) and a parameter item C;
3) The optimal convex quadratic programming is as follows:
and (3) solving to obtain:
optionally, the emotion recognition method provided in this embodiment may further perform the following steps (1) to (4), and determine an emotion of the user:
(1) When the facial image of the user is acquired, the facial image of the user is processed by utilizing an Xreception network to obtain a feature vector of the facial image of the user, and the facial image of the user is processed by utilizing a space-time neural network (STNN) to obtain a space dimension feature, a short time dimension feature and a long time dimension feature of the facial image of the user;
(2) Inputting the feature vector of the facial image of the user into a softmax algorithm model to obtain a corresponding relation between a first emotion and emotion probability;
(3) Inputting the space dimension feature, the short-time dimension feature and the long-time dimension feature of the facial image of the user into a softmax algorithm model to obtain a corresponding relation between the second emotion and the emotion probability;
(4) And processing the corresponding relation between the first emotion and the emotion probability and the corresponding relation between the second emotion and the emotion probability by using a DS fusion algorithm, and determining the emotion of the user.
In the step (1), the process of processing the facial image of the user by using the Xception network to obtain the feature vector of the facial image of the user is a prior art, and will not be described herein.
The STNN network passes through two Con3D layers first, one Maxpooling layer, two ConvLSTM layers, one tile layer, two full connection layers, and finally to the softmax function.
Each Con3D layer in the STNN has 32 convolution kernels with a size of 3×3×15, where 3×3 is the size of the receiving spatial domain and 15 is the temporal depth.
The Maxpooling layer maximum pooling core size is: 3 x 3 of the total number of the units, the step size is also 1 x1, valid fill is used in the active mode.
Finally, the predicted values of six micro expressions are obtained through the output of the second full-connection layer and a softmax function.
The processing of the facial image of the user by using the space-time neural network STNN to obtain the spatial dimension feature and the short-time dimension feature of the facial image of the user, and the long-time dimension feature are the prior art, and are not described herein.
In the step (2), the specific process of inputting the feature vector of the facial image of the user into the softmax algorithm model to obtain the corresponding relationship between the first emotion and the emotion probability is a prior art, and will not be described herein.
In the step (3), the specific process of inputting the spatial dimension feature, the short-time dimension feature, and the long-time dimension feature of the facial image of the user into the softmax algorithm model to obtain the corresponding relationship between the second emotion and the emotion probability is the prior art, and will not be described herein.
In the step (4), the specific process of determining the emotion of the user by using the DS fusion algorithm to process the correspondence between the first emotion and the emotion probability and the correspondence between the second emotion and the emotion probability is a prior art and will not be described herein.
After determining the emotion of the user using the above procedure, the method may further continue with the following steps (1) to (2):
(1) When the emotion of the user belongs to the neutral emotion, the happy emotion or the surprise emotion, a reply text set is obtained, each reply text in the reply text set is matched with the emotion of the user by utilizing a smooth word inversion frequency (Smooth Inverse Frequency, SIF) algorithm, and a reply text with the highest emotion matching degree with the user is determined from each reply text in the reply text set and fed back to the user;
(2) And when the emotion of the user belongs to the wounded emotion, the angry emotion and the aversion emotion, notifying an online artificial customer service to answer the user.
In the step (1), the reply text set is preset in the robot.
And the reply text set is used for recording reply texts replying to the questions raised by the user. The reply text may match the neutral emotion, the happy emotion, or the surprise emotion.
The process of matching each reply text in the reply text set with the emotion of the user by using the SIF algorithm is a prior art and will not be described herein.
According to the emotion recognition method, emotion recognition of the user is achieved mainly through facial expression recognition, voice emotion recognition and text recognition, an algorithm is improved, recognition effect is better, accuracy is higher, and recognition modes are comprehensive and reasonable.
In summary, the embodiment provides an emotion recognition method, which determines the emotion of a user through the voice information obtained by extracting the feature map of the voice information sent by the user and the chromaticity feature matrix obtained by processing the voice information.
Example 2
The present embodiment proposes an emotion recognition device for executing the emotion recognition method proposed in embodiment 1 above.
Referring to a schematic diagram of an emotion recognition device shown in fig. 2, this embodiment proposes an emotion recognition device, including:
the extracting module 200 is configured to extract a feature map of the voice information when voice information sent by a user is obtained, so as to obtain the feature map of the voice information;
the processing module 202 is configured to perform fourier transform on the voice information to obtain voice information after fourier transform, and perform preprocessing on the voice information after fourier transform to obtain preprocessed voice information;
the first calculation module 204 is configured to calculate a pitch of the preprocessed voice information, and add pitches belonging to { C, c#, d#, E, F, f#, G, g#, a, a#, B } pitch sets in the pitch of the preprocessed voice information, and normalize the pitches to obtain a chromaticity feature matrix of the voice information;
the second calculation module 206 is configured to input the feature map and the chromaticity feature matrix of the voice information into a deep full convolution network algorithm model to obtain corresponding relations between different emotions and emotion probabilities;
a determining module 208, configured to determine, as the emotion of the user, the emotion corresponding to the largest emotion probability in the correspondence between different emotions and emotion probabilities.
Optionally, the emotion recognition device provided in this embodiment further includes:
the first determining unit is used for determining that the text information sent by the user in a typing mode is used when the text information sent by the user is obtained;
the acquisition unit is used for acquiring the input characteristic information of the user recorded by a text input method installed on the mobile terminal used by the user;
the first processing unit is used for processing the input characteristic information to obtain a characteristic vector of the input characteristic information;
the second processing unit is used for inputting the feature vector of the input feature information into a support vector machine, and inputting the processing result of the support vector machine into a softmax algorithm model to obtain the corresponding relation between emotion and emotion probability;
and the second determining unit is used for determining the emotion corresponding to the maximum emotion probability as the emotion of the user.
Optionally, the emotion recognition device provided in this embodiment further includes:
the third processing unit is used for processing the facial image of the user by utilizing an Xreception network when the facial image of the user is acquired, obtaining a feature vector of the facial image of the user, and processing the facial image of the user by utilizing a space-time neural network STNN, obtaining a space dimension feature, a short time dimension feature and a long time dimension feature of the facial image of the user;
the first computing unit is used for inputting the feature vector of the facial image of the user into a softmax algorithm model to obtain a corresponding relation between a first emotion and emotion probability;
the second computing unit is used for inputting the space dimension feature, the short-time dimension feature and the long-time dimension feature of the facial image of the user into the softmax algorithm model to obtain a corresponding relation between a second emotion and emotion probability;
and the third determining unit is used for processing the corresponding relation between the first emotion and the emotion probability and the corresponding relation between the second emotion and the emotion probability by using a DS fusion algorithm to determine the emotion of the user.
The emotion includes: neutral emotion, happy emotion, surprise emotion, wounded emotion, vital emotion and aversion emotion.
Optionally, the emotion recognition device provided in this embodiment further includes:
the feedback unit is used for acquiring a reply text set when the emotion of the user belongs to the neutral emotion, the happy emotion or the surprise emotion, matching each reply text in the reply text set with the emotion of the user by using a Smooth Inverse Frequency algorithm, and determining a reply text with the highest emotion matching degree with the user from each reply text in the reply text set to feed back to the user;
and the notification unit is used for notifying an online artificial customer service to answer the user when the emotion of the user belongs to the wounded emotion, the angry emotion and the aversion emotion.
In summary, the present embodiment provides an emotion recognition device, which determines an emotion of a user through a voice message obtained by extracting a feature map of a voice message sent by the user and a chromaticity feature matrix obtained by processing the voice message.
Example 3
The present embodiment proposes a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the emotion recognition method described in embodiment 1 above. The specific implementation can be referred to method embodiment 1, and will not be described herein.
In addition, referring to the schematic structural diagram of an electronic device shown in fig. 3, the present embodiment also proposes an electronic device, which includes a bus 51, a processor 52, a transceiver 53, a bus interface 54, a memory 55, and a user interface 56. The electronic device includes a memory 55.
In this embodiment, the electronic device further includes: one or more programs stored on memory 55 and executable on processor 52, configured to be executed by the processor for performing steps (1) through (5) below:
(1) When voice information sent by a user is obtained, extracting a feature map of the voice information to obtain the feature map of the voice information;
(2) Performing Fourier transform on the voice information to obtain voice information after Fourier transform, and preprocessing the voice information after Fourier transform to obtain preprocessed voice information;
(3) Calculating the pitch of the preprocessed voice information, adding the pitches belonging to { C, C#, D, D#, E, F#, G, G#, A#, B } pitch sets in the pitch of the preprocessed voice information, and carrying out normalization processing to obtain a chromaticity characteristic matrix of the voice information;
(4) Inputting the feature map and the chromaticity feature matrix of the voice information into a deep full convolution network algorithm model to obtain corresponding relations between different emotions and emotion probabilities;
(5) And determining the emotion corresponding to the largest emotion probability in the corresponding relation between the different emotions and the emotion probabilities as the emotion of the user.
A transceiver 53 for receiving and transmitting data under the control of the processor 52.
Where bus architecture (represented by bus 51), bus 51 may comprise any number of interconnected buses and bridges, with bus 51 linking together various circuits, including one or more processors, represented by processor 52, and memory, represented by memory 55. The bus 51 may also link together various other circuits such as peripheral devices, voltage regulators, power management circuits, etc., as are well known in the art, and therefore, will not be described further in connection with this embodiment. Bus interface 54 provides an interface between bus 51 and transceiver 53. The transceiver 53 may be one element or may be a plurality of elements, such as a plurality of receivers and transmitters, providing a means for communicating with various other apparatus over a transmission medium. For example: the transceiver 53 receives external data from other devices. The transceiver 53 is used to transmit the data processed by the processor 52 to other devices. Depending on the nature of the computing system, a user interface 56 may also be provided, such as a keypad, display, speaker, microphone, joystick.
The processor 52 is responsible for managing the bus 51 and general processing, as described above, running a general purpose operating system. And memory 55 may be used to store data used by processor 52 in performing operations.
Alternatively, processor 52 may be, but is not limited to: a central processing unit, a single chip microcomputer, a microprocessor or a programmable logic device.
It will be appreciated that the memory 55 in embodiments of the application may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The nonvolatile Memory may be a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable EPROM (EEPROM), or a flash Memory. The volatile memory may be random access memory (Random Access Memory, RAM) which acts as an external cache. By way of example, and not limitation, many forms of RAM are available, such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (Double Data Rate SDRAM), enhanced SDRAM (ESDRAM), synchronous DRAM (SLDRAM), and Direct RAM (DRRAM). The memory 55 of the system and method described in this embodiment is intended to comprise, without being limited to, these and any other suitable types of memory.
In some implementations, the memory 55 stores the following elements, executable modules or data structures, or a subset thereof, or an extended set thereof: operating system 551 and application programs 552.
The operating system 551 includes various system programs, such as a framework layer, a core library layer, a driver layer, and the like, for implementing various basic services and processing hardware-based tasks. The application programs 552 include various application programs such as a Media Player (Media Player), a Browser (Browser), and the like for implementing various application services. A program for implementing the method of the embodiment of the present application may be included in the application program 552.
In summary, the present embodiment provides a computer readable storage medium, which determines the emotion of a user through the voice information obtained by extracting the feature map of the voice information sent by the user and the chromaticity feature matrix obtained by processing the voice information, and compared with the mode that only the response information without emotion can be fed back to the user in the related technology, the computer readable storage medium can feed back the response information corresponding to the emotion to the user according to the determined emotion of the user, thereby improving the experience of the user.
The foregoing is merely illustrative of the present application, and the present application is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (10)

1. A method of emotion recognition, comprising:
when voice information sent by a user is obtained, extracting a feature map of the voice information to obtain the feature map of the voice information;
performing Fourier transform on the voice information to obtain voice information after Fourier transform, and preprocessing the voice information after Fourier transform to obtain preprocessed voice information;
calculating the pitch of the preprocessed voice information, adding the pitches belonging to { C, C#, D, D#, E, F#, G, G#, A#, B } pitch sets in the pitch of the preprocessed voice information, and carrying out normalization processing to obtain a chromaticity characteristic matrix of the voice information;
inputting the feature map and the chromaticity feature matrix of the voice information into a deep full convolution network algorithm model to obtain corresponding relations between different emotions and emotion probabilities;
and determining the emotion corresponding to the largest emotion probability in the corresponding relation between the different emotions and the emotion probabilities as the emotion of the user.
2. The method as recited in claim 1, further comprising:
when the text information sent by the user is obtained, determining that the text information sent by the user in a typing mode is used by the user;
acquiring input characteristic information of a user recorded by a text input method installed on a mobile terminal used by the user;
processing the input characteristic information to obtain a characteristic vector of the input characteristic information;
inputting the feature vector of the input feature information into a support vector machine, and inputting the processing result of the support vector machine into a softmax algorithm model to obtain a corresponding relation between emotion and emotion probability;
and determining the emotion corresponding to the maximum emotion probability as the emotion of the user.
3. The method as recited in claim 1, further comprising:
when the facial image of the user is acquired, the facial image of the user is processed by utilizing an Xreception network to obtain a feature vector of the facial image of the user, and the facial image of the user is processed by utilizing a space-time neural network (STNN) to obtain a space dimension feature, a short time dimension feature and a long time dimension feature of the facial image of the user;
inputting the feature vector of the facial image of the user into a softmax algorithm model to obtain a corresponding relation between a first emotion and emotion probability;
inputting the space dimension feature, the short-time dimension feature and the long-time dimension feature of the facial image of the user into a softmax algorithm model to obtain a corresponding relation between the second emotion and the emotion probability;
and processing the corresponding relation between the first emotion and the emotion probability and the corresponding relation between the second emotion and the emotion probability by using a DS fusion algorithm, and determining the emotion of the user.
4. A method according to any one of claims 1-3, wherein the mood comprises: neutral emotion, happy emotion, surprise emotion, wounded emotion, vital emotion and aversion emotion;
the method further comprises the steps of:
when the emotion of the user belongs to the neutral emotion, the happy emotion or the surprise emotion, a reply text set is obtained, each reply text in the reply text set is matched with the emotion of the user by utilizing a Smooth Inverse Frequency algorithm, and the reply text with the highest emotion matching degree with the user is determined from each reply text in the reply text set and fed back to the user;
and when the emotion of the user belongs to the wounded emotion, the angry emotion and the aversion emotion, notifying an online artificial customer service to answer the user.
5. An emotion recognition device, characterized by comprising:
the extraction module is used for extracting a characteristic diagram of the voice information when the voice information sent by the user is obtained, so as to obtain the characteristic diagram of the voice information;
the processing module is used for carrying out Fourier transform on the voice information to obtain voice information after Fourier transform, and preprocessing the voice information after Fourier transform to obtain preprocessed voice information;
the first calculation module is used for calculating the pitch of the preprocessed voice information, adding the pitches belonging to { C, C#, D#, E, F#, G#, A, A#, B } pitch sets in the pitch of the preprocessed voice information, and carrying out normalization processing to obtain a chromaticity characteristic matrix of the voice information;
the second calculation module is used for inputting the feature map and the chromaticity feature matrix of the voice information into a deep full convolution network algorithm model to obtain corresponding relations between different emotions and emotion probabilities;
and the determining module is used for determining the emotion corresponding to the largest emotion probability in the corresponding relation between the different emotions and the emotion probabilities as the emotion of the user.
6. The apparatus as recited in claim 5, further comprising:
the first determining unit is used for determining that the text information sent by the user in a typing mode is used when the text information sent by the user is obtained;
the acquisition unit is used for acquiring the input characteristic information of the user recorded by a text input method installed on the mobile terminal used by the user;
the first processing unit is used for processing the input characteristic information to obtain a characteristic vector of the input characteristic information;
the second processing unit is used for inputting the feature vector of the input feature information into a support vector machine, and inputting the processing result of the support vector machine into a softmax algorithm model to obtain the corresponding relation between emotion and emotion probability;
and the second determining unit is used for determining the emotion corresponding to the maximum emotion probability as the emotion of the user.
7. The apparatus as recited in claim 5, further comprising:
the third processing unit is used for processing the facial image of the user by utilizing an Xreception network when the facial image of the user is acquired, obtaining a feature vector of the facial image of the user, and processing the facial image of the user by utilizing a space-time neural network STNN, obtaining a space dimension feature, a short time dimension feature and a long time dimension feature of the facial image of the user;
the first computing unit is used for inputting the feature vector of the facial image of the user into a softmax algorithm model to obtain a corresponding relation between a first emotion and emotion probability;
the second computing unit is used for inputting the space dimension feature, the short-time dimension feature and the long-time dimension feature of the facial image of the user into the softmax algorithm model to obtain a corresponding relation between a second emotion and emotion probability;
and the third determining unit is used for processing the corresponding relation between the first emotion and the emotion probability and the corresponding relation between the second emotion and the emotion probability by using a DS fusion algorithm to determine the emotion of the user.
8. The apparatus of any one of claims 5-7, wherein the emotion comprises: neutral emotion, happy emotion, surprise emotion, wounded emotion, vital emotion and aversion emotion;
the apparatus further comprises:
the feedback unit is used for acquiring a reply text set when the emotion of the user belongs to the neutral emotion, the happy emotion or the surprise emotion, matching each reply text in the reply text set with the emotion of the user by using a SmoothInverse Frequency algorithm, and determining a reply text with the highest emotion matching degree with the user from each reply text in the reply text set to feed back to the user;
and when the emotion of the user belongs to the wounded emotion, the angry emotion and the aversion emotion, notifying an online artificial customer service to answer the user.
9. A computer readable storage medium having stored thereon a computer program, characterized in that the computer program when executed by a processor performs the steps of the method of any of the preceding claims 1-4.
10. An electronic device comprising a memory, a processor, wherein one or more programs are stored in the memory and configured to perform the steps of the method of any of claims 1-4 by the processor.
CN202111093158.4A 2021-09-17 2021-09-17 Emotion recognition method and device and electronic equipment Active CN113707185B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111093158.4A CN113707185B (en) 2021-09-17 2021-09-17 Emotion recognition method and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111093158.4A CN113707185B (en) 2021-09-17 2021-09-17 Emotion recognition method and device and electronic equipment

Publications (2)

Publication Number Publication Date
CN113707185A CN113707185A (en) 2021-11-26
CN113707185B true CN113707185B (en) 2023-09-12

Family

ID=78661036

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111093158.4A Active CN113707185B (en) 2021-09-17 2021-09-17 Emotion recognition method and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN113707185B (en)

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR102133728B1 (en) * 2017-11-24 2020-07-21 주식회사 제네시스랩 Device, method and readable media for multimodal recognizing emotion based on artificial intelligence
CN111368609B (en) * 2018-12-26 2023-10-17 深圳Tcl新技术有限公司 Speech interaction method based on emotion engine technology, intelligent terminal and storage medium
CN110110653A (en) * 2019-04-30 2019-08-09 上海迥灵信息技术有限公司 The Emotion identification method, apparatus and storage medium of multiple features fusion
CN110909131A (en) * 2019-11-26 2020-03-24 携程计算机技术(上海)有限公司 Model generation method, emotion recognition method, system, device and storage medium
CN111798873A (en) * 2020-05-15 2020-10-20 厦门快商通科技股份有限公司 Voice emotion recognition method and device based on 3-d convolutional neural network
CN113143270A (en) * 2020-12-02 2021-07-23 长春理工大学 Bimodal fusion emotion recognition method based on biological radar and voice information
CN113380271B (en) * 2021-08-12 2021-12-21 明品云(北京)数据科技有限公司 Emotion recognition method, system, device and medium

Also Published As

Publication number Publication date
CN113707185A (en) 2021-11-26

Similar Documents

Publication Publication Date Title
EP4113507A1 (en) Speech recognition method and apparatus, device, and storage medium
US10789942B2 (en) Word embedding system
CN109923558A (en) Mixture of expert neural network
CN112732911A (en) Semantic recognition-based conversational recommendation method, device, equipment and storage medium
US20240029436A1 (en) Action classification in video clips using attention-based neural networks
CN112418059A (en) Emotion recognition method and device, computer equipment and storage medium
CN111581623B (en) Intelligent data interaction method and device, electronic equipment and storage medium
CN113240510A (en) Abnormal user prediction method, device, equipment and storage medium
CN112836521A (en) Question-answer matching method and device, computer equipment and storage medium
CN110955765A (en) Corpus construction method and apparatus of intelligent assistant, computer device and storage medium
CN112995414A (en) Behavior quality inspection method, device, equipment and storage medium based on voice call
CN116186295B (en) Attention-based knowledge graph link prediction method, attention-based knowledge graph link prediction device, attention-based knowledge graph link prediction equipment and attention-based knowledge graph link prediction medium
CN113707185B (en) Emotion recognition method and device and electronic equipment
CN116580704A (en) Training method of voice recognition model, voice recognition method, equipment and medium
CN116127049A (en) Model training method, text generation method, terminal device and computer medium
CN115687934A (en) Intention recognition method and device, computer equipment and storage medium
CN110931002A (en) Human-computer interaction method and device, computer equipment and storage medium
CN114519094A (en) Method and device for conversational recommendation based on random state and electronic equipment
CN114218356A (en) Semantic recognition method, device, equipment and storage medium based on artificial intelligence
CN112699213A (en) Speech intention recognition method and device, computer equipment and storage medium
CN113886556B (en) Question answering method and device and electronic equipment
CN113096649B (en) Voice prediction method, device, electronic equipment and storage medium
CN113934825B (en) Question answering method and device and electronic equipment
CN111340218B (en) Method and system for training problem recognition model
CN116230023A (en) Speech emotion recognition method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant