CN116564281B

CN116564281B - Emotion recognition method and device based on AI

Info

Publication number: CN116564281B
Application number: CN202310825516.9A
Authority: CN
Inventors: 王英; 李伟
Original assignee: 4u Beijing Technology Co ltd
Current assignee: 4u Beijing Technology Co ltd
Priority date: 2023-07-06
Filing date: 2023-07-06
Publication date: 2023-09-05
Anticipated expiration: 2043-07-06
Also published as: CN116564281A

Abstract

The application provides an AI-based emotion recognition method and device, wherein the method comprises the following steps: in response to receiving voice data of a user, extracting audio features from the voice data and converting the voice data into text content; identifying emotion words used for representing emotion in the text content, and determining intensity limiting words used for representing emotion intensity based on the position of the emotion words in the text content; determining emotion types corresponding to the text content based on the emotion words, and determining emotion intensities corresponding to the text content based on the intensity qualifiers and the audio features; based on the emotion type and the emotion intensity, an emotion of the user is identified. The application solves the technical problem that the emotion of the user cannot be finely identified in the related technology.

Description

Emotion recognition method and device based on AI

Technical Field

The application relates to the technical field of artificial intelligence, in particular to an emotion recognition method and device based on AI.

Background

AI digital persons are virtual characters created using artificial intelligence techniques that are highly realistic in appearance, motion, and speech capabilities. Through AI algorithms and techniques, AI digital humans can simulate the appearance, behavior, and manner of communication of humans, making them visually and audibly indistinct from real humans.

The AI numerator can act as a numerator staff in the enterprise, such as professional customer service, administrative foreground, sales host, etc., to provide services such as content distribution, brand marketing, sales conversion, etc. for the enterprise. The method can be applied to various terminal scenes, such as PC, APP, applet, VRMR and the like, so as to meet the diversified requirements of different industries, improve the data interaction capability and realize the great development of power-assisted enterprises in marketing.

However, the current AI digital person interaction technology uses a machine learning algorithm and a natural language processing technology, so that the AI digital person can understand and respond to the question or interaction of the user, but the current AI digital person can only respond based on the voice data or text data input by the user, and cannot comprehensively consider the emotion of the user. This means that the emotional state of the user cannot be accurately recognized and dealt with when interacting with the AI-digital person.

Identifying the emotion of a user is critical to providing personalized and emotional services. By accurately sensing and understanding the emotion of the user, the AI digital person can better respond to the demands of the user and provide corresponding support and solutions. Therefore, developing a emotion recognition technology to enable AI digital people to accurately capture and analyze emotion changes of users is a technical problem to be solved.

In view of the above problems, no effective solution has been proposed at present.

Disclosure of Invention

The embodiment of the application provides an AI-based emotion recognition method and an AI-based emotion recognition device, which at least solve the technical problem that the emotion of a user cannot be recognized finely in the related technology.

According to an aspect of an embodiment of the present application, there is provided an AI-based emotion recognition method including: in response to receiving voice data of a user, extracting audio features from the voice data and converting the voice data into text content; identifying emotion words used for representing emotion in the text content, and determining intensity limiting words used for representing emotion intensity based on the position of the emotion words in the text content; determining emotion types corresponding to the text content based on the emotion words, and determining emotion intensities corresponding to the text content based on the intensity qualifiers and the audio features; based on the emotion type and the emotion intensity, an emotion of the user is identified.

According to another aspect of the embodiment of the present application, there is also provided an AI-based emotion recognition device including: a preprocessing module configured to extract audio features from voice data of a user in response to receiving the voice data, and to convert the voice data into text content; a word recognition module configured to recognize emotion words in the text content for representing emotion and determine intensity qualifiers for representing emotion intensities based on the position of the emotion words in the text content; an intensity determination module configured to determine a type of emotion corresponding to the text content based on the emotion word and determine an intensity of emotion corresponding to the text content based on the intensity qualifier and the audio feature; an emotion recognition module configured to recognize an emotion of the user based on the emotion type and the emotion intensity.

In the embodiment of the application, in response to receiving voice data of a user, audio features are extracted from the voice data, and the voice data is converted into text content; identifying emotion words used for representing emotion in the text content, and determining intensity limiting words used for representing emotion intensity based on the position of the emotion words in the text content; determining emotion types corresponding to the text content based on the emotion words, and determining emotion intensities corresponding to the text content based on the intensity qualifiers and the audio features; based on the emotion type and the emotion intensity, an emotion of the user is identified. By the aid of the scheme, the technical problem that emotion of a user cannot be finely identified in the related technology is solved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application. In the drawings:

FIG. 1 is a flow chart of an AI-based emotion recognition method in accordance with an embodiment of the present application;

FIG. 2 is a flow chart of another AI-based emotion recognition method in accordance with an embodiment of the present application;

FIG. 3 is a flow chart of a method of processing speech data according to an embodiment of the application;

FIG. 4 is a flow chart of a method of determining an application scenario of text content according to an embodiment of the present application;

FIG. 5 is a flow chart of a method of obtaining emotion words and strength qualifiers in accordance with an embodiment of the present application;

FIG. 6 is a flow chart of a method of calculating emotional intensity according to an embodiment of the application;

fig. 7 is a schematic structural view of an AI-based emotion recognition device, according to an embodiment of the present application;

fig. 8 shows a schematic structural diagram of an electronic device suitable for use in implementing embodiments of the present disclosure.

Detailed Description

It should be noted that, without conflict, the embodiments of the present application and features of the embodiments may be combined with each other. The application will be described in detail below with reference to the drawings in connection with embodiments.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the present application. As used herein, the singular is also intended to include the plural unless the context clearly indicates otherwise, and furthermore, it is to be understood that the terms "comprises" and/or "comprising" when used in this specification are taken to specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof.

The relative arrangement of the components and steps, numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present application unless it is specifically stated otherwise. Meanwhile, it should be understood that the sizes of the respective parts shown in the drawings are not drawn in actual scale for convenience of description. Techniques, methods, and apparatus known to one of ordinary skill in the relevant art may not be discussed in detail, but should be considered part of the specification where appropriate. In all examples shown and discussed herein, any specific values should be construed as merely illustrative, and not a limitation. Thus, other examples of the exemplary embodiments may have different values. It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further discussion thereof is necessary in subsequent figures.

Example 1

The embodiment of the application provides an AI-based emotion recognition method, as shown in fig. 1, which comprises the following steps:

step S102, in response to receiving voice data of a user, extracting audio features from the voice data and converting the voice data into text content.

When the AI digital person receives voice data of a user, the processing procedures of audio feature extraction and voice-to-text conversion are firstly carried out. Specifically, the received voice data is preprocessed by using a signal processing technology, including operations of removing noise, adjusting volume and the like, so as to ensure that the quality of the extracted audio features is high.

Next, audio features are extracted from the preprocessed speech data by a feature extraction algorithm. The audio features may include spectral information, pitch, intensity, etc. related data. These features can describe different aspects of the speech signal, such as frequency distribution of sound, pitch level, volume level, etc. By extracting these features, the system can better understand the content and features of the speech signal.

The extracted audio features are then converted to text content using speech recognition techniques. Speech recognition techniques convert speech data into corresponding textual representations by pattern matching and language model analysis of the audio features. In this way, the user's speech input can be converted into a text form that can be processed and understood.

By converting the voice data into text content, the AI digital person can more conveniently perform subsequent emotion recognition, emotion analysis and other processes so as to realize recognition and response to the emotion of the user.

Step S104, identifying emotion words used for representing emotion in the text content, and determining intensity limiting words used for representing emotion intensity based on the position of the emotion words in the text content.

Firstly, extracting features of words in the context range of the emotion words based on the position of the emotion words in the text content, and obtaining feature vectors.

The feature vectors are then classified using a classification algorithm to determine the intensity qualifiers. For example, performing continuous value discretization processing on the feature vector by using an equal frequency discretization method to obtain a discrete feature vector; and respectively calculating posterior probability of the discrete feature vector under each intensity limiting category, and selecting the intensity limiting category with the highest probability value in the posterior probability as a classification result of the discrete feature vector, wherein the posterior probability is the probability that one feature vector appears under one intensity limiting category.

In this embodiment, the constant value discretization processing is performed on the feature vector by using the equal frequency discretization method, so that the feature vector is converted into the discrete feature vector, thereby simplifying the classification problem of the emotion intensity. Such discretization helps to reduce complexity of feature space and improve classification efficiency and accuracy. Further, by calculating the posterior probability of discrete feature vectors under each intensity-defining class, more accurate emotional classification of the feature vectors may be performed. The posterior probability represents the probability that the feature vector appears under a certain intensity-limited category, and the degree of attribution of the feature vector in each intensity-limited category can be determined by calculating the posterior probability of the feature vector under each intensity-limited category. The intensity-defining class with the highest posterior probability value is selected as the classification result of the feature vector, which facilitates more accurate determination of the intensity-defining class expressed by the feature vector.

In some embodiments, the posterior probability of a discrete feature vector under each intensity-defining class may be calculated in the following manner: calculating a conditional probability of a discrete feature vector under each intensity-defining class, wherein the conditional probability represents a probability of occurrence of a feature vector given the intensity-defining class; calculating the prior probability of each intensity-defining class, wherein the prior probability represents the probability that all data in the training data set are classified into one intensity-defining class; based on the conditional probability and the prior probability, the posterior probability of the discrete feature vector under each intensity-defining class is calculated.

The present embodiment can evaluate the probability of the feature vector appearing under each intensity-defining class given the intensity-defining class by calculating the conditional probability of the discrete feature vector under that intensity-defining class. This helps to understand the degree of association between the feature vector and the different intensity-defining classes, further revealing the intensity information expressed by the feature vector. Further, by calculating the prior probability for each intensity-defining class, the probability of the training dataset being classified into data for each intensity-defining class can be measured. The prior probability provides knowledge of the distribution of intensity-defining classes throughout the dataset, providing an important reference for the calculation of posterior probabilities. By comprehensively considering the conditional probability of the feature vector and the prior probability of each intensity-limited class, the posterior probability of the discrete feature vector under each intensity-limited class can be calculated more accurately. In summary, computing the posterior probability of the discrete feature vector under each intensity-defining class in the above manner may provide a more comprehensive and accurate intensity classification result.

And step S106, determining the emotion type corresponding to the text content based on the emotion words, and determining the emotion intensity corresponding to the text content based on the intensity qualifier and the audio feature.

First, the emotion type is determined. For example, based on a string lookup, the emotion words are found from the text content, and then the emotion type is determined based on the emotion words.

Then, an intensity weight is determined. For example, extracting scene features from the text content, performing similarity matching on the scene features and each application scene in an application scene library, and determining an application scene type corresponding to the text content; and determining the intensity weight based on the application scene type and the intensity qualifier, wherein the same intensity qualifier has different intensity weights under different application scene types.

In this embodiment, by extracting scene features from text content and performing similarity matching with each application scene in the application scene library, the application scene type corresponding to the text content may be determined. This helps to understand the specific context and context in which the text content is located, further providing background information about the emotional expression. Furthermore, under different application scenario types, the same intensity qualifier may have different intensity weights, because the intensity of emotional expressions may differ under different circumstances. By considering the application scene type, the emotion intensity can be more accurately adjusted and quantified. Finally, by determining the intensity weight, the intensity of the emotion expression can be matched with a specific application scene and context, so that emotion classification is more accurate and reliable.

Then, the emotional intensity is determined. For example, calculating a first emotional intensity value of the text content based on the intensity qualifier and the intensity weight; identifying a second emotional intensity value corresponding to the text content based on the audio features, wherein the audio features include a pitch feature and a speed feature; and determining the emotion intensity corresponding to the text content based on the first emotion intensity value and the second emotion intensity value.

The present embodiment can quantify and associate emotion intensity with emotion words and context by calculating a first emotion intensity value of text content based on the intensity qualifier and the intensity weight. This helps to determine the overall level and strength of emotion. Furthermore, by identifying a second emotion intensity value corresponding to the text content, in particular a pitch feature and a speed feature, based on the audio features, the emotion can be assessed complementarily from a sound perspective. The pitch characteristics may reflect the frequency characteristics of the sound, while the speed characteristics may reflect the speed of speech and the variation of intonation. In combination with these audio features, details of the emotional expressions and sound attributes may be analyzed more fully. Finally, based on the first emotion intensity value and the second emotion intensity value, emotion information of the text content and emotion indexes of audio features can be comprehensively considered, so that emotion intensity corresponding to the text content can be determined. By comprehensively considering the emotion expressions of the text and the sound, more comprehensive and accurate emotion assessment and analysis results can be provided.

Step S108, based on the emotion type and the emotion intensity, identifying the emotion of the user.

The embodiment realizes the identification of the emotion of the user by matching and combining the emotion type and the emotion intensity. For example, if the emotion type is determined to be anger and the emotion intensity is determined to be high, it may be inferred that the user is experiencing a strong anger emotion. Similarly, from the different emotional types and corresponding emotional intensity values, a particular emotional state that the user may be experiencing may be determined.

Example 2

The embodiment of the application provides another emotion recognition method based on AI, as shown in FIG. 2, which comprises the following steps:

step S202, voice data of a user is acquired.

A user may use a voice input device to provide voice data, such as a microphone, a voice recognition application, and the like. Through which the user can directly input speech into the terminal device.

Step S204, processing the voice data to obtain the audio characteristics and text content of the voice data.

A method for processing voice data, as shown in fig. 3, includes the following steps:

step S2042, preprocessing the voice data.

A preprocessing step is required to improve the audio quality before any processing of the speech data is performed. For example, operations such as removing noise, reducing echo, adjusting volume, etc., are performed to ensure that the audio data obtained by subsequent processing has a higher quality.

And step S2044, extracting audio features.

Audio features are extracted from the preprocessed speech data. The audio features may include spectral information, pitch, intensity, etc. These features can describe different aspects of the speech signal, such as frequency distribution of sound, pitch level, volume level, etc.

Step S2046 converts the voice data into text content.

The voice data subjected to preprocessing and feature extraction is converted into text content. Voice data is converted into a corresponding text representation using voice recognition techniques. Speech recognition techniques utilize pattern matching and language model analysis to process audio features to recognize language content contained in speech. Common speech recognition methods include those based on hidden Markov models (Hidden Markov Model, HMM) and deep learning methods (e.g., recurrent neural networks, long and short term memory networks, etc.).

The text content in the present embodiment may be a phrase, a sentence, or a fragment composed of a plurality of sentences, and the present embodiment does not limit the length and form of the text content.

Step S206, determining an application scenario of the text content.

As shown in fig. 4, the method for determining an application scene of text content includes:

step S2062, establishing an application scene library.

First, an application scenario library containing various application scenario types needs to be built. This library may include a plurality of application scenarios, each having a corresponding feature description, such as keywords, topics, domain knowledge, and so on. These feature descriptions may be used to represent the features and characteristics of each application scenario.

And step S2064, extracting text features.

For a given text content, information representative of its characteristics needs to be extracted from it. For example, the TF-IDF vector representation method is used to convert text into a vector representation, capturing key information and features in the text.

Step S2066, scene feature extraction.

From the text features, information representing scene features is extracted therefrom. This may be accomplished by analyzing keywords in the text, contextual information, semantic representations, and the like. For example, specific words, parts of speech, emotion words, behavioural verbs, etc. in the text may be extracted as scene features.

In step S2068, the similarities are matched.

And matching the extracted scene characteristics with the similarity of each application scene in the application scene library. The similarity matching may use various distance metrics or similarity calculation methods, such as cosine similarity, euclidean distance, jaccard similarity, and the like. By calculating the similarity between the text features and the application scene features, the degree of matching between the text and each application scene can be evaluated.

In step S2069, the application scene type is determined.

And selecting the application scene type with the highest similarity with the text content as the finally determined application scene type according to the result of the similarity matching. A similarity threshold may be set and the match is considered successful only if the similarity exceeds the threshold.

Step S208, based on the text content, emotion words and strength limiting words in the text content are obtained.

As shown in fig. 5, the method for obtaining the emotion words and the strength qualifiers includes the following steps:

and step S2082, word segmentation is performed on the text content.

The text content is subjected to word segmentation processing, and the text is split into individual words or phrases. Text may be partitioned into sequences of words using chinese word segmentation tools, such as jieba, etc.

And S2084, extracting emotion words.

Emotion words in text are identified using an emotion dictionary or corpus. The emotion dictionary is a dictionary containing emotion words and their emotion polarities (e.g., positive, negative, neutral) and can be determined by looking up whether the words in the text are present in the emotion dictionary. An existing Chinese emotion dictionary such as emotion knowledge base (HowNet) or the like may be used.

Step S2086, intensity qualifier recognition.

And determining the context range of the emotion words according to the positions of the emotion words in the text content. A certain number of front and rear words may be selected as a contextual window. For example, 5 words before and after an emotion word may be selected as the context range. And extracting features related to the emotion words within the determined context. For example, the following features may be considered: extracting words in a context range as features; extracting the parts of speech of words in the context range as features; judging whether the word in the context range is an emotion word or not, and taking the emotion word as a characteristic; word frequencies of words within the context are calculated and characterized. The extracted features are then converted into the form of feature vectors. Features may be represented as vectors using TF-IDF or the like. Each feature corresponds to a dimension in a feature vector, and the values in the feature vector may represent the importance or frequency of occurrence of the feature within the context.

And then, carrying out continuous value discretization processing on the feature vector by using an equal frequency discretization method to obtain the discrete feature vector. The continuous value is divided into the same number of intervals by using an equal frequency discretization method, so that the number of samples in each interval is equal. Thus, the continuous value characteristic can be converted into discrete characteristic, and subsequent processing and calculation are convenient.

Subsequently, intensity qualifiers are identified.

1) Conditional probabilities are calculated separately.

For a discretized feature vector, it is necessary to calculate its conditional probability under each intensity-defining class. The conditional probability represents the probability that a feature vector appears in a given intensity-defining class given that intensity-defining class. The method comprises the following specific steps: for each intensity-defining class, the number of feature vectors that occur under that class is counted. The conditional probability of the feature vector under each intensity-defining class is calculated. The conditional probability may be calculated by dividing the number of occurrences of the feature vector under the intensity-defining class by the total number of all feature vectors under the intensity-defining class.

In some embodiments, the conditional probability may be calculated based on the count of feature vectors in the intensity-defining class, the sum of the counts of feature vectors, a parameter controlling the intensity of the smoothing, a parameter controlling the degree of smoothing, and the relative frequency of feature vectors in the intensity-defining class. For example, the conditional probability can be calculated using the following formula:

Wherein count (v, c) represents the count of feature vectors v in intensity-defining class c; total_count represents the count sum of all feature vectors; k is a non-negative integer for controlling the intensity of smoothing; λ is a smoothing parameter, controlling the degree of smoothing, p (v|c) is the relative frequency, representing the relative frequency of the eigenvector v in the intensity-defining class c.

In this embodiment, by adjusting the value of k, the count of feature vectors can be smoothed when calculating the conditional probability. Smoothing can alleviate sparsity problems in feature vector counts and reduce the risk of overfitting. Adjusting the smoothing intensity may balance the fitting ability and generalization ability of the model according to the specific situation. By adjusting the value of λ, the degree of smoothing can be controlled. Controlling the lambda value to be greater than the preset threshold increases the smoothing effect, reduces the influence of the count of feature vectors on the conditional probability, and thereby smoothes the estimate of the conditional probability. The control of the smoothness degree can be adjusted according to the characteristics and the requirements of the data set so as to achieve better model performance. Furthermore, the relative frequency p (v|c) takes into account the relative frequency of the feature vector in the intensity-defining class c. This allows a more accurate estimation of the probability of a feature vector under a given intensity-defining class. The relative frequency reflects the relative importance of the feature vector in the intensity-limited category, and can better reflect the association relationship between the feature and emotion. In summary, by comprehensively considering the count, the total count, the smoothing parameter and the relative frequency of the feature vector, the embodiment can calculate the conditional probability more accurately, thereby improving the performance and generalization capability of the emotion classification model. It can handle sparsity problems of feature vector counts and balance the relationship between fitting ability and generalization ability.

In some embodiments, the relative frequency may be obtained by:

wherein count (V, c) represents the count of a particular feature vector V in the intensity-defining class c, V ' represents each feature vector, V represents all feature vectors, and count (V ', c) represents the count of each feature vector V ' in the intensity-defining class c.

The present embodiment calculates the relative frequency of the eigenvectors under a given intensity-defining class, i.e., the ratio of the frequency of occurrence of eigenvectors v in the intensity-defining class c to the frequency of occurrence of all eigenvectors in the intensity-defining class c. It reflects the importance or significance of the feature vector in the intensity-defining class. By calculating the relative frequencies, the contribution of the feature vectors in the emotion classification can be better understood, and more accurate information is provided in the conditional probability calculation.

2) The prior probability is calculated.

The prior probability represents the probability that all feature vectors fall into the respective intensity-defining class. The step of calculating the prior probability for each intensity-defining class is as follows: the number of feature vectors under each intensity-defining class is counted. The prior probability for each intensity-defining class is calculated. The prior probability may be calculated by dividing the number of feature vectors under the intensity-defining class by the total number of all feature vectors.

For example, the prior probability may be calculated by:

where count (emotion) represents the total number of all feature vectors under the intensity-defining category and total_count represents the total number of all feature vectors.

3) The posterior probability is calculated based on the conditional probability and the prior probability.

The posterior probability represents the probability of a discrete feature vector under each intensity-defining class. The step of calculating the posterior probability of the discrete feature vector under each intensity-defining class is as follows: for each intensity-limited category, multiplying the conditional probability of the feature vector by the prior probability of the intensity-limited category to obtain the posterior probability of the feature vector under the intensity-limited category.

For example, the posterior probability may be calculated by:

the embodiment can quantify the probability distribution of the discrete feature vector under each intensity-limited category by calculating the conditional probability, the prior probability and the posterior probability. These probability values reflect the degree of association between the feature vector and the different intensity-defining classes. In selecting the classification result of the feature vector, the intensity-limited class having the highest probability value may be selected as the classification result of the feature vector according to the magnitude of the posterior probability. In this way, the intensity-defining class to which the discrete feature vector corresponds can be determined.

Step S210, calculating emotion intensity based on the intensity qualifier and the audio feature.

As shown in fig. 6, the method of calculating the emotional intensity includes the steps of:

step S2102, determining an intensity weight corresponding to the intensity qualifier based on the intensity qualifier, and calculating a first emotion intensity value of the text content based on the intensity qualifier and the intensity weight.

And determining the intensity weight based on the determined application scene type and the intensity qualifier, wherein the same intensity qualifier has different intensity weights under different application scene types. For each intensity qualifier under each application scene type, corresponding intensity weight is determined according to the characteristics and semantic meaning of the application scene type. The intensity weight reflects the importance degree or influence degree of the same intensity qualifier on emotion expression under different application scene types.

For each intensity qualifier, a first emotional intensity value is calculated based on its occurrence in the text content and the corresponding intensity weight. The calculation can be performed using the following formula:

first emotional intensity value = intensity weight 1 x number of occurrences of intensity qualifier 1 + intensity weight 2 x number of occurrences of intensity qualifier 2 +.+ -. Intensity weight n x number of occurrences of intensity qualifier n

Where n represents the number of intensity qualifiers in the text, the intensity qualifier i represents the i-th intensity qualifier, the intensity weight i represents the intensity weight corresponding to the intensity qualifier i, and the number of occurrences of the intensity qualifier i represents the number of occurrences of the intensity qualifier in the text.

Through the above steps, a first emotional intensity value of the text content may be calculated based on the intensity qualifier and the intensity weight. This allows for quantification and assessment of emotion expressions to better understand and analyze the emotion expressed in the text.

And step S2104, identifying a second emotion intensity value corresponding to the text content based on the audio features, wherein the audio features comprise a pitch feature and a speed feature.

The pitch characteristics in the audio are extracted by audio signal processing techniques. Common pitch extraction algorithms include fundamental frequency estimation, autocorrelation functions, spectral analysis, and the like. After extracting the pitch feature, a pitch sequence can be obtained, representing the pitch information of the audio signal at different time points.

The sound velocity characteristics in the audio are extracted through the audio signal processing technology. Common methods for extracting sound velocity include acoustic models, time delay estimation and the like. After the sound speed characteristics are extracted, a sound speed sequence can be obtained, and the sound speed information of the audio signal at different time points is represented.

The pitch and speed characteristics are then preprocessed. Pretreatment of pitch and speed features is often required before emotion recognition models are applied. This may include processing steps such as feature normalization, dimension reduction, smoothing, etc., to extract a more useful representation of the feature.

And constructing an emotion recognition model. Based on the labeled audio samples and their corresponding emotion intensity values, a machine learning algorithm (e.g., support vector machine, random forest, deep neural network, etc.) may be used to construct an emotion recognition model. The input of the model is audio features (including pitch and speed features) and the output is a corresponding emotion intensity value.

Finally, a second emotional intensity value is predicted. And predicting the audio characteristics by using the constructed emotion recognition model to obtain a corresponding second emotion intensity value. Based on the input pitch and speed characteristics, the model outputs a value representing the intensity level of the second emotion expressed by the audio.

Through the above steps, based on the pitch feature and the speed feature, a second emotion intensity value corresponding to the text content can be identified. Thus, the characteristics of emotion expression can be further understood from the audio, and more comprehensive emotion analysis and understanding are provided.

Step S2106, determining an emotional intensity corresponding to the text content based on the first emotional intensity value and the second emotional intensity value.

And taking the product of the first emotion intensity value and the second emotion intensity value as emotion intensity corresponding to the text content.

Step S212, calculating emotion values based on the emotion intensities and the emotion words.

1) And obtaining the emotion value of the emotion word.

And for each emotion word contained in the text, acquiring a corresponding emotion value according to the emotion word library or emotion dictionary.

2) An emotion value and an accumulated value of emotion intensity are calculated.

For each emotion word in the text, multiplying the emotion value of each emotion word by the corresponding emotion intensity to obtain the product of the emotion value and the emotion intensity of the emotion word. And then, adding the products of the emotion values and the emotion intensities of all emotion words to obtain an accumulated value. This cumulative value represents the combined impact of all emotion words in the text on emotion value and emotion intensity.

3) And taking the sum of the accumulated values corresponding to all emotion words as the emotion value of the text content.

And adding the corresponding emotion values and the accumulated emotion intensity values of all emotion words contained in the text to obtain a sum. The sum is the emotion value of the text content, and reflects the emotion tendency or emotion state of the whole text.

4) Based on the emotion type and the emotion value, an emotion of the user is determined.

According to the magnitude of the emotion value and the definition of the emotion type, the emotion value of the text content is compared with the emotion type. The emotion types may include positive, negative, neutral, or more specifically such as happiness, sadness, anger, etc. And determining the emotion of the user according to the range of the emotion value or the matching degree of the emotion value and the emotion type. For example, if the emotion value is high and matches the positive emotion type, the user emotion may be judged as positive; if the emotion value is low and matches the negative emotion type, the user emotion can be judged as negative.

Through the above steps, the emotion value of the emotion word can be obtained, the accumulated value of the emotion value and the emotion intensity is calculated, the sum of the accumulated values of the emotion word is used as the emotion value of the text content, and the emotion of the user is determined based on the emotion type and the emotion value. This allows for a more accurate analysis of the emotional tendency of the text and the emotional state of the user.

Example 3

The embodiment of the application provides an AI-based emotion recognition device, as shown in FIG. 7, comprising: a preprocessing module 72, a word recognition module 74, an intensity determination module 76, and an emotion recognition module 78.

The preprocessing module 72 is configured to extract audio features from voice data of a user in response to receiving the voice data and to convert the voice data into text content; word recognition module 74 is configured to recognize emotion words in the text content that are used to characterize emotion and determine intensity qualifiers for characterizing emotion intensity based on the position of the emotion words in the text content; intensity determination module 76 is configured to determine a type of emotion corresponding to the text content based on the emotion words and determine an intensity of emotion corresponding to the text content based on the intensity qualifier and the audio feature; emotion recognition module 78 is configured to recognize an emotion of the user based on the emotion type and the emotion intensity.

It should be noted that: the AI-based emotion recognition device provided in the above embodiment is only exemplified by the above-described division of each functional module, and in practical application, the above-described function allocation may be performed by different functional modules, i.e., the internal structure of the device is divided into different functional modules, so as to perform all or part of the functions described above. In addition, the AI-based emotion recognition device provided in the above embodiment and the AI-based emotion recognition method embodiment belong to the same concept, and detailed implementation processes thereof are shown in the method embodiment and are not described herein.

Example 4

Fig. 8 shows a schematic structural diagram of an electronic device suitable for use in implementing embodiments of the present disclosure. It should be noted that the electronic device shown in fig. 8 is only an example, and should not impose any limitation on the functions and the application scope of the embodiments of the present disclosure.

As shown in fig. 8, the electronic device includes a Central Processing Unit (CPU) 1001 that can execute various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 1002 or a program loaded from a storage section 1008 into a Random Access Memory (RAM) 1003. In the RAM 1003, various programs and data required for system operation are also stored. The CPU1001, ROM 1002, and RAM 1003 are connected to each other by a bus 1004. An input/output (I/O) interface 1005 is also connected to bus 1004.

The following components are connected to the I/O interface 1005: an input section 1006 including a keyboard, a mouse, and the like; an output portion 1007 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), etc., and a speaker, etc.; a storage portion 1008 including a hard disk or the like; and a communication section 1009 including a network interface card such as a LAN card, a modem, or the like. The communication section 1009 performs communication processing via a network such as the internet. The drive 1010 is also connected to the I/O interface 1005 as needed. A removable medium 1011, such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like, is installed as needed in the drive 1010, so that a computer program read out therefrom is installed as needed in the storage section 1008.

In particular, according to embodiments of the present disclosure, the processes described below with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flowcharts. In such an embodiment, the computer program may be downloaded and installed from a network via the communication portion 1009, and/or installed from the removable medium 1011. When being executed by a Central Processing Unit (CPU) 1001, performs the various functions defined in the method and apparatus of the present application. In some embodiments, the electronic device may further include an AI (Artificial Intelligence ) processor for processing computing operations related to machine learning.

It should be noted that the computer readable medium shown in the present disclosure may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present disclosure, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units involved in the embodiments of the present disclosure may be implemented by means of software, or may be implemented by means of hardware, and the described units may also be provided in a processor. Wherein the names of the units do not constitute a limitation of the units themselves in some cases.

As another aspect, the present application also provides a computer-readable medium that may be contained in the electronic device described in the above embodiment; or may exist alone without being incorporated into the electronic device.

The computer-readable medium carries one or more programs which, when executed by one of the electronic devices, cause the electronic device to implement the methods described in the embodiments below. For example, the electronic device may implement the steps of the method embodiments described above, and so on.

The integrated units in the above embodiments may be stored in the above-described computer-readable storage medium if implemented in the form of software functional units and sold or used as separate products. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, comprising several instructions for causing one or more computer devices (which may be personal computers, servers or network devices, etc.) to perform all or part of the steps of the method described in the embodiments of the present application.

In the foregoing embodiments of the present application, the descriptions of the embodiments are emphasized, and for a portion of this disclosure that is not described in detail in this embodiment, reference is made to the related descriptions of other embodiments.

In several embodiments provided in the present application, it should be understood that the disclosed terminal device may be implemented in other manners. The above-described embodiments of the apparatus are merely exemplary, and the division of the units, such as the division of the units, is merely a logical function division, and may be implemented in another manner, for example, multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some interfaces, units or modules, or may be in electrical or other forms.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The foregoing is merely a preferred embodiment of the present application and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present application, which are intended to be comprehended within the scope of the present application.

Claims

1. An AI-based emotion recognition method, comprising:

in response to receiving voice data of a user, extracting audio features from the voice data and converting the voice data into text content;

identifying emotion words used for representing emotion in the text content, and determining intensity limiting words used for representing emotion intensity based on the position of the emotion words in the text content;

determining emotion types corresponding to the text content based on the emotion words, and determining emotion intensities corresponding to the text content based on the intensity qualifiers and the audio features;

Identifying an emotion of the user based on the emotion type and the emotion intensity;

wherein determining an intensity qualifier for characterizing an intensity of emotion based on a position of the emotion word in the text content comprises: based on the position of the emotion word in the text content, extracting the characteristics of the words in the context range of the emotion word to obtain a characteristic vector; classifying the feature vectors by using a classification algorithm to determine the intensity qualifiers;

wherein classifying the feature vector using a classification algorithm comprises: performing continuous value discretization processing on the feature vector by using an equal frequency discretization method to obtain a discrete feature vector; and respectively calculating posterior probability of the discrete feature vector under each intensity limiting category, and selecting the intensity limiting category with the highest probability value in the posterior probability as a classification result of the discrete feature vector, wherein the posterior probability is the probability that one feature vector appears under one intensity limiting category.

2. The method of claim 1, wherein separately computing the posterior probabilities of the discrete feature vectors under each intensity-defining class comprises:

Separately calculating a conditional probability of a discrete feature vector under each intensity-defining class, wherein the conditional probability represents a probability that a feature vector appears in a given intensity-defining class on the premise of the given intensity-defining class;

calculating the prior probability of each intensity limiting category, wherein the prior probability represents the probability that all feature vectors fall into each intensity limiting category;

based on the conditional probability and the prior probability, the posterior probability of the discrete feature vector under each of the intensity-defining classes is calculated.

3. The method of claim 1, wherein determining an emotional intensity corresponding to the text content based on the intensity qualifier and the audio feature comprises:

determining an intensity weight corresponding to the intensity qualifier based on the intensity qualifier, and calculating a first emotional intensity value of the text content based on the intensity qualifier and the intensity weight;

identifying a second emotional intensity value corresponding to the text content based on the audio features, wherein the audio features include a pitch feature and a speed feature;

and determining the emotion intensity corresponding to the text content based on the first emotion intensity value and the second emotion intensity value.

4. A method according to claim 3, wherein determining an intensity weight corresponding to the intensity qualifier based on the intensity qualifier comprises:

determining an application scene type corresponding to the text content based on the text content;

and determining the intensity weight based on the application scene type and the intensity qualifier, wherein the same intensity qualifier has different intensity weights under different application scene types.

5. The method of claim 4, wherein determining the application scene type corresponding to the text content based on the text content comprises: extracting scene characteristics from the text content, performing similarity matching on the scene characteristics and each application scene in an application scene library, and determining the application scene type corresponding to the text content.

6. The method of claim 1, wherein identifying the emotion of the user based on the emotion type and the emotion intensity comprises:

acquiring an emotion value of the emotion word, and calculating an accumulated value of the emotion value and the emotion intensity;

taking the sum of the accumulated values corresponding to all emotion words contained in the text content as an emotion value of the text content;

Determining an emotion of the user based on the emotion type and the emotion value.

7. An AI-based emotion recognition device, comprising:

a preprocessing module configured to extract audio features from voice data of a user in response to receiving the voice data, and to convert the voice data into text content;

a word recognition module configured to recognize emotion words in the text content for representing emotion and determine intensity qualifiers for representing emotion intensities based on the position of the emotion words in the text content;

an intensity determination module configured to determine a type of emotion corresponding to the text content based on the emotion word and determine an intensity of emotion corresponding to the text content based on the intensity qualifier and the audio feature;

an emotion recognition module configured to recognize an emotion of the user based on the emotion type and the emotion intensity;

wherein the word recognition module is further configured to: based on the position of the emotion word in the text content, extracting the characteristics of the words in the context range of the emotion word to obtain a characteristic vector; performing continuous value discretization processing on the feature vector by using an equal frequency discretization method to obtain a discrete feature vector; and respectively calculating posterior probability of the discrete feature vector under each intensity limiting category, and selecting the intensity limiting category with the highest probability value in the posterior probability as a classification result of the discrete feature vector to determine the intensity limiting word, wherein the posterior probability is the probability that one feature vector appears under one intensity limiting category.

8. A computer-readable storage medium, on which a program is stored, characterized in that the program, when run, causes a computer to perform the method of any one of claims 1 to 6.