CN113707184B

CN113707184B - Method and device for determining emotion characteristics, electronic equipment and storage medium

Info

Publication number: CN113707184B
Application number: CN202111007019.5A
Authority: CN
Inventors: 李森
Original assignee: Beijing Kingsoft Cloud Network Technology Co Ltd
Current assignee: Beijing Kingsoft Cloud Network Technology Co Ltd
Priority date: 2021-08-30
Filing date: 2021-08-30
Publication date: 2023-05-05
Anticipated expiration: 2041-08-30
Also published as: CN113707184A

Abstract

The application relates to a method and a device for determining emotional characteristics, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring multiple sections of user voices with different emotion types; extracting characteristic values of each section of user voice in a plurality of characteristic dimensions; acquiring characteristic change values and characteristic thresholds of each characteristic dimension among different emotion types according to the characteristic values of each section of user voice on a plurality of characteristic dimensions; and selecting a characteristic dimension for speech emotion recognition from a plurality of characteristic dimensions according to the characteristic variation value and the characteristic threshold value, wherein the characteristic variation value of the characteristic dimension for speech emotion recognition is not smaller than the characteristic threshold value. The method solves the technical problem that the feature selection method adopted in the voice emotion recognition in the related technology cannot select the features which can distinguish different emotion types from a large number of features.

Description

Method and device for determining emotion characteristics, electronic equipment and storage medium

Technical Field

The present invention relates to the field of speech processing, and in particular, to a method and apparatus for determining emotional characteristics, an electronic device, and a storage medium.

Background

Smart Home (Home Automation) is a modern living system formed based on a traditional house and assisted by internet of things communication technology, automatic control technology and artificial intelligence technology. The gradual maturity of intelligent house brings a great deal of convenience for modern fast-paced life, can provide good, comfortable and intelligent living environment for the user. In the fields of smart home, AI dialogue system, customer service system, education, medical treatment, etc., it is necessary to recognize emotion of a user through voice of the user so as to take a corresponding policy according to the recognized emotion. Before classification recognition, feature selection is required, and the main purpose of feature selection is to obtain the attribute of the best classification from the feature data set, which is important for improving the accuracy of emotion recognition. By selecting the features, the features which are effective and closely related to emotion can be extracted from the original voice, and the number of the features in the feature data set is reduced, so that the classification performance and accuracy are improved.

At present, the feature selection methods adopted in the related technology are mainly manually selected, are limited by personal factors, are unfavorable for improving the success rate of emotion recognition, increase the total work amount, and can not select the features which can most distinguish different emotion types from a large number of features, so that the method for emotion recognition through voice has low recognition accuracy and complex calculation.

Disclosure of Invention

The application provides a method and a device for determining emotion characteristics, electronic equipment and a storage medium, which at least solve the technical problem that a characteristic selection method in the related technology cannot select characteristics which can distinguish different emotion types from a large number of characteristics.

According to an aspect of an embodiment of the present application, there is provided a method for determining emotional characteristics, including: acquiring multiple sections of user voices with different emotion types; extracting characteristic values of each section of user voice in a plurality of characteristic dimensions; acquiring characteristic change values and characteristic thresholds of each characteristic dimension among different emotion types according to the characteristic values of each section of user voice on a plurality of characteristic dimensions; and selecting the feature dimension for voice emotion recognition from the plurality of feature dimensions according to the feature variation value and the feature threshold, wherein the feature variation value of the feature dimension for voice emotion recognition is not smaller than the feature threshold.

According to another aspect of the embodiments of the present application, there is also provided a determining apparatus for emotional characteristics, including: the voice acquisition module is used for acquiring multiple sections of user voices with different emotion types; the feature extraction module is used for extracting feature values of each section of user voice in a plurality of feature dimensions; the parameter acquisition module is used for acquiring characteristic change values and characteristic thresholds of each characteristic dimension among different emotion types according to the characteristic values of each section of user voice on a plurality of characteristic dimensions; the feature selection module is used for selecting the feature dimension for voice emotion recognition from a plurality of feature dimensions according to the feature change value and the feature threshold value, wherein the feature change value of the feature dimension for voice emotion recognition is not smaller than the feature threshold value.

According to another aspect of the embodiments of the present application, there is also provided an electronic device including a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor executing the method described above by the computer program.

According to another aspect of the embodiments of the present application, there is also provided a storage medium including a stored program that when executed performs the above-described method.

According to one aspect of the present application, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the steps of any of the embodiments of the method described above.

In the embodiment of the application, the feature change values of the user voice under different feature dimensions are compared, the feature change values are selected from a plurality of candidate feature dimensions to be not smaller than the feature threshold feature dimensions, so that the feature dimension which can distinguish different emotion types is selected, the technical problem that the feature selection method in the related technology cannot select the feature which can distinguish different emotion types from a large number of features is solved, and the feature data set of the feature dimension which is selected by the method for determining the emotion features and is used for voice emotion recognition is used for classification recognition during voice emotion recognition, so that the technical effect of improving the accuracy of voice emotion recognition is achieved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute an undue limitation to the application. In the drawings:

FIG. 1 is a schematic diagram of a hardware environment of a method of determining emotional characteristics according to embodiments of the application;

FIG. 2 is a flow chart of an alternative method of determining emotional characteristics according to embodiments of the application;

FIG. 3 is a schematic diagram of an alternative speech emotion recognition method overall framework in accordance with an embodiment of the present application;

FIG. 4 is a schematic illustration of a process flow of an alternative method of determining emotional characteristics according to embodiments of the application;

FIG. 5 is a schematic diagram of an alternative emotion feature determination device according to an embodiment of the present application; the method comprises the steps of,

fig. 6 is a block diagram of a terminal according to an embodiment of the present application.

Detailed Description

In order to make the present application solution better understood by those skilled in the art, the following description will be made in detail and with reference to the accompanying drawings in the embodiments of the present application, it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, shall fall within the scope of the present application.

It should be noted that the terms "first," "second," and the like in the description and claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that embodiments of the present application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The scheme can be applied to the fields of intelligent home, intelligent customer service, online education, intelligent medical treatment and the like, and needs to recognize the emotion of the user through the voice of the user so as to adopt the scenes of corresponding strategies.

First, partial terms or terminology appearing in describing embodiments of the present application are applicable to the following explanation:

Feature selection: feature selection refers to the process of selecting N features from the existing M features to optimize a specific index of a system, and is a process of selecting some most effective features from original features to reduce the dimension of a data set, and is also a key data preprocessing step in pattern recognition.

Pre-emphasis: the pre-emphasis is a signal processing mode for compensating high-frequency components of an input signal at a transmitting end, and aims to emphasize a high-frequency part of voice, remove the influence of lip radiation and increase the high-frequency resolution of the voice.

Framing: framing refers to dividing a speech signal into shorter frames, where the speech signal is a quasi-stationary signal, and after dividing it into shorter frames, each frame can be regarded as a stationary signal, and the frames are processed by a method of processing the stationary signal, so that parameters between one frame and another frame can be smoothly transitioned, and the frames should be partially overlapped with each other.

Windowing: after the speech signal is framed, each frame of signal needs to be analyzed and processed, the window function generally has low-pass characteristic, the purpose of the windowing function is to reduce leakage in the frequency domain, so that the purposes of feature enhancement and interference information removal are realized, the window functions commonly used in speech signal analysis comprise rectangular windows, hamming windows and hanning windows, and different window functions can be selected according to different conditions.

According to an aspect of the embodiments of the present application, an embodiment of a method for determining emotional characteristics is provided.

Alternatively, in the present embodiment, the above-described method of determining emotional characteristics may be applied to a hardware environment constituted by the terminal 101 and the server 103 as shown in fig. 1. As shown in fig. 1, the server 103 is connected to the terminal 101 through a network, which may be used to provide emotion recognition services for the terminal or a client installed on the terminal, and a database 105 may be provided on the server or independent of the server, for providing data storage services for the server 103, where the network includes, but is not limited to: the terminal 101 is not limited to a PC, a mobile phone, a tablet computer, or the like. The method for determining emotional characteristics according to the embodiments of the present application may be performed by the server 103, may be performed by the terminal 101, or may be performed by both the server 103 and the terminal 101. The method for determining the emotional characteristics of the terminal 101 according to the embodiments of the present application may also be performed by a client installed thereon. An example of an embodiment method of performing a determination method of an emotional characteristic in the embodiment of the present application on a server is described below.

FIG. 2 is a flow chart of an alternative method of determining emotional characteristics according to embodiments of the application, as shown in FIG. 2, the method may include the steps of:

in step S202, the server obtains multiple segments of user voices with different emotion types, where each segment of user voice carries an emotion tag for representing one emotion type expressed by the user voice.

Each voice in the multiple sections of user voices has an emotion label of only one emotion, and each user voice section in the multiple sections of user voices can be an independent sentence of user voices; the multiple pieces of user voices can also be obtained by splitting one or more user voices according to emotion types (if one sentence of user voices has multiple emotions, the user voices can be split into the same number of user voice sections); of course, the multiple pieces of user speech may also be a combination of the two forms, such as a separate sentence of user speech for each of a portion of the user speech segments, and a portion of the user speech segments are split from the same sentence of user speech.

For example, emotions can be classified into six types of "vital energy", "happy", "fear", "sadness", "surprise" and "neutral", and emotion tags of each emotion type are respectively: generating gas, wherein the label is '01'; happy, the label is "02"; fear, label "03"; sadness, label "04"; surprisingly, the label is "05"; the label is "06" neutral.

In step S204, the server extracts feature values of each user voice in a plurality of feature dimensions, where the plurality of feature dimensions are candidate feature dimensions for expressing sound characteristics. The plurality of feature dimensions may be various types of feature dimensions of short-time energy, mel Frequency Cepstrum Coefficient (MFCC), short-time zero-crossing rate, fundamental frequency, sound probability, and the like of the user's voice.

In step S206, the server obtains feature variation values and feature thresholds of each feature dimension among different emotion types according to feature values of each segment of user voice in a plurality of feature dimensions.

In step S208, the server selects a feature dimension for speech emotion recognition from the plurality of feature dimensions according to the feature variation value and the feature threshold, wherein the feature variation value of the feature dimension for speech emotion recognition is not less than the feature threshold.

Each feature variation value is used to represent a distinction of user voices of different emotions in one feature dimension, and as the emotions are definitely different, the larger the feature variation value (i.e. the greater the degree of discretization) the more emotion can be represented, the feature variation value of the feature dimension selected for voice emotion recognition is not smaller than the feature threshold, and the selected feature dimension is the feature dimension more suitable for distinguishing the emotion.

Through the steps S202 to S208, the feature variation value of the scheme is used for representing the discrete degree of the feature value of different emotion types on one feature dimension, the larger the feature variation value is, the larger the discrete degree of the feature value is, the different emotion types can be distinguished, whether the feature dimension is favorable for distinguishing different emotion types or not is judged according to the discrete degree, the feature variation value of the user voice under different feature dimensions is compared, the feature dimension with larger feature variation value is selected from a plurality of candidate feature dimensions, so that the feature dimension which can be most distinguished into different emotion types is selected, the technical problem that the feature selection method in the related technology cannot select the feature which can be most distinguished into different emotion types from a large number of features is solved, and the feature data set of the feature dimension which is selected by the determining method of the emotion features is used for classifying and recognizing the voice emotion during voice emotion recognition is used for classifying and recognizing, so that the technical effect of improving the accuracy of voice emotion recognition is achieved.

In the technical solution provided in step S202, the server acquires multiple segments of user voices with different emotion types, each segment of user voice expresses one emotion, and carries an emotion tag for expressing one emotion type expressed by the user voice, so that in order to improve accuracy of voice emotion recognition, the multiple segments of user voices for feature selection should contain as many emotion types as possible.

Alternatively, in the present embodiment, the server may acquire the user voice meeting the above conditions in a plurality of ways as follows: (1) Selecting a section of user voice from each emotion label in the emotion corpus; (2) And obtaining the manually marked user voice data, wherein the voice data comprises a plurality of sections of user voices and emotion labels of the user voices.

In the technical solution provided in step S204, the server extracts feature values of each segment of user speech in a plurality of feature dimensions, where the plurality of feature dimensions are candidate feature dimensions.

As an alternative embodiment, the server extracts feature values of each segment of user speech in a plurality of feature dimensions, including: the server performs pre-emphasis processing on the multiple sections of user voices to obtain first user voices, wherein the pre-emphasis processing is used for compensating damaged signals in the multiple sections of user voices; the server carries out framing processing on the first user voice to obtain a second user voice divided into a plurality of frames at a specified sampling frequency; the server performs windowing processing on each frame in the second user voice to obtain third user voice, wherein the windowing processing is used for cutting off the second user voice so as to reduce spectrum leakage; the server adopts a plurality of extraction schemes to extract the characteristic value of each section of user voice in a plurality of characteristic dimensions from the third user voice, and each extraction scheme is used for extracting the characteristic value of the third user voice in one characteristic dimension.

Optionally, in this embodiment, the open source tool openSMIL may be used to extract features of the user voice, and the obtaining includes applying a statistical function to a plurality of low-level functional Symbols (LLDs) to obtain feature values of the user voice in a plurality of feature dimensions, that is, feature data sets, where an extraction scheme refers to applying a statistical function to one low-level functional symbol, where the applied statistical function may be several functions as follows: (1) stddev: standard deviation of values in the profile; (2) Shewness: skewness (3 rd moment); (3) kurtosis: kurtosis (4 th moment); (4) quatertile 1: a first quartile (25% percentile); (5) quatertile 2: a first quartile (50% percentile); (6) quatertile 3: a first quartile (75% percentile); (7) iqr1-2: quartile spacing: quaterile 2-quaterile 1; (8) iqr2-3: quartile spacing: quaterile 3-quaterile 2; (9) iqr1-3: quartile spacing: quatile3-quatile1; (10) Percentile1.0: the outlier robust minimum of the contour is expressed as a 1% percentile; (11) Percentile99.0: the outlier robust maximum of the contour is expressed as a 99% percentile; (12) pctlrange0-1: an outlier robust signal range "max-min" represented by a range of percentiles of 1% and 99%; (13) upleveltime75: time percent of signal exceeding (75% xrange + min); (14) upleveltime90: the percentage of time that the signal exceeds (90% xrange + min). The common feature set IS 09_emission.conf contains 384 feature dimensions, which IS obtained by applying 12 statistical functions to 32 low-level descriptors, and LLDs of IS 09_emission.conf contain 5 types of features of short-time energy, 12-order MFCC, short-time zero-crossing rate, fundamental frequency and sounding probability.

In the technical solution provided in step S206, the server obtains feature variation values and feature thresholds of each feature dimension between different emotion types according to feature values of each segment of user speech on a plurality of feature dimensions.

As an optional embodiment, the server obtaining the feature variation value and the feature threshold value of each feature dimension between different emotion types according to the feature values of each segment of user voice on the feature dimensions includes: acquiring characteristic change values of each characteristic dimension among different emotion types according to the characteristic values of each section of user voice on a plurality of characteristic dimensions; and determining a characteristic threshold according to the characteristic change value of each characteristic dimension between different emotion types.

Optionally, in this embodiment, the server obtaining, according to feature values of each segment of user speech on a plurality of feature dimensions, feature change values of each feature dimension between different emotion types includes determining feature change values of a target feature dimension between the plurality of emotion types according to the following manner:

wherein f _c-mean (i) Representing feature variation values of a target feature dimension between multiple emotion types, n representing a number of emotion types contained in a plurality of segments of user speech, f _mean (j, i) representing the feature average value of the user voice with the jth emotion as the emotion label in the target feature dimension, f _mean (k, i) feature average over target feature dimensions of user speech representing a kth emotion labeled as emotion，f _mean (j,i)-f _mean (k, i) represents the difference of the feature averages of different emotion types in the target feature dimension, the feature variation value is equivalent to the average value of the accumulated difference of the feature averages of all different emotion types in the target feature dimension, the feature variation value is used for representing the degree of dispersion of the feature averages of different emotion types in the target feature dimension, and the larger the degree of dispersion is, the more different emotion types can be distinguished according to the feature dimension.

For example, if there are 4 emotion types in total, the feature variation value of the target feature dimension is

If f _mean (1,i)＝3，f _mean (2,i)＝2，f _mean (3,i)＝5，f _mean (4, i) =6, then +.>

The feature variation value may represent the cumulative difference of feature averages of any two emotions over a feature dimension, the factorization in the formula may be eliminated, followed by

Because there are two layers accumulated, the previous one is multiplied by +.>

Unified dimension, convenient calculation.

Optionally, in this embodiment, the server determines the feature threshold according to a feature variation value of each feature dimension between different emotion types, and determining the feature threshold includes: acquiring an average value of feature variation values of each feature dimension among different emotion types; the average value of the feature variation values is used as a feature threshold value.

Optionally, in this embodiment, the server determines the feature threshold according to the following formula:

wherein th _MN Representing a characteristic threshold, f _c-mean (i) The feature variation value representing the ith feature dimension, and m represents the number of the plurality of feature dimensions.

For example, if there are 3 feature dimensions in total, the feature variation value of the 1 st feature dimension is 2, and the feature variation value of the 2 nd feature dimension is 4; the feature change value of the 3 rd feature dimension C is 3, and the feature threshold value is

The feature variation value is not smaller than the feature threshold, which is the 2 nd feature dimension and the 3 rd feature dimension.

Optionally, in this embodiment, the server may further determine the feature threshold as follows: the feature values of the nth (or proportionally selected) feature dimension positioned at the head (or tail) of the queue are selected as feature thresholds, and the N or the proportion can be determined according to actual requirements, for example, at least 5 features are actually needed, and then the value of N can be set to be 5 or a value larger than 5.

For example, feature values (feature average values) of a plurality of feature dimensions are arranged from small to large as: 1. 2, 3, 4, 5, 6, 7, 8, 9, the feature value of the 5 th feature dimension at the head of the team is taken as a feature threshold value, and the feature threshold value is 5.

In the technical solution provided in step S208, the server selects a feature dimension for speech emotion recognition from a plurality of feature dimensions according to the feature variation value and the feature threshold, where the feature variation value of the feature dimension for speech emotion recognition is not less than the feature threshold, and each feature variation value is used to represent the difference of user speech in different emotions in one feature dimension, and the larger the feature variation value, the larger the relevance between the feature dimension and emotion is indicated, which is more favorable for accurately recognizing the emotion type expressed by the user speech. In a section of voice sequence, a plurality of feature dimensions can be extracted through feature extraction, wherein the feature dimensions are closely related to emotion and are not greatly related, the features are required to be processed again, and the feature dimensions closely related to emotion are screened out, so that the accuracy of emotion recognition can be improved; and secondly, the characteristic data set can be reduced, so that the calculation efficiency is improved.

Alternatively, in the present embodiment, the process of selecting feature dimensions for speech emotion recognition may be formulated as

f (i) represents the feature value of the user's voice in a feature dimension, and represents the calculated feature variation value f if in a feature dimension _c-mean (i) Greater than the set threshold th _MN And (5) reserving and discarding the product otherwise.

Optionally, in this embodiment, the server obtains a feature average value of all user voices with the same emotion type in a feature dimension in the multiple segments of user voices, including: extracting the characteristic value of the user voice of one emotion type in the characteristic dimension from the characteristic values of the multiple sections of user voices in the characteristic dimension according to the emotion type; and acquiring a characteristic average value of the emotion type user voice in the characteristic dimension according to the characteristic value of the emotion type user voice in the characteristic dimension.

Alternatively, in this embodiment, the server may determine the feature average value according to the following formula:

f _mean (l, i) represents a characteristic average value of the first emotion in the i-th characteristic dimension, f (l, i) represents a characteristic value of the first emotion in the i-th characteristic dimension, and c represents the number of characteristic values.

For example, the feature value of emotion 1 in feature dimension 1 (i.e., the feature value of user speech with emotion tag "01" in feature dimension 1) is: 1,4,2,3; the number of eigenvalues c at this time was 4 and the average eigenvalue was 2.5.

The present application also provides an alternative embodiment that determines feature averages, feature variance values, feature thresholds in a similar manner to the above embodiments, i.e., replaces all averaging operations with a median.

The present application also provides an alternative embodiment that determines the feature variation value in a similar manner to the above embodiment, i.e., using the operations of taking the difference, standard deviation, polar difference, or average difference instead of taking the average of the accumulated differences.

The application also provides an optional embodiment, wherein after the server selects the feature dimension for speech emotion recognition from a plurality of feature dimensions according to the feature variation value and the feature threshold value, speech emotion recognition is performed in the following manner: the server collects voice fragments, wherein the voice fragments are voice fragments of emotion types to be recognized; the server performs feature extraction on the voice fragments in feature dimensions for voice emotion recognition to obtain feature values of the voice fragments in feature dimensions for voice emotion recognition; and the server performs emotion recognition by using the characteristic value of the voice fragment in the characteristic dimension for voice emotion recognition to obtain a recognition result, wherein the recognition result is used for representing the emotion type expressed by the voice fragment.

The characteristic change value is used for representing the discrete degree of the characteristic value of different emotion types on the target characteristic dimension, whether the target characteristic dimension is favorable for distinguishing different emotion types is judged according to the discrete degree, the characteristic dimension which is larger in characteristic change value and used for voice emotion recognition is selected from a plurality of candidate characteristic dimensions, so that the characteristic dimension which is most favorable for distinguishing different emotion types is selected, the characteristic value closely related to emotion is reserved, the characteristic value which is not greatly related to emotion is reduced, the technical problem that the characteristic selection method in the related technology cannot select the characteristic which is most favorable for distinguishing different emotion types from a large number of characteristics is solved, the characteristic data set which is selected by the method for determining the emotion characteristics and used for voice emotion recognition is used for classifying and recognizing when the voice emotion recognition is performed, and further the technical effect of improving the accuracy of voice emotion recognition is achieved.

As an alternative example, the technical solutions of the present application are schematically described below in connection with specific embodiments:

in the intelligent voice customer service field, not only the voice content of the user is required to be identified, but also the emotion of the user is required to be identified, and a pacifying strategy is timely adopted for negative emotion such as 'gas generation', 'fear' of the user, so that the overall satisfaction degree of the user to the intelligent voice customer service is improved, the potential business requirement of the user is better evaluated, and the accurate marketing is facilitated.

In the dialogue process of intelligent voice customer service and users, the emotion recognition process of the server on the user voice is generally divided into five parts of input, preprocessing, feature extraction, feature selection and classification recognition. A schematic diagram of an overall framework of an alternative speech emotion recognition method according to an embodiment of the present application is shown in fig. 3.

Step 1, input: collecting voice fragments, and taking a voice signal as input;

step 2, pretreatment: the method comprises the steps of sequentially carrying out pre-emphasis, framing and windowing on an input voice signal through a preprocessing module;

step 3, extracting the characteristic, namely extracting the characteristic values of the voice fragments in a plurality of characteristic dimensions by utilizing a characteristic extraction tool kit;

Step 4, selecting the features, in a feature selecting part, performing screening operation on the various features extracted by the feature extracting part, so as to select some features most relevant to emotion features (selecting feature data sets of feature dimensions for speech emotion recognition in the embodiment of the application);

and 5, classifying and identifying, namely inputting the screened characteristic data set into a classifying and identifying part to perform classifying and identifying operation, and obtaining an identifying result.

If the content of the user speech expression is: "why is this product unusable at night? If emotion recognition is not performed, the intelligent voice customer service solves the problem according to the voice content and replies: "because night mode is not turned on, hopefully answer your question"; if emotion recognition is performed, recognizing that the emotion type of the user voice is 'live', the intelligent voice customer service adopts a pacifying strategy: "very sorry brings bad experience to you, what problems I can solve to you as much as you can, probably because the night mode is not opened, you can first open the trial.

The feature dimension selected by the emotion feature determining method in the embodiment of the application is used for emotion recognition, so that the accuracy of speech emotion recognition is improved, and the service quality of intelligent speech customer service is improved. In a section of speech sequence containing multiple emotions, multiple features can be extracted through feature extraction to obtain feature values f (i) corresponding to each feature, the feature value of the ith emotion in the ith feature dimension is denoted as f (l, i), and a schematic diagram of a processing flow of an optional method for determining emotion features according to an embodiment of the present application is shown in fig. 4, where the processing flow of the method for determining emotion features is as follows:

Step 1, calculating emotion characteristic average values:

c represents the number of feature values of the ith emotion in the ith feature dimension;

step 2, calculating emotion feature variation values in a feature sequence:

n represents the number of emotion types contained in the speech sequence, j represents the j-th emotion, and k represents the k-th emotion;

step 3, calculating emotion feature threshold values:

m represents the number of the plurality of feature dimensions;

step 4, screening all the characteristic values of the characteristic sequence according to a threshold value:

if the calculated emotion feature variation value in one feature is larger than the set threshold value, reserving, otherwise discarding;

and 5, obtaining the characteristic value after screening.

It should be noted that, for simplicity of description, the foregoing method embodiments are all expressed as a series of action combinations, but it should be understood by those skilled in the art that the present application is not limited by the order of actions described, as some steps may be performed in other order or simultaneously in accordance with the present application. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily required in the present application.

From the description of the above embodiments, it will be clear to a person skilled in the art that the method according to the above embodiments may be implemented by means of software plus the necessary general hardware platform, but of course also by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk), comprising several instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the method described in the embodiments of the present application.

According to another aspect of the embodiments of the present application, there is also provided a determining apparatus for implementing the above-mentioned determining method of emotional characteristics. Fig. 5 is a schematic diagram of an alternative emotion feature determination device according to an embodiment of the present application, as shown in fig. 5, the device may include: the voice acquisition module 52 is configured to acquire multiple segments of user voices with different emotion types; a feature extraction module 54, configured to extract feature values of each segment of user speech in a plurality of feature dimensions; the parameter obtaining module 56 is configured to obtain, according to the feature values of each segment of user speech in the multiple feature dimensions, feature variation values and feature thresholds of each feature dimension between different emotion types; the feature selection module 58 is configured to select a feature dimension for speech emotion recognition from a plurality of feature dimensions according to the feature variation value and the feature threshold, where the feature variation value of the feature dimension for speech emotion recognition is not less than the feature threshold.

It should be noted that, the voice obtaining module 52 in this embodiment may be used to perform step S202 in the embodiment of the present application, the feature extracting module 54 in this embodiment may be used to perform step S204 in the embodiment of the present application, the parameter obtaining module 56 in this embodiment may be used to perform step S206 in the embodiment of the present application, and the feature selecting module 58 in this embodiment may be used to perform step S208 in the embodiment of the present application.

It should be noted that the above modules are the same as examples and application scenarios implemented by the corresponding steps, but are not limited to what is disclosed in the above embodiments. It should be noted that the above modules may be implemented in software or hardware as a part of the apparatus in the hardware environment shown in fig. 1.

By the aid of the module, the technical problem that the feature selection method in the related technology cannot select the features which can distinguish different emotion types from a large number of features can be solved, and the feature data set of feature dimensions, selected by the emotion feature determination method, for voice emotion recognition is used for classifying and recognizing during voice emotion recognition, so that the technical effect of improving accuracy of voice emotion recognition is achieved.

As an alternative embodiment, the parameter acquisition module 56 includes: the acquisition unit is used for acquiring characteristic change values of each characteristic dimension among different emotion types according to the characteristic values of each section of user voice on a plurality of characteristic dimensions; and the determining unit is used for determining the characteristic threshold value according to the characteristic change value of each characteristic dimension among different emotion types.

Optionally, the determining unit is further configured to: acquiring an average value of feature variation values of each feature dimension among different emotion types; the average value of the feature variation values is used as a feature threshold value.

Optionally, the obtaining unit is further configured to: acquiring a characteristic average value of all user voices with the same emotion type in a plurality of sections of user voices in a characteristic dimension; and obtaining characteristic variation values of the characteristic dimension among different emotion types according to the characteristic average value of the user voice under each emotion type in the characteristic dimension.

Optionally, the obtaining unit is further configured to: the feature variation value of the feature dimension between different emotion types is determined as follows:

wherein f _c-mean (i) A feature variation value representing a feature dimension between a plurality of emotion types, n representing a number of emotion types contained in a plurality of segments of user speech, f _mean (j, i) representing a feature average value of user speech of the jth emotion type in a feature dimension, f _mean (k, i) represents a feature average value of the user's voice of the kth emotion type in the feature dimension.

Optionally, the obtaining unit is further configured to: extracting the characteristic value of the user voice of one emotion type in a characteristic dimension from the characteristic values of the multiple sections of user voices in the characteristic dimension according to the emotion type; and acquiring a characteristic average value of the emotion type user voice in the characteristic dimension according to the characteristic value of the emotion type user voice in the characteristic dimension. .

As an alternative embodiment, the feature extraction module 54 includes: the pre-emphasis unit is used for carrying out pre-emphasis processing on the voice of each user; the framing unit is used for framing the pre-emphasized user voice; the windowing unit is used for windowing each frame after framing; and the extraction unit is used for extracting the characteristic values of the user voice in a plurality of characteristic dimensions from the windowed voice.

As an alternative embodiment, the determining means of emotion characteristics further comprises a recognition unit for performing speech emotion recognition in the following way: collecting voice fragments, wherein the voice fragments are voice fragments of emotion types to be recognized; extracting characteristic values of the voice fragments in characteristic dimensions for voice emotion recognition; and carrying out emotion recognition by utilizing the characteristic value of the voice fragment in the characteristic dimension for voice emotion recognition to obtain a recognition result, wherein the recognition result is used for representing the emotion type expressed by the voice fragment.

It should be noted that the above modules are the same as examples and application scenarios implemented by the corresponding steps, but are not limited to what is disclosed in the above embodiments. It should be noted that the above modules may be implemented in software or in hardware as part of the apparatus shown in fig. 1, where the hardware environment includes a network environment.

According to another aspect of the embodiments of the present application, there is also provided a server or a terminal for implementing the above-mentioned method for determining emotional characteristics.

Fig. 6 is a block diagram of a terminal according to an embodiment of the present application, and as shown in fig. 6, the terminal may include: one or more (only one is shown in fig. 6) processors 601, memory 603, and transmission means 605, as shown in fig. 6, the terminal may further comprise an input output device 607.

The memory 603 may be configured to store software programs and modules, such as program instructions/modules corresponding to the method and apparatus for determining emotional characteristics in the embodiments of the present application, and the processor 601 executes the software programs and modules stored in the memory 203, thereby performing various functional applications and data processing, that is, implementing the method for determining emotional characteristics as described above. Memory 603 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid state memory. In some examples, the memory 603 may further include memory remotely located with respect to the processor 601, which may be connected to the terminal through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission device 605 is used to receive or transmit data via a network, and may also be used for data transmission between the processor and the memory. Specific examples of the network described above may include wired networks and wireless networks. In one example, the transmission device 605 includes a network adapter (Network Interface Controller, NIC) that may be connected to other network devices and routers via a network cable to communicate with the internet or a local area network. In one example, the transmission device 605 is a Radio Frequency (RF) module that is configured to communicate wirelessly with the internet.

In particular, the memory 603 is used to store applications.

The processor 601 may call an application program stored in the memory 603 through the transmission means 605 to perform the steps of: acquiring multiple sections of user voices with different emotion types; extracting characteristic values of each section of user voice in a plurality of characteristic dimensions; acquiring characteristic change values and characteristic thresholds of each characteristic dimension among different emotion types according to the characteristic values of each section of user voice on a plurality of characteristic dimensions; and selecting a characteristic dimension for speech emotion recognition from a plurality of characteristic dimensions according to the characteristic change value and the characteristic threshold, wherein the characteristic change value of the characteristic dimension for speech emotion recognition is not smaller than the characteristic threshold.

By adopting the embodiment of the application, a scheme for determining the emotion characteristics is provided. The feature dimension for voice emotion recognition with larger feature variation value is selected from the plurality of candidate feature dimensions by comparing feature variation values of the voice of the user under different feature dimensions, so that the feature dimension which can distinguish different emotion types is selected, the technical problem that the feature selection method in the related technology can not select the feature which can distinguish different emotion types from a large number of features is solved, and the feature data set of the feature dimension for voice emotion recognition selected by the method for determining the emotion features is used for classifying and recognizing during voice emotion recognition, so that the technical effect of improving the accuracy of voice emotion recognition is achieved.

Alternatively, specific examples in this embodiment may refer to examples described in the foregoing embodiments, and this embodiment is not described herein.

It will be appreciated by those skilled in the art that the structure shown in fig. 6 is only illustrative, and the terminal may be a smart phone (such as an Android phone, an iOS phone, etc.), a tablet computer, a palmtop computer, a mobile internet device (Mobile Internet Devices, MID), a PAD, etc. Fig. 6 is not limited to the structure of the electronic device described above. For example, the terminal may also include more or fewer components (e.g., network interfaces, display devices, etc.) than shown in fig. 6, or have a different configuration than shown in fig. 6.

Those of ordinary skill in the art will appreciate that all or part of the steps in the various methods of the above embodiments may be implemented by a program for instructing a terminal device to execute in association with hardware, the program may be stored in a computer readable storage medium, and the storage medium may include: flash disk, read-Only Memory (ROM), random-access Memory (Random Access Memory, RAM), magnetic or optical disk, and the like.

Embodiments of the present application also provide a storage medium. Alternatively, in the present embodiment, the above-described storage medium may be used for program code for executing the method of determining emotional characteristics.

Alternatively, in this embodiment, the storage medium may be located on at least one network device of the plurality of network devices in the network shown in the above embodiment.

Alternatively, in the present embodiment, the storage medium is configured to store program code for performing the steps of:

acquiring multiple sections of user voices with different emotion types; extracting characteristic values of each section of user voice in a plurality of characteristic dimensions; acquiring characteristic change values and characteristic thresholds of each characteristic dimension among different emotion types according to the characteristic values of each section of user voice on a plurality of characteristic dimensions; and selecting a characteristic dimension for speech emotion recognition from a plurality of characteristic dimensions according to the characteristic change value and the characteristic threshold, wherein the characteristic change value of the characteristic dimension for speech emotion recognition is not smaller than the characteristic threshold.

Alternatively, in the present embodiment, the storage medium may include, but is not limited to: a U-disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The foregoing embodiment numbers of the present application are merely for describing, and do not represent advantages or disadvantages of the embodiments.

The integrated units in the above embodiments may be stored in the above-described computer-readable storage medium if implemented in the form of software functional units and sold or used as separate products. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, including several instructions to cause one or more computer devices (which may be personal computers, servers or network devices, etc.) to perform all or part of the steps of the methods described in the various embodiments of the present application.

In the foregoing embodiments of the present application, the descriptions of the embodiments are emphasized, and for a portion of this disclosure that is not described in detail in this embodiment, reference is made to the related descriptions of other embodiments.

In several embodiments provided in the present application, it should be understood that the disclosed client may be implemented in other manners. The above-described embodiments of the apparatus are merely exemplary, and the division of the units, such as the division of the units, is merely a logical function division, and may be implemented in another manner, for example, multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some interfaces, units or modules, or may be in electrical or other forms.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The foregoing is merely a preferred embodiment of the present application and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present application and are intended to be comprehended within the scope of the present application.

Claims

1. A method of determining emotional characteristics, comprising:

acquiring multiple sections of user voices with different emotion types;

extracting characteristic values of each section of user voice in a plurality of characteristic dimensions;

according to the characteristic values of each section of user voice on a plurality of characteristic dimensions, the characteristic change values and characteristic thresholds of each characteristic dimension among different emotion types are obtained, and the method comprises the following steps: acquiring feature average values of all user voices with the same emotion type in a plurality of sections of user voices in a feature dimension, acquiring feature change values of the feature dimension between different emotion types according to the feature average values of the user voices in each emotion type in the feature dimension, and determining feature threshold values according to the feature change values of each feature dimension between different emotion types;

And selecting a characteristic dimension for speech emotion recognition from a plurality of characteristic dimensions according to the characteristic change value and the characteristic threshold value.

2. The method of claim 1, wherein selecting a feature dimension for speech emotion recognition from a plurality of feature dimensions based on the feature variation value and the feature threshold value comprises:

and selecting a characteristic dimension with the characteristic change value not smaller than the characteristic threshold from a plurality of characteristic dimensions as the characteristic dimension of speech emotion recognition.

3. The method of claim 1, wherein determining the feature threshold from feature variation values between different emotion types for each feature dimension comprises:

acquiring an average value of feature variation values of each feature dimension among different emotion types;

and taking the average value of the characteristic change values as the characteristic threshold value.

4. The method according to claim 1, wherein the obtaining the feature variation value of the feature dimension between different emotion types according to the feature average value of the user voice under each emotion type in the feature dimension includes:

the feature dimension feature variation value between different emotion types is determined according to the following formula:

Wherein f _c-mean (i) Representing feature variation values of the feature dimension among multiple emotion types, n representing the number of emotion types contained in multiple segments of user speech, f _mean (j, i) representing a feature average value of the user speech of the jth emotion type in the feature dimension, f _mean (k, i) represents a feature average value of user speech of a kth emotion type over the feature dimension.

5. The method of claim 1, wherein the obtaining a feature average value of all user voices with the same emotion type in a feature dimension in the plurality of segments of user voices comprises:

extracting the characteristic value of the user voice of one emotion type in the characteristic dimension from the characteristic values of the multiple sections of user voices in the characteristic dimension according to the emotion type;

and acquiring a characteristic average value of the emotion type user voice in the characteristic dimension according to the characteristic value of the emotion type user voice in the characteristic dimension.

6. The method of claim 1, wherein extracting feature values for each segment of user speech over a plurality of feature dimensions comprises:

pre-emphasis processing is carried out on the voice of each user;

Carrying out framing treatment on the pre-emphasized user voice;

windowing each frame after framing;

extracting feature values of the user voice in a plurality of feature dimensions from the windowed voice.

7. The method of claim 1, wherein after selecting a feature dimension for speech emotion recognition from a plurality of feature dimensions based on the feature variation value and the feature threshold value, the method further comprises:

collecting a voice fragment, wherein the voice fragment is a voice fragment of the emotion type to be recognized;

extracting characteristic values of the voice fragments in the characteristic dimension for voice emotion recognition;

and carrying out emotion recognition by utilizing the characteristic value of the voice fragment in the characteristic dimension for voice emotion recognition to obtain a recognition result, wherein the recognition result is used for representing the emotion type expressed by the voice fragment.

8. A device for determining emotional characteristics, comprising:

the voice acquisition module is used for acquiring multiple sections of user voices with different emotion types;

the feature extraction module is used for extracting feature values of each section of user voice in a plurality of feature dimensions;

The parameter acquisition module is used for acquiring characteristic change values and characteristic thresholds of each characteristic dimension among different emotion types according to the characteristic values of each section of user voice on a plurality of characteristic dimensions;

the feature selection module is used for selecting a feature dimension for voice emotion recognition from a plurality of feature dimensions according to the feature variation value and the feature threshold, wherein the feature variation value of the feature dimension for voice emotion recognition is not smaller than the feature threshold;

the parameter acquisition module is further used for: the method comprises the steps of obtaining feature average values of all user voices with the same emotion type in a plurality of sections of user voices in a feature dimension, obtaining feature change values of the feature dimension between different emotion types according to the feature average values of the user voices in each emotion type in the feature dimension, and determining feature thresholds according to the feature change values of each feature dimension between different emotion types.

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor performs the method of any of the preceding claims 1 to 7 by means of the computer program.

10. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method as claimed in any one of claims 1 to 7.