CN115410602A

CN115410602A - Voice emotion recognition method and device and electronic equipment

Info

Publication number: CN115410602A
Application number: CN202211014757.7A
Authority: CN
Inventors: 赵玉坤; 张磊; 李聚兴; 邹增晖; 胡景忠
Original assignee: Hebei Gongda Green Energy Technology Corp ltd
Current assignee: Hebei Gongda Green Energy Technology Corp ltd
Priority date: 2022-08-23
Filing date: 2022-08-23
Publication date: 2022-11-29

Abstract

The invention provides a speech emotion recognition method and device and electronic equipment. The method comprises the following steps: dividing the target voice segment into a plurality of voice slices with set duration, and extracting a voice characteristic value of each voice slice; calculating the similarity between each voice slice in the plurality of voice slices and each voice sample in the plurality of voice samples with known emotion types based on the voice characteristic values of the plurality of voice slices; summing the similarity between a plurality of voice slices and the same voice sample to obtain the similarity between a target voice segment and the voice sample; and determining the emotion type of the target voice fragment based on the similarity between the target voice fragment and each voice sample and the emotion type of each voice sample. The method and the device can improve the accuracy of speech emotion recognition.

Description

Voice emotion recognition method and device and electronic equipment

Technical Field

The invention relates to the technical field of voice recognition, in particular to a voice emotion recognition method and device and electronic equipment.

Background

Speech is the main medium of communication in daily life, and not only conveys ideas, but also expresses the emotion of a speaker. The speech emotion recognition is an important technical branch of the human-computer interaction technology, and the speech emotion recognition aims to recognize human emotional states from speech.

The current speech emotion recognition scheme mainly analyzes the speech fragment as a whole to obtain the emotion information of the speaker. But the same speaker may express different emotions due to changes in application scenarios and time. For example, the speaker is in a happy state at the previous moment and in an angry state at the later moment, and the emotion type obtained by the current speech emotion recognition scheme cannot accurately represent the emotion information of the speaker.

Disclosure of Invention

The invention provides a voice emotion recognition method and device and electronic equipment, which can improve the accuracy of voice emotion recognition.

In a first aspect, the present invention provides a speech emotion recognition method, including: dividing a target voice fragment into a plurality of voice slices with set duration, and extracting a voice characteristic value of each voice slice; calculating the similarity between each voice slice in the plurality of voice slices and each voice sample in the plurality of voice samples with known emotion types based on the voice characteristic values of the plurality of voice slices; summing the similarity between the multiple voice slices and the same voice sample to obtain the similarity between the target voice slice and the voice sample; and determining the emotion type of the target voice fragment based on the similarity between the target voice fragment and each voice sample and the emotion type of each voice sample.

The invention provides a voice emotion recognition method, which comprises the steps of slicing a target voice fragment to obtain a plurality of voice slices, calculating the similarity between each voice slice and a voice sample, summing the similarities of the voice slices in the target voice fragment to obtain the similarity between the target voice fragment and the voice sample, and finally determining the emotion type of the target voice fragment by combining the emotion types of the voice samples. The similarity between each voice slice in the target voice segment and the voice sample is comprehensively considered, so that the finally determined emotion type is more accurate, and the accuracy of voice emotion recognition is improved.

In a possible implementation manner, after determining the emotion type of the target speech segment based on the similarity between the target speech segment and each speech sample and the emotion type of each speech sample, the method further includes: determining the emotion type of each voice slice based on the similarity between each voice slice in the plurality of voice slices and each voice sample in a plurality of voice samples with known emotion types; determining the emotion change trend of the target voice fragment based on the emotion types of the voice slices; and determining the service level of the speaker of the target voice segment corresponding to the customer service staff based on the emotion change trend of the target voice segment.

In one possible implementation manner, determining the service level of the speaker corresponding to the target speech segment based on the emotion change trend of the target speech segment includes: if the emotion change trend is changed from the positive state to the negative state or from the neutral state to the negative state, determining that the service level of the customer service staff is low; if the emotion change trend is changed from a negative state to an active state or from a neutral state to the active state, determining that the service level of the customer service staff is higher; wherein the positive state comprises happy, and the negative state comprises angry, fear, sadness or surprise.

In a possible implementation manner, after determining the emotion type of the target speech segment based on the similarity between the target speech segment and each speech sample and the emotion type of each speech sample, the method further includes: recording the emotion types of a plurality of voice segments of a first speaker and the emotion types of a plurality of voice segments of a second speaker when the first speaker and the second speaker have a conversation in a set time period; the voice segments of the first speaker correspond to the voice segments of the second speaker one by one; if the emotion type of the first speaker is changed in a set mode and the emotion type of the second speaker is kept neutral and unchanged, determining that the second speaker is the robot; wherein the setting change comprises one of: from an active state to a passive state, from neutral to a passive state, from a passive state to an active state, and from neutral to an active state.

In a possible implementation manner, after determining the emotion type of the target speech segment based on the similarity between the target speech segment and each speech sample and the emotion type of each speech sample, the method further includes: recording a plurality of voice segments of the same speaker in a set time period and emotion types of the plurality of voice segments; determining the emotion change trend of the speaker based on the emotion types of the voice segments; determining voice characteristic parameters of the speaker based on a plurality of voice segments of the speaker, wherein the voice characteristic parameters comprise a voice speed, a tone, a voice volume, a short-time zero-crossing rate and a volume; if the emotion change trend of the speaker is kept neutral, the absolute value of the short-time zero-crossing rate is smaller than a set value, and the volume is smaller than the set volume, determining that the speaker is at risk of having Parkinson syndrome; and if the emotion change trend of the speaker is kept neutral, the voice speed is less than the set voice speed, the voice tone is less than the set voice tone, and the voice volume is less than the set voice volume, determining that the speaker is at risk of suffering from depression.

In one possible implementation, the speech feature values include Mel frequency cepstral coefficients, short-time zero-crossing rates, short-time fourier transform spectra, decibel volumes, constant Q transform spectra; the similarity calculation method comprises a cosine similarity calculation method, an adjusted cosine similarity calculation method, an Euclidean distance algorithm and a Jacard similarity calculation method; after determining the emotion type of the target voice segment based on the similarity between the target voice segment and each voice sample and the emotion type of each voice sample, the method further comprises the following steps: the method comprises the following steps: recording M voice segments of the same speaker in a target scene; wherein M is more than or equal to 2; step two: determining the emotion types of the M voice segments based on a combined algorithm of voice emotion recognition; the combined algorithm comprises a method for calculating a class of voice characteristic values and a class of similarity; step three: determining the emotion types and the actual observation types of the M voice segments to be K, and determining R = K/M as a consistency judgment result corresponding to the combined algorithm of the voice emotion recognition; wherein K is more than 0 and less than or equal to M; step four: changing the calculation method of the voice characteristic value type or the similarity in the combined algorithm of the voice emotion recognition, and repeating the second step and the third step until all the combined algorithms of the voice emotion recognition are completely calculated; step five: and determining the combined algorithm with the highest consistency judgment result as the speech emotion recognition algorithm suitable for the target scene and the speaker.

In one possible implementation manner, determining the emotion type of the target speech segment based on the similarity between the target speech segment and each speech sample and the emotion type of each speech sample includes: and determining the emotion type corresponding to the voice sample with the maximum similarity as the emotion type of the target voice segment.

In a second aspect, an embodiment of the present invention provides a speech emotion recognition apparatus, including: the communication module is used for acquiring a target voice fragment; the processing module is used for cutting the target voice slice into a plurality of voice slices with set duration and extracting the voice characteristic value of each voice slice; calculating the similarity between each voice slice in the plurality of voice slices and each voice sample in a plurality of voice samples with known emotion types based on the voice characteristic values of the plurality of voice slices; summing the similarity between a plurality of voice slices and the same voice sample to obtain the similarity between a target voice segment and the voice sample; and determining the emotion type of the target voice fragment based on the similarity between the target voice fragment and each voice sample and the emotion type of each voice sample.

In a possible implementation manner, the processing module is further configured to determine an emotion type of each voice slice based on a similarity between each voice slice in the plurality of voice slices and each voice sample in a plurality of voice samples with known emotion types; determining the emotion change trend of the target voice fragment based on the emotion types of the voice slices; and determining the service level of the speaker of the target voice segment corresponding to the customer service staff based on the emotion change trend of the target voice segment.

In one possible implementation, the processing module is specifically configured to determine that the service level of the customer service staff is low if the emotion change tendency changes from an active state to a passive state or from a neutral state to a passive state; if the emotion change trend is changed from a negative state to an active state or from a neutral state to the active state, determining that the service level of the customer service staff is higher; wherein the positive state comprises happy, and the negative state comprises angry, fear, sadness or surprise.

In a possible implementation manner, the processing module is further configured to record emotion types of a plurality of voice segments of the first speaker and emotion types of a plurality of voice segments of the second speaker when the first speaker and the second speaker have a conversation within a set time period; the voice segments of the first speaker correspond to the voice segments of the second speaker one by one; if the emotion type of the first speaker is changed in a set mode and the emotion type of the second speaker is kept neutral and unchanged, determining that the second speaker is the robot; wherein the setting change comprises one of: from an active state to a passive state, from neutral to a passive state, from a passive state to an active state, and from neutral to an active state.

In a possible implementation manner, the processing module is further configured to record a plurality of voice segments of the same speaker within a set time period and emotion types of the plurality of voice segments; determining the emotion change trend of the speaker based on the emotion types of the voice segments; determining voice characteristic parameters of the speaker based on a plurality of voice segments of the speaker, wherein the voice characteristic parameters comprise a voice speed, a tone, a voice volume, a short-time zero-crossing rate and a volume; if the emotion change trend of the speaker is kept neutral, the absolute value of the short-time zero-crossing rate is smaller than a set value, and the volume is smaller than the set volume, determining that the speaker is at risk of having Parkinson syndrome; and if the emotion change trend of the speaker is kept neutral, the speech speed is less than the set speech speed, the intonation is less than the set intonation and the speech volume is less than the set speech volume, determining that the speaker is at risk of suffering from depression.

In one possible implementation, the speech feature values include Mel frequency cepstral coefficients, short-time zero-crossing rates, short-time fourier transform spectra, decibel volumes, constant Q transform spectra; the similarity calculation method comprises a cosine similarity calculation method, an adjusted cosine similarity calculation method, an Euclidean distance algorithm and a Jacard similarity calculation method; the processing module is further used for executing the following steps: the method comprises the following steps: recording M voice segments of the same speaker in a target scene; wherein M is more than or equal to 2; step two: determining the emotion types of the M voice segments based on a combined algorithm of voice emotion recognition; the combined algorithm comprises a method for calculating a class of voice characteristic values and a class of similarity; step three: determining the emotion types and the actual observation types of the M voice segments to be K, and determining R = K/M as a consistency judgment result corresponding to the combined algorithm of the voice emotion recognition; wherein K is more than 0 and less than or equal to M; step four: changing the calculation method of the voice characteristic value type or the similarity in the combined algorithm of the voice emotion recognition, and repeating the second step and the third step until all the combined algorithms of the voice emotion recognition are completely calculated; step five: and determining the combined algorithm with the highest consistency judgment result as the speech emotion recognition algorithm suitable for the target scene and the speaker.

In a possible implementation manner, the processing module is specifically configured to determine an emotion type corresponding to the speech sample with the largest similarity as an emotion type of the target speech segment.

In a third aspect, an embodiment of the present invention provides an electronic device, where the electronic device includes a memory and a processor, where the memory stores a computer program, and the processor is configured to call and execute the computer program stored in the memory to perform the steps of the method according to any one of the foregoing first aspect and possible implementation manners of the first aspect.

In a fourth aspect, the present invention provides a computer-readable storage medium, where a computer program is stored, where the computer program is configured to, when executed by a processor, implement the steps of the method according to the first aspect and any possible implementation manner of the first aspect.

The technical effects brought by any one of the implementation manners of the second aspect to the fourth aspect may refer to the technical effects brought by the corresponding implementation manners of the first aspect, and are not described herein again.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.

FIG. 1 is a flow chart of a speech emotion recognition method according to an embodiment of the present invention;

FIG. 2 is a flow chart of another speech emotion recognition method provided by the embodiment of the invention;

FIG. 3 is a flow chart of another speech emotion recognition method provided by the embodiment of the invention;

FIG. 4 is a flow chart of another speech emotion recognition method provided by the embodiment of the invention;

FIG. 5 is a schematic structural diagram of an apparatus for speech emotion recognition according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the invention. It will be apparent, however, to one skilled in the art that the present invention may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present invention with unnecessary detail.

In the description of the present invention, "/" means "or" unless otherwise specified, for example, a/B may mean a or B. "and/or" herein is merely an association relationship describing an associated object, and means that there may be three relationships, for example, a and/or B, and may mean: a exists alone, A and B exist simultaneously, and B exists alone. Further, "at least one" or "a plurality" means two or more. The terms "first", "second", and the like do not necessarily limit the number and execution order, and the terms "first", "second", and the like do not necessarily limit the difference.

In the embodiments of the present application, words such as "exemplary" or "for example" are used to mean serving as an example, instance, or illustration. Any embodiment or design described herein as "exemplary" or "such as" is not necessarily to be construed as preferred or advantageous over other embodiments or designs. Rather, use of the word "exemplary" or "such as" is intended to present relevant concepts in a concrete fashion for ease of understanding.

Furthermore, the terms "including" and "having," and any variations thereof, as referred to in the description of the present application, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or modules is not limited to the listed steps or modules, but may alternatively include other steps or modules not listed or inherent to such process, method, article, or apparatus.

In order to make the objects, technical solutions and advantages of the present invention more apparent, the following description is made by way of specific embodiments with reference to the accompanying drawings.

As described in the background art, the speech emotion recognition accuracy is low. The speech emotion recognition mainly comprises two steps of feature extraction and classifier construction. In the process of speech emotion recognition, the problem of fuzzy speech emotion definition exists, and speech emotion is not uniformly defined. The voice emotion types are diversified, complex and changeable, the emotion expressed by the voice can be different in different application scenes and conversation environments, different speakers express the same speech, and the emotion types can be changed. The voice segments are limited, and the duration of the voice segments is different, so that the emotion recognition result may not completely express the emotion information of the speaker.

To solve the above technical problem, as shown in fig. 1, an embodiment of the present invention provides a speech emotion recognition method. The execution subject is a speech emotion recognition device, and the speech emotion recognition method comprises the steps of S101-S104.

S101, dividing the target voice segment into a plurality of voice slices, and extracting a voice characteristic value of each voice slice.

In some embodiments, the target speech segment is a segment of speech of a speaker. The duration of the speech segment may be 5 minutes or 3 minutes. The present application is not limited.

In some embodiments, the voice slice may be a voice sub-segment of a set duration. The duration of a speech slice may be 2 seconds or 3 seconds. The present application is not limited.

As a possible implementation manner, the speech emotion recognition apparatus may segment the target speech segment into a plurality of speech segments of a set duration in time sequence.

As one possible implementation, the speech emotion recognition apparatus may divide the target speech segment into a plurality of speech slices in time sequence based on the degree of discontinuity of the speaker in the speech slices.

In some embodiments, the speech feature values are two-dimensional matrices. The voice characteristic values comprise Mel frequency cepstrum coefficients, short-time zero-crossing rates, short-time Fourier transform spectrums, decibel volumes and constant Q transform spectrums.

Illustratively, mel Frequency Cepstrum Coefficient (MFCC), which is a coefficient constituting a Mel frequency cepstrum spectrum, is used. The frequency band division of the mel-frequency cepstrum is equally divided on the mel scale, which more closely approximates the human auditory system than the linearly spaced frequency bands used in the normal log cepstrum. The non-linear representation of Mel-frequency cepstrum coefficients can provide a better representation of the sound signal in a number of domains. For example, the speech emotion recognition device may calculate a speech feature value for each speech slice based on the MFCC function. E.g., librosa.

For another example, the short-term zero-crossing rate is a characteristic parameter in time-domain analysis of the speech signal. The short-term zero-crossing rate refers to the number of times the signal passes through a zero value within each frame. For continuous speech signals with time horizontal axis, the situation that the time domain waveform of the speech passes through the horizontal axis can be observed, so that the speech signals can be found out from the background noise by using the short-time average zero crossing rate, and the unvoiced speech and the voiced speech can be distinguished. For example, the speech emotion recognition device may calculate the short-term zero-crossing rate for each speech slice based on a function of the short-term average zero-crossing rate. Such as library.

As yet another example, the short-time fourier transform spectrum STFT is extended based on a discrete fourier transform DFT. Any periodic signal can be represented as a linear combination of a series of sine and cosine signals, and the DFT finds these components, giving each component its frequency, amplitude and phase. However, when the signal does not vibrate for an integer number of cycles within the acquired N sampling points, the Frequency Leakage problem occurs when the DFT is performed, which brings influence to the reconstruction of the original signal according to the Frequency spectrum, and in order to reduce the influence of the Frequency Leakage Frequency, the DFT may be performed after the original signal is multiplied by a window function (from 0 to 0). Based on this, the STFT can reflect the change of the signal over time by taking a small windowed segment of the signal and then performing DFT. For example, the speech emotion recognition device may calculate a short-time Fourier transform spectrum for each speech slice based on the short-time Fourier transform function. Stft.

As yet another example, decibel volume refers to converting your power spectrum to decibels. For example, the speech emotion recognition device may calculate the decibel volume for each speech slice based on the power to decibel function. E.g., library.

As another example, the constant Q transform CQT refers to a filter bank in which the center frequencies are exponentially distributed and the filter bandwidths are different, but the ratio of the center frequency to the bandwidth is constant Q. It differs from the fourier transform in that the horizontal axis frequency of its spectrum is not linear, but is based on log2, and the filter window length can be varied to obtain better performance depending on the spectral line frequency. Since the CQT and the scale frequency have the same distribution, the amplitude value of the music signal at each note frequency can be directly obtained by calculating the CQT spectrum of the music signal, which is suitable for signal processing of music. For example, the speech emotion recognition device may calculate a constant Q transform spectrum for each speech slice based on a CQT function. For example, library _ to _ db (library. Cqt (y, sr = 16000), ref = np.max.

And S102, calculating the similarity between each voice slice in the voice slices and each voice sample in the voice samples with known emotion types based on the voice characteristic values of the voice slices.

In some embodiments, the speech sample is a segment of speech of which the emotion type is known in advance.

In some embodiments, emotion types may include anger, happy, afraid, sad, surprised, and neutral.

Illustratively, the emotion types may include an active status, a neutral status, and a passive status. Wherein the positive status includes happy. Negative states include anger, fear, sadness, or surprise.

It should be noted that the voice samples may be 6 emotion samples of a CASIA chinese emotion corpus, and constitute 6 types of voice samples with known emotion types.

As a possible implementation manner, the speech emotion recognition device may extract a speech feature value of each speech sample, extract a speech feature value of each speech slice, and then calculate a similarity between each speech slice in the plurality of speech slices and each speech sample in the plurality of speech samples with known emotion types based on the speech feature values of the plurality of speech slices.

In some embodiments, the similarity calculation method includes a cosine similarity algorithm, a modified cosine similarity algorithm, a euclidean distance algorithm, and a Jacard similarity algorithm;

in an exemplary cosine similarity algorithm, a cosine value of an included angle between two vectors in a vector space is used as a measure of the difference between two individuals, and the cosine similarity focuses more on the difference between the two vectors in the direction. The more similar the two vectors are, the smaller the vector included angle is, and the larger the cosine absolute value is; the value is negative and the two vectors are inversely related.

As a possible implementation, the speech emotion recognition device may determine the similarity between each speech slice and each speech sample based on the following formula.

Wherein cos θ is the cosine similarity, ai is the ith element in the vector of the voice characteristic value representing the voice slice, and Bi is the ith element in the vector of the voice characteristic value representing the voice sample.

For example, although the cosine similarity can correct the bias existing between individuals to some extent, the difference of each dimension value cannot be measured because only the difference between dimensions of the individuals can be distinguished. And adjusting the cosine similarity to subtract the mean value from the values of all dimensions, and calculating by using the cosine similarity, wherein the calculation result is more practical.

As a possible implementation, the speech emotion recognition apparatus may determine the similarity between each speech slice and each speech sample based on the following formula.

Wherein the content of the first and second substances,

to adjust the cosine similarity, ai is the i-th element in the vector of speech feature values characterizing the speech slice, bi is the i-th element in the vector of speech feature values characterizing the speech sample,

is the average value of each element in the vector of speech feature values of the speech slice,

is the average value of each element in the vector of speech feature values of the speech slice.

For example, the euclidean distance algorithm measures the absolute distance between points in a multidimensional space.

Wherein, R is Euclidean distance, ai is the ith element in the vector of the voice characteristic value of the voice slice, and Bi is the ith element in the vector of the voice characteristic value of the voice sample.

Illustratively, jacard similarity is generally used to measure the magnitude of the difference between two sets.

Wherein J is Jacard similarity, C is a set formed by each element in a vector representing a speech feature value of the speech slice, and D is a set formed by each element in a vector representing a speech feature value of the speech sample.

S103, the similarity between the multiple voice slices and the same voice sample is added to obtain the similarity between the target voice segment and the voice sample.

It should be noted that the emotion types of the multiple voice slices may be different types. The speech emotion recognition device can sum the similarity between the multiple speech slices and the same speech sample to obtain the similarity between the target speech segment and the speech sample, so as to comprehensively consider the emotion types of different types in the multiple speech slices.

And S104, determining the emotion type of the target voice fragment based on the similarity between the target voice fragment and each voice sample and the emotion type of each voice sample.

As a possible implementation manner, the speech emotion recognition apparatus may determine the emotion type corresponding to the speech sample with the largest similarity as the emotion type of the target speech segment.

Illustratively, as shown in table 1, the speech emotion recognition device segments the target speech segment into 27 speech slices, and the speech emotion recognition device calculates the cosine similarity of Mel frequency cepstrum coefficients of each speech slice and the gas production sample. And the voice emotion recognition device sums the cosine similarities obtained by calculation to obtain the cosine similarity of the Mel frequency cepstrum coefficient of the target voice fragment and the gas sample. The cosine similarity of the Mel frequency cepstrum coefficient calculated by the speech emotion recognition device is-312.30.

TABLE 1

For another example, as shown in table 2, the speech emotion recognition apparatus may calculate the similarity between the target speech sample and each speech sample one by one. As can be seen from table 2, the similarity between the target speech sample and the neutral sample is the greatest, and the speech emotion recognition apparatus can determine that the emotion type of the target speech sample is neutral.

TABLE 2

Optionally, as shown in fig. 2, the speech emotion recognition method provided in the embodiment of the present invention further includes steps S201 to S203 after step S104.

S201, determining the emotion type of each voice slice based on the similarity between each voice slice in the voice slices and each voice sample in the voice samples with known emotion types.

For example, for any voice slice, if the similarity between the voice slice and a voice sample is the greatest, the voice emotion recognition apparatus may determine that the emotion type of the voice slice is the emotion type corresponding to the voice sample.

S202, determining the emotion change trend of the target voice segment based on the emotion types of the voice slices.

Illustratively, an emotion change trend may include a transition from an active state to neutral, a transition from an active state to a passive state, a transition from neutral to a passive state, a transition from a neutral to an active state, a transition from a passive state to an active state, or a transition from a passive state to neutral.

As a possible implementation manner, the speech emotion recognition device can sequence the emotion types of the speech slices in time sequence and then determine the emotion change trend of the target speech segment.

S203, determining the service level of the speaker of the target voice segment corresponding to the customer service staff based on the emotion change trend of the target voice segment.

As a possible implementation, if the emotion change tendency changes from an active state to a passive state or from a neutral state to a passive state, the speech emotion recognition apparatus may determine that the service level of the customer service person is low.

As another possible implementation, if the emotion change tendency changes from a negative state to a positive state or from a neutral state to a positive state, the speech emotion recognition apparatus may determine that the service level of the customer service person is high.

Wherein the positive state comprises happy, and the negative state comprises angry, fear, sadness or surprise.

Therefore, the embodiment of the invention can determine the service level of the customer service staff based on the emotion change trend of the target voice fragment and provide data support for the customer service staff to change the service strategy.

As a possible implementation manner, the voice emotion recognition device can also send out first reminding information when the emotion change trend changes from a positive state to a negative state or from a neutral state to the negative state so as to prompt customer service personnel to change the service strategy and avoid further deterioration of customer satisfaction.

As another possible implementation manner, the voice emotion recognition device can also send out a second reminding message when the emotion change trend changes from a negative state to a positive state or from a neutral state to a positive state so as to prompt the customer service personnel to maintain the current service strategy.

Optionally, as shown in fig. 3, the speech emotion recognition method provided in the embodiment of the present invention further includes steps S301 to S302 after step S104.

S301, recording emotion types of a plurality of voice segments of a first speaker and emotion types of a plurality of voice segments of a second speaker when the first speaker and the second speaker have a conversation in a set time period.

The plurality of voice segments of the first speaker correspond to the plurality of voice segments of the second speaker one by one.

And S302, if the emotion type of the first speaker is changed in a setting mode and the emotion type of the second speaker is kept neutral and unchanged, determining that the second speaker is the robot.

Wherein the setting change comprises one of: from an active state to a passive state, from neutral to a passive state, from a passive state to an active state, and from neutral to an active state.

Therefore, the embodiment of the invention can judge whether the speaker is the robot or not based on whether the emotion type of the speaker is set and changed or not, thereby realizing the identification of the robot and the real person.

Optionally, as shown in fig. 4, the speech emotion recognition method provided in the embodiment of the present invention further includes steps S401 to S405 after step S104.

S401, recording a plurality of voice segments of the same speaker in a set time period and emotion types of the voice segments.

S402, determining the emotion change trend of the speaker based on the emotion types of the voice segments.

S403, determining the voice characteristic parameters of the speaker based on the plurality of voice segments of the speaker.

In the embodiment of the present application, the speech characteristic parameters include speech rate, intonation, speech volume, short-time zero-crossing rate, and volume.

S404, if the emotion change trend of the speaker is kept neutral, the absolute value of the short-time zero-crossing rate is smaller than a set value, and the volume is smaller than the set volume, determining that the speaker is at risk of having Parkinson' S syndrome.

S405, if the emotion change trend of the speaker is kept neutral, the voice speed is less than the set voice speed, the voice tone is less than the set voice tone, and the voice volume is less than the set voice volume, determining that the speaker is at risk of suffering from depression.

Therefore, the embodiment of the invention can determine whether the speaker is at risk of suffering from the specific disease or not based on the emotion change trend and the voice characteristic parameters, and prompt in time so as to prompt the speaker to seek medical advice in time and avoid missing the optimal treatment opportunity.

Optionally, after step S104, the speech emotion recognition method provided in the embodiment of the present invention further includes steps one to five.

The method comprises the following steps: recording M voice segments of the same speaker in a target scene; wherein M is more than or equal to 2.

Step two: and determining the emotion types of the M voice segments based on a combined algorithm of voice emotion recognition.

In some embodiments, the combination algorithm includes a class of speech feature values and a class of similarity calculation methods.

Step three: determining the emotion types and the actual observation types of the M voice segments to be K, and determining R = K/M as a consistency judgment result corresponding to the combined algorithm of the voice emotion recognition; wherein K is more than 0 and less than or equal to M.

Step four: and (4) changing the calculation method of the voice characteristic value type or the similarity in the voice emotion recognition combined algorithm, and repeating the second step and the third step until all the voice emotion recognition combined algorithms are completely calculated.

Step five: and determining the combined algorithm with the highest consistency judgment result as the speech emotion recognition algorithm suitable for the target scene and the speaker.

Therefore, the embodiment of the invention can determine the emotion types of the voice segments based on different combination algorithms of the voice characteristic value and the similarity algorithm, and determine the consistency of each combination algorithm and the actual emotion types by combining the actual emotion types of the voice segments. And then determining the combined algorithm with the highest consistency judgment result as the speech emotion recognition algorithm suitable for the target scene and the speaker, thereby realizing the optimized matching of the speech emotion recognition algorithm.

It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present invention.

The following are embodiments of the apparatus of the invention, reference being made to the corresponding method embodiments described above for details which are not described in detail therein.

Fig. 5 shows a schematic structural diagram of a speech emotion recognition apparatus according to an embodiment of the present invention. The speech emotion recognition apparatus 500 includes a communication module 501 and a processing module 502.

A communication module 501, configured to obtain a target voice segment.

The processing module 502 is configured to divide the target voice segment into a plurality of voice slices with set duration, and extract a voice feature value of each voice slice; calculating the similarity between each voice slice in the plurality of voice slices and each voice sample in the plurality of voice samples with known emotion types based on the voice characteristic values of the plurality of voice slices; summing the similarity between a plurality of voice slices and the same voice sample to obtain the similarity between a target voice segment and the voice sample; and determining the emotion type of the target voice fragment based on the similarity between the target voice fragment and each voice sample and the emotion type of each voice sample.

In a possible implementation manner, the processing module 502 is further configured to determine an emotion type of each voice slice based on a similarity between each voice slice in the plurality of voice slices and each voice sample in the plurality of voice samples with known emotion types; determining the emotion change trend of the target voice segment based on the emotion types of the voice slices; and determining the service level of the speaker of the target voice segment corresponding to the customer service staff based on the emotion change trend of the target voice segment.

In one possible implementation, the processing module 502 is specifically configured to determine that the service level of the customer service person is low if the emotion change tendency changes from an active state to a passive state or from neutral to a passive state; if the emotion change trend is changed from a negative state to an active state or from a neutral state to the active state, determining that the service level of the customer service staff is higher; wherein the positive state comprises happy, the negative state comprises angry, fear, sadness or surprise.

In a possible implementation manner, the processing module 502 is further configured to record emotion types of a plurality of voice segments of a first speaker and emotion types of a plurality of voice segments of a second speaker when the first speaker and the second speaker have a conversation within a set time period; the method comprises the steps that a plurality of voice segments of a first speaker correspond to a plurality of voice segments of a second speaker one by one; if the emotion type of the first speaker is changed in a set mode and the emotion type of the second speaker is kept neutral and unchanged, determining that the second speaker is the robot; wherein the setting change comprises one of: from an active state to a passive state, from neutral to a passive state, from a passive state to an active state, and from neutral to an active state.

In a possible implementation manner, the processing module 502 is further configured to record a plurality of voice segments of the same speaker in a set time period and emotion types of the plurality of voice segments; determining the emotion change trend of the speaker based on the emotion types of the voice segments; determining voice characteristic parameters of the speaker based on a plurality of voice segments of the speaker, wherein the voice characteristic parameters comprise a voice speed, a tone, a voice volume, a short-time zero-crossing rate and a volume; if the emotion change trend of the speaker is kept neutral, the absolute value of the short-time zero-crossing rate is smaller than a set value, and the volume is smaller than the set volume, determining that the speaker is at risk of having Parkinson's syndrome; and if the emotion change trend of the speaker is kept neutral, the speech speed is less than the set speech speed, the intonation is less than the set intonation and the speech volume is less than the set speech volume, determining that the speaker is at risk of suffering from depression.

In one possible implementation, the speech feature values include Mel frequency cepstral coefficients, short-time zero-crossing rates, short-time fourier transform spectra, decibel volumes, constant Q transform spectra; the similarity calculation method comprises a cosine similarity calculation method, an adjusted cosine similarity calculation method, an Euclidean distance algorithm and a Jacard similarity calculation method; the processing module 502 is further configured to perform the following steps: the method comprises the following steps: recording M voice segments of the same speaker in a target scene; wherein M is more than or equal to 2; step two: determining the emotion types of the M voice segments based on a combined algorithm of voice emotion recognition; the combined algorithm comprises a method for calculating a class of voice characteristic values and a class of similarity; step three: determining the emotion types and the actual observation types of the M voice segments to be K, and determining R = K/M as a consistency judgment result corresponding to the combined algorithm of the voice emotion recognition; wherein K is more than 0 and less than or equal to M; step four: changing the calculation method of the voice characteristic value type or the similarity in the voice emotion recognition combined algorithm, and repeating the second step and the third step until all the voice emotion recognition combined algorithms are completely calculated; step five: and determining the combined algorithm with the highest consistency judgment result as the speech emotion recognition algorithm suitable for the target scene and the speaker.

In a possible implementation manner, the processing module 502 is specifically configured to determine an emotion type corresponding to the voice sample with the largest similarity as an emotion type of the target voice segment.

Fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present invention. As shown in fig. 6, the electronic apparatus 600 of this embodiment includes: a processor 601, a memory 602, and a computer program 603 stored in said memory 602 and executable on said processor 601. The processor 601, when executing the computer program 603, implements the steps in the above-described method embodiments, such as the steps 101 to 104 shown in fig. 1. Alternatively, the processor 601, when executing the computer program 603, implements the functions of each module/unit in each device embodiment described above, for example, the functions of the communication module 501 and the processing module 502 shown in fig. 5.

Illustratively, the computer program 603 may be partitioned into one or more modules/units, which are stored in the memory 602 and executed by the processor 601 to implement the present invention. The one or more modules/units may be a series of computer program instruction segments capable of performing certain functions, which are used to describe the execution of the computer program 603 in the electronic device 600. For example, the computer program 603 may be divided into a communication module 501 and a processing module 502 shown in fig. 5.

The Processor 601 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The storage 602 may be an internal storage unit of the electronic device 600, such as a hard disk or a memory of the electronic device 600. The memory 602 may also be an external storage device of the electronic device 600, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the electronic device 600. Further, the memory 602 may also include both internal storage units and external storage devices of the electronic device 600. The memory 602 is used for storing the computer programs and other programs and data required by the terminal. The memory 602 may also be used to temporarily store data that has been output or is to be output.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working processes of the units and modules in the system may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the technical solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

In the embodiments provided in the present invention, it should be understood that the disclosed apparatus/terminal and method may be implemented in other ways. For example, the above-described apparatus/terminal embodiments are merely illustrative, and for example, the division of the modules or units is only one type of logical function division, and other division manners may exist in actual implementation, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on multiple network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated modules/units, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. Based on such understanding, all or part of the flow of the method according to the embodiments of the present invention may also be implemented by a computer program, which may be stored in a computer-readable storage medium, and when the computer program is executed by a processor, the steps of the method embodiments may be implemented. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, read-Only Memory (ROM), random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit and scope of the embodiments of the present invention, and they should be construed as being included therein.

Claims

1. A speech emotion recognition method is characterized by comprising the following steps:

dividing the target voice segment into a plurality of voice slices, and extracting a voice characteristic value of each voice slice;

calculating the similarity between each voice slice in the plurality of voice slices and each voice sample in the plurality of voice samples with known emotion types based on the voice characteristic values of the plurality of voice slices;

summing the similarity between a plurality of voice slices and the same voice sample to obtain the similarity between the target voice slice and the voice sample;

and determining the emotion type of the target voice fragment based on the similarity between the target voice fragment and each voice sample and the emotion type of each voice sample.

2. The method of claim 1, wherein after determining the emotion type of the target speech segment based on the similarity between the target speech segment and each speech sample and the emotion type of each speech sample, the method further comprises:

determining the emotion types of the voice slices based on the similarity between each voice slice in the voice slices and each voice sample in the voice samples with known emotion types;

determining the emotion change trend of the target voice segment based on the emotion type of each voice slice;

and determining the service level of the speaker of the target voice segment corresponding to the customer service staff based on the emotion change trend of the target voice segment.

3. The method for recognizing speech emotion according to claim 2, wherein the determining the service level of the speaker of the target speech segment corresponding to the customer service person based on the emotion change tendency of the target speech segment comprises:

if the emotion change trend changes from an active state to a passive state or changes from a neutral state to the passive state, determining that the service level of the customer service personnel is low;

if the emotion change trend is changed from a negative state to an active state or from a neutral state to an active state, determining that the service level of the customer service staff is high;

4. The method of claim 1, wherein after determining the emotion type of the target speech segment based on the similarity between the target speech segment and each speech sample and the emotion type of each speech sample, the method further comprises:

recording the emotion types of a plurality of voice segments of a first speaker and the emotion types of a plurality of voice segments of a second speaker when the first speaker and the second speaker carry out conversation in a set time period; the voice segments of the first speaker correspond to the voice segments of the second speaker one by one;

if the emotion type of the first speaker is changed in a set mode and the emotion type of the second speaker is kept neutral and unchanged, determining that the second speaker is the robot; wherein the setting change comprises one of: from an active state to a passive state, from neutral to a passive state, from a passive state to an active state, and from neutral to an active state.

5. The method of claim 1, wherein after determining the emotion type of the target speech segment based on the similarity between the target speech segment and each speech sample and the emotion type of each speech sample, the method further comprises:

recording a plurality of voice segments of the same speaker in a set time period and emotion types of the plurality of voice segments;

determining the emotion change trend of the speaker based on the emotion types of the voice segments;

determining voice characteristic parameters of the speaker based on a plurality of voice segments of the speaker, wherein the voice characteristic parameters comprise a voice speed, a tone, a voice volume, a short-time zero-crossing rate and a volume;

if the emotion change trend of the speaker is kept neutral, the absolute value of the short-time zero-crossing rate is smaller than a set value, and the volume is smaller than the set volume, determining that the speaker is at risk of having Parkinson's syndrome;

and if the emotion change trend of the speaker is kept neutral, the voice speed is less than the set voice speed, the voice tone is less than the set voice tone, and the voice volume is less than the set voice volume, determining that the speaker is at risk of suffering from depression.

6. The method of claim 1, wherein the speech feature values comprise Mel frequency cepstral coefficients, short-time zero-crossing rate, short-time Fourier transform spectrum, decibel volume, constant Q transform spectrum;

the similarity calculation method comprises a cosine similarity calculation method, an adjusted cosine similarity calculation method, an Euclidean distance algorithm and a Jacard similarity calculation method;

after determining the emotion type of the target voice segment based on the similarity between the target voice segment and each voice sample and the emotion type of each voice sample, the method further includes:

the method comprises the following steps: recording M voice segments of the same speaker in a target scene; wherein M is more than or equal to 2;

step two: determining the emotion types of the M voice segments based on a combined algorithm of voice emotion recognition; the combination algorithm comprises a method for calculating a class of voice characteristic values and a class of similarity;

step three: determining the emotion types and the actual observation types of the M voice fragments to be K, and determining R = K/M as a consistency judgment result corresponding to the combined algorithm of the voice emotion recognition; wherein K is more than 0 and less than or equal to M;

step four: changing the calculation method of the voice characteristic value type or the similarity in the voice emotion recognition combined algorithm, and repeating the second step and the third step until all the voice emotion recognition combined algorithms are completely calculated;

7. The method for recognizing speech emotion according to any one of claims 1 to 6, wherein the determining the emotion type of the target speech segment based on the similarity between the target speech segment and each speech sample and the emotion type of each speech sample comprises:

and determining the emotion type corresponding to the voice sample with the maximum similarity as the emotion type of the target voice segment.

8. A speech emotion recognition apparatus, comprising:

the communication module is used for acquiring a target voice fragment;

the processing module is used for cutting the target voice section into a plurality of voice sections and extracting the voice characteristic value of each voice section; calculating the similarity between each voice slice in the plurality of voice slices and each voice sample in the plurality of voice samples with known emotion types based on the voice characteristic values of the plurality of voice slices; summing the similarity between a plurality of voice slices and the same voice sample to obtain the similarity between the target voice slice and the voice sample; and determining the emotion type of the target voice fragment based on the similarity between the target voice fragment and each voice sample and the emotion type of each voice sample.

9. An electronic device, comprising a memory storing a computer program and a processor for invoking and executing the computer program stored in the memory to perform the method of any one of claims 1 to 7.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.