CN113241095A - Conversation emotion real-time recognition method and device, computer equipment and storage medium - Google Patents

Conversation emotion real-time recognition method and device, computer equipment and storage medium Download PDF

Info

Publication number
CN113241095A
CN113241095A CN202110706524.2A CN202110706524A CN113241095A CN 113241095 A CN113241095 A CN 113241095A CN 202110706524 A CN202110706524 A CN 202110706524A CN 113241095 A CN113241095 A CN 113241095A
Authority
CN
China
Prior art keywords
emotion
voice
speech
output quantity
call
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110706524.2A
Other languages
Chinese (zh)
Other versions
CN113241095B (en
Inventor
曹磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Life Insurance Company of China Ltd
Original Assignee
Ping An Life Insurance Company of China Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Life Insurance Company of China Ltd filed Critical Ping An Life Insurance Company of China Ltd
Priority to CN202110706524.2A priority Critical patent/CN113241095B/en
Publication of CN113241095A publication Critical patent/CN113241095A/en
Application granted granted Critical
Publication of CN113241095B publication Critical patent/CN113241095B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/45Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of analysis window
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/90Pitch determination of speech signals

Abstract

The invention relates to the technical field of artificial intelligence, and provides a conversation emotion real-time recognition method, a device, computer equipment and a storage medium. The method is suitable for analyzing the dynamic emotion fluctuation in the turn conversation scene, and is not used for simply counting the emotion value of the user.

Description

Conversation emotion real-time recognition method and device, computer equipment and storage medium
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a method and a device for recognizing conversation emotion in real time, computer equipment and a storage medium.
Background
At present, in most industries such as banks, communication industries and the like, emotion recognition of users is very important in the telephone communication process between customer service personnel and the users, and the intentions of the users can be predicted through the emotions, so that the communication modes of the customer service personnel can be adjusted in time. In addition, the clear emotion itself may be dealt with in a targeted manner as an intention.
In the prior art, the speech rate features of a user are extracted based on a call recording, and the emotion of the user is identified according to the speech rate features. However, when the speech rate characteristics of the user are not obvious, the emotion of the user cannot be accurately recognized.
Disclosure of Invention
In view of the foregoing, it is desirable to provide a method, an apparatus, a computer device and a storage medium for real-time call emotion recognition, which can dynamically and accurately analyze the emotion of a user in real time.
The first aspect of the invention provides a method for identifying conversation emotion in real time, which comprises the following steps:
setting a plurality of sliding window lengths, and dividing a plurality of training voices by using each sliding window length to obtain a plurality of training voice fragments of each training voice;
aiming at each sliding window length, recognizing the training voice fragments by using an emotion recognition model to obtain emotion recognition results, calculating emotion recognition accuracy according to the emotion recognition results, and determining the sliding window length corresponding to the highest value in the emotion recognition accuracy as a target sliding window length;
sampling user call voice according to the target sliding window length to obtain a plurality of call voice fragments, and identifying the call text of each call voice fragment;
determining the syllable number and the tone value of the pronunciation of each text character in the call text, calculating a tone sample value according to the syllable number and the tone value of each text character, and calculating the voice output quantity of each call voice segment based on the tone sample value;
segmenting the voice output quantity into a first voice output quantity and a second voice output quantity according to the voice output quantity and the voice speed of each conversation voice fragment;
calculating a reference emotion value based on the first voice output quantity, and calculating a real-time emotion value based on the second voice output quantity;
and analyzing the emotion of the user according to the reference emotion value and the real-time emotion value.
In an optional embodiment, the segmenting the speech output quantity into a first speech output quantity and a plurality of second speech output quantities according to the speech output quantity and the speech speed of each call speech segment includes:
calculating a first voice output quantity amplitude difference value of the voice output quantity of the second call voice segment and the voice output quantity of the first call voice segment;
calculating a first speech speed amplitude difference value of the speech speed of the second call speech segment and the speech speed of the first call speech segment;
when a first mean value of the first voice output quantity amplitude difference value and the first speech speed amplitude difference value is smaller than a preset threshold value, calculating a second voice output quantity amplitude difference value of the voice output quantity of a third conversation voice fragment and the voice output quantity of the first conversation voice fragment, and calculating a second speech speed amplitude difference value of the speech speed of the third conversation voice fragment and the speech speed of the first conversation voice fragment;
and when a second average value of the second speech output quantity amplitude difference value and the second speech speed amplitude difference value is greater than the preset threshold value, segmenting the speech output quantity into a first speech output quantity and a second speech output quantity by taking the third conversation speech segment as a segmentation point.
In an alternative embodiment, the pitch sample value of each text character is calculated from the number of pitches and the pitch using the following formula:
d ═ X12+ X22)1/2, where D is the pitch value, X1 represents the number of pitches, and X2 represents the pitch.
In an alternative embodiment, said calculating a real-time emotion value based on said second speech output quantity comprises:
taking the current second voice output quantity as the center of a preset window, and acquiring the rest second voice output quantity corresponding to the preset window;
calculating the variance of the current second voice output quantity and the rest second voice output quantity;
and calculating a real-time emotion value according to the variance and the current second voice output quantity.
In an alternative embodiment, the real-time emotion value is calculated from the variance and the current second speech output using the following formula:
and Cn is Bn Sn, wherein Bn is a second voice output quantity corresponding to the nth call voice segment, and Sn is a variance between Bn and two adjacent second voice output quantities before and after.
In an optional embodiment, the method further comprises:
mapping the user emotion corresponding to each conversation voice segment into an emotion score;
calculating the weighted sum of the emotion scores according to the weights corresponding to the call content categories to obtain a first evaluation score;
acquiring customer service conversation accuracy corresponding to each conversation voice fragment in the conversation process;
calculating the weighted sum of the customer service conversation accuracy according to the weight corresponding to the conversation content category to obtain a second evaluation score;
and calculating to obtain a service quality evaluation result of the customer service according to the first evaluation score and the second evaluation score.
In an optional embodiment, the method further comprises:
creating a user representation;
classifying the user emotions, and calculating the number of similar user emotions;
using the user emotion and the corresponding number as emotion labels of the user portrait;
and matching the target customer service according to the emotion label of the user portrait when the incoming call of the user is detected again.
A second aspect of the present invention provides a conversation emotion real-time recognition apparatus, including:
the voice segmentation module is used for setting a plurality of sliding window lengths, and segmenting a plurality of training voices by using each sliding window length to obtain a plurality of training voice segments of each training voice;
the window length determining module is used for recognizing the training voice fragments by using an emotion recognition model according to each sliding window length to obtain emotion recognition results, calculating emotion recognition accuracy according to the emotion recognition results, and determining the sliding window length corresponding to the highest value in the emotion recognition accuracy as a target sliding window length;
the call sampling module is used for sampling the call voice of the user according to the target sliding window length to obtain a plurality of call voice fragments and identifying the call text of each call voice fragment;
the speech volume calculation module is used for determining the syllable number and the tone value of the pronunciation of each text character in the call text, calculating a speech sample value according to the syllable number and the tone value of each text character, and calculating the speech volume output volume of each call speech segment based on the speech sample value;
the voice volume segmentation module is used for segmenting the voice output volume into a first voice output volume and a second voice output volume according to the voice output volume and the voice speed of each conversation voice segment;
the emotion calculation module is also used for calculating a reference emotion value based on the first voice output quantity and calculating a real-time emotion value based on the second voice output quantity;
and the emotion analysis module is used for analyzing the emotion of the user according to the reference emotion value and the real-time emotion value.
A third aspect of the invention provides a computer device comprising a processor for implementing the call emotion real-time recognition method when executing a computer program stored in a memory.
A fourth aspect of the present invention provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the call emotion real-time recognition method.
In summary, the method, the device, the computer device and the storage medium for recognizing the call emotion in real time are suitable for a call session scene, and the fluctuation of emotion is analyzed through the change of voice output quantity in the call process, so that the dynamic change of the emotion of the user in the call is obtained, and the emotion value of the user is not simply counted. The method comprises the steps of verifying a plurality of sliding window lengths by using an emotion recognition model, determining a target sliding window length which is most suitable for segmenting voice, sampling the user communication voice by using the target sliding window length to obtain communication voice segments, then determining the number of syllables and the pitch value of the pronunciation of each text character in the communication text, calculating a voice sample value according to the number of the syllables and the pitch value of each text character, calculating the voice output quantity of each communication voice segment based on the voice sample value, segmenting the voice output quantity into a first voice output quantity and a second voice output quantity according to the voice output quantity and the voice speed of each communication voice segment, and further calculating to obtain a reference emotion value of a user based on the first voice output quantity, so that the emotion of the user is analyzed in real time by taking the reference emotion value as a reference standard, and the emotion value is more objective and accurate.
Drawings
Fig. 1 is a flowchart of a method for recognizing call emotion in real time according to an embodiment of the present invention.
FIG. 2 is a diagram of a syllable table according to a second embodiment of the present invention.
Fig. 3 is a structural diagram of a real-time call emotion recognition apparatus according to a second embodiment of the present invention.
Fig. 4 is a schematic structural diagram of a computer device according to a third embodiment of the present invention.
Detailed Description
In order that the above objects, features and advantages of the present invention can be more clearly understood, a detailed description of the present invention will be given below with reference to the accompanying drawings and specific embodiments. It should be noted that the embodiments of the present invention and features of the embodiments may be combined with each other without conflict.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention.
The method for recognizing the conversation emotion in real time provided by the embodiment of the invention is executed by the computer equipment, and correspondingly, the conversation emotion real-time recognition device runs in the computer equipment.
Fig. 1 is a flowchart of a method for recognizing call emotion in real time according to an embodiment of the present invention. The real-time call emotion recognition method specifically comprises the following steps, and the sequence of the steps in the flowchart can be changed and some steps can be omitted according to different requirements.
S11, a plurality of sliding window lengths are set, and a plurality of training voices are segmented using each sliding window length to obtain a plurality of training voice segments for each training voice.
The training speech is pre-collected speech used to determine the appropriate sliding window length. The lengths of the plurality of sliding window lengths may form an arithmetic progression. A sliding window length may be used to perform a non-overlapping sliding on each training utterance, thereby segmenting each training utterance into multiple utterance segments.
And S12, recognizing the training voice fragments by using an emotion recognition model according to each sliding window length to obtain emotion recognition results, calculating emotion recognition accuracy according to the emotion recognition results, and determining the sliding window length corresponding to the highest value in the emotion recognition accuracy as a target sliding window length.
The emotion recognition model is a model for recognizing the emotion of a speech segment, which is trained in advance by using a neural network (e.g., a convolutional neural network) as a network framework, and the training process is the prior art and is not described in detail. And for any sliding window length, recognizing the training voice segments of each training voice by using an emotion recognition model to obtain a plurality of emotion recognition results of each training voice, and calculating the emotion recognition accuracy of each training voice according to the emotion recognition results and the actual emotion result corresponding to each emotion recognition result.
And for any sliding window length, carrying out average calculation on the emotion recognition accuracy of all the training voices to obtain the average emotion recognition accuracy. And one sliding window length corresponds to one average emotion recognition accuracy, and the sliding window length corresponding to the maximum average emotion recognition accuracy is taken as the target sliding window length.
S13, sampling the user call voice according to the target sliding window length to obtain a plurality of call voice segments, and identifying the call text of each call voice segment.
In the prior art, the conversation voice of a user is a preset fixed value, and cannot be changed, the setting is too large, the accuracy of subsequent emotion recognition is easily influenced, the setting is too small, so that the subsequent emotion recognition is too time-consuming, the emotion recognition cannot be processed in real time, the implementation is implemented by setting a plurality of sliding window lengths, verifying the plurality of sliding window lengths by using an emotion recognition model, and really having the most suitable target sliding window length, sampling the conversation voice of the user by using the target sliding window length, and obtaining a plurality of conversation voice fragments which can be considered to have the minimum granularity and are most suitable for analyzing the emotion of the user, so that the accuracy of the emotion recognition of the user can be ensured, and the instantaneity of the emotion recognition of the user can also be ensured.
And S14, determining the number of syllables and the pitch value of the pronunciation of each text character in the call text, calculating a pitch sample value according to the number of syllables and the pitch value of each text character, and calculating the speech output quantity of each call speech segment based on the pitch sample value.
The computer device may perform voice separation on the complete call, for example, the voice separation technology may be used to perform voice separation on the complete call between the customer service and the user to obtain a customer service call voice segment and a user call voice segment, and then the voice recognition technology is used to recognize a call text of each call voice segment, where each call text includes a plurality of text characters. Sampling is carried out from the initial position of the user call voice, the user call voice is collected once every target sliding window length, and the call voice segment collected each time is marked as A.
The invention needs to recognize the emotion of the user, and can collect the conversation voice of the user. If the customer service emotion needs to be recognized, the customer service call voice can be collected. In other scenes, the customer service call voice and the user call voice can be collected simultaneously. The computer device is pre-provided with a text pronunciation database, and the pronunciation of each text character can be determined according to the text pronunciation database. Then, the syllable table of Chinese language as shown in FIG. 2 is used to determine the syllable of each text character, wherein the syllable table includes 402 single letter sounds, double pinyin sounds and triple pinyin sounds. Different readings correspond to different number of syllables and different pitch values. Wherein, the pitch value refers to the number corresponding to the pitch. If the text character "wide", the pronunciation is guang, corresponding to 3 pitches, and the 3 rd pitch, the pitch value is 3. For another example, if the text character "east" is spoken dong, which corresponds to 2 pitches, and the 1 st pitch, the pitch value is 1.
In an alternative embodiment, the pitch sample value of each text character may be calculated from the number of pitches and the pitch using the following formula: d ═ X1 2+X2 2)1/2Wherein D is the sound sample value, X1Representing the number of said syllables, X2Representing the tone.
Due to differences in pronunciation between different text characters, there are also differences in the time consumption of different text characters, for example, the text character "wide" takes more time than the text character "east". The number of the sound sample values obtained by calculation according to the number of the syllables of the pronunciation and the pitch value is used for representing the pronunciation time consumption of the text character, the larger the number of the syllables is, the larger the pitch value is, the larger the calculated sound sample value is, the more the pronunciation time consumption of the text character is, the smaller the number of the syllables is, the smaller the pitch value is, the smaller the calculated sound sample value is, and the less the pronunciation time consumption of the text character is.
The computer device extracts the sound sample value of each text character, and performs summation calculation according to the sound sample values of all the text characters of each call text to obtain the speech output quantity of the call text, namely the speech output quantity (speech quantity) corresponding to each A, but not the number of the text characters.
In an alternative embodiment, the speech output may be calculated using the following calculation: b ═ sum (L (i)), where B denotes the amount of speech output, i is the number of text characters, and L is the value of the sound sample.
Generally speaking, the more stable the emotion of a person, the more stable the speech rate, and the more balanced the speech output, but when angry or excited, the speech rate may be stable, for example, a word-by-word speech, but the speech mood may be heavier, and the speech mood may be reflected by syllables and tones, so that the number of the syllables and the tone value, and the speech output calculated based on the pitch values, may sufficiently reflect the emotion of the person.
In the prior art, features of voice are mostly extracted by training an emotion recognition model and emotion is obtained by emotion mapping, however, the extracted features do not always reflect emotion of a user obviously, and emotion recognition accuracy is low. In the embodiment, the change of the emotion of the user is effectively represented by calculating the voice output quantity of each call voice segment and comparing the voice output quantities of different call voice segments, rather than directly calculating a specific emotion value, so that the emotion recognition and analysis are more objective and higher in accuracy.
And S15, segmenting the voice output quantity into a first voice output quantity and a second voice output quantity according to the voice output quantity and the speed of each conversation voice segment.
Each A corresponds to a B, and the A is used as an abscissa and the B is used as an ordinate, so that a voice output fluctuation rate curve can be created. And segmenting the voice output quantity according to the voice output quantity fluctuation rate curve, namely sequencing all B according to the time sequence, searching a segmentation point, segmenting the B, wherein the B positioned in front of the segmentation point is called a first voice output quantity, and the B positioned behind the segmentation point is called a second voice output quantity.
In an optional embodiment, the segmenting the speech output quantity into a first speech output quantity and a plurality of second speech output quantities according to the speech output quantity and the speech speed of each call speech segment includes:
calculating a first voice output quantity amplitude difference value of the voice output quantity of the second call voice segment and the voice output quantity of the first call voice segment;
calculating a first speech speed amplitude difference value of the speech speed of the second call speech segment and the speech speed of the first call speech segment;
when a first mean value of the first voice output quantity amplitude difference value and the first speech speed amplitude difference value is smaller than a preset threshold value, calculating a second voice output quantity amplitude difference value of the voice output quantity of a third conversation voice fragment and the voice output quantity of the first conversation voice fragment, and calculating a second speech speed amplitude difference value of the speech speed of the third conversation voice fragment and the speech speed of the first conversation voice fragment;
and when a second average value of the second speech output quantity amplitude difference value and the second speech speed amplitude difference value is greater than the preset threshold value, segmenting the speech output quantity into a first speech output quantity and a second speech output quantity by taking the third conversation speech segment as a segmentation point.
After calculating a first speech speed amplitude difference value of the speech speed of a second speech communication speech fragment and the speech speed of a first speech communication speech fragment, the computer device judges whether a first average value of the first speech output quantity amplitude difference value and the first speech speed amplitude difference value is smaller than a preset threshold value, and when the first speech output quantity amplitude difference value is larger than the first speech speed amplitude difference value, the computer device cuts the speech output quantity into a first speech output quantity and a second speech output quantity by taking the second speech communication speech fragment as a cutting point. At this time, the first voice output quantity comprises the voice output quantity of the first call voice segment, and the second voice output quantity comprises the voice output quantity of the second call voice segment and the voice output quantity of the call voice segment after the second call voice segment.
When the second mean value of the second speech output quantity amplitude difference value and the second speech speed amplitude difference value is smaller than the preset threshold value, calculating a third speech output quantity amplitude difference value of the speech output quantity of a fourth speech communication fragment and the speech speed of the first speech communication fragment, and calculating a third speech speed amplitude difference value of the speech speed of the fourth speech communication fragment and the speech speed of the first speech communication fragment, and when the third mean value of the third speech output quantity amplitude difference value and the third speech speed amplitude difference value is larger than the preset threshold value, segmenting the speech output quantity into the first speech output quantity and the second speech output quantity by taking the fourth speech communication fragment as a segmentation point. At this time, the first voice output quantity includes a voice output quantity of a first call voice segment, a voice output quantity of a second call voice segment and a voice output quantity of a third call voice segment, and the second voice output quantity includes a voice output quantity of a fourth call voice segment and a voice output quantity of a call voice segment after the fourth call voice segment.
And when a second average value of the second speech output quantity amplitude difference value and the second speech speed amplitude difference value is greater than the preset threshold value, segmenting the speech output quantity into a first speech output quantity and a second speech output quantity by taking the third conversation speech segment as a segmentation point. At this time, the first voice output quantity includes a voice output quantity of a first call voice segment and a voice output quantity of a second call voice segment, and the second voice output quantity includes a voice output quantity of a third call voice segment and a voice output quantity of a call voice segment after the third call voice segment.
Generally, in the first few seconds or even several minutes of the user and the customer service conversation, the emotion is relatively stable, namely the output quantity of voice is not very large, and the speed of speech is not very fast, the first conversation voice segment can be used as a reference point, the difference between the speed of speech of each conversation voice segment and the speed of speech of the reference point is calculated by calculating the difference between the output quantity of voice of each conversation voice segment and the output quantity of voice of the reference point, if the difference between the amplitudes of the output quantities of voice is large, or the difference between the amplitudes of the speeds of speech is large, the emotion of the user is relatively excited, and if the difference between the amplitudes of the output quantities of voice is small, or the difference between the amplitudes of the speeds of speech is small, the emotion of the user is relatively stable.
In this optional embodiment, by calculating the mean value of the amplitude difference of the speech output quantity and the amplitude difference of the speech rate, and comprehensively considering the correlation between the speech output quantity and the speech rate and the emotion, the segmentation of the speech output quantity is more accurate by the amplitude difference of the speech output quantity and the amplitude difference of the speech rate compared with the case of only considering the speech rate singly.
And S16, calculating a reference emotion value based on the first voice output quantity, and calculating a real-time emotion value based on the second voice output quantity.
Taking the mean value of the first speech output amount as the reference emotion value, for example, if the first speech output amount includes the speech output amount of the first 3 call speech segments, B1, B2, and B3, the reference emotion value C0 is (B1+ B2+ B3)/3.
In an alternative embodiment, said calculating a real-time emotion value based on said second speech output quantity comprises:
taking the current second voice output quantity as the center of a preset window, and acquiring the rest second voice output quantity corresponding to the preset window;
calculating the variance of the current second voice output quantity and the rest second voice output quantity;
and calculating a real-time emotion value according to the variance and the current second voice output quantity.
The real-time emotion value of each of the subsequent second voice output quantities is denoted as Cn, for example, the second voice output quantity corresponding to the 4 th call voice segment is denoted as C4, and the second voice output quantity corresponding to the 5 th call voice segment is denoted as C5.
The real-time emotion value Cn is calculated as follows: cn ═ Bn × Sn, where Bn is the second speech output quantity corresponding to the nth call speech segment, and Sn is the variance between Bn and the two adjacent second speech output quantities (i.e., the Bn-2/Bn-1/Bn +1/Bn +2 sequence). The larger Sn is, the larger the variation amplitude of Bn and the front and back values is, and the smaller Sn is, the smaller the variation amplitude of Bn and the front and back values is.
And S17, analyzing the emotion of the user according to the reference emotion value and the real-time emotion value.
And the computer equipment calculates the emotion change rate according to the reference emotion value and the real-time emotion value, and analyzes the emotion of the user according to the emotion change rate X. Wherein, the calculation formula of the emotion change rate is as follows: xn ═ Cn-C0. The larger the emotion change rate Xn is, the stronger the emotion of the user tends to be, and the smaller the emotion change rate Xn is, the milder the emotion of the user tends to be.
For the user emotion tending to be strong, the emotion can be displayed on the user end of the customer service in real time to play a role in reminding the customer service, so that the customer service can adjust the speech strategy.
In an optional embodiment, the method further comprises: mapping the user emotion corresponding to each conversation voice segment into an emotion score; calculating the weighted sum of the emotion scores according to the weights corresponding to the call content categories to obtain a first evaluation score; acquiring customer service conversation accuracy corresponding to each conversation voice fragment in the conversation process; calculating the weighted sum of the customer service conversation accuracy according to the weight corresponding to the conversation content category to obtain a second evaluation score; and calculating to obtain a service quality evaluation result of the customer service according to the first evaluation score and the second evaluation score.
Each user emotion has a corresponding emotion score, for example, a pleasant emotion corresponds to an emotion score of 90, an unpleasant emotion corresponds to an emotion score of 60, and an angry emotion corresponds to an emotion score of 30. In practical application, the emotion score and the accuracy can be normalized respectively, and then the evaluation score can be calculated. And calculating the sum of the first evaluation score and the second evaluation score to obtain a service quality evaluation result of the customer service of the call.
In an optional embodiment, the method further comprises: creating a user representation; classifying the user emotions, and calculating the number of similar user emotions; using the user emotion and the corresponding number as emotion labels of the user portrait; and matching the target customer service according to the emotion label of the user portrait when the incoming call of the user is detected again.
The computer device may obtain a plurality of data of a user, create a user representation based on the plurality of data, and add a user emotion corresponding to each call upon the user call to the user representation as an emotion tag for the user representation of the user.
Matching a target customer service according to the emotion labels of the user portraits, distributing the target customer service to serve the user, and displaying the user portraits added with one or more emotion labels on a user terminal of the target customer service, so that the target customer service can better provide services.
In the optional embodiment, the target customer service is matched according to the emotion tag of the user portrait, so that the target customer service provides service for the user, dynamic scheduling of the customer service is realized, and corresponding service can be provided for the user. For users with most severe emotions, more experienced customer service can be allocated to provide services, more professional service quality can be provided, and user experience can be improved.
Because the customer service and the user generally have a sentence-by-sentence conversation in the customer service scene, the service generally focuses on dynamic change of the user emotion in each turn of the customer service, and whether emotion fluctuation is generated due to a certain reply or not, the method is suitable for the turn conversation scene, and the emotion fluctuation is analyzed through the change of the voice output quantity in the conversation process, so that the dynamic change of the emotion of the user in the conversation is obtained, and the emotion value of the user is not simply counted. Specifically, a plurality of sliding window lengths are verified by using an emotion recognition model, a target sliding window length which is most suitable for segmenting voice is determined, so that a conversation voice segment is obtained by sampling a conversation voice of a user by using the target sliding window length, the syllable number and the tone value of the pronunciation of each text character in the conversation text are determined, a voice sample value is calculated according to the syllable number and the tone value of each text character, the voice output quantity of each conversation voice segment is calculated based on the voice sample value, the voice output quantity is segmented into a first voice output quantity and a second voice output quantity according to the voice output quantity and the voice speed of each conversation voice segment, a reference emotion value of the user is calculated based on the first voice output quantity, the reference emotion value is calculated according to the conversation voice dynamic state of the user, and thus the reference emotion values obtained by calculating the conversation voices of different users are different, the determination of the reference emotion value has reference significance; in addition, the speech speed and the voice output quantity are considered when the reference emotion value is calculated, and the calculation accuracy of the reference emotion value is improved, so that the emotion of the user is analyzed in real time by taking the reference emotion value as a reference standard, and the method is more objective and accurate.
It is emphasized that, in order to further ensure the privacy and security of the calculation formulas of the user call voice and the real-time emotion value, the calculation formulas of the user call voice and the real-time emotion value may be stored in the nodes of the block chain.
Fig. 3 is a structural diagram of a real-time call emotion recognition apparatus according to a second embodiment of the present invention.
In some embodiments, the real-time call emotion recognition apparatus 30 may include a plurality of functional modules composed of computer program segments. The computer program of each program segment in the real-time call emotion recognition apparatus 30 may be stored in a memory of a computer device and executed by at least one processor to perform the function of real-time call emotion recognition (described in detail in fig. 1).
In this embodiment, the real-time call emotion recognition apparatus 30 may be divided into a plurality of functional modules according to the functions executed by the apparatus. The functional module may include: a voice segmentation module 301, a window length determination module 302, a call sampling module 303, a speech amount calculation module 304, a speech amount segmentation module 305, an emotion calculation module 306, an emotion analysis module 307, a quality evaluation module 308, and a target matching module 309. The module referred to herein is a series of computer program segments capable of being executed by at least one processor and capable of performing a fixed function and is stored in memory. In the present embodiment, the functions of the modules will be described in detail in the following embodiments.
The speech segmentation module 301 is configured to set a plurality of sliding window lengths, and segment a plurality of training speeches using each sliding window length to obtain a plurality of training speech segments of each training speech.
The training speech is pre-collected speech used to determine the appropriate sliding window length. The lengths of the plurality of sliding window lengths may form an arithmetic progression. A sliding window length may be used to perform a non-overlapping sliding on each training utterance, thereby segmenting each training utterance into multiple utterance segments.
The window length determining module 302 is configured to identify the training speech segments by using an emotion recognition model for each sliding window length to obtain an emotion recognition result, calculate emotion recognition accuracy according to the emotion recognition result, and determine a sliding window length corresponding to a highest value in the emotion recognition accuracy as a target sliding window length.
The emotion recognition model is a model for recognizing the emotion of a speech segment, which is trained in advance by using a neural network (e.g., a convolutional neural network) as a network framework, and the training process is the prior art and is not described in detail. And for any sliding window length, recognizing the training voice segments of each training voice by using an emotion recognition model to obtain a plurality of emotion recognition results of each training voice, and calculating the emotion recognition accuracy of each training voice according to the emotion recognition results and the actual emotion result corresponding to each emotion recognition result.
And for any sliding window length, carrying out average calculation on the emotion recognition accuracy of all the training voices to obtain the average emotion recognition accuracy. And one sliding window length corresponds to one average emotion recognition accuracy, and the sliding window length corresponding to the maximum average emotion recognition accuracy is taken as the target sliding window length.
The call sampling module 303 is configured to sample a call voice of a user according to the target sliding window length to obtain a plurality of call voice segments, and identify a call text of each call voice segment.
In the prior art, the conversation voice of a user is a preset fixed value, and cannot be changed, the setting is too large, the accuracy of subsequent emotion recognition is easily influenced, the setting is too small, so that the subsequent emotion recognition is too time-consuming, the emotion recognition cannot be processed in real time, the implementation is implemented by setting a plurality of sliding window lengths, verifying the plurality of sliding window lengths by using an emotion recognition model, and really having the most suitable target sliding window length, sampling the conversation voice of the user by using the target sliding window length, and obtaining a plurality of conversation voice fragments which can be considered to have the minimum granularity and are most suitable for analyzing the emotion of the user, so that the accuracy of the emotion recognition of the user can be ensured, and the instantaneity of the emotion recognition of the user can also be ensured.
The speech volume calculating module 304 is configured to determine the number of pitches and pitch values of the pronunciation of each text character in the call text, calculate a pitch sample value according to the number of pitches and pitch values of each text character, and calculate the speech volume of each call speech segment based on the pitch sample value.
The computer device may perform voice separation on the complete call, for example, the voice separation technology may be used to perform voice separation on the complete call between the customer service and the user to obtain a customer service call voice segment and a user call voice segment, and then the voice recognition technology is used to recognize a call text of each call voice segment, where each call text includes a plurality of text characters. Sampling is carried out from the initial position of the user call voice, the user call voice is collected once every target sliding window length, and the call voice segment collected each time is marked as A.
The invention needs to recognize the emotion of the user, and can collect the conversation voice of the user. If the customer service emotion needs to be recognized, the customer service call voice can be collected. In other scenes, the customer service call voice and the user call voice can be collected simultaneously. The computer device is pre-provided with a text pronunciation database, and the pronunciation of each text character can be determined according to the text pronunciation database. Then, the syllable table of Chinese language as shown in FIG. 2 is used to determine the syllable of each text character, wherein the syllable table includes 402 single letter sounds, double pinyin sounds and triple pinyin sounds. Different readings correspond to different number of syllables and different pitch values. Wherein, the pitch value refers to the number corresponding to the pitch. If the text character "wide", the pronunciation is guang, corresponding to 3 pitches, and the 3 rd pitch, the pitch value is 3. For another example, if the text character "east" is spoken dong, which corresponds to 2 pitches, and the 1 st pitch, the pitch value is 1.
In an alternative embodiment, each text can be calculated according to the number of the syllables and the pitch by using the following formulaTone sample value of this character: d ═ X1 2+X2 2)1/2Wherein D is the sound sample value, X1Representing the number of said syllables, X2Representing the tone.
Due to differences in pronunciation between different text characters, there are also differences in the time consumption of different text characters, for example, the text character "wide" takes more time than the text character "east". The number of the sound sample values obtained by calculation according to the number of the syllables of the pronunciation and the pitch value is used for representing the pronunciation time consumption of the text character, the larger the number of the syllables is, the larger the pitch value is, the larger the calculated sound sample value is, the more the pronunciation time consumption of the text character is, the smaller the number of the syllables is, the smaller the pitch value is, the smaller the calculated sound sample value is, and the less the pronunciation time consumption of the text character is.
The computer device extracts the sound sample value of each text character, and performs summation calculation according to the sound sample values of all the text characters of each call text to obtain the speech output quantity of the call text, namely the speech output quantity (speech quantity) corresponding to each A, but not the number of the text characters.
In an alternative embodiment, the speech output may be calculated using the following calculation: b ═ sum (L (i)), where B denotes the amount of speech output, i is the number of text characters, and L is the value of the sound sample.
Generally speaking, the more stable the emotion of a person, the more stable the speech rate, and the more balanced the speech output, but when angry or excited, the speech rate may be stable, for example, a word-by-word speech, but the speech mood may be heavier, and the speech mood may be reflected by syllables and tones, so that the number of the syllables and the tone value, and the speech output calculated based on the pitch values, may sufficiently reflect the emotion of the person.
In the prior art, features of voice are mostly extracted by training an emotion recognition model and emotion is obtained by emotion mapping, however, the extracted features do not always reflect emotion of a user obviously, and emotion recognition accuracy is low. In the embodiment, the change of the emotion of the user is effectively represented by calculating the voice output quantity of each call voice segment and comparing the voice output quantities of different call voice segments, rather than directly calculating a specific emotion value, so that the emotion recognition and analysis are more objective and higher in accuracy.
The speech volume segmentation module 305 is configured to segment the speech output volume into a first speech output volume and a second speech output volume according to the speech output volume and the speech speed of each speech segment.
Each A corresponds to a B, and the A is used as an abscissa and the B is used as an ordinate, so that a voice output fluctuation rate curve can be created. And segmenting the voice output quantity according to the voice output quantity fluctuation rate curve, namely sequencing all B according to the time sequence, searching a segmentation point, segmenting the B, wherein the B positioned in front of the segmentation point is called a first voice output quantity, and the B positioned behind the segmentation point is called a second voice output quantity.
In an optional embodiment, the segmenting the speech output quantity into a first speech output quantity and a plurality of second speech output quantities by the speech volume segmentation module 305 according to the speech output quantity and the speech speed of each speech segment comprises:
calculating a first voice output quantity amplitude difference value of the voice output quantity of the second call voice segment and the voice output quantity of the first call voice segment;
calculating a first speech speed amplitude difference value of the speech speed of the second call speech segment and the speech speed of the first call speech segment;
when a first mean value of the first voice output quantity amplitude difference value and the first speech speed amplitude difference value is smaller than a preset threshold value, calculating a second voice output quantity amplitude difference value of the voice output quantity of a third conversation voice fragment and the voice output quantity of the first conversation voice fragment, and calculating a second speech speed amplitude difference value of the speech speed of the third conversation voice fragment and the speech speed of the first conversation voice fragment;
and when a second average value of the second speech output quantity amplitude difference value and the second speech speed amplitude difference value is greater than the preset threshold value, segmenting the speech output quantity into a first speech output quantity and a second speech output quantity by taking the third conversation speech segment as a segmentation point.
After calculating a first speech speed amplitude difference value of the speech speed of a second speech communication speech fragment and the speech speed of a first speech communication speech fragment, the computer device judges whether a first average value of the first speech output quantity amplitude difference value and the first speech speed amplitude difference value is smaller than a preset threshold value, and when the first speech output quantity amplitude difference value is larger than the first speech speed amplitude difference value, the computer device cuts the speech output quantity into a first speech output quantity and a second speech output quantity by taking the second speech communication speech fragment as a cutting point. At this time, the first voice output quantity comprises the voice output quantity of the first call voice segment, and the second voice output quantity comprises the voice output quantity of the second call voice segment and the voice output quantity of the call voice segment after the second call voice segment.
When the second mean value of the second speech output quantity amplitude difference value and the second speech speed amplitude difference value is smaller than the preset threshold value, calculating a third speech output quantity amplitude difference value of the speech output quantity of a fourth speech communication fragment and the speech speed of the first speech communication fragment, and calculating a third speech speed amplitude difference value of the speech speed of the fourth speech communication fragment and the speech speed of the first speech communication fragment, and when the third mean value of the third speech output quantity amplitude difference value and the third speech speed amplitude difference value is larger than the preset threshold value, segmenting the speech output quantity into the first speech output quantity and the second speech output quantity by taking the fourth speech communication fragment as a segmentation point. At this time, the first voice output quantity includes a voice output quantity of a first call voice segment, a voice output quantity of a second call voice segment and a voice output quantity of a third call voice segment, and the second voice output quantity includes a voice output quantity of a fourth call voice segment and a voice output quantity of a call voice segment after the fourth call voice segment.
And when a second average value of the second speech output quantity amplitude difference value and the second speech speed amplitude difference value is greater than the preset threshold value, segmenting the speech output quantity into a first speech output quantity and a second speech output quantity by taking the third conversation speech segment as a segmentation point. At this time, the first voice output quantity includes a voice output quantity of a first call voice segment and a voice output quantity of a second call voice segment, and the second voice output quantity includes a voice output quantity of a third call voice segment and a voice output quantity of a call voice segment after the third call voice segment.
Generally, in the first few seconds or even several minutes of the user and the customer service conversation, the emotion is relatively stable, namely the output quantity of voice is not very large, and the speed of speech is not very fast, the first conversation voice segment can be used as a reference point, the difference between the speed of speech of each conversation voice segment and the speed of speech of the reference point is calculated by calculating the difference between the output quantity of voice of each conversation voice segment and the output quantity of voice of the reference point, if the difference between the amplitudes of the output quantities of voice is large, or the difference between the amplitudes of the speeds of speech is large, the emotion of the user is relatively excited, and if the difference between the amplitudes of the output quantities of voice is small, or the difference between the amplitudes of the speeds of speech is small, the emotion of the user is relatively stable.
In this optional embodiment, by calculating the mean value of the amplitude difference of the speech output quantity and the amplitude difference of the speech rate, and comprehensively considering the correlation between the speech output quantity and the speech rate and the emotion, the segmentation of the speech output quantity is more accurate by the amplitude difference of the speech output quantity and the amplitude difference of the speech rate compared with the case of only considering the speech rate singly.
The emotion calculating module 306 is configured to calculate a reference emotion value based on the first speech output quantity, and calculate a real-time emotion value based on the second speech output quantity.
Taking the mean value of the first speech output amount as the reference emotion value, for example, if the first speech output amount includes the speech output amount of the first 3 call speech segments, B1, B2, and B3, the reference emotion value C0 is (B1+ B2+ B3)/3.
In an alternative embodiment, the emotion calculation module 306 calculating the real-time emotion value based on the second speech output quantity includes:
taking the current second voice output quantity as the center of a preset window, and acquiring the rest second voice output quantity corresponding to the preset window;
calculating the variance of the current second voice output quantity and the rest second voice output quantity;
and calculating a real-time emotion value according to the variance and the current second voice output quantity.
The real-time emotion value of each of the subsequent second voice output quantities is denoted as Cn, for example, the second voice output quantity corresponding to the 4 th call voice segment is denoted as C4, and the second voice output quantity corresponding to the 5 th call voice segment is denoted as C5.
The real-time emotion value Cn is calculated as follows: cn ═ Bn × Sn, where Bn is the second speech output quantity corresponding to the nth call speech segment, and Sn is the variance between Bn and the two adjacent second speech output quantities (i.e., the Bn-2/Bn-1/Bn +1/Bn +2 sequence). The larger Sn is, the larger the variation amplitude of Bn and the front and back values is, and the smaller Sn is, the smaller the variation amplitude of Bn and the front and back values is.
And the emotion analyzing module 307 is configured to analyze the emotion of the user according to the reference emotion value and the real-time emotion value.
And the computer equipment calculates the emotion change rate according to the reference emotion value and the real-time emotion value, and analyzes the emotion of the user according to the emotion change rate X. Wherein, the calculation formula of the emotion change rate is as follows: xn ═ Cn-C0. The larger the emotion change rate Xn is, the stronger the emotion of the user tends to be, and the smaller the emotion change rate Xn is, the milder the emotion of the user tends to be.
For the user emotion tending to be strong, the emotion can be displayed on the user end of the customer service in real time to play a role in reminding the customer service, so that the customer service can adjust the speech strategy.
In an optional embodiment, the quality evaluation module 308 is configured to map a user emotion corresponding to each call voice segment into an emotion score; calculating the weighted sum of the emotion scores according to the weights corresponding to the call content categories to obtain a first evaluation score; acquiring customer service conversation accuracy corresponding to each conversation voice fragment in the conversation process; calculating the weighted sum of the customer service conversation accuracy according to the weight corresponding to the conversation content category to obtain a second evaluation score; and calculating to obtain a service quality evaluation result of the customer service according to the first evaluation score and the second evaluation score.
Each user emotion has a corresponding emotion score, for example, a pleasant emotion corresponds to an emotion score of 90, an unpleasant emotion corresponds to an emotion score of 60, and an angry emotion corresponds to an emotion score of 30. In practical application, the emotion score and the accuracy can be normalized respectively, and then the evaluation score can be calculated. And calculating the sum of the first evaluation score and the second evaluation score to obtain a service quality evaluation result of the customer service of the call.
In an alternative embodiment, the target matching module 309 is used to create a user representation; classifying the user emotions, and calculating the number of similar user emotions; using the user emotion and the corresponding number as emotion labels of the user portrait; and matching the target customer service according to the emotion label of the user portrait when the incoming call of the user is detected again.
The computer device may obtain a plurality of data of a user, create a user representation based on the plurality of data, and add a user emotion corresponding to each call upon the user call to the user representation as an emotion tag for the user representation of the user. Matching a target customer service according to the emotion labels of the user portraits, distributing the target customer service to serve the user, and displaying the user portraits added with one or more emotion labels on a user terminal of the target customer service, so that the target customer service can better provide services.
In the optional embodiment, the target customer service is matched according to the emotion tag of the user portrait, so that the target customer service provides service for the user, dynamic scheduling of the customer service is realized, and corresponding service can be provided for the user. For users with most severe emotions, more experienced customer service can be allocated to provide services, more professional service quality can be provided, and user experience can be improved.
Because the customer service and the user generally have a sentence-by-sentence conversation in the customer service scene, the service generally focuses on dynamic change of the user emotion in each turn of the customer service, and whether emotion fluctuation is generated due to a certain reply or not, the method is suitable for the turn conversation scene, and the emotion fluctuation is analyzed through the change of the voice output quantity in the conversation process, so that the dynamic change of the emotion of the user in the conversation is obtained, and the emotion value of the user is not simply counted. Specifically, a plurality of sliding window lengths are verified by using an emotion recognition model, a target sliding window length which is most suitable for segmenting voice is determined, so that a conversation voice segment is obtained by sampling a conversation voice of a user by using the target sliding window length, the syllable number and the tone value of the pronunciation of each text character in the conversation text are determined, a voice sample value is calculated according to the syllable number and the tone value of each text character, the voice output quantity of each conversation voice segment is calculated based on the voice sample value, the voice output quantity is segmented into a first voice output quantity and a second voice output quantity according to the voice output quantity and the voice speed of each conversation voice segment, a reference emotion value of the user is calculated based on the first voice output quantity, the reference emotion value is calculated according to the conversation voice dynamic state of the user, and thus the reference emotion values obtained by calculating the conversation voices of different users are different, the determination of the reference emotion value has reference significance; in addition, the speech speed and the voice output quantity are considered when the reference emotion value is calculated, and the calculation accuracy of the reference emotion value is improved, so that the emotion of the user is analyzed in real time by taking the reference emotion value as a reference standard, and the method is more objective and accurate.
Fig. 4 is a schematic structural diagram of a computer device according to a third embodiment of the present invention. In the preferred embodiment of the present invention, the computer device 4 includes a memory 41, at least one memory 42, at least one communication bus 43, and a transceiver 44.
It will be appreciated by those skilled in the art that the configuration of the computer device shown in fig. 4 is not limiting to the embodiments of the present invention, and may be a bus-type configuration or a star-type configuration, and that the computer device 4 may include more or less hardware or software than those shown, or a different arrangement of components.
In some embodiments, the computer device 4 is a device capable of automatically performing numerical calculation and/or information processing according to instructions set or stored in advance, and the hardware includes but is not limited to a microprocessor, an application specific integrated circuit, a programmable gate array, a digital processor, an embedded device, and the like. The computer device 4 may also include a user device, which includes, but is not limited to, any electronic product that can interact with a user through a keyboard, a mouse, a remote controller, a touch pad, or a voice control device, for example, a personal computer, a tablet computer, a smart phone, a digital camera, and the like.
It should be noted that the computer device 4 is only an example, and other electronic products that are currently available or may come into existence in the future, such as electronic products that can be adapted to the present invention, should also be included in the scope of the present invention, and are included herein by reference.
In some embodiments, the memory 41 has stored therein a computer program which, when executed by the at least one memory 42, implements all or part of the steps of the call emotion real-time recognition method as described. The Memory 41 includes a Read-Only Memory (ROM), a Programmable Read-Only Memory (PROM), an Erasable Programmable Read-Only Memory (EPROM), a One-time Programmable Read-Only Memory (OTPROM), an electronically Erasable rewritable Read-Only Memory (Electrically-Erasable Programmable Read-Only Memory (EEPROM)), an optical Read-Only disk (CD-ROM) or other optical disk Memory, a magnetic disk Memory, a tape Memory, or any other medium readable by a computer capable of carrying or storing data.
Further, the computer-readable storage medium may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to the use of the blockchain node, and the like.
The block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.
In some embodiments, the at least one memory 42 is a Control Unit (Control Unit) of the computer device 4, and is connected to various components of the whole computer device 4 by various interfaces and lines, and executes various functions and processes data of the computer device 4 by running or executing programs or modules stored in the memory 41 and calling data stored in the memory 41. For example, the at least one memory 42, when executing the computer program stored in the memory, implements all or part of the steps of the call emotion real-time recognition method described in the embodiment of the present invention; or all or part of functions of the conversation emotion real-time recognition device are realized. The at least one memory 42 may be formed by an integrated circuit, for example, a single packaged integrated circuit, or may be formed by a plurality of integrated circuits packaged with the same function or different functions, and includes one or more Central Processing Units (CPUs), microprocessors, digital Processing chips, graphics processors, and combinations of various control chips.
In some embodiments, the at least one communication bus 43 is arranged to enable connection communication between the memory 41 and the at least one memory 42, etc.
Although not shown, the computer device 4 may further include a power source (such as a battery) for supplying power to the various components, and preferably, the power source may be logically connected to the at least one memory 42 through a power management device, so as to implement functions of managing charging, discharging, and power consumption through the power management device. The power supply may also include any component of one or more dc or ac power sources, recharging devices, power failure detection circuitry, power converters or inverters, power status indicators, and the like. The computer device 4 may further include various sensors, a bluetooth module, a Wi-Fi module, and the like, which are not described herein again.
The integrated unit implemented in the form of a software functional module may be stored in a computer-readable storage medium. The software functional module is stored in a storage medium and includes several instructions to enable a computer device (which may be a personal computer, a computer device, or a network device) or a processor (processor) to execute parts of the methods according to the embodiments of the present invention.
In the embodiments provided in the present invention, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is only one logical functional division, and other divisions may be realized in practice.
The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.
In addition, functional modules in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional module.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned. Furthermore, it is obvious that the word "comprising" does not exclude other elements or that the singular does not exclude the plural. A plurality of units or means recited in the present invention can also be implemented by one unit or means through software or hardware. The terms first, second, etc. are used to denote names, but not any particular order.
Finally, it should be noted that the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention is described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.

Claims (10)

1. A conversation emotion real-time recognition method is characterized by comprising the following steps:
setting a plurality of sliding window lengths, and dividing a plurality of training voices by using each sliding window length to obtain a plurality of training voice fragments of each training voice;
for each sliding window length, recognizing the training voice segment by using an emotion recognition model to obtain an emotion recognition result, calculating emotion recognition accuracy according to the emotion recognition result, and determining the sliding window length corresponding to the highest value in the emotion recognition accuracy as a target sliding window length;
sampling user call voice according to the target sliding window length to obtain a plurality of call voice fragments, and identifying the call text of each call voice fragment;
determining the syllable number and the tone value of the pronunciation of each text character in the call text, calculating a tone sample value according to the syllable number and the tone value of each text character, and calculating the voice output quantity of each call voice segment based on the tone sample value;
segmenting the voice output quantity into a first voice output quantity and a second voice output quantity according to the voice output quantity and the voice speed of each conversation voice fragment;
calculating a reference emotion value based on the first voice output quantity, and calculating a real-time emotion value based on the second voice output quantity;
and analyzing the emotion of the user according to the reference emotion value and the real-time emotion value.
2. The method of claim 1, wherein the segmenting the speech output into a first speech output and a plurality of second speech outputs according to the speech output and speech rate of each speech segment comprises:
calculating a first voice output quantity amplitude difference value of the voice output quantity of the second call voice segment and the voice output quantity of the first call voice segment;
calculating a first speech speed amplitude difference value of the speech speed of the second call speech segment and the speech speed of the first call speech segment;
when a first mean value of the first voice output quantity amplitude difference value and the first speech speed amplitude difference value is smaller than a preset threshold value, calculating a second voice output quantity amplitude difference value of the voice output quantity of a third conversation voice fragment and the voice output quantity of the first conversation voice fragment, and calculating a second speech speed amplitude difference value of the speech speed of the third conversation voice fragment and the speech speed of the first conversation voice fragment;
and when a second average value of the second speech output quantity amplitude difference value and the second speech speed amplitude difference value is greater than the preset threshold value, segmenting the speech output quantity into a first speech output quantity and a second speech output quantity by taking the third conversation speech segment as a segmentation point.
3. The method of claim 2, wherein the sound sample value of each text character is calculated according to the number of the syllables and the pitch by using the following formula:
D=(X12+X22)1/2where D is the pitch sample, X1 represents the number of pitches, and X2 represents the pitch.
4. The real-time call emotion recognition method of claim 3, wherein said calculating a real-time emotion value based on the second speech output quantity comprises:
taking the current second voice output quantity as the center of a preset window, and acquiring the rest second voice output quantity corresponding to the preset window;
calculating the variance of the current second voice output quantity and the rest second voice output quantity;
and calculating a real-time emotion value according to the variance and the current second voice output quantity.
5. The method of claim 4, wherein the real-time emotion value is calculated according to the variance and the current second speech output quantity by using the following formula:
and Cn is Bn Sn, wherein Bn is a second voice output quantity corresponding to the nth call voice segment, and Sn is a variance between Bn and two adjacent second voice output quantities before and after.
6. The method for real-time recognition of call emotion according to any one of claims 1 to 5, wherein the method further comprises:
mapping the user emotion corresponding to each conversation voice segment into an emotion score;
calculating the weighted sum of the emotion scores according to the weights corresponding to the call content categories to obtain a first evaluation score;
acquiring customer service conversation accuracy corresponding to each conversation voice fragment in the conversation process;
calculating the weighted sum of the customer service conversation accuracy according to the weight corresponding to the conversation content category to obtain a second evaluation score;
and calculating to obtain a service quality evaluation result of the customer service according to the first evaluation score and the second evaluation score.
7. The method for real-time recognition of call emotion according to any one of claims 1 to 5, wherein the method further comprises:
creating a user representation;
classifying the user emotions, and calculating the number of similar user emotions;
using the user emotion and the corresponding number as emotion labels of the user portrait;
and matching the target customer service according to the emotion label of the user portrait when the incoming call of the user is detected again.
8. A conversation emotion real-time recognition apparatus, comprising:
the voice segmentation module is used for setting a plurality of sliding window lengths, and segmenting a plurality of training voices by using each sliding window length to obtain a plurality of training voice segments of each training voice;
the window length determining module is used for recognizing the training voice fragments by using an emotion recognition model according to each sliding window length to obtain emotion recognition results, calculating emotion recognition accuracy according to the emotion recognition results, and determining the sliding window length corresponding to the highest value in the emotion recognition accuracy as a target sliding window length;
the call sampling module is used for sampling the call voice of the user according to the target sliding window length to obtain a plurality of call voice fragments and identifying the call text of each call voice fragment;
the speech volume calculation module is used for determining the syllable number and the tone value of the pronunciation of each text character in the call text, calculating a speech sample value according to the syllable number and the tone value of each text character, and calculating the speech volume output volume of each call speech segment based on the speech sample value;
the voice volume segmentation module is used for segmenting the voice output volume into a first voice output volume and a second voice output volume according to the voice output volume and the voice speed of each conversation voice segment;
the emotion calculation module is also used for calculating a reference emotion value based on the first voice output quantity and calculating a real-time emotion value based on the second voice output quantity;
and the emotion analysis module is used for analyzing the emotion of the user according to the reference emotion value and the real-time emotion value.
9. A computer device, characterized in that the computer device comprises a processor for implementing the call emotion real-time recognition method as claimed in any one of claims 1 to 7 when executing a computer program stored in a memory.
10. A computer-readable storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the call emotion real-time recognition method according to any one of claims 1 to 7.
CN202110706524.2A 2021-06-24 2021-06-24 Conversation emotion real-time recognition method and device, computer equipment and storage medium Active CN113241095B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110706524.2A CN113241095B (en) 2021-06-24 2021-06-24 Conversation emotion real-time recognition method and device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110706524.2A CN113241095B (en) 2021-06-24 2021-06-24 Conversation emotion real-time recognition method and device, computer equipment and storage medium

Publications (2)

Publication Number Publication Date
CN113241095A true CN113241095A (en) 2021-08-10
CN113241095B CN113241095B (en) 2023-04-11

Family

ID=77140723

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110706524.2A Active CN113241095B (en) 2021-06-24 2021-06-24 Conversation emotion real-time recognition method and device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113241095B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160162807A1 (en) * 2014-12-04 2016-06-09 Carnegie Mellon University, A Pennsylvania Non-Profit Corporation Emotion Recognition System and Method for Modulating the Behavior of Intelligent Systems
US20160329043A1 (en) * 2014-01-21 2016-11-10 Lg Electronics Inc. Emotional-speech synthesizing device, method of operating the same and mobile terminal including the same
US20160372116A1 (en) * 2012-01-24 2016-12-22 Auraya Pty Ltd Voice authentication and speech recognition system and method
CN110097894A (en) * 2019-05-21 2019-08-06 焦点科技股份有限公司 A kind of method and system of speech emotion recognition end to end
CN110556130A (en) * 2019-09-17 2019-12-10 平安科技(深圳)有限公司 Voice emotion recognition method and device and storage medium
US20210142820A1 (en) * 2019-11-07 2021-05-13 Sling Media Pvt Ltd Method and system for speech emotion recognition

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160372116A1 (en) * 2012-01-24 2016-12-22 Auraya Pty Ltd Voice authentication and speech recognition system and method
US20160329043A1 (en) * 2014-01-21 2016-11-10 Lg Electronics Inc. Emotional-speech synthesizing device, method of operating the same and mobile terminal including the same
US20160162807A1 (en) * 2014-12-04 2016-06-09 Carnegie Mellon University, A Pennsylvania Non-Profit Corporation Emotion Recognition System and Method for Modulating the Behavior of Intelligent Systems
CN110097894A (en) * 2019-05-21 2019-08-06 焦点科技股份有限公司 A kind of method and system of speech emotion recognition end to end
CN110556130A (en) * 2019-09-17 2019-12-10 平安科技(深圳)有限公司 Voice emotion recognition method and device and storage medium
WO2021051577A1 (en) * 2019-09-17 2021-03-25 平安科技(深圳)有限公司 Speech emotion recognition method, apparatus, device, and storage medium
US20210142820A1 (en) * 2019-11-07 2021-05-13 Sling Media Pvt Ltd Method and system for speech emotion recognition

Also Published As

Publication number Publication date
CN113241095B (en) 2023-04-11

Similar Documents

Publication Publication Date Title
CN108197115B (en) Intelligent interaction method and device, computer equipment and computer readable storage medium
Weninger et al. The voice of leadership: Models and performances of automatic analysis in online speeches
CN108564968A (en) A kind of method and device of evaluation customer service
CN110085221A (en) Speech emotional exchange method, computer equipment and computer readable storage medium
WO2021047319A1 (en) Voice-based personal credit assessment method and apparatus, terminal and storage medium
CN110874716A (en) Interview evaluation method and device, electronic equipment and storage medium
CN113807103B (en) Recruitment method, device, equipment and storage medium based on artificial intelligence
CN112417128B (en) Method and device for recommending dialect, computer equipment and storage medium
CN111901627B (en) Video processing method and device, storage medium and electronic equipment
CN113436634B (en) Voice classification method and device based on voiceprint recognition and related equipment
CN114007131A (en) Video monitoring method and device and related equipment
CN111462761A (en) Voiceprint data generation method and device, computer device and storage medium
CN108962243A (en) arrival reminding method and device, mobile terminal and computer readable storage medium
CN112863529A (en) Speaker voice conversion method based on counterstudy and related equipment
CN113591489B (en) Voice interaction method and device and related equipment
CN113255362B (en) Method and device for filtering and identifying human voice, electronic device and storage medium
CN112489628B (en) Voice data selection method and device, electronic equipment and storage medium
CN113077821A (en) Audio quality detection method and device, electronic equipment and storage medium
CN113241095B (en) Conversation emotion real-time recognition method and device, computer equipment and storage medium
CN112466337A (en) Audio data emotion detection method and device, electronic equipment and storage medium
CN113436617B (en) Voice sentence breaking method, device, computer equipment and storage medium
CN113221990A (en) Information input method and device and related equipment
CN113808577A (en) Intelligent extraction method and device of voice abstract, electronic equipment and storage medium
CN114842880A (en) Intelligent customer service voice rhythm adjusting method, device, equipment and storage medium
Jacob et al. Prosodic feature based speech emotion recognition at segmental and supra segmental levels

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant