CN111081279A - Voice emotion fluctuation analysis method and device - Google Patents
Voice emotion fluctuation analysis method and device Download PDFInfo
- Publication number
- CN111081279A CN111081279A CN201911341679.XA CN201911341679A CN111081279A CN 111081279 A CN111081279 A CN 111081279A CN 201911341679 A CN201911341679 A CN 201911341679A CN 111081279 A CN111081279 A CN 111081279A
- Authority
- CN
- China
- Prior art keywords
- emotion
- audio
- character
- recognition result
- feature
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000008451 emotion Effects 0.000 title claims abstract description 175
- 238000004458 analytical method Methods 0.000 title claims abstract description 41
- 230000008909 emotion recognition Effects 0.000 claims abstract description 100
- 238000000034 method Methods 0.000 claims abstract description 36
- 238000000605 extraction Methods 0.000 claims abstract description 23
- 238000007499 fusion processing Methods 0.000 claims abstract description 8
- 239000013598 vector Substances 0.000 claims description 33
- 230000015654 memory Effects 0.000 claims description 29
- 238000001228 spectrum Methods 0.000 claims description 18
- 238000013507 mapping Methods 0.000 claims description 14
- 238000012545 processing Methods 0.000 claims description 11
- 238000004590 computer program Methods 0.000 claims description 10
- 239000011159 matrix material Substances 0.000 claims description 8
- 230000011218 segmentation Effects 0.000 claims description 7
- 206010027951 Mood swings Diseases 0.000 claims 2
- 238000011156 evaluation Methods 0.000 abstract description 2
- 238000007689 inspection Methods 0.000 abstract description 2
- 230000008569 process Effects 0.000 description 13
- 238000012549 training Methods 0.000 description 11
- 230000006870 function Effects 0.000 description 6
- 238000010586 diagram Methods 0.000 description 5
- 238000005516 engineering process Methods 0.000 description 4
- 238000002372 labelling Methods 0.000 description 4
- 238000000926 separation method Methods 0.000 description 3
- 230000005236 sound signal Effects 0.000 description 3
- 238000004422 calculation algorithm Methods 0.000 description 2
- 239000003086 colorant Substances 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 238000007667 floating Methods 0.000 description 2
- 230000007787 long-term memory Effects 0.000 description 2
- 238000003062 neural network model Methods 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 230000033764 rhythmic process Effects 0.000 description 2
- 230000006403 short-term memory Effects 0.000 description 2
- 230000003068 static effect Effects 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 238000012795 verification Methods 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000002457 bidirectional effect Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000008094 contradictory effect Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000009432 framing Methods 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 238000010295 mobile communication Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000001537 neural effect Effects 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/63—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/04—Segmentation; Word boundary detection
- G10L15/05—Word boundary detection
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0272—Voice signal separating
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
- G10L25/87—Detection of discrete points within a voice signal
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Quality & Reliability (AREA)
- Child & Adolescent Psychology (AREA)
- General Health & Medical Sciences (AREA)
- Hospice & Palliative Care (AREA)
- Psychiatry (AREA)
- Artificial Intelligence (AREA)
- Machine Translation (AREA)
Abstract
The embodiment of the invention provides a voice emotion fluctuation analysis method, which comprises the following steps: acquiring a first audio characteristic and a first character characteristic of voice data to be detected; extracting a second audio feature from the first audio feature based on an audio feature extraction network in a pre-trained audio recognition model; extracting a second character feature from the first character features based on a character feature extraction network in a pre-trained character recognition model; identifying a second audio characteristic to obtain an audio emotion identification result; identifying the second character characteristics to obtain a character emotion identification result; and carrying out fusion processing on the audio emotion recognition result and the character emotion recognition result to obtain an emotion recognition result, and sending the emotion recognition result to the associated terminal. According to the method, through a double-channel voice emotion recognition method and the emotion value heat map drawing, imagination reference and help are provided for customer service quality inspection, so that the evaluation result is more objective, an enterprise is helped to improve the customer service quality, and the customer experience is improved.
Description
Technical Field
The invention relates to the technical field of internet, in particular to a voice emotion fluctuation analysis method and device.
Background
With the development of artificial intelligence technology, emotion fluctuation analysis is applied in more and more commercial scenes, such as emotion fluctuation situations of both parties when a customer service person talks with a client. In the prior art, emotion fluctuation analysis for audio is generally performed by using audio signals of sound, such as intonation, frequency and amplitude variation of sound waves, the analysis mode is single, the audio signals of different people are different, and the emotion analysis accuracy is low only by using the audio signals of sound.
Disclosure of Invention
In view of this, embodiments of the present invention provide a method and an apparatus for analyzing speech emotion fluctuation, a computer device, and a computer-readable storage medium, which are used for solving the problem of low accuracy in analyzing emotion fluctuation.
The embodiment of the invention solves the technical problems through the following technical scheme:
a speech emotion fluctuation analysis method includes:
acquiring a first audio characteristic and a first character characteristic of voice data to be detected;
extracting a second audio feature from the first audio features based on an audio feature extraction network in a pre-trained audio recognition model; extracting a second character feature from the first character features based on a character feature extraction network in a pre-trained character recognition model;
identifying the second audio features to obtain an audio emotion identification result; identifying the second character features to obtain character emotion identification results;
and carrying out fusion processing on the audio emotion recognition result and the character emotion recognition result to obtain an emotion recognition result, and sending the emotion recognition result to an associated terminal.
Further, the acquiring a first audio feature and a first character feature of the voice data to be detected includes:
performing frame windowing on the voice data to be detected to obtain a voice analysis frame;
carrying out Fourier transform on the voice analysis frame to obtain a corresponding frequency spectrum;
the frequency spectrum is processed by a Mel filter bank to obtain a Mel frequency spectrum;
and performing cepstrum analysis on the Mel frequency spectrum to obtain a first audio characteristic of the voice data to be detected.
Further, the second audio features are identified, and an audio emotion identification result is obtained; recognizing the second character features, and acquiring character emotion recognition results, wherein the character emotion recognition results comprise:
identifying the second audio features based on an audio classification network in a pre-trained audio identification model, and acquiring first confidence degrees corresponding to a plurality of audio emotion classification vectors;
selecting the audio emotion classification with the highest first confidence coefficient as a target audio emotion classification, wherein the corresponding first confidence coefficient is a target audio emotion classification parameter;
and carrying out numerical value mapping on the target audio emotion classification vector parameters to obtain an audio emotion recognition result.
Further, the acquiring of the first audio feature and the first character feature of the voice data to be detected further includes:
converting the voice data to be tested into characters;
performing word segmentation processing on the characters to obtain L word segments, wherein L is a natural number greater than 0;
and respectively carrying out word vector mapping on the L participles to obtain a d-dimensional word vector matrix corresponding to the L participles, wherein d is a natural number greater than 0, and the d-dimensional word vector matrix is a first character characteristic of the voice data to be detected.
Further, the second audio features are identified, and an audio emotion identification result is obtained; recognizing the second character features, and acquiring character emotion recognition results, wherein the character emotion recognition results comprise:
recognizing the second character features based on a character classification network in a pre-trained character recognition model, and acquiring second confidence degrees corresponding to a plurality of character emotion classification vectors;
selecting the audio emotion classification with the highest second confidence coefficient as a target character emotion classification, wherein the corresponding second confidence coefficient is a target character emotion classification parameter;
and carrying out numerical value mapping on the target character emotion classification vector parameters to obtain character emotion recognition results.
Further, the method further comprises:
acquiring offline or online voice data to be detected;
and separating the voice data to obtain voice data to be detected, wherein the voice data to be detected comprises a plurality of sections of first user voice data and second user voice data.
Further, the fusion processing of the audio emotion recognition result and the character emotion recognition result to obtain an emotion recognition result, and the sending of the emotion recognition result to the associated terminal includes:
weighting the audio emotion recognition result and the character emotion recognition result of each section of the voice data of the first user to obtain a first emotion value, and weighting the audio emotion recognition result and the character emotion recognition result of each section of the voice data of the second user to obtain a second emotion value;
generating a first sentiment value heat map according to the first sentiment value and a second sentiment value heat map according to the second sentiment value;
and sending the first emotion value heat map and the second emotion value heat map to a related terminal.
In order to achieve the above object, an embodiment of the present invention further provides a speech emotion analyzing apparatus, including:
the first voice feature acquisition module is used for acquiring a first audio feature and a first character feature of the voice data to be detected;
the second voice feature extraction module is used for extracting a second audio feature in the first audio feature based on an audio feature extraction network in a pre-trained audio recognition model; extracting a second character feature from the first character features based on a character feature extraction network in a pre-trained character recognition model;
the voice feature recognition module is used for recognizing the second audio features and acquiring an audio emotion recognition result; identifying the second character features to obtain character emotion identification results;
and the recognition result acquisition module is used for carrying out fusion processing on the audio emotion recognition result and the character emotion recognition result to obtain an emotion recognition result and sending the emotion recognition result to the associated terminal.
In order to achieve the above object, an embodiment of the present invention further provides a computer device, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and the processor implements the steps of the speech emotion fluctuation analysis method as described above when executing the computer program.
In order to achieve the above object, an embodiment of the present invention further provides a computer-readable storage medium, in which a computer program is stored, where the computer program is executable by at least one processor, so as to cause the at least one processor to execute the steps of the speech emotion fluctuation analysis method as described above.
According to the voice emotion fluctuation analysis method, the voice emotion fluctuation analysis device, the computer equipment and the computer readable storage medium, the voice emotion is analyzed through two channels, the voice emotion is analyzed through the audio acoustic rhythm, the emotion of a speaker is further judged through the speaking content, and therefore the emotion analysis accuracy is improved.
The invention is described in detail below with reference to the drawings and specific examples, but the invention is not limited thereto.
Drawings
FIG. 1 is a flowchart illustrating a method for analyzing speech emotion fluctuation according to an embodiment of the present invention;
FIG. 2 is a detailed flowchart of obtaining voice data to be tested;
FIG. 3 is a detailed flowchart of extracting a first audio feature from the voice data to be detected;
FIG. 4 is a detailed flowchart of extracting a first text feature from the voice data to be detected;
FIG. 5 is a flowchart illustrating the specific process of identifying the second audio feature and obtaining the audio emotion recognition result;
fig. 6 is a specific flowchart for identifying the second character feature and obtaining a character emotion identification result;
fig. 7 is a specific flowchart for performing fusion processing on the audio emotion recognition result and the character emotion recognition result to obtain an emotion recognition result, and sending the emotion recognition result to an associated terminal;
FIG. 8 is a schematic diagram of a second embodiment of a speech emotion analyzing apparatus according to the present invention;
FIG. 9 is a diagram of a hardware structure of a third embodiment of the computer apparatus according to the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Technical solutions between various embodiments may be combined with each other, but must be realized by those skilled in the art, and when the technical solutions are contradictory or cannot be realized, such a combination should not be considered to exist, and is not within the protection scope of the present invention.
Example one
Referring to fig. 1, a flowchart illustrating steps of a speech emotion analyzing method according to an embodiment of the present invention is shown. It is to be understood that the flow charts in the embodiments of the present method are not intended to limit the order in which the steps are performed. The following description is given by taking a computer device as an execution subject, specifically as follows:
s100: acquiring a first audio characteristic and a first character characteristic of voice data to be detected;
referring to fig. 2, the method for analyzing speech emotion fluctuation according to the embodiment of the present invention further includes:
s110: and acquiring voice data to be detected.
The acquiring of the voice data to be tested further comprises:
S110A: acquiring offline or online voice data;
specifically, the voice data includes online voice data and offline voice data, the online voice data refers to voice data obtained in real time in a call process, the offline voice data refers to call voice data stored in a system background, and the voice data to be tested is a recording file in a wav format.
S110B: and separating the voice data to obtain voice data to be detected, wherein the voice data to be detected comprises a plurality of sections of first user voice data and second user voice data.
Specifically, after the voice data is acquired, the voice data to be detected is divided into a plurality of sections of first user voice data and second user voice data according to the mute part of the call voice, an end point detection technology and a voice separation technology are adopted to remove the mute part in the call process of the voice data to be detected, the start point and the end point of each section of conversation are marked based on the time threshold of the mute time existing in the set speaking interval section, the cutting and separation are carried out according to the time point to obtain a plurality of short voice frequency sections, the speaker identity and the speaking time of each short voice frequency section are marked by a voiceprint recognition tool, and the identification and the speaking time are distinguished by numbers. The time length threshold is determined according to an empirical value, and as an embodiment, the time length threshold of the scheme is 0.25-0.3 second.
The number includes, but is not limited to, the job number of the customer service, the landline number of the customer service, and the cell phone number of the customer.
Specifically, the voiceprint recognition tool is a LIUM _ splinization toolkit, and the first user voice data and the second user voice data are distinguished through the LIUM _ splinization toolkit, for example, as follows:
start_time | end_time | speaker |
0 | 3 | 1 |
4 | 8 | 2 |
8.3 | 12.5 | 1 |
we consider the person speaking in the first opening to be the first user (i.e., the spaker 1 in the table) and the second to be the second user (i.e., the spaker 2 in the table) naturally.
Referring to fig. 3, the acquiring the first audio feature of the to-be-detected speech data further includes:
S100A 1: performing frame windowing on the voice data to be detected to obtain a voice analysis frame;
specifically, the voice data signal has short-time stationarity, and the voice data signal can be subjected to framing processing to obtain a plurality of audio frames, where an audio frame refers to a set of N sampling points. In this embodiment, N is 256 or 512, and the time covered is 20 to 30ms, after obtaining a plurality of audio frames, each audio frame is multiplied by a hamming window to increase the continuity of the left and right ends of the frame, and obtain a speech analysis frame.
S100B 1: carrying out Fourier transform on the voice analysis frame to obtain a corresponding frequency spectrum;
specifically, since the voice data signal is difficult to change in the time domain, the voice data signal needs to be converted into energy distribution in the frequency domain, and the voice analysis frame is subjected to fourier transform to obtain the frequency spectrum of each voice analysis frame.
S100C 1: the frequency spectrum is processed by a Mel filter bank to obtain a Mel frequency spectrum;
S100D 1: and performing cepstrum analysis on the Mel frequency spectrum to obtain a first audio characteristic of the voice data to be detected.
Specifically, cepstrum analysis is performed on the mel frequency spectrum to obtain 36 1024-dimensional audio vectors, and the audio vectors are the first audio features of the voice data to be detected.
Referring to fig. 4, the acquiring the first text feature of the voice data to be detected further includes:
S100A 2: converting the voice data to be tested into characters;
specifically, the multiple sections of the first user voice data and the second user voice data are converted into characters by using a voice dictation interface. As an embodiment, the dictation interface is a fly-by-fly voice dictation interface.
S100B 2: performing word segmentation processing on the characters to obtain L word segments, wherein L is a natural number greater than 0;
specifically, the word segmentation process is completed through a dictionary word segmentation algorithm, which includes, but is not limited to, a forward maximum matching method, a reverse maximum matching method, and a two-way matching word segmentation method, and may also be based on hidden markov models HMM, CRF, SVM, and a deep learning algorithm.
S100C 2: and respectively carrying out word vector mapping on the L participles to obtain a d-dimensional word vector matrix corresponding to the L participles, wherein d is a natural number greater than 0, and the d-dimensional word vector matrix is a first character characteristic of the voice data to be detected.
Specifically, a 128-dimensional word vector of each participle is obtained through word2vec and other models.
S102: extracting a second audio feature from the first audio features based on an audio feature extraction network in a pre-trained audio recognition model; and extracting a second character feature from the first character features based on a character feature extraction network in a pre-trained character recognition model.
Specifically, the second audio features and the second character features are semantic feature vectors which are extracted from the first audio features and the first character features by the feature extraction network of the emotion recognition model and have fewer dimensions and pay more attention to words expressing emotion, and by extracting the second audio features and the second character features, the learning capacity of the model can be better, and the accuracy of final classification is higher.
S104: identifying the second audio features to obtain an audio emotion identification result; and identifying the second character features to obtain a character emotion identification result.
Specifically, the audio recognition result is obtained by inputting the audio features into an audio recognition model, and the character emotion recognition result is obtained by inputting the character features into a character recognition model. Specifically, the audio recognition model and the character emotion recognition model comprise a feature extraction network and a classification network, wherein the feature extraction network is used for extracting semantic feature vectors with fewer dimensions, namely a second audio feature and a second character feature, from a first audio feature and a first character feature, and the classification network is used for outputting confidence degrees of all preset emotion categories, wherein the preset emotion categories can be divided according to business requirements, such as positive, negative and the like. The character emotion recognition model is a deep neural network model comprising an Embedding layer and a Long Short-Term Memory neural Layer (LSTM), and the audio emotion recognition model is a neural network model comprising a self-attention layer and a bidirectional Long-Term Memory neural network layer (forward LSTM and backward LSTM).
The long and short term memory network is used for processing the sequence dependency relationship between long spans and is suitable for processing the task of dependency between long texts.
Further, the embodiment of the present invention further includes training the audio recognition model and the character recognition model, where the training process includes:
acquiring a training set and a calibration set corresponding to the target field;
the method for acquiring the training set and the check set corresponding to the target field comprises the following steps:
acquiring voice data of a training set and a check set;
specifically, the acquisition mode of the training set and the call verification set voice data includes, but is not limited to, recording data of a call center in a company, customer service recording data provided by a client, and direct purchase of the customer service recording data from a data platform.
Marking the emotion type of the recorded data;
specifically, the labeling process is as follows: manually marking the pause time point of each recording to obtain a plurality of short audio segments (conversation segments) of each recording; emotion tendency labeling (i.e., positive emotion, negative emotion) is performed on each short audio piece, and in the present embodiment, the audio-annotation tool audio-annotor is used to implement the start and end time point labeling and emotion labeling of the audio piece.
Separating a training set and a check set;
specifically, the process of separating the training set and the check set includes: randomly disorganizing all marked audio segment samples, then dividing the audio segment samples into two data sets according to the proportion of 4:1, wherein more parts are used for model training and are training sets, and less parts are used for model verification and are check sets.
Adjusting the voice emotion recognition model and the character emotion recognition model based on the emotion types of the training set;
and testing the voice emotion recognition model and the character emotion recognition model by using the test set so as to determine the accuracy of the voice emotion recognition model and the character emotion recognition model.
Referring to fig. 5, the identifying the second audio feature and obtaining the audio emotion recognition result further includes:
S104A 1: identifying the second audio features based on an audio classification network in a pre-trained audio identification model, and acquiring a plurality of audio emotion classifications and first confidence degrees corresponding to the audio emotion classifications;
and inputting the extracted second audio features into an audio classification network in the audio recognition model, and analyzing the second audio features by a classification network layer to obtain a plurality of audio emotion classifications corresponding to the second audio features and a first confidence coefficient corresponding to each audio emotion classification. For example, the first confidence of "positive emotions" is 0.3, and the first confidence of "negative emotions" is 0.7.
S104B 1: and selecting the audio emotion classification with the highest first confidence coefficient as a target audio emotion classification, wherein the corresponding first confidence coefficient is a target audio emotion classification parameter.
Correspondingly, the target audio emotion is classified as "negative emotion", and the target audio emotion classification parameter is 0.7.
S104C 1: and carrying out numerical value mapping on the target audio emotion classification vector parameters to obtain an audio emotion recognition result.
The numerical value mapping means that the original output result is mapped into a specific numerical value by taking the emotion type as the emotion type, so that the fluctuation of emotion can be conveniently observed further in the follow-up process. In an embodiment, the emotion classification is mapped to a specific number through a certain functional relation, for example, after a first confidence of each preset emotion classification of the voice data to be detected is obtained, a target audio emotion classification vector parameter X corresponding to the emotion classification with the highest confidence is selected, and the audio emotion recognition result Y finally output is calculated by using the following audio emotion recognition result formula.
In this embodiment, the numerical mapping relationship is that, when the recognized emotion type is "positive", Y is 0.5X; when the emotion recognition result is "negative", Y is 0.5(1+ X), so that the finally output audio emotion recognition result is a floating point number having a numerical value between 0 and 1.
Specifically, the final output audio emotion recognition result is 0.85.
Referring to fig. 6, recognizing the second text feature, and obtaining a text emotion recognition result further includes:
S104A 2: and identifying the second character features based on a character classification network in a pre-trained character identification model, and acquiring second confidence degrees corresponding to a plurality of character emotion classification vectors.
And inputting the extracted second character features into a character classification network in the character recognition model, and analyzing the second character features by the classification network layer to obtain a plurality of character emotion classifications corresponding to the second character features and a second confidence coefficient corresponding to each character emotion classification. For example, the second confidence of "positive emotions" is 0.2, and the first confidence of "negative emotions" is 0.8.
S104B 2: and selecting the audio emotion classification with the highest second confidence coefficient as a target character emotion classification, wherein the corresponding second confidence coefficient is a target character emotion classification parameter.
Correspondingly, the target word emotion is classified as "negative emotion", and the target word emotion classification parameter is 0.8.
S104C 2: and carrying out numerical value mapping on the target character emotion classification vector parameters to obtain character emotion recognition results.
Specifically, the final output character emotion recognition result is 0.9.
And S106, carrying out fusion processing on the audio emotion recognition result and the character emotion recognition result to obtain an emotion recognition result, and sending the emotion recognition result to an associated terminal.
Referring to fig. 7, the step S106 may further include:
s106, 106A, carrying out weighting processing on the audio emotion recognition result and the character emotion recognition result of each section of the voice data of the first user to obtain a first emotion value, and carrying out weighting processing on the audio emotion recognition result and the character emotion recognition result of each section of the voice data of the second user to obtain a second emotion value;
specifically, two emotion values of the same audio segment are processed by a numerical value weighting method, wherein the emotion values are floating point numbers between 0 and 1, the emotion is more negative when the emotion values are closer to 1, and the emotion values are more positive when the emotion values are closer to 0.
As an example, the weight of the emotion value obtained by the speech emotion recognition channel is 0.7; the weight of the emotion value obtained by the character emotion recognition channel is 0.3.
Further, the final output emotion value is 0.865, as described in the above embodiment.
S106, 106B, generating a first emotion value heat map according to the first emotion value and generating a second emotion value heat map according to the second emotion value;
specifically, the emotion value heatmap is used for numbering and drawing each section of voice to be detected according to the time sequence, and the heatmap is used for clustering the emotion of each time section.
Specifically, a heatmap of emotion values is plotted using the heatmap function of the seaborn library of python, with different colors representing different emotions, e.g., positive emotions being positive, the colors being darker.
S106, 106C, the first emotion value heat map and the second emotion value heat map are sent to a related terminal.
Specifically, the association terminal includes a first user terminal and a second user terminal, and as an embodiment, when the first user and the second user are a client and a customer service respectively, the association terminal includes a customer service quality supervision and management terminal and a customer service superior terminal in addition to the client and the customer service terminal, so as to supervise and correct the service quality of the customer service.
The embodiment of the invention adopts double channels to analyze the voice emotion, analyzes the voice emotion through the audio acoustic rhythm, further judges the emotion of the speaker through the speaking content, thereby improving the emotion analysis accuracy, analyzes and judges the emotion value of each section of conversation by combining the conversation separation technology, thereby obtaining the emotion of the speaker in each time period in the complete conversation process, further analyzing the emotion fluctuation condition of the speaker, providing visualized reference and help for quality inspection of customer service, leading the evaluation result to be more objective, finally helping enterprises to improve the quality of the customer service and improving the customer experience.
Example two
With continued reference to fig. 8, a schematic diagram of program modules of the speech emotion analyzing apparatus according to the present invention is shown. In the present embodiment, the speech emotion analyzing apparatus 20 may include or be divided into one or more program modules, and the one or more program modules are stored in a storage medium and executed by one or more processors to implement the present invention and implement the speech emotion analyzing method described above. The program module referred to in the embodiments of the present invention refers to a series of computer program instruction segments capable of performing specific functions, and is more suitable for describing the execution process of the speech emotion fluctuation analysis apparatus 20 in the storage medium than the program itself. The following description will specifically describe the functions of the program modules of the present embodiment:
the first voice feature obtaining module 200 is configured to obtain a first audio feature and a first text feature of the voice data to be detected.
Further, the first speech feature obtaining module 200 is further configured to:
acquiring offline or online voice data to be detected;
and separating the voice data to obtain voice data to be detected, wherein the voice data to be detected comprises a plurality of sections of first user voice data and second user voice data.
The second speech feature extraction module 202: the audio recognition system is used for extracting a second audio feature from the first audio feature based on an audio feature extraction network in a pre-trained audio recognition model; and extracting a second character feature from the first character features based on a character feature extraction network in a pre-trained character recognition model.
Further, the second speech feature extraction module 202 is further configured to:
performing frame windowing on the voice data to be detected to obtain a voice analysis frame;
carrying out Fourier transform on the voice analysis frame to obtain a corresponding frequency spectrum;
the frequency spectrum is processed by a Mel filter bank to obtain a Mel frequency spectrum;
and performing cepstrum analysis on the Mel frequency spectrum to obtain a first audio characteristic of the voice data to be detected.
Further, the second speech feature extraction module 202 is further configured to:
converting the voice data to be tested into characters;
performing word segmentation processing on the characters to obtain L word segments, wherein L is a natural number greater than 0;
and respectively carrying out word vector mapping on the L participles to obtain a d-dimensional word vector matrix corresponding to the L participles, wherein d is a natural number greater than 0, and the d-dimensional word vector matrix is a first character characteristic of the voice data to be detected.
The speech feature recognition module 204: the second audio feature is identified, and an audio emotion identification result is obtained; and identifying the second character features to obtain a character emotion identification result.
Further, the speech feature recognition module 204 is further configured to:
identifying the second audio features based on an audio classification network in a pre-trained audio identification model, and acquiring first confidence degrees corresponding to a plurality of audio emotion classification vectors;
selecting the audio emotion classification with the highest first confidence coefficient as a target audio emotion classification, wherein the corresponding first confidence coefficient is a target audio emotion classification parameter;
and carrying out numerical value mapping on the target audio emotion classification vector parameters to obtain an audio emotion recognition result.
Further, the speech feature recognition module 204 is further configured to:
recognizing the second character features based on a character classification network in a pre-trained character recognition model, and acquiring second confidence degrees corresponding to a plurality of character emotion classification vectors;
selecting the audio emotion classification with the highest second confidence coefficient as a target character emotion classification, wherein the corresponding second confidence coefficient is a target character emotion classification parameter;
and carrying out numerical value mapping on the target character emotion classification vector parameters to obtain character emotion recognition results.
The recognition result acquisition module 206: and the emotion recognition device is used for fusing the audio emotion recognition result and the character emotion recognition result to obtain an emotion recognition result and sending the emotion recognition result to the associated terminal.
Further, the recognition result obtaining module 206 is further configured to:
weighting the audio emotion recognition result and the character emotion recognition result of each section of the voice data of the first user to obtain a first emotion value, and weighting the audio emotion recognition result and the character emotion recognition result of each section of the voice data of the second user to obtain a second emotion value;
generating a first sentiment value heat map according to the first sentiment value and a second sentiment value heat map according to the second sentiment value;
and sending the first emotion value heat map and the second emotion value heat map to a related terminal.
EXAMPLE III
Fig. 9 is a schematic diagram of a hardware architecture of a computer device according to a third embodiment of the present invention. In the present embodiment, the computer device 2 is a device capable of automatically performing numerical calculation and/or information processing in accordance with a preset or stored instruction. The computer device 2 may be a rack server, a blade server, a tower server or a rack server (including an independent server or a server cluster composed of a plurality of servers), and the like. As shown in fig. 9, the computer device 2 includes, but is not limited to, at least a memory 21, a processor 22, a network interface 23, and a speech emotion analyzing device 20, which are communicatively connected to each other via a system device bus. Wherein:
in this embodiment, the memory 21 includes at least one type of computer-readable storage medium including a flash memory, a hard disk, a multimedia card, a card-type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, and the like. In some embodiments, the storage 21 may be an internal storage unit of the computer device 2, such as a hard disk or a memory of the computer device 2. In other embodiments, the memory 21 may also be an external storage device of the computer device 2, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), or the like provided on the computer device 2. Of course, the memory 21 may also comprise both internal and external memory units of the computer device 2. In this embodiment, the memory 21 is generally used for storing various application software and operating system devices installed in the computer device 2, such as the program codes of the speech emotion analyzing device 20 in the second embodiment. Further, the memory 21 may also be used to temporarily store various types of data that have been output or are to be output.
The network interface 23 may comprise a wireless network interface or a wired network interface, and the network interface 23 is generally used for establishing communication connection between the computer device 2 and other electronic system devices. For example, the network interface 23 is used to connect the computer device 2 to an external terminal through a network, establish a data transmission channel and a communication connection between the computer device 2 and the external terminal, and the like. The network may be a wireless or wired network such as an Intranet (Intranet), the Internet (Internet), a Global System of Mobile communication (GSM), Wideband Code Division Multiple Access (WCDMA), a 4G network, a 5G network, Bluetooth (Bluetooth), Wi-Fi, and the like.
It is noted that fig. 9 only shows the computer device 2 with components 20-23, but it is to be understood that not all shown components are required to be implemented, and that more or less components may be implemented instead.
In this embodiment, the speech emotion analyzing apparatus 20 stored in the memory 21 may also be divided into one or more program modules, and the one or more program modules are stored in the memory 21 and executed by one or more processors (in this embodiment, the processor 22) to complete the present invention.
For example, fig. 8 shows a schematic diagram of program modules of a second embodiment of implementing the speech emotion fluctuation analysis apparatus 20, in this embodiment, the speech emotion fluctuation analysis apparatus 20 can be divided into a first speech feature acquisition module 200, a second speech feature extraction module 202, a speech feature recognition module 204, and a recognition result acquisition module 206. The program module referred to in the present invention refers to a series of computer program instruction segments capable of performing specific functions, and is more suitable than a program for describing the execution process of the speech emotion fluctuation analysis device 20 in the computer device 2. The specific functions of the program modules, i.e., the first speech feature obtaining module 200 and the recognition result obtaining module 206, have been described in detail in the second embodiment, and are not described herein again.
Example four
The present embodiment also provides a computer-readable storage medium, such as a flash memory, a hard disk, a multimedia card, a card-type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a read-only memory (ROM), an electrically erasable programmable read-only memory (EEPROM), a programmable read-only memory (PROM), a magnetic memory, a magnetic disk, an optical disk, a server, an App application mall, etc., on which a computer program is stored, which when executed by a processor implements corresponding functions. The computer-readable storage medium of the present embodiment is used for storing a speech emotion fluctuation analysis device 20, and when executed by a processor, implements the speech emotion fluctuation analysis method of the above-described embodiment.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner.
The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.
Claims (10)
1. A speech emotion fluctuation analysis method is characterized by comprising the following steps:
acquiring a first audio characteristic and a first character characteristic of voice data to be detected;
extracting a second audio feature from the first audio features based on an audio feature extraction network in a pre-trained audio recognition model; extracting a second character feature from the first character features based on a character feature extraction network in a pre-trained character recognition model;
identifying the second audio features to obtain an audio emotion identification result; identifying the second character features to obtain character emotion identification results;
and carrying out fusion processing on the audio emotion recognition result and the character emotion recognition result to obtain an emotion recognition result, and sending the emotion recognition result to an associated terminal.
2. The method for analyzing speech emotion fluctuation according to claim 1, wherein the obtaining of the first audio feature and the first text feature of the speech data to be detected includes:
performing frame windowing on the voice data to be detected to obtain a voice analysis frame;
carrying out Fourier transform on the voice analysis frame to obtain a corresponding frequency spectrum;
the frequency spectrum is processed by a Mel filter bank to obtain a Mel frequency spectrum;
and performing cepstrum analysis on the Mel frequency spectrum to obtain a first audio characteristic of the voice data to be detected.
3. The voice emotion analyzing method according to claim 2, wherein the second audio feature is recognized to obtain an audio emotion recognition result; recognizing the second character features, and acquiring character emotion recognition results, wherein the character emotion recognition results comprise:
identifying the second audio features based on an audio classification network in a pre-trained audio identification model, and acquiring first confidence degrees corresponding to a plurality of audio emotion classification vectors;
selecting the audio emotion classification with the highest first confidence coefficient as a target audio emotion classification, wherein the corresponding first confidence coefficient is a target audio emotion classification parameter;
and carrying out numerical value mapping on the target audio emotion classification vector parameters to obtain an audio emotion recognition result.
4. The method for analyzing speech emotion fluctuation according to claim 1, wherein the obtaining of the first audio feature and the first text feature of the speech data to be tested further includes:
converting the voice data to be tested into characters;
performing word segmentation processing on the characters to obtain L word segments, wherein L is a natural number greater than 0;
and respectively carrying out word vector mapping on the L participles to obtain a d-dimensional word vector matrix corresponding to the L participles, wherein d is a natural number greater than 0, and the d-dimensional word vector matrix is a first character characteristic of the voice data to be detected.
5. The method for analyzing speech emotion fluctuation according to claim 4, wherein the second audio feature is recognized to obtain an audio emotion recognition result; recognizing the second character features, and acquiring character emotion recognition results, wherein the character emotion recognition results comprise:
recognizing the second character features based on a character classification network in a pre-trained character recognition model, and acquiring second confidence degrees corresponding to a plurality of character emotion classification vectors;
selecting the audio emotion classification with the highest second confidence coefficient as a target character emotion classification, wherein the corresponding second confidence coefficient is a target character emotion classification parameter;
and carrying out numerical value mapping on the target character emotion classification vector parameters to obtain character emotion recognition results.
6. The speech emotion fluctuation analysis method of claim 1, wherein the method further comprises:
acquiring offline or online voice data to be detected;
and separating the voice data to obtain voice data to be detected, wherein the voice data to be detected comprises a plurality of sections of first user voice data and second user voice data.
7. The voice emotion fluctuation analysis method of claim 6, wherein the fusing the audio emotion recognition result and the character emotion recognition result to obtain an emotion recognition result, and the sending the emotion recognition result to the association terminal includes:
weighting the audio emotion recognition result and the character emotion recognition result of each section of the voice data of the first user to obtain a first emotion value, and weighting the audio emotion recognition result and the character emotion recognition result of each section of the voice data of the second user to obtain a second emotion value;
generating a first sentiment value heat map according to the first sentiment value and a second sentiment value heat map according to the second sentiment value;
and sending the first emotion value heat map and the second emotion value heat map to a related terminal.
8. A speech emotion fluctuation analysis apparatus, comprising:
the first voice feature acquisition module is used for acquiring a first audio feature and a first character feature of the voice data to be detected;
the second voice feature extraction module is used for extracting a second audio feature in the first audio feature based on an audio feature extraction network in a pre-trained audio recognition model; extracting a second character feature from the first character features based on a character feature extraction network in a pre-trained character recognition model;
the voice feature recognition module is used for recognizing the second audio features and acquiring an audio emotion recognition result; identifying the second character features to obtain character emotion identification results;
and the recognition result acquisition module is used for carrying out fusion processing on the audio emotion recognition result and the character emotion recognition result to obtain an emotion recognition result and sending the emotion recognition result to the associated terminal.
9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor, when executing the computer program, implements the steps of the speech mood swing analyzing method according to any of claims 1 to 7.
10. A computer-readable storage medium, in which a computer program is stored which is executable by at least one processor to cause the at least one processor to perform the steps of the speech mood swing analyzing method according to any one of claims 1 to 7.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911341679.XA CN111081279A (en) | 2019-12-24 | 2019-12-24 | Voice emotion fluctuation analysis method and device |
PCT/CN2020/094338 WO2021128741A1 (en) | 2019-12-24 | 2020-06-04 | Voice emotion fluctuation analysis method and apparatus, and computer device and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911341679.XA CN111081279A (en) | 2019-12-24 | 2019-12-24 | Voice emotion fluctuation analysis method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111081279A true CN111081279A (en) | 2020-04-28 |
Family
ID=70317032
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911341679.XA Pending CN111081279A (en) | 2019-12-24 | 2019-12-24 | Voice emotion fluctuation analysis method and device |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN111081279A (en) |
WO (1) | WO2021128741A1 (en) |
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111739559A (en) * | 2020-05-07 | 2020-10-02 | 北京捷通华声科技股份有限公司 | Speech early warning method, device, equipment and storage medium |
CN111916112A (en) * | 2020-08-19 | 2020-11-10 | 浙江百应科技有限公司 | Emotion recognition method based on voice and characters |
CN111938674A (en) * | 2020-09-07 | 2020-11-17 | 南京宇乂科技有限公司 | Emotion recognition control system for conversation |
CN112100337A (en) * | 2020-10-15 | 2020-12-18 | 平安科技(深圳)有限公司 | Emotion recognition method and device in interactive conversation |
CN112215927A (en) * | 2020-09-18 | 2021-01-12 | 腾讯科技(深圳)有限公司 | Method, device, equipment and medium for synthesizing face video |
CN112527994A (en) * | 2020-12-18 | 2021-03-19 | 平安银行股份有限公司 | Emotion analysis method, emotion analysis device, emotion analysis equipment and readable storage medium |
CN112837702A (en) * | 2020-12-31 | 2021-05-25 | 萨孚凯信息系统(无锡)有限公司 | Voice emotion distributed system and voice signal processing method |
CN112911072A (en) * | 2021-01-28 | 2021-06-04 | 携程旅游网络技术(上海)有限公司 | Call center volume identification method and device, electronic equipment and storage medium |
CN113053409A (en) * | 2021-03-12 | 2021-06-29 | 科大讯飞股份有限公司 | Audio evaluation method and device |
WO2021128741A1 (en) * | 2019-12-24 | 2021-07-01 | 深圳壹账通智能科技有限公司 | Voice emotion fluctuation analysis method and apparatus, and computer device and storage medium |
CN113129927A (en) * | 2021-04-16 | 2021-07-16 | 平安科技(深圳)有限公司 | Voice emotion recognition method, device, equipment and storage medium |
CN114049902A (en) * | 2021-10-27 | 2022-02-15 | 广东万丈金数信息技术股份有限公司 | Aricloud-based recording uploading recognition and emotion analysis method and system |
CN115430155A (en) * | 2022-09-06 | 2022-12-06 | 北京中科心研科技有限公司 | Team cooperation capability assessment method and system based on audio analysis |
WO2023246076A1 (en) * | 2022-06-24 | 2023-12-28 | 上海哔哩哔哩科技有限公司 | Emotion category recognition method, apparatus, storage medium and electronic device |
CN117688344A (en) * | 2024-02-04 | 2024-03-12 | 北京大学 | Multi-mode fine granularity trend analysis method and system based on large model |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114373455A (en) * | 2021-12-08 | 2022-04-19 | 北京声智科技有限公司 | Emotion recognition method and device, electronic equipment and storage medium |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102779510A (en) * | 2012-07-19 | 2012-11-14 | 东南大学 | Speech emotion recognition method based on feature space self-adaptive projection |
CN106228977A (en) * | 2016-08-02 | 2016-12-14 | 合肥工业大学 | The song emotion identification method of multi-modal fusion based on degree of depth study |
CN108305643A (en) * | 2017-06-30 | 2018-07-20 | 腾讯科技(深圳)有限公司 | The determination method and apparatus of emotion information |
CN108305642A (en) * | 2017-06-30 | 2018-07-20 | 腾讯科技(深圳)有限公司 | The determination method and apparatus of emotion information |
US20190325897A1 (en) * | 2018-04-21 | 2019-10-24 | International Business Machines Corporation | Quantifying customer care utilizing emotional assessments |
CN110390956A (en) * | 2019-08-15 | 2019-10-29 | 龙马智芯(珠海横琴)科技有限公司 | Emotion recognition network model, method and electronic equipment |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108305641B (en) * | 2017-06-30 | 2020-04-07 | 腾讯科技(深圳)有限公司 | Method and device for determining emotion information |
CN111081279A (en) * | 2019-12-24 | 2020-04-28 | 深圳壹账通智能科技有限公司 | Voice emotion fluctuation analysis method and device |
-
2019
- 2019-12-24 CN CN201911341679.XA patent/CN111081279A/en active Pending
-
2020
- 2020-06-04 WO PCT/CN2020/094338 patent/WO2021128741A1/en active Application Filing
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102779510A (en) * | 2012-07-19 | 2012-11-14 | 东南大学 | Speech emotion recognition method based on feature space self-adaptive projection |
CN106228977A (en) * | 2016-08-02 | 2016-12-14 | 合肥工业大学 | The song emotion identification method of multi-modal fusion based on degree of depth study |
CN108305643A (en) * | 2017-06-30 | 2018-07-20 | 腾讯科技(深圳)有限公司 | The determination method and apparatus of emotion information |
CN108305642A (en) * | 2017-06-30 | 2018-07-20 | 腾讯科技(深圳)有限公司 | The determination method and apparatus of emotion information |
US20190325897A1 (en) * | 2018-04-21 | 2019-10-24 | International Business Machines Corporation | Quantifying customer care utilizing emotional assessments |
CN110390956A (en) * | 2019-08-15 | 2019-10-29 | 龙马智芯(珠海横琴)科技有限公司 | Emotion recognition network model, method and electronic equipment |
Cited By (21)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2021128741A1 (en) * | 2019-12-24 | 2021-07-01 | 深圳壹账通智能科技有限公司 | Voice emotion fluctuation analysis method and apparatus, and computer device and storage medium |
CN111739559A (en) * | 2020-05-07 | 2020-10-02 | 北京捷通华声科技股份有限公司 | Speech early warning method, device, equipment and storage medium |
CN111739559B (en) * | 2020-05-07 | 2023-02-28 | 北京捷通华声科技股份有限公司 | Speech early warning method, device, equipment and storage medium |
CN111916112A (en) * | 2020-08-19 | 2020-11-10 | 浙江百应科技有限公司 | Emotion recognition method based on voice and characters |
CN111938674A (en) * | 2020-09-07 | 2020-11-17 | 南京宇乂科技有限公司 | Emotion recognition control system for conversation |
CN112215927A (en) * | 2020-09-18 | 2021-01-12 | 腾讯科技(深圳)有限公司 | Method, device, equipment and medium for synthesizing face video |
CN112215927B (en) * | 2020-09-18 | 2023-06-23 | 腾讯科技(深圳)有限公司 | Face video synthesis method, device, equipment and medium |
CN112100337A (en) * | 2020-10-15 | 2020-12-18 | 平安科技(深圳)有限公司 | Emotion recognition method and device in interactive conversation |
CN112100337B (en) * | 2020-10-15 | 2024-03-05 | 平安科技(深圳)有限公司 | Emotion recognition method and device in interactive dialogue |
CN112527994A (en) * | 2020-12-18 | 2021-03-19 | 平安银行股份有限公司 | Emotion analysis method, emotion analysis device, emotion analysis equipment and readable storage medium |
CN112837702A (en) * | 2020-12-31 | 2021-05-25 | 萨孚凯信息系统(无锡)有限公司 | Voice emotion distributed system and voice signal processing method |
CN112911072A (en) * | 2021-01-28 | 2021-06-04 | 携程旅游网络技术(上海)有限公司 | Call center volume identification method and device, electronic equipment and storage medium |
CN113053409A (en) * | 2021-03-12 | 2021-06-29 | 科大讯飞股份有限公司 | Audio evaluation method and device |
CN113053409B (en) * | 2021-03-12 | 2024-04-12 | 科大讯飞股份有限公司 | Audio evaluation method and device |
CN113129927A (en) * | 2021-04-16 | 2021-07-16 | 平安科技(深圳)有限公司 | Voice emotion recognition method, device, equipment and storage medium |
CN113129927B (en) * | 2021-04-16 | 2023-04-07 | 平安科技(深圳)有限公司 | Voice emotion recognition method, device, equipment and storage medium |
CN114049902A (en) * | 2021-10-27 | 2022-02-15 | 广东万丈金数信息技术股份有限公司 | Aricloud-based recording uploading recognition and emotion analysis method and system |
WO2023246076A1 (en) * | 2022-06-24 | 2023-12-28 | 上海哔哩哔哩科技有限公司 | Emotion category recognition method, apparatus, storage medium and electronic device |
CN115430155A (en) * | 2022-09-06 | 2022-12-06 | 北京中科心研科技有限公司 | Team cooperation capability assessment method and system based on audio analysis |
CN117688344A (en) * | 2024-02-04 | 2024-03-12 | 北京大学 | Multi-mode fine granularity trend analysis method and system based on large model |
CN117688344B (en) * | 2024-02-04 | 2024-05-07 | 北京大学 | Multi-mode fine granularity trend analysis method and system based on large model |
Also Published As
Publication number | Publication date |
---|---|
WO2021128741A1 (en) | 2021-07-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111081279A (en) | Voice emotion fluctuation analysis method and device | |
CN107680582B (en) | Acoustic model training method, voice recognition method, device, equipment and medium | |
CN108198547B (en) | Voice endpoint detection method and device, computer equipment and storage medium | |
US9536547B2 (en) | Speaker change detection device and speaker change detection method | |
US10878823B2 (en) | Voiceprint recognition method, device, terminal apparatus and storage medium | |
US10388279B2 (en) | Voice interaction apparatus and voice interaction method | |
CN110457432B (en) | Interview scoring method, interview scoring device, interview scoring equipment and interview scoring storage medium | |
CN109256150B (en) | Speech emotion recognition system and method based on machine learning | |
US9368116B2 (en) | Speaker separation in diarization | |
CN111311327A (en) | Service evaluation method, device, equipment and storage medium based on artificial intelligence | |
US20180122377A1 (en) | Voice interaction apparatus and voice interaction method | |
CN111785275A (en) | Voice recognition method and device | |
CN110390946A (en) | A kind of audio signal processing method, device, electronic equipment and storage medium | |
CN110675862A (en) | Corpus acquisition method, electronic device and storage medium | |
CN108899033B (en) | Method and device for determining speaker characteristics | |
US11837236B2 (en) | Speaker recognition based on signal segments weighted by quality | |
Pao et al. | A study on the search of the most discriminative speech features in the speaker dependent speech emotion recognition | |
CN106710588B (en) | Speech data sentence recognition method, device and system | |
CN110556098A (en) | voice recognition result testing method and device, computer equipment and medium | |
CN118035411A (en) | Customer service voice quality inspection method, customer service voice quality inspection device, customer service voice quality inspection equipment and storage medium | |
CN116741155A (en) | Speech recognition method, training method, device and equipment of speech recognition model | |
CN111933153B (en) | Voice segmentation point determining method and device | |
CN111326161B (en) | Voiceprint determining method and device | |
CN113421552A (en) | Audio recognition method and device | |
CN114446284A (en) | Speaker log generation method and device, computer equipment and readable storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
AD01 | Patent right deemed abandoned |
Effective date of abandoning: 20240209 |
|
AD01 | Patent right deemed abandoned |