CN111081279A - Voice emotion fluctuation analysis method and device - Google Patents

Voice emotion fluctuation analysis method and device Download PDF

Info

Publication number
CN111081279A
CN111081279A CN201911341679.XA CN201911341679A CN111081279A CN 111081279 A CN111081279 A CN 111081279A CN 201911341679 A CN201911341679 A CN 201911341679A CN 111081279 A CN111081279 A CN 111081279A
Authority
CN
China
Prior art keywords
emotion
audio
character
recognition result
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911341679.XA
Other languages
Chinese (zh)
Inventor
朱锦祥
单以磊
臧磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
OneConnect Smart Technology Co Ltd
OneConnect Financial Technology Co Ltd Shanghai
Original Assignee
OneConnect Financial Technology Co Ltd Shanghai
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by OneConnect Financial Technology Co Ltd Shanghai filed Critical OneConnect Financial Technology Co Ltd Shanghai
Priority to CN201911341679.XA priority Critical patent/CN111081279A/en
Publication of CN111081279A publication Critical patent/CN111081279A/en
Priority to PCT/CN2020/094338 priority patent/WO2021128741A1/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • G10L15/05Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/87Detection of discrete points within a voice signal

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Quality & Reliability (AREA)
  • Child & Adolescent Psychology (AREA)
  • General Health & Medical Sciences (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychiatry (AREA)
  • Artificial Intelligence (AREA)
  • Machine Translation (AREA)

Abstract

The embodiment of the invention provides a voice emotion fluctuation analysis method, which comprises the following steps: acquiring a first audio characteristic and a first character characteristic of voice data to be detected; extracting a second audio feature from the first audio feature based on an audio feature extraction network in a pre-trained audio recognition model; extracting a second character feature from the first character features based on a character feature extraction network in a pre-trained character recognition model; identifying a second audio characteristic to obtain an audio emotion identification result; identifying the second character characteristics to obtain a character emotion identification result; and carrying out fusion processing on the audio emotion recognition result and the character emotion recognition result to obtain an emotion recognition result, and sending the emotion recognition result to the associated terminal. According to the method, through a double-channel voice emotion recognition method and the emotion value heat map drawing, imagination reference and help are provided for customer service quality inspection, so that the evaluation result is more objective, an enterprise is helped to improve the customer service quality, and the customer experience is improved.

Description

Voice emotion fluctuation analysis method and device
Technical Field
The invention relates to the technical field of internet, in particular to a voice emotion fluctuation analysis method and device.
Background
With the development of artificial intelligence technology, emotion fluctuation analysis is applied in more and more commercial scenes, such as emotion fluctuation situations of both parties when a customer service person talks with a client. In the prior art, emotion fluctuation analysis for audio is generally performed by using audio signals of sound, such as intonation, frequency and amplitude variation of sound waves, the analysis mode is single, the audio signals of different people are different, and the emotion analysis accuracy is low only by using the audio signals of sound.
Disclosure of Invention
In view of this, embodiments of the present invention provide a method and an apparatus for analyzing speech emotion fluctuation, a computer device, and a computer-readable storage medium, which are used for solving the problem of low accuracy in analyzing emotion fluctuation.
The embodiment of the invention solves the technical problems through the following technical scheme:
a speech emotion fluctuation analysis method includes:
acquiring a first audio characteristic and a first character characteristic of voice data to be detected;
extracting a second audio feature from the first audio features based on an audio feature extraction network in a pre-trained audio recognition model; extracting a second character feature from the first character features based on a character feature extraction network in a pre-trained character recognition model;
identifying the second audio features to obtain an audio emotion identification result; identifying the second character features to obtain character emotion identification results;
and carrying out fusion processing on the audio emotion recognition result and the character emotion recognition result to obtain an emotion recognition result, and sending the emotion recognition result to an associated terminal.
Further, the acquiring a first audio feature and a first character feature of the voice data to be detected includes:
performing frame windowing on the voice data to be detected to obtain a voice analysis frame;
carrying out Fourier transform on the voice analysis frame to obtain a corresponding frequency spectrum;
the frequency spectrum is processed by a Mel filter bank to obtain a Mel frequency spectrum;
and performing cepstrum analysis on the Mel frequency spectrum to obtain a first audio characteristic of the voice data to be detected.
Further, the second audio features are identified, and an audio emotion identification result is obtained; recognizing the second character features, and acquiring character emotion recognition results, wherein the character emotion recognition results comprise:
identifying the second audio features based on an audio classification network in a pre-trained audio identification model, and acquiring first confidence degrees corresponding to a plurality of audio emotion classification vectors;
selecting the audio emotion classification with the highest first confidence coefficient as a target audio emotion classification, wherein the corresponding first confidence coefficient is a target audio emotion classification parameter;
and carrying out numerical value mapping on the target audio emotion classification vector parameters to obtain an audio emotion recognition result.
Further, the acquiring of the first audio feature and the first character feature of the voice data to be detected further includes:
converting the voice data to be tested into characters;
performing word segmentation processing on the characters to obtain L word segments, wherein L is a natural number greater than 0;
and respectively carrying out word vector mapping on the L participles to obtain a d-dimensional word vector matrix corresponding to the L participles, wherein d is a natural number greater than 0, and the d-dimensional word vector matrix is a first character characteristic of the voice data to be detected.
Further, the second audio features are identified, and an audio emotion identification result is obtained; recognizing the second character features, and acquiring character emotion recognition results, wherein the character emotion recognition results comprise:
recognizing the second character features based on a character classification network in a pre-trained character recognition model, and acquiring second confidence degrees corresponding to a plurality of character emotion classification vectors;
selecting the audio emotion classification with the highest second confidence coefficient as a target character emotion classification, wherein the corresponding second confidence coefficient is a target character emotion classification parameter;
and carrying out numerical value mapping on the target character emotion classification vector parameters to obtain character emotion recognition results.
Further, the method further comprises:
acquiring offline or online voice data to be detected;
and separating the voice data to obtain voice data to be detected, wherein the voice data to be detected comprises a plurality of sections of first user voice data and second user voice data.
Further, the fusion processing of the audio emotion recognition result and the character emotion recognition result to obtain an emotion recognition result, and the sending of the emotion recognition result to the associated terminal includes:
weighting the audio emotion recognition result and the character emotion recognition result of each section of the voice data of the first user to obtain a first emotion value, and weighting the audio emotion recognition result and the character emotion recognition result of each section of the voice data of the second user to obtain a second emotion value;
generating a first sentiment value heat map according to the first sentiment value and a second sentiment value heat map according to the second sentiment value;
and sending the first emotion value heat map and the second emotion value heat map to a related terminal.
In order to achieve the above object, an embodiment of the present invention further provides a speech emotion analyzing apparatus, including:
the first voice feature acquisition module is used for acquiring a first audio feature and a first character feature of the voice data to be detected;
the second voice feature extraction module is used for extracting a second audio feature in the first audio feature based on an audio feature extraction network in a pre-trained audio recognition model; extracting a second character feature from the first character features based on a character feature extraction network in a pre-trained character recognition model;
the voice feature recognition module is used for recognizing the second audio features and acquiring an audio emotion recognition result; identifying the second character features to obtain character emotion identification results;
and the recognition result acquisition module is used for carrying out fusion processing on the audio emotion recognition result and the character emotion recognition result to obtain an emotion recognition result and sending the emotion recognition result to the associated terminal.
In order to achieve the above object, an embodiment of the present invention further provides a computer device, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and the processor implements the steps of the speech emotion fluctuation analysis method as described above when executing the computer program.
In order to achieve the above object, an embodiment of the present invention further provides a computer-readable storage medium, in which a computer program is stored, where the computer program is executable by at least one processor, so as to cause the at least one processor to execute the steps of the speech emotion fluctuation analysis method as described above.
According to the voice emotion fluctuation analysis method, the voice emotion fluctuation analysis device, the computer equipment and the computer readable storage medium, the voice emotion is analyzed through two channels, the voice emotion is analyzed through the audio acoustic rhythm, the emotion of a speaker is further judged through the speaking content, and therefore the emotion analysis accuracy is improved.
The invention is described in detail below with reference to the drawings and specific examples, but the invention is not limited thereto.
Drawings
FIG. 1 is a flowchart illustrating a method for analyzing speech emotion fluctuation according to an embodiment of the present invention;
FIG. 2 is a detailed flowchart of obtaining voice data to be tested;
FIG. 3 is a detailed flowchart of extracting a first audio feature from the voice data to be detected;
FIG. 4 is a detailed flowchart of extracting a first text feature from the voice data to be detected;
FIG. 5 is a flowchart illustrating the specific process of identifying the second audio feature and obtaining the audio emotion recognition result;
fig. 6 is a specific flowchart for identifying the second character feature and obtaining a character emotion identification result;
fig. 7 is a specific flowchart for performing fusion processing on the audio emotion recognition result and the character emotion recognition result to obtain an emotion recognition result, and sending the emotion recognition result to an associated terminal;
FIG. 8 is a schematic diagram of a second embodiment of a speech emotion analyzing apparatus according to the present invention;
FIG. 9 is a diagram of a hardware structure of a third embodiment of the computer apparatus according to the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Technical solutions between various embodiments may be combined with each other, but must be realized by those skilled in the art, and when the technical solutions are contradictory or cannot be realized, such a combination should not be considered to exist, and is not within the protection scope of the present invention.
Example one
Referring to fig. 1, a flowchart illustrating steps of a speech emotion analyzing method according to an embodiment of the present invention is shown. It is to be understood that the flow charts in the embodiments of the present method are not intended to limit the order in which the steps are performed. The following description is given by taking a computer device as an execution subject, specifically as follows:
s100: acquiring a first audio characteristic and a first character characteristic of voice data to be detected;
referring to fig. 2, the method for analyzing speech emotion fluctuation according to the embodiment of the present invention further includes:
s110: and acquiring voice data to be detected.
The acquiring of the voice data to be tested further comprises:
S110A: acquiring offline or online voice data;
specifically, the voice data includes online voice data and offline voice data, the online voice data refers to voice data obtained in real time in a call process, the offline voice data refers to call voice data stored in a system background, and the voice data to be tested is a recording file in a wav format.
S110B: and separating the voice data to obtain voice data to be detected, wherein the voice data to be detected comprises a plurality of sections of first user voice data and second user voice data.
Specifically, after the voice data is acquired, the voice data to be detected is divided into a plurality of sections of first user voice data and second user voice data according to the mute part of the call voice, an end point detection technology and a voice separation technology are adopted to remove the mute part in the call process of the voice data to be detected, the start point and the end point of each section of conversation are marked based on the time threshold of the mute time existing in the set speaking interval section, the cutting and separation are carried out according to the time point to obtain a plurality of short voice frequency sections, the speaker identity and the speaking time of each short voice frequency section are marked by a voiceprint recognition tool, and the identification and the speaking time are distinguished by numbers. The time length threshold is determined according to an empirical value, and as an embodiment, the time length threshold of the scheme is 0.25-0.3 second.
The number includes, but is not limited to, the job number of the customer service, the landline number of the customer service, and the cell phone number of the customer.
Specifically, the voiceprint recognition tool is a LIUM _ splinization toolkit, and the first user voice data and the second user voice data are distinguished through the LIUM _ splinization toolkit, for example, as follows:
start_time end_time speaker
0 3 1
4 8 2
8.3 12.5 1
we consider the person speaking in the first opening to be the first user (i.e., the spaker 1 in the table) and the second to be the second user (i.e., the spaker 2 in the table) naturally.
Referring to fig. 3, the acquiring the first audio feature of the to-be-detected speech data further includes:
S100A 1: performing frame windowing on the voice data to be detected to obtain a voice analysis frame;
specifically, the voice data signal has short-time stationarity, and the voice data signal can be subjected to framing processing to obtain a plurality of audio frames, where an audio frame refers to a set of N sampling points. In this embodiment, N is 256 or 512, and the time covered is 20 to 30ms, after obtaining a plurality of audio frames, each audio frame is multiplied by a hamming window to increase the continuity of the left and right ends of the frame, and obtain a speech analysis frame.
S100B 1: carrying out Fourier transform on the voice analysis frame to obtain a corresponding frequency spectrum;
specifically, since the voice data signal is difficult to change in the time domain, the voice data signal needs to be converted into energy distribution in the frequency domain, and the voice analysis frame is subjected to fourier transform to obtain the frequency spectrum of each voice analysis frame.
S100C 1: the frequency spectrum is processed by a Mel filter bank to obtain a Mel frequency spectrum;
S100D 1: and performing cepstrum analysis on the Mel frequency spectrum to obtain a first audio characteristic of the voice data to be detected.
Specifically, cepstrum analysis is performed on the mel frequency spectrum to obtain 36 1024-dimensional audio vectors, and the audio vectors are the first audio features of the voice data to be detected.
Referring to fig. 4, the acquiring the first text feature of the voice data to be detected further includes:
S100A 2: converting the voice data to be tested into characters;
specifically, the multiple sections of the first user voice data and the second user voice data are converted into characters by using a voice dictation interface. As an embodiment, the dictation interface is a fly-by-fly voice dictation interface.
S100B 2: performing word segmentation processing on the characters to obtain L word segments, wherein L is a natural number greater than 0;
specifically, the word segmentation process is completed through a dictionary word segmentation algorithm, which includes, but is not limited to, a forward maximum matching method, a reverse maximum matching method, and a two-way matching word segmentation method, and may also be based on hidden markov models HMM, CRF, SVM, and a deep learning algorithm.
S100C 2: and respectively carrying out word vector mapping on the L participles to obtain a d-dimensional word vector matrix corresponding to the L participles, wherein d is a natural number greater than 0, and the d-dimensional word vector matrix is a first character characteristic of the voice data to be detected.
Specifically, a 128-dimensional word vector of each participle is obtained through word2vec and other models.
S102: extracting a second audio feature from the first audio features based on an audio feature extraction network in a pre-trained audio recognition model; and extracting a second character feature from the first character features based on a character feature extraction network in a pre-trained character recognition model.
Specifically, the second audio features and the second character features are semantic feature vectors which are extracted from the first audio features and the first character features by the feature extraction network of the emotion recognition model and have fewer dimensions and pay more attention to words expressing emotion, and by extracting the second audio features and the second character features, the learning capacity of the model can be better, and the accuracy of final classification is higher.
S104: identifying the second audio features to obtain an audio emotion identification result; and identifying the second character features to obtain a character emotion identification result.
Specifically, the audio recognition result is obtained by inputting the audio features into an audio recognition model, and the character emotion recognition result is obtained by inputting the character features into a character recognition model. Specifically, the audio recognition model and the character emotion recognition model comprise a feature extraction network and a classification network, wherein the feature extraction network is used for extracting semantic feature vectors with fewer dimensions, namely a second audio feature and a second character feature, from a first audio feature and a first character feature, and the classification network is used for outputting confidence degrees of all preset emotion categories, wherein the preset emotion categories can be divided according to business requirements, such as positive, negative and the like. The character emotion recognition model is a deep neural network model comprising an Embedding layer and a Long Short-Term Memory neural Layer (LSTM), and the audio emotion recognition model is a neural network model comprising a self-attention layer and a bidirectional Long-Term Memory neural network layer (forward LSTM and backward LSTM).
The long and short term memory network is used for processing the sequence dependency relationship between long spans and is suitable for processing the task of dependency between long texts.
Further, the embodiment of the present invention further includes training the audio recognition model and the character recognition model, where the training process includes:
acquiring a training set and a calibration set corresponding to the target field;
the method for acquiring the training set and the check set corresponding to the target field comprises the following steps:
acquiring voice data of a training set and a check set;
specifically, the acquisition mode of the training set and the call verification set voice data includes, but is not limited to, recording data of a call center in a company, customer service recording data provided by a client, and direct purchase of the customer service recording data from a data platform.
Marking the emotion type of the recorded data;
specifically, the labeling process is as follows: manually marking the pause time point of each recording to obtain a plurality of short audio segments (conversation segments) of each recording; emotion tendency labeling (i.e., positive emotion, negative emotion) is performed on each short audio piece, and in the present embodiment, the audio-annotation tool audio-annotor is used to implement the start and end time point labeling and emotion labeling of the audio piece.
Separating a training set and a check set;
specifically, the process of separating the training set and the check set includes: randomly disorganizing all marked audio segment samples, then dividing the audio segment samples into two data sets according to the proportion of 4:1, wherein more parts are used for model training and are training sets, and less parts are used for model verification and are check sets.
Adjusting the voice emotion recognition model and the character emotion recognition model based on the emotion types of the training set;
and testing the voice emotion recognition model and the character emotion recognition model by using the test set so as to determine the accuracy of the voice emotion recognition model and the character emotion recognition model.
Referring to fig. 5, the identifying the second audio feature and obtaining the audio emotion recognition result further includes:
S104A 1: identifying the second audio features based on an audio classification network in a pre-trained audio identification model, and acquiring a plurality of audio emotion classifications and first confidence degrees corresponding to the audio emotion classifications;
and inputting the extracted second audio features into an audio classification network in the audio recognition model, and analyzing the second audio features by a classification network layer to obtain a plurality of audio emotion classifications corresponding to the second audio features and a first confidence coefficient corresponding to each audio emotion classification. For example, the first confidence of "positive emotions" is 0.3, and the first confidence of "negative emotions" is 0.7.
S104B 1: and selecting the audio emotion classification with the highest first confidence coefficient as a target audio emotion classification, wherein the corresponding first confidence coefficient is a target audio emotion classification parameter.
Correspondingly, the target audio emotion is classified as "negative emotion", and the target audio emotion classification parameter is 0.7.
S104C 1: and carrying out numerical value mapping on the target audio emotion classification vector parameters to obtain an audio emotion recognition result.
The numerical value mapping means that the original output result is mapped into a specific numerical value by taking the emotion type as the emotion type, so that the fluctuation of emotion can be conveniently observed further in the follow-up process. In an embodiment, the emotion classification is mapped to a specific number through a certain functional relation, for example, after a first confidence of each preset emotion classification of the voice data to be detected is obtained, a target audio emotion classification vector parameter X corresponding to the emotion classification with the highest confidence is selected, and the audio emotion recognition result Y finally output is calculated by using the following audio emotion recognition result formula.
In this embodiment, the numerical mapping relationship is that, when the recognized emotion type is "positive", Y is 0.5X; when the emotion recognition result is "negative", Y is 0.5(1+ X), so that the finally output audio emotion recognition result is a floating point number having a numerical value between 0 and 1.
Specifically, the final output audio emotion recognition result is 0.85.
Referring to fig. 6, recognizing the second text feature, and obtaining a text emotion recognition result further includes:
S104A 2: and identifying the second character features based on a character classification network in a pre-trained character identification model, and acquiring second confidence degrees corresponding to a plurality of character emotion classification vectors.
And inputting the extracted second character features into a character classification network in the character recognition model, and analyzing the second character features by the classification network layer to obtain a plurality of character emotion classifications corresponding to the second character features and a second confidence coefficient corresponding to each character emotion classification. For example, the second confidence of "positive emotions" is 0.2, and the first confidence of "negative emotions" is 0.8.
S104B 2: and selecting the audio emotion classification with the highest second confidence coefficient as a target character emotion classification, wherein the corresponding second confidence coefficient is a target character emotion classification parameter.
Correspondingly, the target word emotion is classified as "negative emotion", and the target word emotion classification parameter is 0.8.
S104C 2: and carrying out numerical value mapping on the target character emotion classification vector parameters to obtain character emotion recognition results.
Specifically, the final output character emotion recognition result is 0.9.
And S106, carrying out fusion processing on the audio emotion recognition result and the character emotion recognition result to obtain an emotion recognition result, and sending the emotion recognition result to an associated terminal.
Referring to fig. 7, the step S106 may further include:
s106, 106A, carrying out weighting processing on the audio emotion recognition result and the character emotion recognition result of each section of the voice data of the first user to obtain a first emotion value, and carrying out weighting processing on the audio emotion recognition result and the character emotion recognition result of each section of the voice data of the second user to obtain a second emotion value;
specifically, two emotion values of the same audio segment are processed by a numerical value weighting method, wherein the emotion values are floating point numbers between 0 and 1, the emotion is more negative when the emotion values are closer to 1, and the emotion values are more positive when the emotion values are closer to 0.
As an example, the weight of the emotion value obtained by the speech emotion recognition channel is 0.7; the weight of the emotion value obtained by the character emotion recognition channel is 0.3.
Further, the final output emotion value is 0.865, as described in the above embodiment.
S106, 106B, generating a first emotion value heat map according to the first emotion value and generating a second emotion value heat map according to the second emotion value;
specifically, the emotion value heatmap is used for numbering and drawing each section of voice to be detected according to the time sequence, and the heatmap is used for clustering the emotion of each time section.
Specifically, a heatmap of emotion values is plotted using the heatmap function of the seaborn library of python, with different colors representing different emotions, e.g., positive emotions being positive, the colors being darker.
S106, 106C, the first emotion value heat map and the second emotion value heat map are sent to a related terminal.
Specifically, the association terminal includes a first user terminal and a second user terminal, and as an embodiment, when the first user and the second user are a client and a customer service respectively, the association terminal includes a customer service quality supervision and management terminal and a customer service superior terminal in addition to the client and the customer service terminal, so as to supervise and correct the service quality of the customer service.
The embodiment of the invention adopts double channels to analyze the voice emotion, analyzes the voice emotion through the audio acoustic rhythm, further judges the emotion of the speaker through the speaking content, thereby improving the emotion analysis accuracy, analyzes and judges the emotion value of each section of conversation by combining the conversation separation technology, thereby obtaining the emotion of the speaker in each time period in the complete conversation process, further analyzing the emotion fluctuation condition of the speaker, providing visualized reference and help for quality inspection of customer service, leading the evaluation result to be more objective, finally helping enterprises to improve the quality of the customer service and improving the customer experience.
Example two
With continued reference to fig. 8, a schematic diagram of program modules of the speech emotion analyzing apparatus according to the present invention is shown. In the present embodiment, the speech emotion analyzing apparatus 20 may include or be divided into one or more program modules, and the one or more program modules are stored in a storage medium and executed by one or more processors to implement the present invention and implement the speech emotion analyzing method described above. The program module referred to in the embodiments of the present invention refers to a series of computer program instruction segments capable of performing specific functions, and is more suitable for describing the execution process of the speech emotion fluctuation analysis apparatus 20 in the storage medium than the program itself. The following description will specifically describe the functions of the program modules of the present embodiment:
the first voice feature obtaining module 200 is configured to obtain a first audio feature and a first text feature of the voice data to be detected.
Further, the first speech feature obtaining module 200 is further configured to:
acquiring offline or online voice data to be detected;
and separating the voice data to obtain voice data to be detected, wherein the voice data to be detected comprises a plurality of sections of first user voice data and second user voice data.
The second speech feature extraction module 202: the audio recognition system is used for extracting a second audio feature from the first audio feature based on an audio feature extraction network in a pre-trained audio recognition model; and extracting a second character feature from the first character features based on a character feature extraction network in a pre-trained character recognition model.
Further, the second speech feature extraction module 202 is further configured to:
performing frame windowing on the voice data to be detected to obtain a voice analysis frame;
carrying out Fourier transform on the voice analysis frame to obtain a corresponding frequency spectrum;
the frequency spectrum is processed by a Mel filter bank to obtain a Mel frequency spectrum;
and performing cepstrum analysis on the Mel frequency spectrum to obtain a first audio characteristic of the voice data to be detected.
Further, the second speech feature extraction module 202 is further configured to:
converting the voice data to be tested into characters;
performing word segmentation processing on the characters to obtain L word segments, wherein L is a natural number greater than 0;
and respectively carrying out word vector mapping on the L participles to obtain a d-dimensional word vector matrix corresponding to the L participles, wherein d is a natural number greater than 0, and the d-dimensional word vector matrix is a first character characteristic of the voice data to be detected.
The speech feature recognition module 204: the second audio feature is identified, and an audio emotion identification result is obtained; and identifying the second character features to obtain a character emotion identification result.
Further, the speech feature recognition module 204 is further configured to:
identifying the second audio features based on an audio classification network in a pre-trained audio identification model, and acquiring first confidence degrees corresponding to a plurality of audio emotion classification vectors;
selecting the audio emotion classification with the highest first confidence coefficient as a target audio emotion classification, wherein the corresponding first confidence coefficient is a target audio emotion classification parameter;
and carrying out numerical value mapping on the target audio emotion classification vector parameters to obtain an audio emotion recognition result.
Further, the speech feature recognition module 204 is further configured to:
recognizing the second character features based on a character classification network in a pre-trained character recognition model, and acquiring second confidence degrees corresponding to a plurality of character emotion classification vectors;
selecting the audio emotion classification with the highest second confidence coefficient as a target character emotion classification, wherein the corresponding second confidence coefficient is a target character emotion classification parameter;
and carrying out numerical value mapping on the target character emotion classification vector parameters to obtain character emotion recognition results.
The recognition result acquisition module 206: and the emotion recognition device is used for fusing the audio emotion recognition result and the character emotion recognition result to obtain an emotion recognition result and sending the emotion recognition result to the associated terminal.
Further, the recognition result obtaining module 206 is further configured to:
weighting the audio emotion recognition result and the character emotion recognition result of each section of the voice data of the first user to obtain a first emotion value, and weighting the audio emotion recognition result and the character emotion recognition result of each section of the voice data of the second user to obtain a second emotion value;
generating a first sentiment value heat map according to the first sentiment value and a second sentiment value heat map according to the second sentiment value;
and sending the first emotion value heat map and the second emotion value heat map to a related terminal.
EXAMPLE III
Fig. 9 is a schematic diagram of a hardware architecture of a computer device according to a third embodiment of the present invention. In the present embodiment, the computer device 2 is a device capable of automatically performing numerical calculation and/or information processing in accordance with a preset or stored instruction. The computer device 2 may be a rack server, a blade server, a tower server or a rack server (including an independent server or a server cluster composed of a plurality of servers), and the like. As shown in fig. 9, the computer device 2 includes, but is not limited to, at least a memory 21, a processor 22, a network interface 23, and a speech emotion analyzing device 20, which are communicatively connected to each other via a system device bus. Wherein:
in this embodiment, the memory 21 includes at least one type of computer-readable storage medium including a flash memory, a hard disk, a multimedia card, a card-type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, and the like. In some embodiments, the storage 21 may be an internal storage unit of the computer device 2, such as a hard disk or a memory of the computer device 2. In other embodiments, the memory 21 may also be an external storage device of the computer device 2, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), or the like provided on the computer device 2. Of course, the memory 21 may also comprise both internal and external memory units of the computer device 2. In this embodiment, the memory 21 is generally used for storing various application software and operating system devices installed in the computer device 2, such as the program codes of the speech emotion analyzing device 20 in the second embodiment. Further, the memory 21 may also be used to temporarily store various types of data that have been output or are to be output.
Processor 22 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor 22 is typically used to control the overall operation of the computer device 2. In the present embodiment, the processor 22 is configured to run the program code stored in the memory 21 or process data, for example, run the speech emotion analyzing apparatus 20, so as to implement the speech emotion analyzing method of the above-described embodiment.
The network interface 23 may comprise a wireless network interface or a wired network interface, and the network interface 23 is generally used for establishing communication connection between the computer device 2 and other electronic system devices. For example, the network interface 23 is used to connect the computer device 2 to an external terminal through a network, establish a data transmission channel and a communication connection between the computer device 2 and the external terminal, and the like. The network may be a wireless or wired network such as an Intranet (Intranet), the Internet (Internet), a Global System of Mobile communication (GSM), Wideband Code Division Multiple Access (WCDMA), a 4G network, a 5G network, Bluetooth (Bluetooth), Wi-Fi, and the like.
It is noted that fig. 9 only shows the computer device 2 with components 20-23, but it is to be understood that not all shown components are required to be implemented, and that more or less components may be implemented instead.
In this embodiment, the speech emotion analyzing apparatus 20 stored in the memory 21 may also be divided into one or more program modules, and the one or more program modules are stored in the memory 21 and executed by one or more processors (in this embodiment, the processor 22) to complete the present invention.
For example, fig. 8 shows a schematic diagram of program modules of a second embodiment of implementing the speech emotion fluctuation analysis apparatus 20, in this embodiment, the speech emotion fluctuation analysis apparatus 20 can be divided into a first speech feature acquisition module 200, a second speech feature extraction module 202, a speech feature recognition module 204, and a recognition result acquisition module 206. The program module referred to in the present invention refers to a series of computer program instruction segments capable of performing specific functions, and is more suitable than a program for describing the execution process of the speech emotion fluctuation analysis device 20 in the computer device 2. The specific functions of the program modules, i.e., the first speech feature obtaining module 200 and the recognition result obtaining module 206, have been described in detail in the second embodiment, and are not described herein again.
Example four
The present embodiment also provides a computer-readable storage medium, such as a flash memory, a hard disk, a multimedia card, a card-type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a read-only memory (ROM), an electrically erasable programmable read-only memory (EEPROM), a programmable read-only memory (PROM), a magnetic memory, a magnetic disk, an optical disk, a server, an App application mall, etc., on which a computer program is stored, which when executed by a processor implements corresponding functions. The computer-readable storage medium of the present embodiment is used for storing a speech emotion fluctuation analysis device 20, and when executed by a processor, implements the speech emotion fluctuation analysis method of the above-described embodiment.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner.
The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (10)

1. A speech emotion fluctuation analysis method is characterized by comprising the following steps:
acquiring a first audio characteristic and a first character characteristic of voice data to be detected;
extracting a second audio feature from the first audio features based on an audio feature extraction network in a pre-trained audio recognition model; extracting a second character feature from the first character features based on a character feature extraction network in a pre-trained character recognition model;
identifying the second audio features to obtain an audio emotion identification result; identifying the second character features to obtain character emotion identification results;
and carrying out fusion processing on the audio emotion recognition result and the character emotion recognition result to obtain an emotion recognition result, and sending the emotion recognition result to an associated terminal.
2. The method for analyzing speech emotion fluctuation according to claim 1, wherein the obtaining of the first audio feature and the first text feature of the speech data to be detected includes:
performing frame windowing on the voice data to be detected to obtain a voice analysis frame;
carrying out Fourier transform on the voice analysis frame to obtain a corresponding frequency spectrum;
the frequency spectrum is processed by a Mel filter bank to obtain a Mel frequency spectrum;
and performing cepstrum analysis on the Mel frequency spectrum to obtain a first audio characteristic of the voice data to be detected.
3. The voice emotion analyzing method according to claim 2, wherein the second audio feature is recognized to obtain an audio emotion recognition result; recognizing the second character features, and acquiring character emotion recognition results, wherein the character emotion recognition results comprise:
identifying the second audio features based on an audio classification network in a pre-trained audio identification model, and acquiring first confidence degrees corresponding to a plurality of audio emotion classification vectors;
selecting the audio emotion classification with the highest first confidence coefficient as a target audio emotion classification, wherein the corresponding first confidence coefficient is a target audio emotion classification parameter;
and carrying out numerical value mapping on the target audio emotion classification vector parameters to obtain an audio emotion recognition result.
4. The method for analyzing speech emotion fluctuation according to claim 1, wherein the obtaining of the first audio feature and the first text feature of the speech data to be tested further includes:
converting the voice data to be tested into characters;
performing word segmentation processing on the characters to obtain L word segments, wherein L is a natural number greater than 0;
and respectively carrying out word vector mapping on the L participles to obtain a d-dimensional word vector matrix corresponding to the L participles, wherein d is a natural number greater than 0, and the d-dimensional word vector matrix is a first character characteristic of the voice data to be detected.
5. The method for analyzing speech emotion fluctuation according to claim 4, wherein the second audio feature is recognized to obtain an audio emotion recognition result; recognizing the second character features, and acquiring character emotion recognition results, wherein the character emotion recognition results comprise:
recognizing the second character features based on a character classification network in a pre-trained character recognition model, and acquiring second confidence degrees corresponding to a plurality of character emotion classification vectors;
selecting the audio emotion classification with the highest second confidence coefficient as a target character emotion classification, wherein the corresponding second confidence coefficient is a target character emotion classification parameter;
and carrying out numerical value mapping on the target character emotion classification vector parameters to obtain character emotion recognition results.
6. The speech emotion fluctuation analysis method of claim 1, wherein the method further comprises:
acquiring offline or online voice data to be detected;
and separating the voice data to obtain voice data to be detected, wherein the voice data to be detected comprises a plurality of sections of first user voice data and second user voice data.
7. The voice emotion fluctuation analysis method of claim 6, wherein the fusing the audio emotion recognition result and the character emotion recognition result to obtain an emotion recognition result, and the sending the emotion recognition result to the association terminal includes:
weighting the audio emotion recognition result and the character emotion recognition result of each section of the voice data of the first user to obtain a first emotion value, and weighting the audio emotion recognition result and the character emotion recognition result of each section of the voice data of the second user to obtain a second emotion value;
generating a first sentiment value heat map according to the first sentiment value and a second sentiment value heat map according to the second sentiment value;
and sending the first emotion value heat map and the second emotion value heat map to a related terminal.
8. A speech emotion fluctuation analysis apparatus, comprising:
the first voice feature acquisition module is used for acquiring a first audio feature and a first character feature of the voice data to be detected;
the second voice feature extraction module is used for extracting a second audio feature in the first audio feature based on an audio feature extraction network in a pre-trained audio recognition model; extracting a second character feature from the first character features based on a character feature extraction network in a pre-trained character recognition model;
the voice feature recognition module is used for recognizing the second audio features and acquiring an audio emotion recognition result; identifying the second character features to obtain character emotion identification results;
and the recognition result acquisition module is used for carrying out fusion processing on the audio emotion recognition result and the character emotion recognition result to obtain an emotion recognition result and sending the emotion recognition result to the associated terminal.
9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor, when executing the computer program, implements the steps of the speech mood swing analyzing method according to any of claims 1 to 7.
10. A computer-readable storage medium, in which a computer program is stored which is executable by at least one processor to cause the at least one processor to perform the steps of the speech mood swing analyzing method according to any one of claims 1 to 7.
CN201911341679.XA 2019-12-24 2019-12-24 Voice emotion fluctuation analysis method and device Pending CN111081279A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201911341679.XA CN111081279A (en) 2019-12-24 2019-12-24 Voice emotion fluctuation analysis method and device
PCT/CN2020/094338 WO2021128741A1 (en) 2019-12-24 2020-06-04 Voice emotion fluctuation analysis method and apparatus, and computer device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911341679.XA CN111081279A (en) 2019-12-24 2019-12-24 Voice emotion fluctuation analysis method and device

Publications (1)

Publication Number Publication Date
CN111081279A true CN111081279A (en) 2020-04-28

Family

ID=70317032

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911341679.XA Pending CN111081279A (en) 2019-12-24 2019-12-24 Voice emotion fluctuation analysis method and device

Country Status (2)

Country Link
CN (1) CN111081279A (en)
WO (1) WO2021128741A1 (en)

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111739559A (en) * 2020-05-07 2020-10-02 北京捷通华声科技股份有限公司 Speech early warning method, device, equipment and storage medium
CN111916112A (en) * 2020-08-19 2020-11-10 浙江百应科技有限公司 Emotion recognition method based on voice and characters
CN111938674A (en) * 2020-09-07 2020-11-17 南京宇乂科技有限公司 Emotion recognition control system for conversation
CN112100337A (en) * 2020-10-15 2020-12-18 平安科技(深圳)有限公司 Emotion recognition method and device in interactive conversation
CN112215927A (en) * 2020-09-18 2021-01-12 腾讯科技(深圳)有限公司 Method, device, equipment and medium for synthesizing face video
CN112527994A (en) * 2020-12-18 2021-03-19 平安银行股份有限公司 Emotion analysis method, emotion analysis device, emotion analysis equipment and readable storage medium
CN112837702A (en) * 2020-12-31 2021-05-25 萨孚凯信息系统(无锡)有限公司 Voice emotion distributed system and voice signal processing method
CN112911072A (en) * 2021-01-28 2021-06-04 携程旅游网络技术(上海)有限公司 Call center volume identification method and device, electronic equipment and storage medium
CN113053409A (en) * 2021-03-12 2021-06-29 科大讯飞股份有限公司 Audio evaluation method and device
WO2021128741A1 (en) * 2019-12-24 2021-07-01 深圳壹账通智能科技有限公司 Voice emotion fluctuation analysis method and apparatus, and computer device and storage medium
CN113129927A (en) * 2021-04-16 2021-07-16 平安科技(深圳)有限公司 Voice emotion recognition method, device, equipment and storage medium
CN114049902A (en) * 2021-10-27 2022-02-15 广东万丈金数信息技术股份有限公司 Aricloud-based recording uploading recognition and emotion analysis method and system
CN115430155A (en) * 2022-09-06 2022-12-06 北京中科心研科技有限公司 Team cooperation capability assessment method and system based on audio analysis
WO2023246076A1 (en) * 2022-06-24 2023-12-28 上海哔哩哔哩科技有限公司 Emotion category recognition method, apparatus, storage medium and electronic device
CN117688344A (en) * 2024-02-04 2024-03-12 北京大学 Multi-mode fine granularity trend analysis method and system based on large model

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114373455A (en) * 2021-12-08 2022-04-19 北京声智科技有限公司 Emotion recognition method and device, electronic equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102779510A (en) * 2012-07-19 2012-11-14 东南大学 Speech emotion recognition method based on feature space self-adaptive projection
CN106228977A (en) * 2016-08-02 2016-12-14 合肥工业大学 The song emotion identification method of multi-modal fusion based on degree of depth study
CN108305643A (en) * 2017-06-30 2018-07-20 腾讯科技(深圳)有限公司 The determination method and apparatus of emotion information
CN108305642A (en) * 2017-06-30 2018-07-20 腾讯科技(深圳)有限公司 The determination method and apparatus of emotion information
US20190325897A1 (en) * 2018-04-21 2019-10-24 International Business Machines Corporation Quantifying customer care utilizing emotional assessments
CN110390956A (en) * 2019-08-15 2019-10-29 龙马智芯(珠海横琴)科技有限公司 Emotion recognition network model, method and electronic equipment

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108305641B (en) * 2017-06-30 2020-04-07 腾讯科技(深圳)有限公司 Method and device for determining emotion information
CN111081279A (en) * 2019-12-24 2020-04-28 深圳壹账通智能科技有限公司 Voice emotion fluctuation analysis method and device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102779510A (en) * 2012-07-19 2012-11-14 东南大学 Speech emotion recognition method based on feature space self-adaptive projection
CN106228977A (en) * 2016-08-02 2016-12-14 合肥工业大学 The song emotion identification method of multi-modal fusion based on degree of depth study
CN108305643A (en) * 2017-06-30 2018-07-20 腾讯科技(深圳)有限公司 The determination method and apparatus of emotion information
CN108305642A (en) * 2017-06-30 2018-07-20 腾讯科技(深圳)有限公司 The determination method and apparatus of emotion information
US20190325897A1 (en) * 2018-04-21 2019-10-24 International Business Machines Corporation Quantifying customer care utilizing emotional assessments
CN110390956A (en) * 2019-08-15 2019-10-29 龙马智芯(珠海横琴)科技有限公司 Emotion recognition network model, method and electronic equipment

Cited By (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021128741A1 (en) * 2019-12-24 2021-07-01 深圳壹账通智能科技有限公司 Voice emotion fluctuation analysis method and apparatus, and computer device and storage medium
CN111739559A (en) * 2020-05-07 2020-10-02 北京捷通华声科技股份有限公司 Speech early warning method, device, equipment and storage medium
CN111739559B (en) * 2020-05-07 2023-02-28 北京捷通华声科技股份有限公司 Speech early warning method, device, equipment and storage medium
CN111916112A (en) * 2020-08-19 2020-11-10 浙江百应科技有限公司 Emotion recognition method based on voice and characters
CN111938674A (en) * 2020-09-07 2020-11-17 南京宇乂科技有限公司 Emotion recognition control system for conversation
CN112215927A (en) * 2020-09-18 2021-01-12 腾讯科技(深圳)有限公司 Method, device, equipment and medium for synthesizing face video
CN112215927B (en) * 2020-09-18 2023-06-23 腾讯科技(深圳)有限公司 Face video synthesis method, device, equipment and medium
CN112100337A (en) * 2020-10-15 2020-12-18 平安科技(深圳)有限公司 Emotion recognition method and device in interactive conversation
CN112100337B (en) * 2020-10-15 2024-03-05 平安科技(深圳)有限公司 Emotion recognition method and device in interactive dialogue
CN112527994A (en) * 2020-12-18 2021-03-19 平安银行股份有限公司 Emotion analysis method, emotion analysis device, emotion analysis equipment and readable storage medium
CN112837702A (en) * 2020-12-31 2021-05-25 萨孚凯信息系统(无锡)有限公司 Voice emotion distributed system and voice signal processing method
CN112911072A (en) * 2021-01-28 2021-06-04 携程旅游网络技术(上海)有限公司 Call center volume identification method and device, electronic equipment and storage medium
CN113053409A (en) * 2021-03-12 2021-06-29 科大讯飞股份有限公司 Audio evaluation method and device
CN113053409B (en) * 2021-03-12 2024-04-12 科大讯飞股份有限公司 Audio evaluation method and device
CN113129927A (en) * 2021-04-16 2021-07-16 平安科技(深圳)有限公司 Voice emotion recognition method, device, equipment and storage medium
CN113129927B (en) * 2021-04-16 2023-04-07 平安科技(深圳)有限公司 Voice emotion recognition method, device, equipment and storage medium
CN114049902A (en) * 2021-10-27 2022-02-15 广东万丈金数信息技术股份有限公司 Aricloud-based recording uploading recognition and emotion analysis method and system
WO2023246076A1 (en) * 2022-06-24 2023-12-28 上海哔哩哔哩科技有限公司 Emotion category recognition method, apparatus, storage medium and electronic device
CN115430155A (en) * 2022-09-06 2022-12-06 北京中科心研科技有限公司 Team cooperation capability assessment method and system based on audio analysis
CN117688344A (en) * 2024-02-04 2024-03-12 北京大学 Multi-mode fine granularity trend analysis method and system based on large model
CN117688344B (en) * 2024-02-04 2024-05-07 北京大学 Multi-mode fine granularity trend analysis method and system based on large model

Also Published As

Publication number Publication date
WO2021128741A1 (en) 2021-07-01

Similar Documents

Publication Publication Date Title
CN111081279A (en) Voice emotion fluctuation analysis method and device
CN107680582B (en) Acoustic model training method, voice recognition method, device, equipment and medium
CN108198547B (en) Voice endpoint detection method and device, computer equipment and storage medium
US9536547B2 (en) Speaker change detection device and speaker change detection method
US10878823B2 (en) Voiceprint recognition method, device, terminal apparatus and storage medium
US10388279B2 (en) Voice interaction apparatus and voice interaction method
CN110457432B (en) Interview scoring method, interview scoring device, interview scoring equipment and interview scoring storage medium
CN109256150B (en) Speech emotion recognition system and method based on machine learning
US9368116B2 (en) Speaker separation in diarization
CN111311327A (en) Service evaluation method, device, equipment and storage medium based on artificial intelligence
US20180122377A1 (en) Voice interaction apparatus and voice interaction method
CN111785275A (en) Voice recognition method and device
CN110390946A (en) A kind of audio signal processing method, device, electronic equipment and storage medium
CN110675862A (en) Corpus acquisition method, electronic device and storage medium
CN108899033B (en) Method and device for determining speaker characteristics
US11837236B2 (en) Speaker recognition based on signal segments weighted by quality
Pao et al. A study on the search of the most discriminative speech features in the speaker dependent speech emotion recognition
CN106710588B (en) Speech data sentence recognition method, device and system
CN110556098A (en) voice recognition result testing method and device, computer equipment and medium
CN118035411A (en) Customer service voice quality inspection method, customer service voice quality inspection device, customer service voice quality inspection equipment and storage medium
CN116741155A (en) Speech recognition method, training method, device and equipment of speech recognition model
CN111933153B (en) Voice segmentation point determining method and device
CN111326161B (en) Voiceprint determining method and device
CN113421552A (en) Audio recognition method and device
CN114446284A (en) Speaker log generation method and device, computer equipment and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
AD01 Patent right deemed abandoned

Effective date of abandoning: 20240209

AD01 Patent right deemed abandoned