CN111081279A

CN111081279A - Voice emotion fluctuation analysis method and device

Info

Publication number: CN111081279A
Application number: CN201911341679.XA
Authority: CN
Inventors: 朱锦祥; 单以磊; 臧磊
Original assignee: OneConnect Financial Technology Co Ltd Shanghai
Current assignee: OneConnect Smart Technology Co Ltd; OneConnect Financial Technology Co Ltd Shanghai
Priority date: 2019-12-24
Filing date: 2019-12-24
Publication date: 2020-04-28
Also published as: WO2021128741A1

Abstract

The embodiment of the invention provides a voice emotion fluctuation analysis method, which comprises the following steps: acquiring a first audio characteristic and a first character characteristic of voice data to be detected; extracting a second audio feature from the first audio feature based on an audio feature extraction network in a pre-trained audio recognition model; extracting a second character feature from the first character features based on a character feature extraction network in a pre-trained character recognition model; identifying a second audio characteristic to obtain an audio emotion identification result; identifying the second character characteristics to obtain a character emotion identification result; and carrying out fusion processing on the audio emotion recognition result and the character emotion recognition result to obtain an emotion recognition result, and sending the emotion recognition result to the associated terminal. According to the method, through a double-channel voice emotion recognition method and the emotion value heat map drawing, imagination reference and help are provided for customer service quality inspection, so that the evaluation result is more objective, an enterprise is helped to improve the customer service quality, and the customer experience is improved.

Description

Voice emotion fluctuation analysis method and device

Technical Field

The invention relates to the technical field of internet, in particular to a voice emotion fluctuation analysis method and device.

Background

With the development of artificial intelligence technology, emotion fluctuation analysis is applied in more and more commercial scenes, such as emotion fluctuation situations of both parties when a customer service person talks with a client. In the prior art, emotion fluctuation analysis for audio is generally performed by using audio signals of sound, such as intonation, frequency and amplitude variation of sound waves, the analysis mode is single, the audio signals of different people are different, and the emotion analysis accuracy is low only by using the audio signals of sound.

Disclosure of Invention

In view of this, embodiments of the present invention provide a method and an apparatus for analyzing speech emotion fluctuation, a computer device, and a computer-readable storage medium, which are used for solving the problem of low accuracy in analyzing emotion fluctuation.

The embodiment of the invention solves the technical problems through the following technical scheme:

a speech emotion fluctuation analysis method includes:

acquiring a first audio characteristic and a first character characteristic of voice data to be detected;

extracting a second audio feature from the first audio features based on an audio feature extraction network in a pre-trained audio recognition model; extracting a second character feature from the first character features based on a character feature extraction network in a pre-trained character recognition model;

identifying the second audio features to obtain an audio emotion identification result; identifying the second character features to obtain character emotion identification results;

and carrying out fusion processing on the audio emotion recognition result and the character emotion recognition result to obtain an emotion recognition result, and sending the emotion recognition result to an associated terminal.

Further, the acquiring a first audio feature and a first character feature of the voice data to be detected includes:

performing frame windowing on the voice data to be detected to obtain a voice analysis frame;

carrying out Fourier transform on the voice analysis frame to obtain a corresponding frequency spectrum;

the frequency spectrum is processed by a Mel filter bank to obtain a Mel frequency spectrum;

and performing cepstrum analysis on the Mel frequency spectrum to obtain a first audio characteristic of the voice data to be detected.

Further, the second audio features are identified, and an audio emotion identification result is obtained; recognizing the second character features, and acquiring character emotion recognition results, wherein the character emotion recognition results comprise:

identifying the second audio features based on an audio classification network in a pre-trained audio identification model, and acquiring first confidence degrees corresponding to a plurality of audio emotion classification vectors;

selecting the audio emotion classification with the highest first confidence coefficient as a target audio emotion classification, wherein the corresponding first confidence coefficient is a target audio emotion classification parameter;

and carrying out numerical value mapping on the target audio emotion classification vector parameters to obtain an audio emotion recognition result.

Further, the acquiring of the first audio feature and the first character feature of the voice data to be detected further includes:

converting the voice data to be tested into characters;

performing word segmentation processing on the characters to obtain L word segments, wherein L is a natural number greater than 0;

and respectively carrying out word vector mapping on the L participles to obtain a d-dimensional word vector matrix corresponding to the L participles, wherein d is a natural number greater than 0, and the d-dimensional word vector matrix is a first character characteristic of the voice data to be detected.

recognizing the second character features based on a character classification network in a pre-trained character recognition model, and acquiring second confidence degrees corresponding to a plurality of character emotion classification vectors;

selecting the audio emotion classification with the highest second confidence coefficient as a target character emotion classification, wherein the corresponding second confidence coefficient is a target character emotion classification parameter;

and carrying out numerical value mapping on the target character emotion classification vector parameters to obtain character emotion recognition results.

Further, the method further comprises:

acquiring offline or online voice data to be detected;

and separating the voice data to obtain voice data to be detected, wherein the voice data to be detected comprises a plurality of sections of first user voice data and second user voice data.

Further, the fusion processing of the audio emotion recognition result and the character emotion recognition result to obtain an emotion recognition result, and the sending of the emotion recognition result to the associated terminal includes:

weighting the audio emotion recognition result and the character emotion recognition result of each section of the voice data of the first user to obtain a first emotion value, and weighting the audio emotion recognition result and the character emotion recognition result of each section of the voice data of the second user to obtain a second emotion value;

generating a first sentiment value heat map according to the first sentiment value and a second sentiment value heat map according to the second sentiment value;

and sending the first emotion value heat map and the second emotion value heat map to a related terminal.

In order to achieve the above object, an embodiment of the present invention further provides a speech emotion analyzing apparatus, including:

the first voice feature acquisition module is used for acquiring a first audio feature and a first character feature of the voice data to be detected;

the second voice feature extraction module is used for extracting a second audio feature in the first audio feature based on an audio feature extraction network in a pre-trained audio recognition model; extracting a second character feature from the first character features based on a character feature extraction network in a pre-trained character recognition model;

the voice feature recognition module is used for recognizing the second audio features and acquiring an audio emotion recognition result; identifying the second character features to obtain character emotion identification results;

and the recognition result acquisition module is used for carrying out fusion processing on the audio emotion recognition result and the character emotion recognition result to obtain an emotion recognition result and sending the emotion recognition result to the associated terminal.

In order to achieve the above object, an embodiment of the present invention further provides a computer device, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and the processor implements the steps of the speech emotion fluctuation analysis method as described above when executing the computer program.

In order to achieve the above object, an embodiment of the present invention further provides a computer-readable storage medium, in which a computer program is stored, where the computer program is executable by at least one processor, so as to cause the at least one processor to execute the steps of the speech emotion fluctuation analysis method as described above.

According to the voice emotion fluctuation analysis method, the voice emotion fluctuation analysis device, the computer equipment and the computer readable storage medium, the voice emotion is analyzed through two channels, the voice emotion is analyzed through the audio acoustic rhythm, the emotion of a speaker is further judged through the speaking content, and therefore the emotion analysis accuracy is improved.

The invention is described in detail below with reference to the drawings and specific examples, but the invention is not limited thereto.

Drawings

FIG. 1 is a flowchart illustrating a method for analyzing speech emotion fluctuation according to an embodiment of the present invention;

FIG. 2 is a detailed flowchart of obtaining voice data to be tested;

FIG. 3 is a detailed flowchart of extracting a first audio feature from the voice data to be detected;

FIG. 4 is a detailed flowchart of extracting a first text feature from the voice data to be detected;

FIG. 5 is a flowchart illustrating the specific process of identifying the second audio feature and obtaining the audio emotion recognition result;

fig. 6 is a specific flowchart for identifying the second character feature and obtaining a character emotion identification result;

fig. 7 is a specific flowchart for performing fusion processing on the audio emotion recognition result and the character emotion recognition result to obtain an emotion recognition result, and sending the emotion recognition result to an associated terminal;

FIG. 8 is a schematic diagram of a second embodiment of a speech emotion analyzing apparatus according to the present invention;

FIG. 9 is a diagram of a hardware structure of a third embodiment of the computer apparatus according to the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Technical solutions between various embodiments may be combined with each other, but must be realized by those skilled in the art, and when the technical solutions are contradictory or cannot be realized, such a combination should not be considered to exist, and is not within the protection scope of the present invention.

Example one

Referring to fig. 1, a flowchart illustrating steps of a speech emotion analyzing method according to an embodiment of the present invention is shown. It is to be understood that the flow charts in the embodiments of the present method are not intended to limit the order in which the steps are performed. The following description is given by taking a computer device as an execution subject, specifically as follows:

s100: acquiring a first audio characteristic and a first character characteristic of voice data to be detected;

referring to fig. 2, the method for analyzing speech emotion fluctuation according to the embodiment of the present invention further includes:

s110: and acquiring voice data to be detected.

The acquiring of the voice data to be tested further comprises:

S110A: acquiring offline or online voice data;

specifically, the voice data includes online voice data and offline voice data, the online voice data refers to voice data obtained in real time in a call process, the offline voice data refers to call voice data stored in a system background, and the voice data to be tested is a recording file in a wav format.

S110B: and separating the voice data to obtain voice data to be detected, wherein the voice data to be detected comprises a plurality of sections of first user voice data and second user voice data.

Specifically, after the voice data is acquired, the voice data to be detected is divided into a plurality of sections of first user voice data and second user voice data according to the mute part of the call voice, an end point detection technology and a voice separation technology are adopted to remove the mute part in the call process of the voice data to be detected, the start point and the end point of each section of conversation are marked based on the time threshold of the mute time existing in the set speaking interval section, the cutting and separation are carried out according to the time point to obtain a plurality of short voice frequency sections, the speaker identity and the speaking time of each short voice frequency section are marked by a voiceprint recognition tool, and the identification and the speaking time are distinguished by numbers. The time length threshold is determined according to an empirical value, and as an embodiment, the time length threshold of the scheme is 0.25-0.3 second.

The number includes, but is not limited to, the job number of the customer service, the landline number of the customer service, and the cell phone number of the customer.

Specifically, the voiceprint recognition tool is a LIUM _ splinization toolkit, and the first user voice data and the second user voice data are distinguished through the LIUM _ splinization toolkit, for example, as follows:

start_time	end_time	speaker
			0	3	1
4	8	2
			8.3	12.5	1

we consider the person speaking in the first opening to be the first user (i.e., the spaker 1 in the table) and the second to be the second user (i.e., the spaker 2 in the table) naturally.

Referring to fig. 3, the acquiring the first audio feature of the to-be-detected speech data further includes:

S100A 1: performing frame windowing on the voice data to be detected to obtain a voice analysis frame;

specifically, the voice data signal has short-time stationarity, and the voice data signal can be subjected to framing processing to obtain a plurality of audio frames, where an audio frame refers to a set of N sampling points. In this embodiment, N is 256 or 512, and the time covered is 20 to 30ms, after obtaining a plurality of audio frames, each audio frame is multiplied by a hamming window to increase the continuity of the left and right ends of the frame, and obtain a speech analysis frame.

S100B 1: carrying out Fourier transform on the voice analysis frame to obtain a corresponding frequency spectrum;

specifically, since the voice data signal is difficult to change in the time domain, the voice data signal needs to be converted into energy distribution in the frequency domain, and the voice analysis frame is subjected to fourier transform to obtain the frequency spectrum of each voice analysis frame.

S100C 1: the frequency spectrum is processed by a Mel filter bank to obtain a Mel frequency spectrum;

S100D 1: and performing cepstrum analysis on the Mel frequency spectrum to obtain a first audio characteristic of the voice data to be detected.

Specifically, cepstrum analysis is performed on the mel frequency spectrum to obtain 36 1024-dimensional audio vectors, and the audio vectors are the first audio features of the voice data to be detected.

Referring to fig. 4, the acquiring the first text feature of the voice data to be detected further includes:

S100A 2: converting the voice data to be tested into characters;

specifically, the multiple sections of the first user voice data and the second user voice data are converted into characters by using a voice dictation interface. As an embodiment, the dictation interface is a fly-by-fly voice dictation interface.

S100B 2: performing word segmentation processing on the characters to obtain L word segments, wherein L is a natural number greater than 0;

specifically, the word segmentation process is completed through a dictionary word segmentation algorithm, which includes, but is not limited to, a forward maximum matching method, a reverse maximum matching method, and a two-way matching word segmentation method, and may also be based on hidden markov models HMM, CRF, SVM, and a deep learning algorithm.

S100C 2: and respectively carrying out word vector mapping on the L participles to obtain a d-dimensional word vector matrix corresponding to the L participles, wherein d is a natural number greater than 0, and the d-dimensional word vector matrix is a first character characteristic of the voice data to be detected.

Specifically, a 128-dimensional word vector of each participle is obtained through word2vec and other models.

S102: extracting a second audio feature from the first audio features based on an audio feature extraction network in a pre-trained audio recognition model; and extracting a second character feature from the first character features based on a character feature extraction network in a pre-trained character recognition model.

Specifically, the second audio features and the second character features are semantic feature vectors which are extracted from the first audio features and the first character features by the feature extraction network of the emotion recognition model and have fewer dimensions and pay more attention to words expressing emotion, and by extracting the second audio features and the second character features, the learning capacity of the model can be better, and the accuracy of final classification is higher.

S104: identifying the second audio features to obtain an audio emotion identification result; and identifying the second character features to obtain a character emotion identification result.

Specifically, the audio recognition result is obtained by inputting the audio features into an audio recognition model, and the character emotion recognition result is obtained by inputting the character features into a character recognition model. Specifically, the audio recognition model and the character emotion recognition model comprise a feature extraction network and a classification network, wherein the feature extraction network is used for extracting semantic feature vectors with fewer dimensions, namely a second audio feature and a second character feature, from a first audio feature and a first character feature, and the classification network is used for outputting confidence degrees of all preset emotion categories, wherein the preset emotion categories can be divided according to business requirements, such as positive, negative and the like. The character emotion recognition model is a deep neural network model comprising an Embedding layer and a Long Short-Term Memory neural Layer (LSTM), and the audio emotion recognition model is a neural network model comprising a self-attention layer and a bidirectional Long-Term Memory neural network layer (forward LSTM and backward LSTM).

The long and short term memory network is used for processing the sequence dependency relationship between long spans and is suitable for processing the task of dependency between long texts.

Further, the embodiment of the present invention further includes training the audio recognition model and the character recognition model, where the training process includes:

acquiring a training set and a calibration set corresponding to the target field;

the method for acquiring the training set and the check set corresponding to the target field comprises the following steps:

acquiring voice data of a training set and a check set;

specifically, the acquisition mode of the training set and the call verification set voice data includes, but is not limited to, recording data of a call center in a company, customer service recording data provided by a client, and direct purchase of the customer service recording data from a data platform.

Marking the emotion type of the recorded data;

specifically, the labeling process is as follows: manually marking the pause time point of each recording to obtain a plurality of short audio segments (conversation segments) of each recording; emotion tendency labeling (i.e., positive emotion, negative emotion) is performed on each short audio piece, and in the present embodiment, the audio-annotation tool audio-annotor is used to implement the start and end time point labeling and emotion labeling of the audio piece.

Separating a training set and a check set;

specifically, the process of separating the training set and the check set includes: randomly disorganizing all marked audio segment samples, then dividing the audio segment samples into two data sets according to the proportion of 4:1, wherein more parts are used for model training and are training sets, and less parts are used for model verification and are check sets.

Adjusting the voice emotion recognition model and the character emotion recognition model based on the emotion types of the training set;

and testing the voice emotion recognition model and the character emotion recognition model by using the test set so as to determine the accuracy of the voice emotion recognition model and the character emotion recognition model.

Referring to fig. 5, the identifying the second audio feature and obtaining the audio emotion recognition result further includes:

S104A 1: identifying the second audio features based on an audio classification network in a pre-trained audio identification model, and acquiring a plurality of audio emotion classifications and first confidence degrees corresponding to the audio emotion classifications;

and inputting the extracted second audio features into an audio classification network in the audio recognition model, and analyzing the second audio features by a classification network layer to obtain a plurality of audio emotion classifications corresponding to the second audio features and a first confidence coefficient corresponding to each audio emotion classification. For example, the first confidence of "positive emotions" is 0.3, and the first confidence of "negative emotions" is 0.7.

S104B 1: and selecting the audio emotion classification with the highest first confidence coefficient as a target audio emotion classification, wherein the corresponding first confidence coefficient is a target audio emotion classification parameter.

Correspondingly, the target audio emotion is classified as "negative emotion", and the target audio emotion classification parameter is 0.7.

S104C 1: and carrying out numerical value mapping on the target audio emotion classification vector parameters to obtain an audio emotion recognition result.

The numerical value mapping means that the original output result is mapped into a specific numerical value by taking the emotion type as the emotion type, so that the fluctuation of emotion can be conveniently observed further in the follow-up process. In an embodiment, the emotion classification is mapped to a specific number through a certain functional relation, for example, after a first confidence of each preset emotion classification of the voice data to be detected is obtained, a target audio emotion classification vector parameter X corresponding to the emotion classification with the highest confidence is selected, and the audio emotion recognition result Y finally output is calculated by using the following audio emotion recognition result formula.

In this embodiment, the numerical mapping relationship is that, when the recognized emotion type is "positive", Y is 0.5X; when the emotion recognition result is "negative", Y is 0.5(1+ X), so that the finally output audio emotion recognition result is a floating point number having a numerical value between 0 and 1.

Specifically, the final output audio emotion recognition result is 0.85.

Referring to fig. 6, recognizing the second text feature, and obtaining a text emotion recognition result further includes:

S104A 2: and identifying the second character features based on a character classification network in a pre-trained character identification model, and acquiring second confidence degrees corresponding to a plurality of character emotion classification vectors.

And inputting the extracted second character features into a character classification network in the character recognition model, and analyzing the second character features by the classification network layer to obtain a plurality of character emotion classifications corresponding to the second character features and a second confidence coefficient corresponding to each character emotion classification. For example, the second confidence of "positive emotions" is 0.2, and the first confidence of "negative emotions" is 0.8.

S104B 2: and selecting the audio emotion classification with the highest second confidence coefficient as a target character emotion classification, wherein the corresponding second confidence coefficient is a target character emotion classification parameter.

Correspondingly, the target word emotion is classified as "negative emotion", and the target word emotion classification parameter is 0.8.

S104C 2: and carrying out numerical value mapping on the target character emotion classification vector parameters to obtain character emotion recognition results.

Specifically, the final output character emotion recognition result is 0.9.

And S106, carrying out fusion processing on the audio emotion recognition result and the character emotion recognition result to obtain an emotion recognition result, and sending the emotion recognition result to an associated terminal.

Referring to fig. 7, the step S106 may further include:

s106, 106A, carrying out weighting processing on the audio emotion recognition result and the character emotion recognition result of each section of the voice data of the first user to obtain a first emotion value, and carrying out weighting processing on the audio emotion recognition result and the character emotion recognition result of each section of the voice data of the second user to obtain a second emotion value;

specifically, two emotion values of the same audio segment are processed by a numerical value weighting method, wherein the emotion values are floating point numbers between 0 and 1, the emotion is more negative when the emotion values are closer to 1, and the emotion values are more positive when the emotion values are closer to 0.

As an example, the weight of the emotion value obtained by the speech emotion recognition channel is 0.7; the weight of the emotion value obtained by the character emotion recognition channel is 0.3.

Further, the final output emotion value is 0.865, as described in the above embodiment.

S106, 106B, generating a first emotion value heat map according to the first emotion value and generating a second emotion value heat map according to the second emotion value;

specifically, the emotion value heatmap is used for numbering and drawing each section of voice to be detected according to the time sequence, and the heatmap is used for clustering the emotion of each time section.

Specifically, a heatmap of emotion values is plotted using the heatmap function of the seaborn library of python, with different colors representing different emotions, e.g., positive emotions being positive, the colors being darker.

S106, 106C, the first emotion value heat map and the second emotion value heat map are sent to a related terminal.

Specifically, the association terminal includes a first user terminal and a second user terminal, and as an embodiment, when the first user and the second user are a client and a customer service respectively, the association terminal includes a customer service quality supervision and management terminal and a customer service superior terminal in addition to the client and the customer service terminal, so as to supervise and correct the service quality of the customer service.

The embodiment of the invention adopts double channels to analyze the voice emotion, analyzes the voice emotion through the audio acoustic rhythm, further judges the emotion of the speaker through the speaking content, thereby improving the emotion analysis accuracy, analyzes and judges the emotion value of each section of conversation by combining the conversation separation technology, thereby obtaining the emotion of the speaker in each time period in the complete conversation process, further analyzing the emotion fluctuation condition of the speaker, providing visualized reference and help for quality inspection of customer service, leading the evaluation result to be more objective, finally helping enterprises to improve the quality of the customer service and improving the customer experience.

Example two

With continued reference to fig. 8, a schematic diagram of program modules of the speech emotion analyzing apparatus according to the present invention is shown. In the present embodiment, the speech emotion analyzing apparatus 20 may include or be divided into one or more program modules, and the one or more program modules are stored in a storage medium and executed by one or more processors to implement the present invention and implement the speech emotion analyzing method described above. The program module referred to in the embodiments of the present invention refers to a series of computer program instruction segments capable of performing specific functions, and is more suitable for describing the execution process of the speech emotion fluctuation analysis apparatus 20 in the storage medium than the program itself. The following description will specifically describe the functions of the program modules of the present embodiment:

the first voice feature obtaining module 200 is configured to obtain a first audio feature and a first text feature of the voice data to be detected.

Further, the first speech feature obtaining module 200 is further configured to:

acquiring offline or online voice data to be detected;

The second speech feature extraction module 202: the audio recognition system is used for extracting a second audio feature from the first audio feature based on an audio feature extraction network in a pre-trained audio recognition model; and extracting a second character feature from the first character features based on a character feature extraction network in a pre-trained character recognition model.

Further, the second speech feature extraction module 202 is further configured to:

converting the voice data to be tested into characters;

The speech feature recognition module 204: the second audio feature is identified, and an audio emotion identification result is obtained; and identifying the second character features to obtain a character emotion identification result.

Further, the speech feature recognition module 204 is further configured to:

The recognition result acquisition module 206: and the emotion recognition device is used for fusing the audio emotion recognition result and the character emotion recognition result to obtain an emotion recognition result and sending the emotion recognition result to the associated terminal.

Further, the recognition result obtaining module 206 is further configured to:

EXAMPLE III

Fig. 9 is a schematic diagram of a hardware architecture of a computer device according to a third embodiment of the present invention. In the present embodiment, the computer device 2 is a device capable of automatically performing numerical calculation and/or information processing in accordance with a preset or stored instruction. The computer device 2 may be a rack server, a blade server, a tower server or a rack server (including an independent server or a server cluster composed of a plurality of servers), and the like. As shown in fig. 9, the computer device 2 includes, but is not limited to, at least a memory 21, a processor 22, a network interface 23, and a speech emotion analyzing device 20, which are communicatively connected to each other via a system device bus. Wherein:

in this embodiment, the memory 21 includes at least one type of computer-readable storage medium including a flash memory, a hard disk, a multimedia card, a card-type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, and the like. In some embodiments, the storage 21 may be an internal storage unit of the computer device 2, such as a hard disk or a memory of the computer device 2. In other embodiments, the memory 21 may also be an external storage device of the computer device 2, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), or the like provided on the computer device 2. Of course, the memory 21 may also comprise both internal and external memory units of the computer device 2. In this embodiment, the memory 21 is generally used for storing various application software and operating system devices installed in the computer device 2, such as the program codes of the speech emotion analyzing device 20 in the second embodiment. Further, the memory 21 may also be used to temporarily store various types of data that have been output or are to be output.

Processor 22 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor 22 is typically used to control the overall operation of the computer device 2. In the present embodiment, the processor 22 is configured to run the program code stored in the memory 21 or process data, for example, run the speech emotion analyzing apparatus 20, so as to implement the speech emotion analyzing method of the above-described embodiment.

The network interface 23 may comprise a wireless network interface or a wired network interface, and the network interface 23 is generally used for establishing communication connection between the computer device 2 and other electronic system devices. For example, the network interface 23 is used to connect the computer device 2 to an external terminal through a network, establish a data transmission channel and a communication connection between the computer device 2 and the external terminal, and the like. The network may be a wireless or wired network such as an Intranet (Intranet), the Internet (Internet), a Global System of Mobile communication (GSM), Wideband Code Division Multiple Access (WCDMA), a 4G network, a 5G network, Bluetooth (Bluetooth), Wi-Fi, and the like.

It is noted that fig. 9 only shows the computer device 2 with components 20-23, but it is to be understood that not all shown components are required to be implemented, and that more or less components may be implemented instead.

In this embodiment, the speech emotion analyzing apparatus 20 stored in the memory 21 may also be divided into one or more program modules, and the one or more program modules are stored in the memory 21 and executed by one or more processors (in this embodiment, the processor 22) to complete the present invention.

For example, fig. 8 shows a schematic diagram of program modules of a second embodiment of implementing the speech emotion fluctuation analysis apparatus 20, in this embodiment, the speech emotion fluctuation analysis apparatus 20 can be divided into a first speech feature acquisition module 200, a second speech feature extraction module 202, a speech feature recognition module 204, and a recognition result acquisition module 206. The program module referred to in the present invention refers to a series of computer program instruction segments capable of performing specific functions, and is more suitable than a program for describing the execution process of the speech emotion fluctuation analysis device 20 in the computer device 2. The specific functions of the program modules, i.e., the first speech feature obtaining module 200 and the recognition result obtaining module 206, have been described in detail in the second embodiment, and are not described herein again.

Example four

The present embodiment also provides a computer-readable storage medium, such as a flash memory, a hard disk, a multimedia card, a card-type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a read-only memory (ROM), an electrically erasable programmable read-only memory (EEPROM), a programmable read-only memory (PROM), a magnetic memory, a magnetic disk, an optical disk, a server, an App application mall, etc., on which a computer program is stored, which when executed by a processor implements corresponding functions. The computer-readable storage medium of the present embodiment is used for storing a speech emotion fluctuation analysis device 20, and when executed by a processor, implements the speech emotion fluctuation analysis method of the above-described embodiment.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner.

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A speech emotion fluctuation analysis method is characterized by comprising the following steps:

2. The method for analyzing speech emotion fluctuation according to claim 1, wherein the obtaining of the first audio feature and the first text feature of the speech data to be detected includes:

3. The voice emotion analyzing method according to claim 2, wherein the second audio feature is recognized to obtain an audio emotion recognition result; recognizing the second character features, and acquiring character emotion recognition results, wherein the character emotion recognition results comprise:

4. The method for analyzing speech emotion fluctuation according to claim 1, wherein the obtaining of the first audio feature and the first text feature of the speech data to be tested further includes:

converting the voice data to be tested into characters;

5. The method for analyzing speech emotion fluctuation according to claim 4, wherein the second audio feature is recognized to obtain an audio emotion recognition result; recognizing the second character features, and acquiring character emotion recognition results, wherein the character emotion recognition results comprise:

6. The speech emotion fluctuation analysis method of claim 1, wherein the method further comprises:

acquiring offline or online voice data to be detected;

7. The voice emotion fluctuation analysis method of claim 6, wherein the fusing the audio emotion recognition result and the character emotion recognition result to obtain an emotion recognition result, and the sending the emotion recognition result to the association terminal includes:

8. A speech emotion fluctuation analysis apparatus, comprising:

9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor, when executing the computer program, implements the steps of the speech mood swing analyzing method according to any of claims 1 to 7.

10. A computer-readable storage medium, in which a computer program is stored which is executable by at least one processor to cause the at least one processor to perform the steps of the speech mood swing analyzing method according to any one of claims 1 to 7.