CN110910903A

CN110910903A - Speech emotion recognition method, device, equipment and computer readable storage medium

Info

Publication number: CN110910903A
Application number: CN201911228396.4A
Authority: CN
Inventors: 吴学阳; 姜迪; 汤耀华; 徐倩
Original assignee: WeBank Co Ltd
Current assignee: WeBank Co Ltd
Priority date: 2019-12-04
Filing date: 2019-12-04
Publication date: 2020-03-24
Anticipated expiration: 2039-12-04
Also published as: CN110910903B

Abstract

The invention discloses a speech emotion recognition method, a speech emotion recognition device, equipment and a computer readable storage medium, wherein the method comprises the following steps: performing phoneme conversion on the voice data to be recognized to obtain a phoneme sequence to be recognized; inputting the phoneme sequence to be identified into a phoneme classifier to obtain a phoneme emotion classification result, wherein the phoneme classifier is obtained by pre-training the phoneme sequence converted at least based on the text data; inputting the voice data to be recognized into a preset voice classifier to obtain a voice emotion classification result; and fusing the phoneme emotion classification result and the voice emotion classification result to obtain an emotion recognition result of the voice data to be recognized. The invention realizes full utilization of emotion information in voice data, improves the accuracy of emotion recognition results and improves emotion recognition effect.

Description

Speech emotion recognition method, device, equipment and computer readable storage medium

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a voice emotion recognition method, a voice emotion recognition device, voice emotion recognition equipment and a computer readable storage medium.

Background

Emotion recognition plays a very important role in intelligent human-computer interaction systems, especially in automated customer service systems. For example, in an automatic customer service system, the system needs to immediately recognize the emotion exposed in the user conversation so as to take corresponding measures against the emotion, for example, to put a peace on the user when the user feels angry, which is very important to improve the user experience and the application efficiency. Nowadays, intelligent human-computer interaction systems move towards phonation, and are very important for emotion recognition of speech.

The existing speech emotion recognition method mainly converts speech data into a text through machine recognition, and then carries out emotion recognition on the text by adopting a text-based emotion recognition method. However, this recognition method converts voice data into text, performs emotion recognition based on the text, and only utilizes emotion information reflected by text information in the voice data, and loses non-text emotion information in the voice data, so that the emotion recognition effect is poor.

Disclosure of Invention

The invention mainly aims to provide a speech emotion recognition method, a speech emotion recognition device, speech emotion recognition equipment and a computer readable storage medium, and aims to solve the technical problem that the existing method for converting speech into text and then recognizing emotion based on the text is poor in recognition effect.

In order to achieve the above object, the present invention provides a speech emotion recognition method, including:

performing phoneme conversion on the voice data to be recognized to obtain a phoneme sequence to be recognized;

inputting the phoneme sequence to be identified into a phoneme classifier to obtain a phoneme emotion classification result, wherein the phoneme classifier is obtained by pre-training a phoneme sequence converted based on text data at least;

inputting the voice data to be recognized into a preset voice classifier to obtain a voice emotion classification result;

and fusing the phoneme emotion classification result and the voice emotion classification result to obtain an emotion recognition result of the voice data to be recognized.

Optionally, before the step of performing phoneme conversion on the speech data to be recognized to obtain a phoneme sequence to be recognized, the method further includes:

acquiring first text training data, first voice training data and first emotion marks corresponding to the training data;

performing phoneme conversion on the first text training data to obtain a first phoneme sequence, and converting the first voice training data to obtain a second phoneme sequence;

and training a phoneme classifier to be trained by adopting the first phoneme sequence, the second phoneme sequence and the first emotion label to obtain the phoneme classifier.

Optionally, after the step of training the phoneme classifier to be trained by using the first phoneme sequence, the second phoneme sequence and the first emotion label to obtain the phoneme classifier, the method further includes:

acquiring second voice training data, second text training data forming parallel linguistic data together with the second voice training data, and second emotion marks corresponding to the second voice training data;

and adopting the second voice training data as input data of the preset voice classifier, adopting phoneme sequences respectively converted from the second voice training data and the second text training data as input data of the phoneme classifier, fusing output data of the preset voice classifier and the phoneme classifier, and carrying out fusion fine adjustment on the preset voice classifier and the phoneme classifier based on the second emotion marking and fusion result.

Optionally, the step of performing phoneme conversion on the first text training data to obtain a first phoneme sequence includes:

and converting the first text training data according to a preset mapping relation between words and phonemes to obtain a first phoneme sequence.

Optionally, the step of inputting the speech data to be recognized into a preset speech classifier to obtain a speech emotion classification result includes:

extracting audio features from the voice data to be recognized, wherein the audio features at least comprise one of logarithmic Mel inverse spectrogram, tone, volume and intensity;

and inputting the audio features into a preset voice classifier to obtain a voice emotion classification result.

Optionally, the step of fusing the phoneme emotion classification result and the speech emotion classification result to obtain an emotion recognition result of the speech data to be recognized includes:

and carrying out weighted average on the phoneme emotion classification result and the voice emotion classification result, and obtaining an emotion recognition result of the voice data to be recognized according to the result of the weighted average.

vector splicing is carried out on the phoneme emotion classification result and the voice emotion classification result;

and inputting the vector splicing result into a preset neural network to obtain an emotion recognition result of the voice data to be recognized.

In addition, to achieve the above object, the present invention also provides a speech emotion recognition apparatus, including:

the conversion module is used for carrying out phoneme conversion on the voice data to be recognized to obtain a phoneme sequence to be recognized;

the first input module is used for inputting the phoneme sequence to be recognized into a phoneme classifier to obtain a phoneme emotion classification result, wherein the phoneme classifier is obtained by training in advance at least a phoneme sequence converted from text data;

the second input module is used for inputting the voice data to be recognized into a preset voice classifier to obtain a voice emotion classification result;

and the fusion module is used for fusing the phoneme emotion classification result and the voice emotion classification result to obtain an emotion recognition result of the voice data to be recognized.

Furthermore, to achieve the above object, the present invention also provides a speech emotion recognition apparatus comprising a memory, a processor and a speech emotion recognition program stored on the memory and executable on the processor, wherein the speech emotion recognition program, when executed by the processor, implements the steps of the speech emotion recognition method as described above.

Further, to achieve the above object, the present invention also provides a computer readable storage medium having stored thereon a speech emotion recognition program, which when executed by a processor, implements the steps of the speech emotion recognition method as described above.

In the present embodiment, a phoneme sequence to be recognized is obtained by performing phoneme conversion on speech data to be recognized; inputting the phoneme sequence to be identified into a phoneme classifier to obtain a phoneme emotion classification result, wherein the phoneme classifier is obtained by pre-training the phoneme sequence converted at least based on the text data; inputting the voice data to be recognized into a preset voice classifier to obtain a voice emotion classification result; and fusing the phoneme emotion classification result and the voice emotion classification result to obtain an emotion recognition result of the voice data to be recognized. The phoneme classifier is obtained by training the phoneme sequence converted by the text data, so that the phoneme classifier learns the semantic information in the phoneme sequence, and the output phoneme emotion classification result not only contains the emotion information reflected by the pronunciation characteristics of the phoneme sequence, but also contains the emotion information reflected by the semantic information in the phoneme sequence; namely, the information of the text mode is completed for the voice data of the single mode through the cross-mode migration technology. Because the final emotion recognition result is fused with the phoneme emotion classification result and the voice emotion classification result, emotion information contained in text semantic information, emotion information contained in pronunciation characteristics and emotion information contained in audio characteristics of the voice data to be recognized are all considered and reflected in the final emotion recognition result, the emotion information in the voice data to be recognized is fully utilized, the accuracy of the emotion recognition result is improved, and the emotion recognition effect is improved.

Drawings

FIG. 1 is a schematic diagram of a hardware operating environment according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating a first embodiment of a speech emotion recognition method according to the present invention;

fig. 3 is a schematic diagram of a process of emotion recognition of voice data according to an embodiment of the present invention;

FIG. 4 is a diagram illustrating a training process of a phoneme classifier and a speech classifier according to an embodiment of the present invention;

FIG. 5 is a block diagram of a preferred embodiment of the speech emotion recognition apparatus according to the present invention.

The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

An embodiment of the present invention provides a speech emotion recognition device, and referring to fig. 1, fig. 1 is a schematic structural diagram of a hardware operating environment according to an embodiment of the present invention.

It should be noted that fig. 1 is a schematic structural diagram of a hardware operating environment of the speech emotion recognition apparatus. The speech emotion recognition device can be a PC (personal computer), and can also be a terminal device with a display function, such as a smart phone, a smart television, a tablet computer and a portable computer.

As shown in fig. 1, the speech emotion recognition apparatus may include: a processor 1001, such as a CPU, a network interface 1004, a user interface 1003, a memory 1005, a communication bus 1002. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display screen (Display), an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1005 may be a high-speed RAM memory or a non-volatile memory (e.g., a magnetic disk memory). The memory 1005 may alternatively be a storage device separate from the processor 1001.

Optionally, the speech emotion recognition device may further include a camera, a Radio Frequency (RF) circuit, a sensor, an audio circuit, a WiFi module, and the like. Those skilled in the art will appreciate that the speech emotion recognition device structure shown in fig. 1 does not constitute a limitation of the speech emotion recognition device, and may include more or less components than those shown, or some components in combination, or a different arrangement of components.

As shown in fig. 1, a memory 1005, which is a kind of computer storage medium, may include therein an operating system, a network communication module, a user interface module, and a speech emotion recognition program.

In the speech emotion recognition apparatus shown in fig. 1, the network interface 1004 is mainly used for connecting to a background server and communicating with the background server; the user interface 1003 is mainly used for connecting a client (user side) and performing data communication with the client; and the processor 1001 may be configured to call the speech emotion recognition program stored in the memory 1005, and perform the following operations:

Further, before the step of performing phoneme conversion on the speech data to be recognized to obtain a phoneme sequence to be recognized, the processor 1001 may be configured to call the speech emotion recognition program stored in the memory 1005, and further perform the following operations:

Further, after the step of training the phoneme classifier to be trained by using the first phoneme sequence, the second phoneme sequence and the first emotion label to obtain the phoneme classifier, the processor 1001 may be configured to call the speech emotion recognition program stored in the memory 1005, and further perform the following operations:

Further, the step of performing phoneme conversion on the first text training data to obtain a first phoneme sequence includes:

Further, the step of inputting the voice data to be recognized into a preset voice classifier to obtain a voice emotion classification result includes:

Further, the step of fusing the phoneme emotion classification result and the speech emotion classification result to obtain an emotion recognition result of the speech data to be recognized includes:

Based on the hardware structure, the invention provides various embodiments of the speech emotion recognition method.

Referring to fig. 2, a first embodiment of the speech emotion recognition method of the present invention provides a speech emotion recognition method, it being noted that although a logical order is shown in the flowchart, in some cases, the steps shown or described may be performed in an order different from that here. The execution subject of each embodiment of the speech emotion recognition method of the present invention may be a terminal device such as a PC, a smart phone, a smart television, a tablet computer, and a portable computer, and for convenience of description, the execution subject is omitted in the following embodiments for explanation. The speech emotion recognition method comprises the following steps:

step S10, performing phoneme conversion on the voice data to be recognized to obtain a phoneme sequence to be recognized;

in the current mode of converting voice into text and then recognizing emotion based on the text, only emotion information reflected by text information in voice data is utilized, but non-text emotion information in the voice data is not utilized, so that the emotion recognition effect is poor. Specifically, because the pronunciation of the character changes when the emotion of the voice changes, but the character does not change, the voice is converted into the text, and the pronunciation characteristics in the voice are lost, so that the emotion recognition is inaccurate and the recognition effect is poor.

Based on this, the present embodiment proposes a cross-modal speech emotion recognition method based on phonemes to solve the above technical problem. In the present embodiment, the phoneme refers to a sound unit constituting the speech, and for example, the phoneme of the word "xian" (xi ā n) may be "x", "i", "ā", and "n". The phoneme represents a physical pronunciation unit, i.e. a unit of a character or a word, and reflects a specific pronunciation in the voice. There is no uniform standard for the specific definition of phonemes, but it is sufficient to use a consistent scheme in the system.

The embodiment of the invention provides a concept of cross-modal emotion recognition: cross-modal refers to spanning different modalities, and in the present invention refers to spanning both "text modality" and "speech modality", one in text and one in sound. The concept related to this is "multi-modal", meaning that different modalities are used as inputs and are thus utilized simultaneously. The cross-modal information prediction method provided by the embodiment is different from the multi-modal information prediction method, and aims to use one modality as input, predict other modality information through machine learning so as to assist the emotion recognition task, namely, in a speech emotion recognition scene, aim to use the speech modality as input, complement text modality information through machine learning, and assist the emotion recognition task based on the text modality information.

Specifically, referring to fig. 3, in the present embodiment, the speech data to be recognized is subjected to phoneme conversion, so as to obtain a phoneme sequence to be recognized. The voice data to be recognized refers to voice data needing emotion recognition, and the sources of the voice data to be recognized are different based on different application scenes. As in the smart customer service application scenario, the speech data to be recognized may be the user's speech data received by the system. The phoneme sequence to be recognized is a phoneme sequence obtained by converting the speech data to be recognized. The phoneme sequence is a sequence composed of phonemes, and the concrete expression form can be a vector. In the present embodiment, a method of converting the voice data into the phoneme sequence may be a method of performing phoneme conversion in an existing voice Recognition technology (ASR). Existing speech recognition techniques typically involve two parts, the first step being the conversion from speech to a sequence of phonemes, called an acoustic model, reflecting the physical pronunciation of the speaker; and the second step realizes the conversion from phoneme to text, and fuses the language model.

Step S20, inputting the phoneme sequence to be recognized into a phoneme classifier to obtain a phoneme emotion classification result, wherein the phoneme classifier is obtained by training in advance at least a phoneme sequence converted from text data;

and inputting the phoneme sequence to be identified into a phoneme classifier to obtain a phoneme emotion classification result. The phoneme classifier can be a neural network, such as a deep neural network, the input data can be a phoneme sequence, and the output phoneme emotion classification result can be an emotion category, such as anger, happiness, sadness and the like, and can also be a vector representing emotion category characteristics; training a phoneme classifier in advance through training data comprising a phoneme sequence and emotion labels, wherein the training mode can adopt a general supervised training mode of a neural network; the phoneme sequence adopted by the training phoneme classifier at least comprises a phoneme sequence converted by the text data. Specifically, text data used for training the phoneme classifier may be collected in advance, emotion classification labeling may be performed on the collected text data, and an artificial labeling manner may be adopted. Compared with the method for marking the voice, the method for marking the text is simple, and a large amount of labor and financial cost can be saved; and correspondingly converting the acquired text data into a phoneme sequence, and training a phoneme classifier by adopting the phoneme sequence converted from the text data.

It should be noted that the phoneme sequence converted by the speech data may coincide with the phoneme space corresponding to the phoneme sequence converted by the text data, or the phoneme sequence may be processed so that the phoneme spaces corresponding to the two coincide with each other, so that the two may share one phoneme classifier.

In this embodiment, the phoneme classifier is trained through the phoneme sequence converted from the text data, and because the text mode gives semantic information to the phoneme sequence, the trained phoneme classifier learns the semantic information through the text mode; and inputting the phoneme sequence converted by the voice to be recognized into a phoneme classifier which learns semantic information, so that the output phoneme emotion classification result not only contains emotion information reflected by the pronunciation characteristics of the phoneme sequence, but also contains emotion information reflected by the semantic information in the phoneme sequence. That is, in this embodiment, the text semantic information in the text mode is cross-modal migrated to the speech mode through the intermediary of the phoneme, that is, the information in the text mode is completed for the speech data in the single mode through the cross-modal migration technique, so that emotion recognition of the speech data is assisted by using semantic information learned from the text mode.

Step S30, inputting the voice data to be recognized into a preset voice classifier to obtain a voice emotion classification result;

and inputting the voice data to be recognized into a preset voice classifier to obtain a voice emotion classification result. The preset voice classifier can be a neural network, such as a deep neural network, the input data can be voice data or preprocessed voice data, and the output voice emotion classification result can be an emotion class, such as anger, happiness, sadness and the like, and can also be a vector representing emotion class characteristics; the speech classifier can be trained in advance through training data containing speech data and emotion labels, and the training mode can be a general supervised training mode of a neural network. Because the input data of the voice classifier is original voice data, emotion classification is carried out based on the audio features of the voice data, and the output voice emotion classification result contains emotion information reflected by the audio features of the voice data.

Further, the step S30 includes:

step S301, extracting audio features from the voice data to be recognized, wherein the audio features at least comprise one of logarithmic Mel inverse spectrogram, tone, volume and intensity;

and extracting audio features from the voice data to be recognized, wherein the audio features at least comprise one of logarithmic Mel inverse spectrogram, tone, volume and intensity. To enable the speech classifier to perform extraction and classification of emotional features based on richer audio features, the audio features may also include other audio features besides log mel-frequency cepstrum, pitch, volume and intensity.

And step S302, inputting the audio characteristics into a preset voice classifier to obtain a voice emotion classification result.

The audio features are input into a preset voice classifier to obtain a voice emotion classification result, specifically, the voice classifier extracts emotion features contained in the audio features based on the audio features of the voice data, and then the voice emotion classification result is obtained according to the emotion features.

And step S40, fusing the phoneme emotion classification result and the voice emotion classification result to obtain an emotion recognition result of the voice data to be recognized.

And fusing the phoneme emotion classification result and the voice emotion classification result to obtain an emotion recognition result of the voice data to be recognized. Specifically, there are many ways of fusion, for example, when the phoneme emotion classification result and the speech emotion classification result are vectors representing emotion category features, the two vectors may be averaged, the result is input into a fusion classifier, and the final emotion recognition result, such as emotion category, is output by the fusion classifier. The fusion classifier can also adopt a neural network and is obtained by performing combined training with a phoneme classifier and a speech classifier in advance.

Further, the step S40 includes:

step S401, carrying out weighted average on the phoneme emotion classification result and the voice emotion classification result, and obtaining an emotion recognition result of the voice data to be recognized according to the result of the weighted average.

Specifically, the phoneme emotion classification result and the voice emotion classification result may also be probability values representing a certain emotion category, the two probability values may be weighted and averaged to obtain a fusion probability value, and the emotion category of the voice data to be recognized is determined according to the fusion probability value. The weight value of the weighted average may be set in advance, for example, a ratio of training data used for training the phoneme classifier and the speech classifier may be used as a weight ratio.

In this embodiment, a phoneme sequence to be recognized is obtained by performing phoneme conversion on speech data to be recognized; inputting the phoneme sequence to be identified into a phoneme classifier to obtain a phoneme emotion classification result, wherein the phoneme classifier is obtained by pre-training the phoneme sequence converted at least based on the text data; inputting the voice data to be recognized into a preset voice classifier to obtain a voice emotion classification result; and fusing the phoneme emotion classification result and the voice emotion classification result to obtain an emotion recognition result of the voice data to be recognized. The phoneme classifier is obtained by training the phoneme sequence converted by the text data, so that the phoneme classifier learns the semantic information in the phoneme sequence, and the output phoneme emotion classification result not only contains the emotion information reflected by the pronunciation characteristics of the phoneme sequence, but also contains the emotion information reflected by the semantic information in the phoneme sequence; namely, the information of the text mode is completed for the voice data of the single mode through the cross-mode migration technology. Because the final emotion recognition result is fused with the phoneme emotion classification result and the voice emotion classification result, emotion information contained in text semantic information, emotion information contained in pronunciation characteristics and emotion information contained in audio characteristics of the voice data to be recognized are all considered and reflected in the final emotion recognition result, the emotion information in the voice data to be recognized is fully utilized, the accuracy of the emotion recognition result is improved, and the emotion recognition effect is improved.

In addition, the existing method for emotion recognition based on text by converting voice into text is limited by the effect of machine recognition, and the final emotion recognition effect is poor due to the poor effect of machine text transcription.

In addition, the cross-modal emotion recognition method based on phonemes in the embodiment only needs to input voice data, and compared with the multi-modal emotion recognition method, the cross-modal emotion recognition method based on phonemes can be applied to an intelligent customer service system only through voice interaction, and multi-modal input including video is not needed, so that the application range of emotion recognition is expanded.

Further, based on the first embodiment, a second embodiment of the speech emotion recognition method of the present invention provides a speech emotion recognition method. In this embodiment, the speech emotion recognition method further includes:

step S50, acquiring first text training data, first voice training data and first emotion marks corresponding to the training data;

further, in this embodiment, the phoneme classifier may be trained separately. Specifically, first text training data, first voice training data, and first emotion labels corresponding to the training data are obtained. The first text training data and the first voice training data may be non-corresponding, that is, the text and the voice do not necessarily correspond to each other one to one, and the data size of the first text training data may be larger than that of the first voice training data.

Step S60, performing phoneme conversion on the first text training data to obtain a first phoneme sequence, and converting the first voice training data to obtain a second phoneme sequence;

and performing phoneme conversion on the first text training data to obtain a first phoneme sequence. And converting the first voice training data to obtain a second phoneme sequence. Specifically, the manner of converting the speech training data into the phoneme sequence is the same as the manner of converting speech into phoneme adopted in the first embodiment, and is not described in detail herein.

Further, the step S60 includes:

step S601, converting the first text training data according to a preset mapping relation between words and phonemes to obtain a first phoneme sequence.

Mapping relations between the words and the phonemes can be preset, for example, a dictionary is used for recording the mapping relations, the words forming sentences in the first text training data are correspondingly converted into phonemes according to the mapping relations, and the phonemes are combined together according to the sequence of the words forming the sentences to obtain a first phoneme sequence.

Step S70, training a phoneme classifier to be trained by using the first phoneme sequence, the second phoneme sequence and the first emotion label to obtain the phoneme classifier.

And training the phoneme classifier to be trained by adopting the first phoneme sequence, the second phoneme sequence and the first emotion label to obtain the phoneme classifier. The specific training process is similar to the supervised training process of a general neural network, and is not described in detail herein. Since the first phoneme sequence is converted from the text training data, the phoneme classifier can learn semantic information in the phoneme sequence. Moreover, the text training data and the speech training data can be independent, so that the phoneme classifier can be fully trained by adopting a large amount of text training data. Since the standard pronunciation is obtained by converting the text into the phoneme and there is a deviation from the actual situation, in this embodiment, the phoneme classifier is trained by using the second phoneme sequence converted from the first speech training data, so that the deviation can be corrected, and the phoneme classifier can obtain a more accurate result.

Further, after the step S70, the method further includes:

step S80, acquiring second voice training data, second text training data forming parallel corpora with the second voice training data, and second emotion labels corresponding to the second voice training data;

referring to the training process diagram shown in fig. 4, after the phoneme classifier and the speech classifier are trained separately, fusion fine tuning can be performed on the phoneme classifier and the speech classifier. Specifically, second voice training data, second text training data that constitutes a parallel corpus with the second voice training data, and a second emotion label corresponding to the second voice training data may be acquired. The parallel linguistic data refers to text data and voice data corresponding to the same linguistic data. The second speech training data may be different from the first speech training data, the second text training data may be different from the first text training data, and a data amount of the second speech training data and the second text training data may be smaller than a data amount of the first speech training data and the first text training data.

And step S90, using the second speech training data as input data of the preset speech classifier, using phoneme sequences respectively converted from the second speech training data and the second text training data as input data of the phoneme classifier, fusing output data of the preset speech classifier and the phoneme classifier, and performing fusion fine tuning on the preset speech classifier and the phoneme classifier based on the second emotion labeling and fusion result.

And adopting the second voice training data as input data of a preset voice classifier. And converting the second voice training data to obtain a phoneme sequence, converting the second text training data to obtain a phoneme sequence, and taking the two parts of phoneme sequences as input data of the phoneme classifier. And fusing the output data of the voice classifier and the output data of the phoneme classifier, specifically, vector splicing or weighted averaging the two output data, inputting the result into a fusion classifier adopting a neural network to obtain a fusion result, calculating a loss function and a gradient value through a second emotion label and the fusion result, and further finely adjusting each parameter of the fusion classifier, the voice classifier and the phoneme classifier. The specific process of fusion fine tuning can adopt a supervised training process of a general neural network.

Further, step S40 includes:

s402, carrying out vector splicing on the phoneme emotion classification result and the voice emotion classification result;

in this embodiment, another way of fusing phoneme emotion classification results and speech emotion classification results is proposed, corresponding to the above-mentioned process of fusion fine tuning. Specifically, after the phoneme emotion classification result and the speech emotion classification result of the speech data to be recognized are obtained, vector splicing is performed on the phoneme emotion classification result and the speech emotion classification result, and the vector splicing mode can be a commonly used vector splicing mode.

And S403, inputting the vector splicing result into a preset neural network to obtain the emotion recognition result of the voice data to be recognized.

And inputting the vector splicing result into a preset neural network to obtain an emotion recognition result of the voice data to be recognized. The preset neural network may be the fusion classifier finely tuned by the fusion.

In this embodiment, the phoneme classifier and the voice classifier are fused and fine-tuned, so that the final emotion recognition result obtained by fusing the phoneme emotion classification result obtained by the phoneme classifier and the voice classifier result obtained by the voice classifier is more accurate, and the voice emotion recognition effect is improved.

In addition, an embodiment of the present invention further provides a speech emotion recognition apparatus, and referring to fig. 5, the speech emotion recognition apparatus includes:

the conversion module 10 is configured to perform phoneme conversion on the speech data to be recognized to obtain a phoneme sequence to be recognized;

a first input module 20, configured to input the phoneme sequence to be recognized into a phoneme classifier to obtain a phoneme emotion classification result, where the phoneme classifier is obtained by pre-training a phoneme sequence converted based on at least text data;

the second input module 30 is configured to input the voice data to be recognized into a preset voice classifier to obtain a voice emotion classification result;

and the fusion module 40 is configured to fuse the phoneme emotion classification result and the speech emotion classification result to obtain an emotion recognition result of the speech data to be recognized.

Further, the speech emotion recognition apparatus further includes:

the acquisition module is used for acquiring first text training data, first voice training data and first emotion marks corresponding to the training data;

the conversion module 10 is further configured to perform phoneme conversion on the first text training data to obtain a first phoneme sequence, and convert the first speech training data to obtain a second phoneme sequence;

and the training module is used for training a phoneme classifier to be trained by adopting the first phoneme sequence, the second phoneme sequence and the first emotion label to obtain the phoneme classifier.

Further, the obtaining module is further configured to obtain second voice training data, second text training data forming a parallel corpus with the second voice training data, and a second emotion label corresponding to the second voice training data;

the training module is further used for adopting the second voice training data as input data of the preset voice classifier, adopting phoneme sequences converted from the second voice training data and the second text training data respectively as input data of the phoneme classifier, fusing output data of the preset voice classifier and output data of the phoneme classifier, and fusing and finely adjusting the preset voice classifier and the phoneme classifier based on the second emotion labeling and fusing results.

Further, the conversion module 10 is further configured to:

Further, the second input module 30 includes:

an extracting unit, configured to extract an audio feature from the to-be-recognized voice data, where the audio feature at least includes one of a logarithmic mel cepstrum, a pitch, a volume, and an intensity;

and the input unit is used for inputting the audio features into a preset voice classifier to obtain a voice emotion classification result.

Further, the fusion module is further configured to perform weighted averaging on the phoneme emotion classification result and the speech emotion classification result, and obtain an emotion recognition result of the speech data to be recognized according to a result of the weighted averaging.

Further, the fusion module is further configured to:

The specific implementation of the speech emotion recognition device of the present invention has basically the same expansion content as the above embodiments of the speech emotion recognition method, and is not described herein again.

Furthermore, an embodiment of the present invention further provides a computer-readable storage medium, where a speech emotion recognition program is stored on the computer-readable storage medium, and when executed by a processor, the speech emotion recognition program implements the steps of the speech emotion recognition method described above.

The specific implementation of the speech emotion recognition device and the computer-readable storage medium of the present invention is basically the same as the embodiments of the speech emotion recognition method described above, and will not be described herein again.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) as described above and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A speech emotion recognition method, characterized in that the speech emotion recognition method comprises:

2. The speech emotion recognition method of claim 1, wherein, before the step of performing phoneme conversion on the speech data to be recognized to obtain a phoneme sequence to be recognized, the method further comprises:

3. The speech emotion recognition method of claim 2, wherein after the step of training the phoneme classifier to be trained by using the first phoneme sequence, the second phoneme sequence and the first emotion label to obtain the phoneme classifier, the method further comprises:

4. The speech emotion recognition method of claim 2, wherein the step of subjecting the first text training data to phoneme conversion to obtain a first phoneme sequence comprises:

5. The speech emotion recognition method of claim 1, wherein the step of inputting the speech data to be recognized into a preset speech classifier to obtain a speech emotion classification result comprises:

6. The speech emotion recognition method of any one of claims 1 to 5, wherein the step of fusing the phoneme emotion classification result and the speech emotion classification result to obtain an emotion recognition result of the speech data to be recognized includes:

7. The speech emotion recognition method of any one of claims 1 to 5, wherein the step of fusing the phoneme emotion classification result and the speech emotion classification result to obtain an emotion recognition result of the speech data to be recognized includes:

8. A speech emotion recognition apparatus, characterized in that the speech emotion recognition apparatus comprises:

9. A speech emotion recognition device, characterized in that the speech emotion recognition device comprises a memory, a processor and a speech emotion recognition program stored on the memory and executable on the processor, which speech emotion recognition program when executed by the processor implements the steps of the speech emotion recognition method as claimed in any one of claims 1 to 7.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon a speech emotion recognition program, which when executed by a processor implements the steps of the speech emotion recognition method according to any of claims 1 to 7.