CN111696579B

CN111696579B - Speech emotion recognition method, device, equipment and computer storage medium

Info

Publication number: CN111696579B
Application number: CN202010554606.5A
Authority: CN
Inventors: 陈剑超; 肖龙源; 李稀敏; 刘晓葳; 叶志坚
Original assignee: Xiamen Kuaishangtong Technology Co Ltd
Current assignee: Xiamen Kuaishangtong Technology Co Ltd
Priority date: 2020-06-17
Filing date: 2020-06-17
Publication date: 2022-10-28
Anticipated expiration: 2040-06-17
Also published as: CN111696579A

Abstract

The invention discloses a speech emotion recognition method, a speech emotion recognition device, speech emotion recognition equipment and a computer storage medium, wherein the method comprises the following steps of: acquiring a phoneme label of a user; wherein the phoneme label is a pronunciation dictionary for judging the emotion data set of the user; extracting one-hot vectors of each phoneme label, and splicing each one-hot vector into a two-dimensional matrix according to a time domain to generate phoneme characteristic phoneme characteristics; acquiring voice characteristics of a user; and respectively inputting the spectrogram characteristics and the phoneme characteristics into the neural network model to obtain emotion prediction output characteristics, and identifying the speech emotion of the user according to the emotion prediction output characteristics. The method can be used for identifying the speech emotion by combining the spectrogram characteristics and the phoneme characteristics, improves the classification accuracy by the aid of the phoneme information, and reduces the manual interference.

Description

Speech emotion recognition method, device, equipment and computer storage medium

Technical Field

The present invention relates to the field of speech research, and in particular, to a speech emotion recognition method, apparatus, device, and computer storage medium.

Background

The speech emotion recognition is an important branch in the field of speech research, and is a technology for judging emotion of a person according to the speech of the person, and is designed to be a core problem in multiple speech researches such as signal processing, feature extraction, pattern recognition and the like. In recent years, with the rapid development of information technology, speech emotion recognition has important application in a plurality of scenes, and is embodied as follows: 1. telephone traffic systems and large commercial establishments need to handle thousands of customer calls every day, wherein ensuring the satisfaction of telephone customers is an important measure to prevent customer loss, and therefore timely finding and early warning are needed for discontent emotions of customers in calls. 2. Education cause and research show that the learning effect of learners is greatly related to the emotional state of learners, and slightly negative emotion in the learning process is helpful for generation of critical thinking. 3. In the advertising mode, in the past, an advertiser can only place advertisements in a large range so as to achieve maximum coverage on potential customers, but the placement mode is high in cost and poor in pertinence. The emotional tendencies of the advertisement readers are their most direct feedback on the advertisement ratings. Based on the voice emotion recognition system, the emotion state of a reader can be acquired, the advertisement placer is helped to obtain feedback information evaluated by the reader, the placement strategy is changed, and the cost is reduced.

However, due to the complexity of emotion, there are two methods of definition, one is discrete emotion definition and one is continuous emotion definition. Discrete emotion definition is an intuitive and simple method, and evaluators mark voices as certain well-defined emotion categories such as 'happy', 'too happy', 'angry', and the like through own subjective feelings. Continuous emotion definition uses not emotion categories but scores in some psychological dimensions to measure emotion, a common model is an intensity-valence model, where intensity reflects certain characteristics of the utterance, usually the more intense sounds, whose high frequency parts contain more energy and have higher pitch. However, some emotions cannot be distinguished using intensity alone, and thus need to be distinguished by valence alone. In the prior art, a complete speech emotion recognition framework mainly comprises three steps of speech feature extraction, emotion distinguishing information acquisition and classifier training, and finally emotion label prediction can be obtained. Referring to fig. 1: the first step of the speech emotion system is to extract speech features which can be used for model training from an original waveform, and the features used in speech emotion recognition are various and can be classified into vocal tract features, prosody features, statistical features and the like. And then further acquiring information capable of distinguishing various emotion categories from the extracted voice features. The traditional method generally increases the distinctiveness of emotion through the skillfully designed feature combination, and is completed through the high-level output of a neural network more and more along with the development of deep learning technology. Finally, after emotion distinguishing information is obtained, a classifier can be trained to obtain emotion prediction on test data, the classification is a generative classifier and a distinguishing classifier, and a full link layer is used for classification in a neural network. However, the conventional method has a big problem that the conventional method is highly sensitive to features, so most researches need to be carried out through a complex feature selection process before training a classifier, and the process of extracting emotion distinguishing information is completed by a feature selection algorithm to a certain extent, so that more manual interference is introduced.

Disclosure of Invention

In view of the foregoing problems, an object of the present invention is to provide a speech emotion recognition method, apparatus, device and storage medium, which can recognize speech emotion by combining spectrogram features and phoneme features, improve classification accuracy by assistance of phoneme information, and reduce artificial interference.

The embodiment of the invention provides a speech emotion recognition method, which comprises the following steps:

acquiring a phoneme label of a user; the phoneme label is a pronunciation dictionary for judging the emotion data set of the user;

acquiring voice characteristics of a user;

extracting one-hot vectors of each phoneme label, and splicing each one-hot vector into a two-dimensional matrix according to a time domain to generate phoneme characteristic phoneme characteristics;

and respectively inputting the voice features and the phoneme features into a neural network model to obtain emotion prediction output features, and identifying the voice emotion of the user according to the emotion prediction output features.

Preferably, the emotion data set comprises at least one of: happy, sad, and happy; the voice features are spectrogram features; the phoneme label is defined using 39 phonemes.

Preferably, the speech features and the phoneme features are respectively input into a neural network model to obtain emotion prediction output features, and the speech emotion of the user is recognized according to the emotion prediction output features, specifically:

normalizing the spectrogram features to extract spectrogram image texture features;

segmenting the texture features of the spectrogram image into segment-level features with the same length, and performing zero filling processing on the part with insufficient length to form spectrogram features after a training batch;

and respectively inputting the spectrogram characteristics and the phoneme characteristics which form the training batch into a neural network model for splicing so as to obtain emotion prediction output characteristics, and recognizing the speech emotion of the user according to the emotion prediction output characteristics.

In a second aspect, an embodiment of the present invention provides a speech emotion recognition apparatus, including:

a phoneme label obtaining unit, configured to obtain a phoneme label of a user; the phoneme label is a pronunciation dictionary for judging the emotion data set of the user;

the voice feature acquisition unit is used for acquiring the voice features of the user;

a one-hot vector extraction unit, configured to extract one-hot vectors of each phoneme label, and splice each one-hot vector into a two-dimensional matrix according to a time domain to generate a phoneme feature;

and the voice emotion recognition unit is used for respectively inputting the voice features and the phoneme features into a neural network model so as to obtain emotion prediction output features, and recognizing the voice emotion of the user according to the emotion prediction output features.

Preferably, the speech emotion recognition unit includes:

the normalization module is used for performing normalization processing on the spectrogram features so as to extract spectrogram image texture features;

the segmentation module is used for segmenting the texture features of the spectrogram image into segment-level features with the same length, and performing zero filling processing on the part with insufficient length to form spectrogram features after a training batch;

and the speech emotion recognition module is used for respectively inputting the spectrogram characteristics and the phoneme characteristics which form the training batch into the neural network model for splicing so as to obtain emotion prediction output characteristics, and recognizing the speech emotion of the user according to the emotion prediction output characteristics.

The embodiment of the invention also provides voice emotion recognition equipment which comprises a processor, a memory and a computer program stored in the memory, wherein the computer program can be executed by the processor to realize the voice emotion recognition method in the embodiment.

The embodiment of the present invention further provides a computer-readable storage medium, where the computer-readable storage medium includes a stored computer program, and when the computer program runs, a device where the computer-readable storage medium is located is controlled to execute the method for writing data into a water card, where the method is described in the above embodiment.

In the embodiment, the speech emotion is recognized by combining the spectrogram characteristic and the phoneme characteristic, the classification accuracy is improved by the aid of the phoneme information, and the artificial interference is reduced.

Drawings

In order to more clearly illustrate the technical solution of the present invention, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a flowchart illustrating a speech emotion recognition method according to a first embodiment of the present invention.

Fig. 2 is a schematic structural diagram of a CNN speech emotion recognition network combined with phoneme information according to an embodiment of the present invention.

Fig. 3 is a schematic structural diagram of texture features of a spectrogram image provided in an embodiment of the present invention.

FIG. 4 is a schematic structural diagram of a speech emotion recognition apparatus according to a second embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1 to fig. 3, a first embodiment of the present invention provides a speech emotion recognition method, which can be executed by a speech emotion recognition device, and in particular, executed by one or more processors in the speech emotion recognition device, and at least includes the following steps:

s101, acquiring a phoneme label of a user; the phoneme label is a pronunciation dictionary for judging the emotion data set of the user;

in this embodiment, the phoneme label of the user is obtained by the G2P tool, and it should be noted that the phoneme definition used in the phoneme label text in the present application is from the university of kanji merlon pronunciation dictionary, and 39 phoneme definitions are used. Of course, it is to be appreciated that the emotion data set includes at least one of: happiness, sadness, difficulty and happiness, and the invention will not be described herein.

S102, acquiring the voice characteristics of the user.

In this embodiment, the speech features include MFCC, mel Filterbank features, PLP features, prosodic features, and spectrogram features, where the extracted features of the spectrogram are finer and have better effects, and other extracted features are coarser and shallower, and are better for a certain kind of emotion, such as happy feeling, but some of them are worse, so in this application, the speech features are spectrogram features, and specifically, the speech features are as follows:

table one:

feature(s)	Parameter(s)	Dimension (D)
			MFCC	25ms window shift, 10ms window length, 26 mel filter banks	39
Mel Filterbank	25ms window length, 10ms window shift, 40 mel filter banks	40
			PLP	25ms window length, 10ms window shift, 5 th order linear sensing	18
Prosodic features	Fundamental frequency, sound probability, loudness curve	3
			Speech spectrum	Window length of 40ms, window shift of 10ms, 1600FFT points	800

S103, extracting one-hot vectors of each phoneme label, and splicing each one-hot vector into a two-dimensional matrix according to a time domain to generate phoneme feature phoneme features.

In this embodiment, the phoneme features are described by using a one-hot vector, that is, for each phoneme p, a 39-dimensional zero-one vector x is used for description, and the following expression is specifically used:

if the phoneme p is the ith phoneme defined by 39 phonemes, the ith dimension of the x vector is 1, and the rest are 0.

It should be noted that, for the phoneme sequence of each sentence, the obtained one-hot vectors are spliced into a two-dimensional matrix according to the time domain to form phoneme features, the phoneme features also use a batch training mode, the sentence sequence of each batch is kept consistent with the sequence of the spectrogram features, so as to accurately identify the speech emotion.

And S104, respectively inputting the voice features and the phoneme features into a neural network model to obtain emotion prediction output features, and identifying the voice emotion of the user according to the emotion prediction output features.

In this embodiment, the convolutional neural network is still used for the learning of the phoneme features, and the single-layer convolutional layer and the global mean pooling are used to obtain the high-level feature output. The design details of the phoneme convolution network part are shown in fig. 2, the convolutional layer is designed to cover all phonemes by using convolution kernels with the width of 39, the width of the convolution kernels is 3, therefore, each convolution kernel covers the information of the front and back 3 phonemes, the output dimension of the convolutional layer is 32, the output features pass through a ReLU activation function, then 32-dimensional feature vectors are obtained by global mean pooling, the feature vectors and the output of the speech part are spliced into 112-dimensional vectors, and emotion prediction output features are obtained by using a full link layer.

Specifically, the S104 includes the following steps:

s1041, performing normalization processing on the spectrogram characteristics to extract spectrogram image texture characteristics;

referring to fig. 3, in the present embodiment, the normalization is to take the logarithm of the spectrogram, but the μ rate compression is used in the present application, and the formula is as follows:

the mu rate compression can improve the proportion of the low-amplitude part of the spectrogram, on one hand, the numerical difference of the spectrogram is reduced, the training process is more stable, and on the other hand, the neural network can utilize more information.

S1042, segmenting the speech spectrogram image texture features into segment-level features with the same length, and performing zero filling processing on the part with insufficient length to form speech spectrogram features after training batch;

and S1043, respectively inputting the spectrogram characteristics and the phoneme characteristics after the training batch is formed into a neural network model for splicing to obtain emotion prediction output characteristics, and recognizing the speech emotion of the user according to the emotion prediction output characteristics.

In this embodiment, the network input features are normalized amplitude spectra, the size of which is 400x200, and are transformed into 16-dimensional features after being convolved by the first layer, the output dimension of each layer is increased by 16 for further extracting the high-level features, and after each layer of convolutional layers, the maximum pooling layer with a pooling window of 2x2 is added to the first three layers of convolutional layers. After the fifth convolutional layer, the global mean pooling layer is used to downsample the high-level emotional features to a single value, and the loss function used is the cross-entropy loss function:

the network training adopts a random gradient descent method, a cosine attenuation function is adopted for the setting of the learning rate, and the initial learning rate is set to be 0.05.

In summary, by acquiring a phoneme label of a user and a voice feature of the user, extracting the phoneme feature of the phoneme label, and then respectively inputting the voice feature and the phoneme feature into a neural network model to obtain an emotion prediction output feature, and recognizing the voice emotion of the user according to the emotion prediction output feature, the invention can recognize the voice emotion by combining a spectrogram feature and a phoneme feature, improves the classification accuracy by the aid of phoneme information, and reduces artificial interference.

The first embodiment of the present invention:

referring to fig. 4, a second embodiment of the present invention provides a speech emotion recognition apparatus, including:

a phoneme label acquiring unit 100, configured to acquire a phoneme label of a user; the phoneme label is a pronunciation dictionary for judging the emotion data set of the user;

a voice feature obtaining unit 200, configured to obtain a voice feature of a user;

a one-hot vector extracting unit 300, configured to extract one-hot vectors of each phoneme label, and splice each one-hot vector into a two-dimensional matrix according to a time domain to generate a phoneme feature;

and a speech emotion recognition unit 400, configured to input the speech features and the phoneme features into a neural network model respectively to obtain emotion prediction output features, and recognize speech emotion of the user according to the emotion prediction output features.

In the foregoing embodiment, in a preferred embodiment of the present invention, the emotion data set includes at least one of: happy, sad, and happy; the voice features are spectrogram features; the phoneme label is defined using 39 phonemes.

In the foregoing embodiment, in a preferred embodiment of the present invention, the speech emotion recognition unit includes:

the normalization module is used for performing normalization processing on the spectrogram characteristics so as to extract spectrogram image texture characteristics;

the segmentation module is used for segmenting the speech spectrogram image texture features into segment-level features with the same length, and performing zero filling processing on the part with insufficient length to form speech spectrogram features after training batch;

The embodiment of the present invention further provides a computer-readable storage medium, where the computer-readable storage medium includes a stored computer program, and when the computer program runs, the apparatus where the computer-readable storage medium is located is controlled to execute the method for writing data into a water card, which is used for preventing data loss, according to the above embodiment.

Illustratively, the computer program may be divided into one or more units, which are stored in the memory and executed by the processor to accomplish the present invention. The one or more units may be a series of computer program instruction segments capable of performing specific functions, which are used for describing the execution process of the computer program in the speech emotion recognition device.

The speech emotion recognition device can include, but is not limited to, a processor, a memory. It will be understood by those skilled in the art that the schematic diagram is merely an example of the speech emotion recognition device, and is not meant to be limiting, and may include more or less components than those shown, or some components may be combined, or different components, for example, the speech emotion recognition device may further include an input/output device, a network access device, a bus, etc.

The Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. The general purpose processor may be a microprocessor or the processor may be any conventional processor or the like, the control center of the speech emotion recognition device connects the various parts of the entire speech emotion recognition device using various interfaces and lines.

The memory can be used for storing the computer programs and/or modules, and the processor can realize various functions of the speech emotion recognition equipment by running or executing the computer programs and/or modules stored in the memory and calling data stored in the memory. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. In addition, the memory may include high speed random access memory, and may also include non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), at least one magnetic disk storage device, a Flash memory device, or other volatile solid state storage device.

Wherein, the integrated unit of the speech emotion recognition device can be stored in a computer readable storage medium if the integrated unit is realized in the form of a software functional unit and sold or used as an independent product. Based on such understanding, all or part of the flow of the method according to the embodiments of the present invention may also be implemented by a computer program, which may be stored in a computer-readable storage medium, and when the computer program is executed by a processor, the steps of the method embodiments may be implemented. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, U.S. disk, removable hard disk, magnetic diskette, optical disk, computer Memory, read-Only Memory (ROM), random Access Memory (RAM), electrical carrier wave signal, telecommunications signal, and software distribution medium, etc. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.

It should be noted that the above-described device embodiments are merely illustrative, where the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. In addition, in the drawings of the embodiment of the apparatus provided by the present invention, the connection relationship between the modules indicates that there is a communication connection between them, and may be specifically implemented as one or more communication buses or signal lines. One of ordinary skill in the art can understand and implement without inventive effort.

While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention.

Claims

1. A speech emotion recognition method is characterized by comprising the following steps:

acquiring voice characteristics of a user;

extracting one-hot vectors of each phoneme label, and splicing each one-hot vector into a two-dimensional matrix according to a time domain to generate phoneme characteristics;

respectively inputting the voice features and the phoneme features into a neural network model to obtain emotion prediction output features, and identifying the voice emotion of the user according to the emotion prediction output features; the voice features are spectrogram features; the method comprises the following specific steps:

and respectively inputting the spectrogram characteristics and the phoneme characteristics which form the training batch into a neural network model for splicing so as to obtain emotion prediction output characteristics, and identifying the speech emotion of the user according to the emotion prediction output characteristics.

2. The speech emotion recognition method of claim 1, wherein the emotion data set includes at least one of: happy, sad, and happy; the phoneme label is defined using 39 phonemes.

3. A speech emotion recognition apparatus, comprising:

a one-hot vector extraction unit, configured to extract one-hot vectors of each phoneme label, and splice each one-hot vector into a two-dimensional matrix according to a time domain to generate phoneme features;

the voice emotion recognition unit is used for respectively inputting the voice features and the phoneme features into a neural network model so as to obtain emotion prediction output features, and recognizing the voice emotion of the user according to the emotion prediction output features; the voice features are spectrogram features;

the speech emotion recognition unit comprises:

4. The speech emotion recognition device of claim 3, wherein the emotion data set includes at least one of: happy, sad, and happy; the phoneme label is defined using 39 phonemes.

5. A speech emotion recognition device comprising a processor, a memory and a computer program stored in the memory, the computer program being executable by the processor to implement the speech emotion recognition method as claimed in any one of claims 1 to 2.

6. A computer-readable storage medium, comprising a stored computer program, wherein the computer program, when executed, controls an apparatus in which the computer-readable storage medium is located to perform the speech emotion recognition method according to any one of claims 1-2.