CN113362857A

CN113362857A - Real-time speech emotion recognition method based on CapcNN and application device

Info

Publication number: CN113362857A
Application number: CN202110663975.2A
Authority: CN
Inventors: 文昕成; 刘昆宏; 叶嘉鑫; 罗妍; 王煊泽; 吴昌鲡
Original assignee: Xiamen University
Current assignee: Xiamen University
Priority date: 2021-06-15
Filing date: 2021-06-15
Publication date: 2021-09-07

Abstract

A real-time speech emotion recognition method based on a CapcNN and an application device thereof relate to the technical field of biological feature recognition, the method comprises the following steps: step one, collecting voice data of a certain number of autism children, and preprocessing the extracted data, wherein the preprocessing comprises end point detection and frame windowing operation; secondly, extracting the voice spectrum characteristics of the preprocessed voice data as input data; constructing a model based on CapcNN, training input data, and judging the emotion of input voice; and step four, combining the input data and the emotion classification in the model, and interacting with the recognition object. Compared with other speech emotion recognition methods, the speech emotion recognition method has the advantages that the accuracy is higher, the speech emotion recognition method is better in short-time recognition performance, better robustness is shown in a plurality of data sets, the position information and the overall characteristics of a spectrogram can be better grasped, and the speech emotion recognition method is efficient and stable.

Description

Real-time speech emotion recognition method based on CapcNN and application device

Technical Field

The invention relates to the technical field of pattern recognition, in particular to a real-time speech emotion recognition method based on CapcNN and an application device.

Background

Voice is the most common, effective and convenient way for human beings to communicate. People not only express basic semantic information but also express emotion, emotion and the like of speakers through vocal cords, and the emotion information contained in voice signals is an important information resource and is one of essential information for people to perceive things. Speech emotion recognition is a key technology for realizing intelligent human-computer interaction and is widely applied in many fields, and the invention mainly focuses on the auxiliary treatment of autism children patients.

Autism in children has a serious impact on the growth and development of children. Studies have shown that the prevalence of autism is always on the rise. In such a severe situation, the current treatment of autism mainly depends on human intervention, and the implementation of human intervention becomes very difficult in the face of a huge population of autistic patients. Therefore, the computer-based self-learning method can help the autistic children to acquire self-knowledge and walk out of the self-closed world in a man-machine interaction mode by combining with computer intervention. In the interaction process of the doctor and the autism patient, the computer can capture the emotion change in the language expression of the doctor and the autism patient in time, namely real-time speech emotion recognition is carried out, the emotion state of the autism patient is fed back to the doctor in real time, and the purpose that the doctor can guide and stabilize the emotion change of the autism patient in time and correctly is achieved.

The speech emotion recognition technology is developed to the present, and some defects still exist in the technology. First, the research of national speech emotion recognition is still in the initial stage, and because of the complexity of speech and the diversity of languages, there is no large number of high-quality speech databases in the research process. Secondly, since there are some times that the speech emotion does not correspond to the emotional state, some emotions are not expressed by visual emotional speech changes, even human beings can hardly understand the emotional state of a person accurately only by speech, and often need to use the specific environment and context information at that time, which makes the research of speech emotion recognition using a computer challenging. Finally, although the current emotion recognition methods are various in variety, different methods have merits and demerits. The most efficient and stable identification methods remain to be studied: the feature of CNNs being able to share convolution kernels and perform feature extraction automatically shows unique advantages for processing high dimensional data, but at the same time pooling layers lose a lot of valuable information and focus on only local features of the information. This makes CNN have a general impact on the process of learning time series; in the face of time-series sensitive problems and tasks, LSTM is generally more appropriate. However, LSTM networks have certain limitations when dealing with tasks that have time dependencies that span relatively large distances on the time axis. Therefore, the current speech emotion recognition has the problems of sample shortage and high recognition difficulty.

Disclosure of Invention

In view of the above-mentioned drawbacks and deficiencies of the prior art, an object of the present invention is to provide a CapCNN-based real-time speech emotion recognition method and an application apparatus thereof, which can improve the accuracy, precision and generalization capability of recognition. Compared with other voice emotion recognition methods, the method has the advantages that better robustness and better emotion classification effects are shown in a plurality of data sets, the position information and the overall characteristics of a spectrogram can be well grasped, and the method is an efficient and stable voice emotion recognition method.

In order to solve the above problem, in a first aspect, the present invention provides a CapCNN-based real-time speech emotion recognition method, including:

collecting voice data with different emotions, framing the audio information, performing overlapping operation by using each frame for 25ms and setting the ratio of frame shift to frame length to be 0.5, and then adding a Hamming window to each voice section;

extracting the spectral characteristics of the voice data: firstly, performing short-time Fourier analysis on each frame of a preprocessed voice signal, then converting the voice signal into a spectrogram, performing linear normalization processing on original features before the spectrogram is used as input data, quantizing the spectrogram into a gray-scale image of 0-255, and finally converting the gray-scale image into a matrix with the size of [1000,40 ]; the specific method comprises the following steps:

(1) speech is characterized as 2D and requires the input data to be extended from 2D to 3D.

(2) And continuously carrying out convolution operation on the convolution layer for three times to prepare for a subsequent capsule routing algorithm.

(3) And reconstructing the matrix passing through the convolution layer and outputting the matrix to the capsule layer.

(4) And inputting the data into a capsule neural network for operation, and performing a dynamic routing algorithm between every two capsules.

(5) Three fully connected layers and Adam optimizer were used.

Constructing a model based on the CapcNN, inputting the model into a network through extracting the processed spectral characteristics and training to realize the discrimination and classification of speech emotion;

and judging the emotional state of the object by combining the input data and the emotion classification in the model, thereby performing human-computer interaction with pertinence.

In a second aspect, the present application further provides a distributed microphone array for collecting voice data, the distributed microphone array including a plurality of microphone array nodes, each of the microphone array nodes being provided with one or a plurality of microphone audio collecting modules, wherein the microphone array collects the audio data and is used in the method described in the embodiments of the present application.

In a third aspect, an embodiment of the present application provides a highly integrated hardware DSP for an embedded speech recognition system, including a microcontroller, an MCU, an a/D, D/a, a RAM, and a ROM integrated on a chip, and a computer program stored in the ROM and operable on the MCU, wherein the MCU implements the method described in the embodiment of the present application when executing the computer program, and has features of small size, high integration, good reliability, stronger interrupt handling capability, high performance price ratio, powerful function, high efficiency instruction system, low power consumption, and low voltage.

In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, where the storage medium may include: ROM, RAM, magnetic disk, optical disk, magneto-optical disk, or semiconductor memory, etc., on which a computer program for: which when executed by a processor implements a method as described in embodiments of the present application.

Drawings

Embodiments of the present invention will now be described with reference to the accompanying drawings, in which

FIG. 1 is a schematic diagram illustrating a CapCNN-based real-time speech emotion recognition process according to the present application;

fig. 2 shows a schematic diagram of the CapCNN network framework of the present application.

Detailed Description

In order to make the objects, technical processes and technical innovation points of the present invention more clearly illustrated, the present invention is further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

In order to achieve the above object, the present invention provides a real-time speech emotion recognition method based on CapCNN, the main flow of which is shown in fig. 1, the method comprising:

step one, collecting voice data with different feelings, framing audio information, performing overlapping operation by using each frame for 25ms, wherein the ratio of frame shift to frame length is 0.5, and then adding a Hamming window to each voice section.

Specifically, equipment for acquiring audio by means of a microphone and the like is used, a PC terminal is called to perform related audio recording application to record voice signals, the frequency range of the voice signals is 300-3400 Hz, and the data volume of the voice signals is not too large for improving the sampling precision, so that the voice signals of the infant with the autism are acquired by adopting the sampling rate of a single sound channel 11025Hz, and the extracted data are subjected to data acquisitionAnd (4) preprocessing. The audio information is first framed and the overlap operation is performed using 25ms per frame with a frame shift to frame length ratio of 0.5. Then, in order to accurately locate the beginning and the end of the voice, the voice segments and the non-voice segments are distinguished, and the average short-time energy of each segment of the voice is calculated

Wherein E_aRepresenting average short-time energy, n representing the total number of sub-frames, E_iRepresenting the short-term energy of the current frame, E_hRepresenting half the average short-time energy. At the same time, the average short-time zero crossing rate needs to be calculated

Wherein Z_iPerforming double-threshold end point detection for the zero crossing rate of each frame by taking the two points as threshold points;

finally, in order to reduce the frequency leakage caused by frame division, the operation of adding Hamming window is carried out to each frame of voice, wherein N is the total frame number of the frame division, alpha is a fixed number,

(0≤n≤N-1，α＝0.46)。

step two, extracting the frequency spectrum characteristics of the voice data: firstly, short-time Fourier analysis is carried out on each frame of a preprocessed voice signal, then the voice signal is converted into a spectrogram, linear normalization processing is carried out on original features of the spectrogram before the spectrogram is used as input data, the spectrogram is quantized into a gray-scale image of 0-255, and finally the gray-scale image is converted into a matrix with the size of [1000,40 ].

Specifically, the spectral features of the voice data are extracted and a spectrogram is drawn. Firstly, each frame of the preprocessed voice signal is subjected to short-time Fourier analysis

Where 0. ltoreq. k.ltoreq.N-1, X (N, k) is a short-time amplitude spectral estimate of X (N). And converting the extracted spectral features into a spectrogram. A spectral energy density function P (n, k) at time m is P (n, k) ═ X (n, k) · Y²The two-dimensional image obtained by this formula is a spectrogram, where n is an abscissa, k is an ordinate, and P (n, k) represents an amplitude by using a shade of color. Because the emotional characteristics used by the method have different meanings and have large value range difference, in order to measure various characteristics, before the characteristic selection, the linear normalization processing is carried out on the original characteristics,

wherein P (a, b) is the gray value of each point in the spectrogram, P_max(a, b) and P_min(a, b) is the maximum value and the minimum value of the spectrogram matrix, after the normalized amplitude is carried out, the spectrogram is quantized into a gray scale map of 0-255, and finally the gray scale map is converted into a gray scale map with the size of [1000,40]]Of the matrix of (a).

Constructing a model based on the CapcNN, inputting the spectral features into a network through extraction processing, and training to realize the discrimination and classification of speech emotion;

the conventional convolutional neural network is always understood from partial features of an image, and positional features of the whole matrix are ignored. For a data set with short voice time, the network structure of the method (called CapcNN) constructed based on the Convolutional Neural Network (CNN) and the capsule neural network (CapsNet) not only focuses on local features, but also understands image features from the whole and is prominent in the aspect of extracting position information. On the basis of a convolutional network, a routing algorithm is used as a representation enhancement structure, and the accuracy of speech emotion recognition is enhanced. For the part of the convolutional neural network, the next operation is carried out by reducing the size of the high-dimensional features to obtain a compact matrix;

the voice features are 2D, and input data needs to be expanded from 2D to 3D;

carrying out continuous two convolution operations of which the step length is 2, the convolution kernel size is 13 and an output channel 8 on the data, and using a relu activation function and batch processing normalization behind each convolution layer;

performing convolution operation on the convolution layer with the step length of 2, the convolution kernel size of 13 and the output channel 64, wherein 8 convolution units are packaged together to serve as a new unit to prepare for a subsequent capsule routing algorithm, the maximum pool of the tensor is deployed again to match the input of the envelope layer, and the 2 nd dimension of the tensor is set to be 1;

the speech emotion recognition task needs a function of accurately extracting characteristic position information, reconstructs the matrix subjected to the convolution layer into [ -1,16], and outputs the matrix to a capsule layer;

the length of the input vector and the output vector of the capsule neural network represents the probability of an entity, and the length value is between 0 and 1; using the Squash non-linear function,

wherein j denotes a given capsule, s_jThe input vector of the designated capsule is obtained, the Squash nonlinear function ensures that the length of the short vector can be reduced to almost zero, and the length of the long vector is close to but not more than 1;

for dynamic routing algorithm between each capsule, the lower layer capsule i changes scalar weight c_ijWherein the scalar weight c_ijThe method comprises the steps that the weight of an output vector of a low-level capsule is determined by an iterative dynamic routing algorithm, and then the output vector is multiplied by the weight and sent to a high-level capsule to be used as the input of the high-level capsule;

three layers (512, 1024, 784) are expanded through the full connection layer, wherein the two layers are activation functions of ReLu, and the last layer is an activation function of Sigmoid;

using Adam's optimizer, the learning rate starts at 0.001 and the weight decays to 1.0 x 10^-6The first order moment is estimated to have an exponential decay rate of 0.9 and the second order moment is estimated to have an exponential decay rate of 0.999.

And step four, judging the emotional state of the infant with autism by combining the input data and the emotional classification in the model, thereby performing human-computer interaction with pertinence.

By comparing with the speech emotion recognition model at the front edge, the method not only realizes higher classification accuracy in the whole data set, but also obtains good effect on the independent data set. Meanwhile, the grasp of the position information and the overall characteristics is necessary for the speech emotion recognition, and the capsule neural network has certain development potential in the speech emotion recognition application.

For a better understanding of the present invention, the foregoing has been described in detail with reference to specific examples thereof, but the invention is not limited thereto. Any simple modifications of the above embodiments according to the technical essence of the present invention still fall within the scope of the technical solution of the present invention.

As another aspect, the present application further provides a distributed microphone array for collecting voice data, the distributed microphone array including a plurality of microphone array nodes, each of the microphone array nodes being provided with one or a plurality of microphone audio collecting modules, wherein the microphone array collects the audio data and is used in the method described in the embodiments of the present application.

As another aspect, the present application further provides a highly integrated hardware DSP for an embedded speech recognition system, including a microcontroller, and integrating an MCU, an a/D, D/a, a RAM, and a ROM on a chip, and a computer program stored in the ROM and capable of running on the MCU, wherein the MCU implements the method described in the embodiments of the present application when executing the computer program, and has the features of small size, high integration, good reliability, strong interrupt handling capability, high performance price ratio, strong function, high efficiency instruction system, low power consumption, and low voltage.

As another aspect, the present application also provides a computer-readable storage medium, which may be the computer-readable storage medium included in the foregoing device in the foregoing embodiment; or it may be a separate computer readable storage medium not incorporated into the device. The computer readable storage medium stores one or more programs for use by one or more processors in performing the methods described in the embodiments of the present application.

Any reference to a storage medium used by embodiments of the present application may include non-volatile, volatile memory. Suitable non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), optical disks (including compact disk read-only memory (CD-ROM) and Digital Versatile Disks (DVD)), or flash memory. Volatile memory can include Random Access Memory (RAM), which acts as external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms, such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), double data rate SDRAM (DDR SDRAM), Enhanced SDRAM (ESDRAM), synchronous Link (Synchlink) DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and bus dynamic RAM (RDRAM).

It should be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. If implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic Gate circuit for realizing a logic function for a data signal, an asic having an appropriate combinational logic Gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), and the like.

It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.

In addition, functional units in the embodiments of the present application may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware form, and can also be realized in a software and firmware form. The integrated module, if implemented in software or firmware and sold or used as a stand-alone product, may be transferred from a storage medium or a network into a computer having a dedicated hardware structure for functional implementation.

It is also to be noted that the steps of performing the above-described series of processes may naturally be performed chronologically in the order of description, but need not necessarily be performed chronologically. Some steps may be performed in parallel or independently of each other.

While embodiments of the present application have been illustrated and described above, it should be understood that they have been presented by way of example only, and not limitation. Variations, modifications, substitutions and alterations of the above-described embodiments may be made by those skilled in the art without departing from the spirit and scope of the invention, and it is therefore intended that the scope of the invention not be limited thereto, but rather by the claims appended hereto.

Claims

1. A real-time speech emotion recognition method based on CapCNN, the method comprises the following steps:

extracting the spectral characteristics of the voice data: firstly, performing short-time Fourier analysis on each frame of a preprocessed voice signal, then converting the voice signal into a spectrogram, performing linear normalization processing on original features before the spectrogram is used as input data, quantizing the spectrogram into a gray-scale image of 0-255, and finally converting the gray-scale image into a matrix with the size of [1000,40 ];

2. The method of claim 1, wherein preprocessing the collected audio of different emotions based on the collected audio of different emotions comprises: framing a section of voice, wherein each frame uses 25ms, and the ratio of frame shift to frame length is 0.5 to perform overlapping operation; then, accurately positioning the start and the end of the voice, distinguishing the voice section from the non-voice section as a target, calculating the average short-time energy and the average short-time zero-crossing rate of each section of voice, and carrying out double-threshold end point detection by taking the average short-time energy and the average short-time zero-crossing rate as threshold points; and finally, adding a Hamming window to each frame of voice in order to reduce frequency leakage caused by frame division.

3. The method according to claim 1, wherein the extracting of the speech features specifically comprises:

step one, extracting spectral characteristics: requiring short-time Fourier analysis of each frame of the preprocessed speech signal

Wherein k is more than or equal to 0 and less than or equal to N-1, N is the total framing number, k is the current framing number, and X (N, k) is the short-time amplitude spectrum estimation of X (N);

step two, drawing a spectrogram: converting the spectral features extracted in the first step, and setting the spectral energy density function P (n, k) at time m as P (n, k) ═ X (n, k) < >, linear²(X (n, k)) × (conj (X (n, k))) where n is the abscissa, k is the ordinate, and P (n, k) represents the amplitude using the shade of color, and the two-dimensional image obtained thereby is a spectrogram;

step three, normalization and graying of the spectrogram: the used emotional characteristics have different meanings and large value range difference, in order to measure various characteristics, the linear normalization processing is carried out on the original characteristics before the characteristic selection is carried out,

wherein P (a, b) is the gray value of each point in the spectrogram, P_max(a, b) and P_min(a, b) is the maximum value and the minimum value of the spectrogram matrix, and after the normalized amplitude is carried out, the spectrogram is quantized into a gray scale map of 0-255;

and step four, converting the spectrogram into a matrix.

4. The method of claim 1, wherein the constructing of the speech emotion recognition network comprises: for the part of the convolutional neural network, the next operation is carried out by reducing the size of the high-dimensional features to obtain a compact matrix;

step one, the voice features are 2D, and input data needs to be expanded from 2D to 3D;

step two, carrying out continuous two convolution operations of step length 2, convolution kernel size 13 and output channel 8 on the data, and using relu activation function and batch processing normalization behind each convolution layer;

step three, performing convolution operation on the convolution layer with the step length of 2, the convolution kernel size of 13 and the output channel 64, wherein 8 convolution units are packaged together to serve as a new unit to prepare for a subsequent capsule routing algorithm, the maximum pool of the tensor is deployed again to match the input of the envelope layer, and the 2 nd dimension of the tensor is set to be 1;

step four, the speech emotion recognition task needs a function of accurately extracting characteristic position information, reconstructs the matrix subjected to the convolution layer into [ -1,16], and outputs the matrix to a capsule layer;

step five, the length of the input vector and the output vector of the capsule neural network represents the probability of an entity, and the value of the length is between 0 and 1; using the Squash non-linear function,

wherein j denotes a given capsule, s_jThe input vector of the designated capsule is obtained, the Squash nonlinear function ensures that the length of the short vector is reduced to almost zero, and the length of the long vector is close to but not more than 1;

step six, a dynamic routing algorithm is carried out among all capsules, and the lower-layer capsules i change scalar weight c_ijWherein the scalar weight c_ijDetermined by an iterative dynamic routing algorithm, then the output vector of the lower-layer capsule is multiplied by the weight and sent to the higher layerA capsule as an input for a high-level capsule;

step seven, three layers (512, 1024 and 784) are expanded through the full connection layer, wherein the two layers are activation functions of ReLu, and the last layer is an activation function of Sigmoid;

step eight, using Adam's optimizer, learning rate starts from 0.001, weight decay is 1.0 x 10^-6The first order moment is estimated to have an exponential decay rate of 0.9 and the second order moment is estimated to have an exponential decay rate of 0.999.

5. A distributed microphone array for collecting speech data, the array comprising a plurality of microphone array nodes, each provided with one or a plurality of microphone audio collection modules, wherein the microphone array collecting the audio data is used in the method of any one of claims 1 to 4.

6. An embedded speech recognition application device comprising a microcontroller, an MCU, a/D, D/a, a RAM, a ROM integrated on one chip, and a computer program stored in said ROM and executable on said MCU, wherein said processor when executing said computer program implements the method according to any one of claims 1 to 4.

7. A computer-readable storage medium having stored thereon a computer program for: the computer program, when executed by a processor, implements the method of any one of claims 1-4.