CN113362857A - Real-time speech emotion recognition method based on CapcNN and application device - Google Patents
Real-time speech emotion recognition method based on CapcNN and application device Download PDFInfo
- Publication number
- CN113362857A CN113362857A CN202110663975.2A CN202110663975A CN113362857A CN 113362857 A CN113362857 A CN 113362857A CN 202110663975 A CN202110663975 A CN 202110663975A CN 113362857 A CN113362857 A CN 113362857A
- Authority
- CN
- China
- Prior art keywords
- voice
- spectrogram
- data
- speech
- frame
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 50
- 230000008909 emotion recognition Effects 0.000 title claims abstract description 29
- 230000008451 emotion Effects 0.000 claims abstract description 20
- 238000007781 pre-processing Methods 0.000 claims abstract description 4
- 238000012549 training Methods 0.000 claims abstract description 4
- 238000001514 detection method Methods 0.000 claims abstract description 3
- 238000001228 spectrum Methods 0.000 claims abstract description 3
- 239000002775 capsule Substances 0.000 claims description 26
- 239000011159 matrix material Substances 0.000 claims description 16
- 230000006870 function Effects 0.000 claims description 15
- 230000003595 spectral effect Effects 0.000 claims description 12
- 238000012545 processing Methods 0.000 claims description 10
- 238000004590 computer program Methods 0.000 claims description 9
- 230000002996 emotional effect Effects 0.000 claims description 9
- 238000010606 normalization Methods 0.000 claims description 8
- 238000013527 convolutional neural network Methods 0.000 claims description 7
- 230000004913 activation Effects 0.000 claims description 6
- 238000009432 framing Methods 0.000 claims description 6
- 230000003993 interaction Effects 0.000 claims description 6
- 238000004458 analytical method Methods 0.000 claims description 5
- 238000013528 artificial neural network Methods 0.000 claims description 5
- 230000037433 frameshift Effects 0.000 claims description 5
- 235000009854 Cucurbita moschata Nutrition 0.000 claims description 4
- 240000001980 Cucurbita pepo Species 0.000 claims description 4
- 235000009852 Cucurbita pepo Nutrition 0.000 claims description 4
- 235000020354 squash Nutrition 0.000 claims description 4
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 claims description 3
- 230000008859 change Effects 0.000 claims description 3
- 238000012886 linear function Methods 0.000 claims description 2
- 206010003805 Autism Diseases 0.000 abstract description 11
- 208000020706 Autistic disease Diseases 0.000 abstract description 11
- 230000008569 process Effects 0.000 description 6
- 241000282414 Homo sapiens Species 0.000 description 4
- 238000011160 research Methods 0.000 description 3
- 238000011161 development Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 230000010354 integration Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000005070 sampling Methods 0.000 description 2
- 230000001360 synchronised effect Effects 0.000 description 2
- 230000004075 alteration Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 238000003909 pattern recognition Methods 0.000 description 1
- 238000011176 pooling Methods 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
- 210000001260 vocal cord Anatomy 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/63—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/45—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of analysis window
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Computational Linguistics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Child & Adolescent Psychology (AREA)
- General Health & Medical Sciences (AREA)
- Hospice & Palliative Care (AREA)
- Psychiatry (AREA)
- Image Analysis (AREA)
Abstract
A real-time speech emotion recognition method based on a CapcNN and an application device thereof relate to the technical field of biological feature recognition, the method comprises the following steps: step one, collecting voice data of a certain number of autism children, and preprocessing the extracted data, wherein the preprocessing comprises end point detection and frame windowing operation; secondly, extracting the voice spectrum characteristics of the preprocessed voice data as input data; constructing a model based on CapcNN, training input data, and judging the emotion of input voice; and step four, combining the input data and the emotion classification in the model, and interacting with the recognition object. Compared with other speech emotion recognition methods, the speech emotion recognition method has the advantages that the accuracy is higher, the speech emotion recognition method is better in short-time recognition performance, better robustness is shown in a plurality of data sets, the position information and the overall characteristics of a spectrogram can be better grasped, and the speech emotion recognition method is efficient and stable.
Description
Technical Field
The invention relates to the technical field of pattern recognition, in particular to a real-time speech emotion recognition method based on CapcNN and an application device.
Background
Voice is the most common, effective and convenient way for human beings to communicate. People not only express basic semantic information but also express emotion, emotion and the like of speakers through vocal cords, and the emotion information contained in voice signals is an important information resource and is one of essential information for people to perceive things. Speech emotion recognition is a key technology for realizing intelligent human-computer interaction and is widely applied in many fields, and the invention mainly focuses on the auxiliary treatment of autism children patients.
Autism in children has a serious impact on the growth and development of children. Studies have shown that the prevalence of autism is always on the rise. In such a severe situation, the current treatment of autism mainly depends on human intervention, and the implementation of human intervention becomes very difficult in the face of a huge population of autistic patients. Therefore, the computer-based self-learning method can help the autistic children to acquire self-knowledge and walk out of the self-closed world in a man-machine interaction mode by combining with computer intervention. In the interaction process of the doctor and the autism patient, the computer can capture the emotion change in the language expression of the doctor and the autism patient in time, namely real-time speech emotion recognition is carried out, the emotion state of the autism patient is fed back to the doctor in real time, and the purpose that the doctor can guide and stabilize the emotion change of the autism patient in time and correctly is achieved.
The speech emotion recognition technology is developed to the present, and some defects still exist in the technology. First, the research of national speech emotion recognition is still in the initial stage, and because of the complexity of speech and the diversity of languages, there is no large number of high-quality speech databases in the research process. Secondly, since there are some times that the speech emotion does not correspond to the emotional state, some emotions are not expressed by visual emotional speech changes, even human beings can hardly understand the emotional state of a person accurately only by speech, and often need to use the specific environment and context information at that time, which makes the research of speech emotion recognition using a computer challenging. Finally, although the current emotion recognition methods are various in variety, different methods have merits and demerits. The most efficient and stable identification methods remain to be studied: the feature of CNNs being able to share convolution kernels and perform feature extraction automatically shows unique advantages for processing high dimensional data, but at the same time pooling layers lose a lot of valuable information and focus on only local features of the information. This makes CNN have a general impact on the process of learning time series; in the face of time-series sensitive problems and tasks, LSTM is generally more appropriate. However, LSTM networks have certain limitations when dealing with tasks that have time dependencies that span relatively large distances on the time axis. Therefore, the current speech emotion recognition has the problems of sample shortage and high recognition difficulty.
Disclosure of Invention
In view of the above-mentioned drawbacks and deficiencies of the prior art, an object of the present invention is to provide a CapCNN-based real-time speech emotion recognition method and an application apparatus thereof, which can improve the accuracy, precision and generalization capability of recognition. Compared with other voice emotion recognition methods, the method has the advantages that better robustness and better emotion classification effects are shown in a plurality of data sets, the position information and the overall characteristics of a spectrogram can be well grasped, and the method is an efficient and stable voice emotion recognition method.
In order to solve the above problem, in a first aspect, the present invention provides a CapCNN-based real-time speech emotion recognition method, including:
collecting voice data with different emotions, framing the audio information, performing overlapping operation by using each frame for 25ms and setting the ratio of frame shift to frame length to be 0.5, and then adding a Hamming window to each voice section;
extracting the spectral characteristics of the voice data: firstly, performing short-time Fourier analysis on each frame of a preprocessed voice signal, then converting the voice signal into a spectrogram, performing linear normalization processing on original features before the spectrogram is used as input data, quantizing the spectrogram into a gray-scale image of 0-255, and finally converting the gray-scale image into a matrix with the size of [1000,40 ]; the specific method comprises the following steps:
(1) speech is characterized as 2D and requires the input data to be extended from 2D to 3D.
(2) And continuously carrying out convolution operation on the convolution layer for three times to prepare for a subsequent capsule routing algorithm.
(3) And reconstructing the matrix passing through the convolution layer and outputting the matrix to the capsule layer.
(4) And inputting the data into a capsule neural network for operation, and performing a dynamic routing algorithm between every two capsules.
(5) Three fully connected layers and Adam optimizer were used.
Constructing a model based on the CapcNN, inputting the model into a network through extracting the processed spectral characteristics and training to realize the discrimination and classification of speech emotion;
and judging the emotional state of the object by combining the input data and the emotion classification in the model, thereby performing human-computer interaction with pertinence.
In a second aspect, the present application further provides a distributed microphone array for collecting voice data, the distributed microphone array including a plurality of microphone array nodes, each of the microphone array nodes being provided with one or a plurality of microphone audio collecting modules, wherein the microphone array collects the audio data and is used in the method described in the embodiments of the present application.
In a third aspect, an embodiment of the present application provides a highly integrated hardware DSP for an embedded speech recognition system, including a microcontroller, an MCU, an a/D, D/a, a RAM, and a ROM integrated on a chip, and a computer program stored in the ROM and operable on the MCU, wherein the MCU implements the method described in the embodiment of the present application when executing the computer program, and has features of small size, high integration, good reliability, stronger interrupt handling capability, high performance price ratio, powerful function, high efficiency instruction system, low power consumption, and low voltage.
In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, where the storage medium may include: ROM, RAM, magnetic disk, optical disk, magneto-optical disk, or semiconductor memory, etc., on which a computer program for: which when executed by a processor implements a method as described in embodiments of the present application.
Drawings
Embodiments of the present invention will now be described with reference to the accompanying drawings, in which
FIG. 1 is a schematic diagram illustrating a CapCNN-based real-time speech emotion recognition process according to the present application;
fig. 2 shows a schematic diagram of the CapCNN network framework of the present application.
Detailed Description
In order to make the objects, technical processes and technical innovation points of the present invention more clearly illustrated, the present invention is further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.
In order to achieve the above object, the present invention provides a real-time speech emotion recognition method based on CapCNN, the main flow of which is shown in fig. 1, the method comprising:
step one, collecting voice data with different feelings, framing audio information, performing overlapping operation by using each frame for 25ms, wherein the ratio of frame shift to frame length is 0.5, and then adding a Hamming window to each voice section.
Specifically, equipment for acquiring audio by means of a microphone and the like is used, a PC terminal is called to perform related audio recording application to record voice signals, the frequency range of the voice signals is 300-3400 Hz, and the data volume of the voice signals is not too large for improving the sampling precision, so that the voice signals of the infant with the autism are acquired by adopting the sampling rate of a single sound channel 11025Hz, and the extracted data are subjected to data acquisitionAnd (4) preprocessing. The audio information is first framed and the overlap operation is performed using 25ms per frame with a frame shift to frame length ratio of 0.5. Then, in order to accurately locate the beginning and the end of the voice, the voice segments and the non-voice segments are distinguished, and the average short-time energy of each segment of the voice is calculatedWherein EaRepresenting average short-time energy, n representing the total number of sub-frames, EiRepresenting the short-term energy of the current frame, EhRepresenting half the average short-time energy. At the same time, the average short-time zero crossing rate needs to be calculatedWherein ZiPerforming double-threshold end point detection for the zero crossing rate of each frame by taking the two points as threshold points;
finally, in order to reduce the frequency leakage caused by frame division, the operation of adding Hamming window is carried out to each frame of voice, wherein N is the total frame number of the frame division, alpha is a fixed number,(0≤n≤N-1,α=0.46)。
step two, extracting the frequency spectrum characteristics of the voice data: firstly, short-time Fourier analysis is carried out on each frame of a preprocessed voice signal, then the voice signal is converted into a spectrogram, linear normalization processing is carried out on original features of the spectrogram before the spectrogram is used as input data, the spectrogram is quantized into a gray-scale image of 0-255, and finally the gray-scale image is converted into a matrix with the size of [1000,40 ].
Specifically, the spectral features of the voice data are extracted and a spectrogram is drawn. Firstly, each frame of the preprocessed voice signal is subjected to short-time Fourier analysisWhere 0. ltoreq. k.ltoreq.N-1, X (N, k) is a short-time amplitude spectral estimate of X (N). And converting the extracted spectral features into a spectrogram. A spectral energy density function P (n, k) at time m is P (n, k) ═ X (n, k) · Y2The two-dimensional image obtained by this formula is a spectrogram, where n is an abscissa, k is an ordinate, and P (n, k) represents an amplitude by using a shade of color. Because the emotional characteristics used by the method have different meanings and have large value range difference, in order to measure various characteristics, before the characteristic selection, the linear normalization processing is carried out on the original characteristics,wherein P (a, b) is the gray value of each point in the spectrogram, Pmax(a, b) and Pmin(a, b) is the maximum value and the minimum value of the spectrogram matrix, after the normalized amplitude is carried out, the spectrogram is quantized into a gray scale map of 0-255, and finally the gray scale map is converted into a gray scale map with the size of [1000,40]]Of the matrix of (a).
Constructing a model based on the CapcNN, inputting the spectral features into a network through extraction processing, and training to realize the discrimination and classification of speech emotion;
the conventional convolutional neural network is always understood from partial features of an image, and positional features of the whole matrix are ignored. For a data set with short voice time, the network structure of the method (called CapcNN) constructed based on the Convolutional Neural Network (CNN) and the capsule neural network (CapsNet) not only focuses on local features, but also understands image features from the whole and is prominent in the aspect of extracting position information. On the basis of a convolutional network, a routing algorithm is used as a representation enhancement structure, and the accuracy of speech emotion recognition is enhanced. For the part of the convolutional neural network, the next operation is carried out by reducing the size of the high-dimensional features to obtain a compact matrix;
the voice features are 2D, and input data needs to be expanded from 2D to 3D;
carrying out continuous two convolution operations of which the step length is 2, the convolution kernel size is 13 and an output channel 8 on the data, and using a relu activation function and batch processing normalization behind each convolution layer;
performing convolution operation on the convolution layer with the step length of 2, the convolution kernel size of 13 and the output channel 64, wherein 8 convolution units are packaged together to serve as a new unit to prepare for a subsequent capsule routing algorithm, the maximum pool of the tensor is deployed again to match the input of the envelope layer, and the 2 nd dimension of the tensor is set to be 1;
the speech emotion recognition task needs a function of accurately extracting characteristic position information, reconstructs the matrix subjected to the convolution layer into [ -1,16], and outputs the matrix to a capsule layer;
the length of the input vector and the output vector of the capsule neural network represents the probability of an entity, and the length value is between 0 and 1; using the Squash non-linear function,wherein j denotes a given capsule, sjThe input vector of the designated capsule is obtained, the Squash nonlinear function ensures that the length of the short vector can be reduced to almost zero, and the length of the long vector is close to but not more than 1;
for dynamic routing algorithm between each capsule, the lower layer capsule i changes scalar weight cijWherein the scalar weight cijThe method comprises the steps that the weight of an output vector of a low-level capsule is determined by an iterative dynamic routing algorithm, and then the output vector is multiplied by the weight and sent to a high-level capsule to be used as the input of the high-level capsule;
three layers (512, 1024, 784) are expanded through the full connection layer, wherein the two layers are activation functions of ReLu, and the last layer is an activation function of Sigmoid;
using Adam's optimizer, the learning rate starts at 0.001 and the weight decays to 1.0 x 10-6The first order moment is estimated to have an exponential decay rate of 0.9 and the second order moment is estimated to have an exponential decay rate of 0.999.
And step four, judging the emotional state of the infant with autism by combining the input data and the emotional classification in the model, thereby performing human-computer interaction with pertinence.
By comparing with the speech emotion recognition model at the front edge, the method not only realizes higher classification accuracy in the whole data set, but also obtains good effect on the independent data set. Meanwhile, the grasp of the position information and the overall characteristics is necessary for the speech emotion recognition, and the capsule neural network has certain development potential in the speech emotion recognition application.
For a better understanding of the present invention, the foregoing has been described in detail with reference to specific examples thereof, but the invention is not limited thereto. Any simple modifications of the above embodiments according to the technical essence of the present invention still fall within the scope of the technical solution of the present invention.
As another aspect, the present application further provides a distributed microphone array for collecting voice data, the distributed microphone array including a plurality of microphone array nodes, each of the microphone array nodes being provided with one or a plurality of microphone audio collecting modules, wherein the microphone array collects the audio data and is used in the method described in the embodiments of the present application.
As another aspect, the present application further provides a highly integrated hardware DSP for an embedded speech recognition system, including a microcontroller, and integrating an MCU, an a/D, D/a, a RAM, and a ROM on a chip, and a computer program stored in the ROM and capable of running on the MCU, wherein the MCU implements the method described in the embodiments of the present application when executing the computer program, and has the features of small size, high integration, good reliability, strong interrupt handling capability, high performance price ratio, strong function, high efficiency instruction system, low power consumption, and low voltage.
As another aspect, the present application also provides a computer-readable storage medium, which may be the computer-readable storage medium included in the foregoing device in the foregoing embodiment; or it may be a separate computer readable storage medium not incorporated into the device. The computer readable storage medium stores one or more programs for use by one or more processors in performing the methods described in the embodiments of the present application.
Any reference to a storage medium used by embodiments of the present application may include non-volatile, volatile memory. Suitable non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), optical disks (including compact disk read-only memory (CD-ROM) and Digital Versatile Disks (DVD)), or flash memory. Volatile memory can include Random Access Memory (RAM), which acts as external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms, such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), double data rate SDRAM (DDR SDRAM), Enhanced SDRAM (ESDRAM), synchronous Link (Synchlink) DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and bus dynamic RAM (RDRAM).
It should be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. If implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic Gate circuit for realizing a logic function for a data signal, an asic having an appropriate combinational logic Gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), and the like.
It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.
In addition, functional units in the embodiments of the present application may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware form, and can also be realized in a software and firmware form. The integrated module, if implemented in software or firmware and sold or used as a stand-alone product, may be transferred from a storage medium or a network into a computer having a dedicated hardware structure for functional implementation.
It is also to be noted that the steps of performing the above-described series of processes may naturally be performed chronologically in the order of description, but need not necessarily be performed chronologically. Some steps may be performed in parallel or independently of each other.
While embodiments of the present application have been illustrated and described above, it should be understood that they have been presented by way of example only, and not limitation. Variations, modifications, substitutions and alterations of the above-described embodiments may be made by those skilled in the art without departing from the spirit and scope of the invention, and it is therefore intended that the scope of the invention not be limited thereto, but rather by the claims appended hereto.
Claims (7)
1. A real-time speech emotion recognition method based on CapCNN, the method comprises the following steps:
collecting voice data with different emotions, framing the audio information, performing overlapping operation by using each frame for 25ms and setting the ratio of frame shift to frame length to be 0.5, and then adding a Hamming window to each voice section;
extracting the spectral characteristics of the voice data: firstly, performing short-time Fourier analysis on each frame of a preprocessed voice signal, then converting the voice signal into a spectrogram, performing linear normalization processing on original features before the spectrogram is used as input data, quantizing the spectrogram into a gray-scale image of 0-255, and finally converting the gray-scale image into a matrix with the size of [1000,40 ];
constructing a model based on the CapcNN, inputting the model into a network through extracting the processed spectral characteristics and training to realize the discrimination and classification of speech emotion;
and judging the emotional state of the object by combining the input data and the emotion classification in the model, thereby performing human-computer interaction with pertinence.
2. The method of claim 1, wherein preprocessing the collected audio of different emotions based on the collected audio of different emotions comprises: framing a section of voice, wherein each frame uses 25ms, and the ratio of frame shift to frame length is 0.5 to perform overlapping operation; then, accurately positioning the start and the end of the voice, distinguishing the voice section from the non-voice section as a target, calculating the average short-time energy and the average short-time zero-crossing rate of each section of voice, and carrying out double-threshold end point detection by taking the average short-time energy and the average short-time zero-crossing rate as threshold points; and finally, adding a Hamming window to each frame of voice in order to reduce frequency leakage caused by frame division.
3. The method according to claim 1, wherein the extracting of the speech features specifically comprises:
step one, extracting spectral characteristics: requiring short-time Fourier analysis of each frame of the preprocessed speech signalWherein k is more than or equal to 0 and less than or equal to N-1, N is the total framing number, k is the current framing number, and X (N, k) is the short-time amplitude spectrum estimation of X (N);
step two, drawing a spectrogram: converting the spectral features extracted in the first step, and setting the spectral energy density function P (n, k) at time m as P (n, k) ═ X (n, k) < >, linear2(X (n, k)) × (conj (X (n, k))) where n is the abscissa, k is the ordinate, and P (n, k) represents the amplitude using the shade of color, and the two-dimensional image obtained thereby is a spectrogram;
step three, normalization and graying of the spectrogram: the used emotional characteristics have different meanings and large value range difference, in order to measure various characteristics, the linear normalization processing is carried out on the original characteristics before the characteristic selection is carried out,wherein P (a, b) is the gray value of each point in the spectrogram, Pmax(a, b) and Pmin(a, b) is the maximum value and the minimum value of the spectrogram matrix, and after the normalized amplitude is carried out, the spectrogram is quantized into a gray scale map of 0-255;
and step four, converting the spectrogram into a matrix.
4. The method of claim 1, wherein the constructing of the speech emotion recognition network comprises: for the part of the convolutional neural network, the next operation is carried out by reducing the size of the high-dimensional features to obtain a compact matrix;
step one, the voice features are 2D, and input data needs to be expanded from 2D to 3D;
step two, carrying out continuous two convolution operations of step length 2, convolution kernel size 13 and output channel 8 on the data, and using relu activation function and batch processing normalization behind each convolution layer;
step three, performing convolution operation on the convolution layer with the step length of 2, the convolution kernel size of 13 and the output channel 64, wherein 8 convolution units are packaged together to serve as a new unit to prepare for a subsequent capsule routing algorithm, the maximum pool of the tensor is deployed again to match the input of the envelope layer, and the 2 nd dimension of the tensor is set to be 1;
step four, the speech emotion recognition task needs a function of accurately extracting characteristic position information, reconstructs the matrix subjected to the convolution layer into [ -1,16], and outputs the matrix to a capsule layer;
step five, the length of the input vector and the output vector of the capsule neural network represents the probability of an entity, and the value of the length is between 0 and 1; using the Squash non-linear function,wherein j denotes a given capsule, sjThe input vector of the designated capsule is obtained, the Squash nonlinear function ensures that the length of the short vector is reduced to almost zero, and the length of the long vector is close to but not more than 1;
step six, a dynamic routing algorithm is carried out among all capsules, and the lower-layer capsules i change scalar weight cijWherein the scalar weight cijDetermined by an iterative dynamic routing algorithm, then the output vector of the lower-layer capsule is multiplied by the weight and sent to the higher layerA capsule as an input for a high-level capsule;
step seven, three layers (512, 1024 and 784) are expanded through the full connection layer, wherein the two layers are activation functions of ReLu, and the last layer is an activation function of Sigmoid;
step eight, using Adam's optimizer, learning rate starts from 0.001, weight decay is 1.0 x 10-6The first order moment is estimated to have an exponential decay rate of 0.9 and the second order moment is estimated to have an exponential decay rate of 0.999.
5. A distributed microphone array for collecting speech data, the array comprising a plurality of microphone array nodes, each provided with one or a plurality of microphone audio collection modules, wherein the microphone array collecting the audio data is used in the method of any one of claims 1 to 4.
6. An embedded speech recognition application device comprising a microcontroller, an MCU, a/D, D/a, a RAM, a ROM integrated on one chip, and a computer program stored in said ROM and executable on said MCU, wherein said processor when executing said computer program implements the method according to any one of claims 1 to 4.
7. A computer-readable storage medium having stored thereon a computer program for: the computer program, when executed by a processor, implements the method of any one of claims 1-4.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110663975.2A CN113362857A (en) | 2021-06-15 | 2021-06-15 | Real-time speech emotion recognition method based on CapcNN and application device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110663975.2A CN113362857A (en) | 2021-06-15 | 2021-06-15 | Real-time speech emotion recognition method based on CapcNN and application device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113362857A true CN113362857A (en) | 2021-09-07 |
Family
ID=77534417
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110663975.2A Pending CN113362857A (en) | 2021-06-15 | 2021-06-15 | Real-time speech emotion recognition method based on CapcNN and application device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113362857A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116577037A (en) * | 2023-07-12 | 2023-08-11 | 上海电机学院 | Air duct leakage signal detection method based on non-uniform frequency spectrogram |
Citations (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104637497A (en) * | 2015-01-16 | 2015-05-20 | 南京工程学院 | Speech spectrum characteristic extracting method facing speech emotion identification |
CN105047194A (en) * | 2015-07-28 | 2015-11-11 | 东南大学 | Self-learning spectrogram feature extraction method for speech emotion recognition |
CN107845390A (en) * | 2017-09-21 | 2018-03-27 | 太原理工大学 | A kind of Emotional speech recognition system based on PCNN sound spectrograph Fusion Features |
CN108597539A (en) * | 2018-02-09 | 2018-09-28 | 桂林电子科技大学 | Speech-emotion recognition method based on parameter migration and sound spectrograph |
CN108899051A (en) * | 2018-06-26 | 2018-11-27 | 北京大学深圳研究生院 | A kind of speech emotion recognition model and recognition methods based on union feature expression |
CN109523994A (en) * | 2018-11-13 | 2019-03-26 | 四川大学 | A kind of multitask method of speech classification based on capsule neural network |
US20200159778A1 (en) * | 2018-06-19 | 2020-05-21 | Priyadarshini Mohanty | Methods and systems of operating computerized neural networks for modelling csr-customer relationships |
CN111357051A (en) * | 2019-12-24 | 2020-06-30 | 深圳市优必选科技股份有限公司 | Speech emotion recognition method, intelligent device and computer readable storage medium |
CN111785301A (en) * | 2020-06-28 | 2020-10-16 | 重庆邮电大学 | Residual error network-based 3DACRNN speech emotion recognition method and storage medium |
CN111883178A (en) * | 2020-07-17 | 2020-11-03 | 渤海大学 | Double-channel voice-to-image-based emotion recognition method |
CN112562725A (en) * | 2020-12-09 | 2021-03-26 | 山西财经大学 | Mixed voice emotion classification method based on spectrogram and capsule network |
CN112800225A (en) * | 2021-01-28 | 2021-05-14 | 南京邮电大学 | Microblog comment emotion classification method and system |
-
2021
- 2021-06-15 CN CN202110663975.2A patent/CN113362857A/en active Pending
Patent Citations (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104637497A (en) * | 2015-01-16 | 2015-05-20 | 南京工程学院 | Speech spectrum characteristic extracting method facing speech emotion identification |
CN105047194A (en) * | 2015-07-28 | 2015-11-11 | 东南大学 | Self-learning spectrogram feature extraction method for speech emotion recognition |
CN107845390A (en) * | 2017-09-21 | 2018-03-27 | 太原理工大学 | A kind of Emotional speech recognition system based on PCNN sound spectrograph Fusion Features |
CN108597539A (en) * | 2018-02-09 | 2018-09-28 | 桂林电子科技大学 | Speech-emotion recognition method based on parameter migration and sound spectrograph |
US20200159778A1 (en) * | 2018-06-19 | 2020-05-21 | Priyadarshini Mohanty | Methods and systems of operating computerized neural networks for modelling csr-customer relationships |
CN108899051A (en) * | 2018-06-26 | 2018-11-27 | 北京大学深圳研究生院 | A kind of speech emotion recognition model and recognition methods based on union feature expression |
CN109523994A (en) * | 2018-11-13 | 2019-03-26 | 四川大学 | A kind of multitask method of speech classification based on capsule neural network |
CN111357051A (en) * | 2019-12-24 | 2020-06-30 | 深圳市优必选科技股份有限公司 | Speech emotion recognition method, intelligent device and computer readable storage medium |
CN111785301A (en) * | 2020-06-28 | 2020-10-16 | 重庆邮电大学 | Residual error network-based 3DACRNN speech emotion recognition method and storage medium |
CN111883178A (en) * | 2020-07-17 | 2020-11-03 | 渤海大学 | Double-channel voice-to-image-based emotion recognition method |
CN112562725A (en) * | 2020-12-09 | 2021-03-26 | 山西财经大学 | Mixed voice emotion classification method based on spectrogram and capsule network |
CN112800225A (en) * | 2021-01-28 | 2021-05-14 | 南京邮电大学 | Microblog comment emotion classification method and system |
Non-Patent Citations (1)
Title |
---|
XIN-CHENG WEN ETC: ""The Application of Capsule Neural Network Based CNN for Speech Emotion Recognition"", pages 9356 - 9362 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116577037A (en) * | 2023-07-12 | 2023-08-11 | 上海电机学院 | Air duct leakage signal detection method based on non-uniform frequency spectrogram |
CN116577037B (en) * | 2023-07-12 | 2023-09-12 | 上海电机学院 | Air duct leakage signal detection method based on non-uniform frequency spectrogram |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Sehgal et al. | A convolutional neural network smartphone app for real-time voice activity detection | |
JP6198872B2 (en) | Detection of speech syllable / vowel / phoneme boundaries using auditory attention cues | |
US9020822B2 (en) | Emotion recognition using auditory attention cues extracted from users voice | |
Qamhan et al. | Digital audio forensics: microphone and environment classification using deep learning | |
CN111341319B (en) | Audio scene identification method and system based on local texture features | |
Rammo et al. | Detecting the speaker language using CNN deep learning algorithm | |
CN114566189B (en) | Speech emotion recognition method and system based on three-dimensional depth feature fusion | |
CN111128222A (en) | Speech separation method, speech separation model training method, and computer-readable medium | |
CN112562725A (en) | Mixed voice emotion classification method based on spectrogram and capsule network | |
Kuang et al. | Simplified inverse filter tracked affective acoustic signals classification incorporating deep convolutional neural networks | |
Kaushik et al. | SLINet: Dysphasia detection in children using deep neural network | |
CN113362857A (en) | Real-time speech emotion recognition method based on CapcNN and application device | |
CN113257279A (en) | GTCN-based real-time voice emotion recognition method and application device | |
Sun | Digital audio scene recognition method based on machine learning technology | |
Sharma et al. | HindiSpeech-Net: a deep learning based robust automatic speech recognition system for Hindi language | |
Dehghani et al. | Time-frequency localization using deep convolutional maxout neural network in Persian speech recognition | |
Tailor et al. | Deep learning approach for spoken digit recognition in Gujarati language | |
Meng et al. | A lightweight CNN and Transformer hybrid model for mental retardation screening among children from spontaneous speech | |
Manjutha et al. | An optimized cepstral feature selection method for dysfluencies classification using Tamil speech dataset | |
Anguraj et al. | Analysis of influencing features with spectral feature extraction and multi-class classification using deep neural network for speech recognition system | |
Kayal et al. | Multilingual vocal emotion recognition and classification using back propagation neural network | |
Gul et al. | Single channel speech enhancement by colored spectrograms | |
US20240038215A1 (en) | Method executed by electronic device, electronic device and storage medium | |
Fartash et al. | A scale–rate filter selection method in the spectro-temporal domain for phoneme classification | |
Chen et al. | Separation of Speech from Speech Interference Based on EGG |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |