CN113362857A - Real-time speech emotion recognition method based on CapcNN and application device - Google Patents

Real-time speech emotion recognition method based on CapcNN and application device Download PDF

Info

Publication number
CN113362857A
CN113362857A CN202110663975.2A CN202110663975A CN113362857A CN 113362857 A CN113362857 A CN 113362857A CN 202110663975 A CN202110663975 A CN 202110663975A CN 113362857 A CN113362857 A CN 113362857A
Authority
CN
China
Prior art keywords
voice
spectrogram
data
speech
frame
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110663975.2A
Other languages
Chinese (zh)
Inventor
文昕成
刘昆宏
叶嘉鑫
罗妍
王煊泽
吴昌鲡
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xiamen University
Original Assignee
Xiamen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiamen University filed Critical Xiamen University
Priority to CN202110663975.2A priority Critical patent/CN113362857A/en
Publication of CN113362857A publication Critical patent/CN113362857A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/45Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of analysis window

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Child & Adolescent Psychology (AREA)
  • General Health & Medical Sciences (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychiatry (AREA)
  • Image Analysis (AREA)

Abstract

A real-time speech emotion recognition method based on a CapcNN and an application device thereof relate to the technical field of biological feature recognition, the method comprises the following steps: step one, collecting voice data of a certain number of autism children, and preprocessing the extracted data, wherein the preprocessing comprises end point detection and frame windowing operation; secondly, extracting the voice spectrum characteristics of the preprocessed voice data as input data; constructing a model based on CapcNN, training input data, and judging the emotion of input voice; and step four, combining the input data and the emotion classification in the model, and interacting with the recognition object. Compared with other speech emotion recognition methods, the speech emotion recognition method has the advantages that the accuracy is higher, the speech emotion recognition method is better in short-time recognition performance, better robustness is shown in a plurality of data sets, the position information and the overall characteristics of a spectrogram can be better grasped, and the speech emotion recognition method is efficient and stable.

Description

Real-time speech emotion recognition method based on CapcNN and application device
Technical Field
The invention relates to the technical field of pattern recognition, in particular to a real-time speech emotion recognition method based on CapcNN and an application device.
Background
Voice is the most common, effective and convenient way for human beings to communicate. People not only express basic semantic information but also express emotion, emotion and the like of speakers through vocal cords, and the emotion information contained in voice signals is an important information resource and is one of essential information for people to perceive things. Speech emotion recognition is a key technology for realizing intelligent human-computer interaction and is widely applied in many fields, and the invention mainly focuses on the auxiliary treatment of autism children patients.
Autism in children has a serious impact on the growth and development of children. Studies have shown that the prevalence of autism is always on the rise. In such a severe situation, the current treatment of autism mainly depends on human intervention, and the implementation of human intervention becomes very difficult in the face of a huge population of autistic patients. Therefore, the computer-based self-learning method can help the autistic children to acquire self-knowledge and walk out of the self-closed world in a man-machine interaction mode by combining with computer intervention. In the interaction process of the doctor and the autism patient, the computer can capture the emotion change in the language expression of the doctor and the autism patient in time, namely real-time speech emotion recognition is carried out, the emotion state of the autism patient is fed back to the doctor in real time, and the purpose that the doctor can guide and stabilize the emotion change of the autism patient in time and correctly is achieved.
The speech emotion recognition technology is developed to the present, and some defects still exist in the technology. First, the research of national speech emotion recognition is still in the initial stage, and because of the complexity of speech and the diversity of languages, there is no large number of high-quality speech databases in the research process. Secondly, since there are some times that the speech emotion does not correspond to the emotional state, some emotions are not expressed by visual emotional speech changes, even human beings can hardly understand the emotional state of a person accurately only by speech, and often need to use the specific environment and context information at that time, which makes the research of speech emotion recognition using a computer challenging. Finally, although the current emotion recognition methods are various in variety, different methods have merits and demerits. The most efficient and stable identification methods remain to be studied: the feature of CNNs being able to share convolution kernels and perform feature extraction automatically shows unique advantages for processing high dimensional data, but at the same time pooling layers lose a lot of valuable information and focus on only local features of the information. This makes CNN have a general impact on the process of learning time series; in the face of time-series sensitive problems and tasks, LSTM is generally more appropriate. However, LSTM networks have certain limitations when dealing with tasks that have time dependencies that span relatively large distances on the time axis. Therefore, the current speech emotion recognition has the problems of sample shortage and high recognition difficulty.
Disclosure of Invention
In view of the above-mentioned drawbacks and deficiencies of the prior art, an object of the present invention is to provide a CapCNN-based real-time speech emotion recognition method and an application apparatus thereof, which can improve the accuracy, precision and generalization capability of recognition. Compared with other voice emotion recognition methods, the method has the advantages that better robustness and better emotion classification effects are shown in a plurality of data sets, the position information and the overall characteristics of a spectrogram can be well grasped, and the method is an efficient and stable voice emotion recognition method.
In order to solve the above problem, in a first aspect, the present invention provides a CapCNN-based real-time speech emotion recognition method, including:
collecting voice data with different emotions, framing the audio information, performing overlapping operation by using each frame for 25ms and setting the ratio of frame shift to frame length to be 0.5, and then adding a Hamming window to each voice section;
extracting the spectral characteristics of the voice data: firstly, performing short-time Fourier analysis on each frame of a preprocessed voice signal, then converting the voice signal into a spectrogram, performing linear normalization processing on original features before the spectrogram is used as input data, quantizing the spectrogram into a gray-scale image of 0-255, and finally converting the gray-scale image into a matrix with the size of [1000,40 ]; the specific method comprises the following steps:
(1) speech is characterized as 2D and requires the input data to be extended from 2D to 3D.
(2) And continuously carrying out convolution operation on the convolution layer for three times to prepare for a subsequent capsule routing algorithm.
(3) And reconstructing the matrix passing through the convolution layer and outputting the matrix to the capsule layer.
(4) And inputting the data into a capsule neural network for operation, and performing a dynamic routing algorithm between every two capsules.
(5) Three fully connected layers and Adam optimizer were used.
Constructing a model based on the CapcNN, inputting the model into a network through extracting the processed spectral characteristics and training to realize the discrimination and classification of speech emotion;
and judging the emotional state of the object by combining the input data and the emotion classification in the model, thereby performing human-computer interaction with pertinence.
In a second aspect, the present application further provides a distributed microphone array for collecting voice data, the distributed microphone array including a plurality of microphone array nodes, each of the microphone array nodes being provided with one or a plurality of microphone audio collecting modules, wherein the microphone array collects the audio data and is used in the method described in the embodiments of the present application.
In a third aspect, an embodiment of the present application provides a highly integrated hardware DSP for an embedded speech recognition system, including a microcontroller, an MCU, an a/D, D/a, a RAM, and a ROM integrated on a chip, and a computer program stored in the ROM and operable on the MCU, wherein the MCU implements the method described in the embodiment of the present application when executing the computer program, and has features of small size, high integration, good reliability, stronger interrupt handling capability, high performance price ratio, powerful function, high efficiency instruction system, low power consumption, and low voltage.
In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, where the storage medium may include: ROM, RAM, magnetic disk, optical disk, magneto-optical disk, or semiconductor memory, etc., on which a computer program for: which when executed by a processor implements a method as described in embodiments of the present application.
Drawings
Embodiments of the present invention will now be described with reference to the accompanying drawings, in which
FIG. 1 is a schematic diagram illustrating a CapCNN-based real-time speech emotion recognition process according to the present application;
fig. 2 shows a schematic diagram of the CapCNN network framework of the present application.
Detailed Description
In order to make the objects, technical processes and technical innovation points of the present invention more clearly illustrated, the present invention is further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.
In order to achieve the above object, the present invention provides a real-time speech emotion recognition method based on CapCNN, the main flow of which is shown in fig. 1, the method comprising:
step one, collecting voice data with different feelings, framing audio information, performing overlapping operation by using each frame for 25ms, wherein the ratio of frame shift to frame length is 0.5, and then adding a Hamming window to each voice section.
Specifically, equipment for acquiring audio by means of a microphone and the like is used, a PC terminal is called to perform related audio recording application to record voice signals, the frequency range of the voice signals is 300-3400 Hz, and the data volume of the voice signals is not too large for improving the sampling precision, so that the voice signals of the infant with the autism are acquired by adopting the sampling rate of a single sound channel 11025Hz, and the extracted data are subjected to data acquisitionAnd (4) preprocessing. The audio information is first framed and the overlap operation is performed using 25ms per frame with a frame shift to frame length ratio of 0.5. Then, in order to accurately locate the beginning and the end of the voice, the voice segments and the non-voice segments are distinguished, and the average short-time energy of each segment of the voice is calculated
Figure BDA0003116043030000031
Wherein EaRepresenting average short-time energy, n representing the total number of sub-frames, EiRepresenting the short-term energy of the current frame, EhRepresenting half the average short-time energy. At the same time, the average short-time zero crossing rate needs to be calculated
Figure BDA0003116043030000041
Wherein ZiPerforming double-threshold end point detection for the zero crossing rate of each frame by taking the two points as threshold points;
finally, in order to reduce the frequency leakage caused by frame division, the operation of adding Hamming window is carried out to each frame of voice, wherein N is the total frame number of the frame division, alpha is a fixed number,
Figure BDA0003116043030000042
(0≤n≤N-1,α=0.46)。
step two, extracting the frequency spectrum characteristics of the voice data: firstly, short-time Fourier analysis is carried out on each frame of a preprocessed voice signal, then the voice signal is converted into a spectrogram, linear normalization processing is carried out on original features of the spectrogram before the spectrogram is used as input data, the spectrogram is quantized into a gray-scale image of 0-255, and finally the gray-scale image is converted into a matrix with the size of [1000,40 ].
Specifically, the spectral features of the voice data are extracted and a spectrogram is drawn. Firstly, each frame of the preprocessed voice signal is subjected to short-time Fourier analysis
Figure BDA0003116043030000043
Where 0. ltoreq. k.ltoreq.N-1, X (N, k) is a short-time amplitude spectral estimate of X (N). And converting the extracted spectral features into a spectrogram. A spectral energy density function P (n, k) at time m is P (n, k) ═ X (n, k) · Y2The two-dimensional image obtained by this formula is a spectrogram, where n is an abscissa, k is an ordinate, and P (n, k) represents an amplitude by using a shade of color. Because the emotional characteristics used by the method have different meanings and have large value range difference, in order to measure various characteristics, before the characteristic selection, the linear normalization processing is carried out on the original characteristics,
Figure BDA0003116043030000044
wherein P (a, b) is the gray value of each point in the spectrogram, Pmax(a, b) and Pmin(a, b) is the maximum value and the minimum value of the spectrogram matrix, after the normalized amplitude is carried out, the spectrogram is quantized into a gray scale map of 0-255, and finally the gray scale map is converted into a gray scale map with the size of [1000,40]]Of the matrix of (a).
Constructing a model based on the CapcNN, inputting the spectral features into a network through extraction processing, and training to realize the discrimination and classification of speech emotion;
the conventional convolutional neural network is always understood from partial features of an image, and positional features of the whole matrix are ignored. For a data set with short voice time, the network structure of the method (called CapcNN) constructed based on the Convolutional Neural Network (CNN) and the capsule neural network (CapsNet) not only focuses on local features, but also understands image features from the whole and is prominent in the aspect of extracting position information. On the basis of a convolutional network, a routing algorithm is used as a representation enhancement structure, and the accuracy of speech emotion recognition is enhanced. For the part of the convolutional neural network, the next operation is carried out by reducing the size of the high-dimensional features to obtain a compact matrix;
the voice features are 2D, and input data needs to be expanded from 2D to 3D;
carrying out continuous two convolution operations of which the step length is 2, the convolution kernel size is 13 and an output channel 8 on the data, and using a relu activation function and batch processing normalization behind each convolution layer;
performing convolution operation on the convolution layer with the step length of 2, the convolution kernel size of 13 and the output channel 64, wherein 8 convolution units are packaged together to serve as a new unit to prepare for a subsequent capsule routing algorithm, the maximum pool of the tensor is deployed again to match the input of the envelope layer, and the 2 nd dimension of the tensor is set to be 1;
the speech emotion recognition task needs a function of accurately extracting characteristic position information, reconstructs the matrix subjected to the convolution layer into [ -1,16], and outputs the matrix to a capsule layer;
the length of the input vector and the output vector of the capsule neural network represents the probability of an entity, and the length value is between 0 and 1; using the Squash non-linear function,
Figure BDA0003116043030000051
wherein j denotes a given capsule, sjThe input vector of the designated capsule is obtained, the Squash nonlinear function ensures that the length of the short vector can be reduced to almost zero, and the length of the long vector is close to but not more than 1;
for dynamic routing algorithm between each capsule, the lower layer capsule i changes scalar weight cijWherein the scalar weight cijThe method comprises the steps that the weight of an output vector of a low-level capsule is determined by an iterative dynamic routing algorithm, and then the output vector is multiplied by the weight and sent to a high-level capsule to be used as the input of the high-level capsule;
three layers (512, 1024, 784) are expanded through the full connection layer, wherein the two layers are activation functions of ReLu, and the last layer is an activation function of Sigmoid;
using Adam's optimizer, the learning rate starts at 0.001 and the weight decays to 1.0 x 10-6The first order moment is estimated to have an exponential decay rate of 0.9 and the second order moment is estimated to have an exponential decay rate of 0.999.
And step four, judging the emotional state of the infant with autism by combining the input data and the emotional classification in the model, thereby performing human-computer interaction with pertinence.
By comparing with the speech emotion recognition model at the front edge, the method not only realizes higher classification accuracy in the whole data set, but also obtains good effect on the independent data set. Meanwhile, the grasp of the position information and the overall characteristics is necessary for the speech emotion recognition, and the capsule neural network has certain development potential in the speech emotion recognition application.
For a better understanding of the present invention, the foregoing has been described in detail with reference to specific examples thereof, but the invention is not limited thereto. Any simple modifications of the above embodiments according to the technical essence of the present invention still fall within the scope of the technical solution of the present invention.
As another aspect, the present application further provides a distributed microphone array for collecting voice data, the distributed microphone array including a plurality of microphone array nodes, each of the microphone array nodes being provided with one or a plurality of microphone audio collecting modules, wherein the microphone array collects the audio data and is used in the method described in the embodiments of the present application.
As another aspect, the present application further provides a highly integrated hardware DSP for an embedded speech recognition system, including a microcontroller, and integrating an MCU, an a/D, D/a, a RAM, and a ROM on a chip, and a computer program stored in the ROM and capable of running on the MCU, wherein the MCU implements the method described in the embodiments of the present application when executing the computer program, and has the features of small size, high integration, good reliability, strong interrupt handling capability, high performance price ratio, strong function, high efficiency instruction system, low power consumption, and low voltage.
As another aspect, the present application also provides a computer-readable storage medium, which may be the computer-readable storage medium included in the foregoing device in the foregoing embodiment; or it may be a separate computer readable storage medium not incorporated into the device. The computer readable storage medium stores one or more programs for use by one or more processors in performing the methods described in the embodiments of the present application.
Any reference to a storage medium used by embodiments of the present application may include non-volatile, volatile memory. Suitable non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), optical disks (including compact disk read-only memory (CD-ROM) and Digital Versatile Disks (DVD)), or flash memory. Volatile memory can include Random Access Memory (RAM), which acts as external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms, such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), double data rate SDRAM (DDR SDRAM), Enhanced SDRAM (ESDRAM), synchronous Link (Synchlink) DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and bus dynamic RAM (RDRAM).
It should be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. If implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic Gate circuit for realizing a logic function for a data signal, an asic having an appropriate combinational logic Gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), and the like.
It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.
In addition, functional units in the embodiments of the present application may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware form, and can also be realized in a software and firmware form. The integrated module, if implemented in software or firmware and sold or used as a stand-alone product, may be transferred from a storage medium or a network into a computer having a dedicated hardware structure for functional implementation.
It is also to be noted that the steps of performing the above-described series of processes may naturally be performed chronologically in the order of description, but need not necessarily be performed chronologically. Some steps may be performed in parallel or independently of each other.
While embodiments of the present application have been illustrated and described above, it should be understood that they have been presented by way of example only, and not limitation. Variations, modifications, substitutions and alterations of the above-described embodiments may be made by those skilled in the art without departing from the spirit and scope of the invention, and it is therefore intended that the scope of the invention not be limited thereto, but rather by the claims appended hereto.

Claims (7)

1. A real-time speech emotion recognition method based on CapCNN, the method comprises the following steps:
collecting voice data with different emotions, framing the audio information, performing overlapping operation by using each frame for 25ms and setting the ratio of frame shift to frame length to be 0.5, and then adding a Hamming window to each voice section;
extracting the spectral characteristics of the voice data: firstly, performing short-time Fourier analysis on each frame of a preprocessed voice signal, then converting the voice signal into a spectrogram, performing linear normalization processing on original features before the spectrogram is used as input data, quantizing the spectrogram into a gray-scale image of 0-255, and finally converting the gray-scale image into a matrix with the size of [1000,40 ];
constructing a model based on the CapcNN, inputting the model into a network through extracting the processed spectral characteristics and training to realize the discrimination and classification of speech emotion;
and judging the emotional state of the object by combining the input data and the emotion classification in the model, thereby performing human-computer interaction with pertinence.
2. The method of claim 1, wherein preprocessing the collected audio of different emotions based on the collected audio of different emotions comprises: framing a section of voice, wherein each frame uses 25ms, and the ratio of frame shift to frame length is 0.5 to perform overlapping operation; then, accurately positioning the start and the end of the voice, distinguishing the voice section from the non-voice section as a target, calculating the average short-time energy and the average short-time zero-crossing rate of each section of voice, and carrying out double-threshold end point detection by taking the average short-time energy and the average short-time zero-crossing rate as threshold points; and finally, adding a Hamming window to each frame of voice in order to reduce frequency leakage caused by frame division.
3. The method according to claim 1, wherein the extracting of the speech features specifically comprises:
step one, extracting spectral characteristics: requiring short-time Fourier analysis of each frame of the preprocessed speech signal
Figure FDA0003116043020000011
Wherein k is more than or equal to 0 and less than or equal to N-1, N is the total framing number, k is the current framing number, and X (N, k) is the short-time amplitude spectrum estimation of X (N);
step two, drawing a spectrogram: converting the spectral features extracted in the first step, and setting the spectral energy density function P (n, k) at time m as P (n, k) ═ X (n, k) < >, linear2(X (n, k)) × (conj (X (n, k))) where n is the abscissa, k is the ordinate, and P (n, k) represents the amplitude using the shade of color, and the two-dimensional image obtained thereby is a spectrogram;
step three, normalization and graying of the spectrogram: the used emotional characteristics have different meanings and large value range difference, in order to measure various characteristics, the linear normalization processing is carried out on the original characteristics before the characteristic selection is carried out,
Figure FDA0003116043020000012
wherein P (a, b) is the gray value of each point in the spectrogram, Pmax(a, b) and Pmin(a, b) is the maximum value and the minimum value of the spectrogram matrix, and after the normalized amplitude is carried out, the spectrogram is quantized into a gray scale map of 0-255;
and step four, converting the spectrogram into a matrix.
4. The method of claim 1, wherein the constructing of the speech emotion recognition network comprises: for the part of the convolutional neural network, the next operation is carried out by reducing the size of the high-dimensional features to obtain a compact matrix;
step one, the voice features are 2D, and input data needs to be expanded from 2D to 3D;
step two, carrying out continuous two convolution operations of step length 2, convolution kernel size 13 and output channel 8 on the data, and using relu activation function and batch processing normalization behind each convolution layer;
step three, performing convolution operation on the convolution layer with the step length of 2, the convolution kernel size of 13 and the output channel 64, wherein 8 convolution units are packaged together to serve as a new unit to prepare for a subsequent capsule routing algorithm, the maximum pool of the tensor is deployed again to match the input of the envelope layer, and the 2 nd dimension of the tensor is set to be 1;
step four, the speech emotion recognition task needs a function of accurately extracting characteristic position information, reconstructs the matrix subjected to the convolution layer into [ -1,16], and outputs the matrix to a capsule layer;
step five, the length of the input vector and the output vector of the capsule neural network represents the probability of an entity, and the value of the length is between 0 and 1; using the Squash non-linear function,
Figure FDA0003116043020000021
wherein j denotes a given capsule, sjThe input vector of the designated capsule is obtained, the Squash nonlinear function ensures that the length of the short vector is reduced to almost zero, and the length of the long vector is close to but not more than 1;
step six, a dynamic routing algorithm is carried out among all capsules, and the lower-layer capsules i change scalar weight cijWherein the scalar weight cijDetermined by an iterative dynamic routing algorithm, then the output vector of the lower-layer capsule is multiplied by the weight and sent to the higher layerA capsule as an input for a high-level capsule;
step seven, three layers (512, 1024 and 784) are expanded through the full connection layer, wherein the two layers are activation functions of ReLu, and the last layer is an activation function of Sigmoid;
step eight, using Adam's optimizer, learning rate starts from 0.001, weight decay is 1.0 x 10-6The first order moment is estimated to have an exponential decay rate of 0.9 and the second order moment is estimated to have an exponential decay rate of 0.999.
5. A distributed microphone array for collecting speech data, the array comprising a plurality of microphone array nodes, each provided with one or a plurality of microphone audio collection modules, wherein the microphone array collecting the audio data is used in the method of any one of claims 1 to 4.
6. An embedded speech recognition application device comprising a microcontroller, an MCU, a/D, D/a, a RAM, a ROM integrated on one chip, and a computer program stored in said ROM and executable on said MCU, wherein said processor when executing said computer program implements the method according to any one of claims 1 to 4.
7. A computer-readable storage medium having stored thereon a computer program for: the computer program, when executed by a processor, implements the method of any one of claims 1-4.
CN202110663975.2A 2021-06-15 2021-06-15 Real-time speech emotion recognition method based on CapcNN and application device Pending CN113362857A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110663975.2A CN113362857A (en) 2021-06-15 2021-06-15 Real-time speech emotion recognition method based on CapcNN and application device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110663975.2A CN113362857A (en) 2021-06-15 2021-06-15 Real-time speech emotion recognition method based on CapcNN and application device

Publications (1)

Publication Number Publication Date
CN113362857A true CN113362857A (en) 2021-09-07

Family

ID=77534417

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110663975.2A Pending CN113362857A (en) 2021-06-15 2021-06-15 Real-time speech emotion recognition method based on CapcNN and application device

Country Status (1)

Country Link
CN (1) CN113362857A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116577037A (en) * 2023-07-12 2023-08-11 上海电机学院 Air duct leakage signal detection method based on non-uniform frequency spectrogram

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104637497A (en) * 2015-01-16 2015-05-20 南京工程学院 Speech spectrum characteristic extracting method facing speech emotion identification
CN105047194A (en) * 2015-07-28 2015-11-11 东南大学 Self-learning spectrogram feature extraction method for speech emotion recognition
CN107845390A (en) * 2017-09-21 2018-03-27 太原理工大学 A kind of Emotional speech recognition system based on PCNN sound spectrograph Fusion Features
CN108597539A (en) * 2018-02-09 2018-09-28 桂林电子科技大学 Speech-emotion recognition method based on parameter migration and sound spectrograph
CN108899051A (en) * 2018-06-26 2018-11-27 北京大学深圳研究生院 A kind of speech emotion recognition model and recognition methods based on union feature expression
CN109523994A (en) * 2018-11-13 2019-03-26 四川大学 A kind of multitask method of speech classification based on capsule neural network
US20200159778A1 (en) * 2018-06-19 2020-05-21 Priyadarshini Mohanty Methods and systems of operating computerized neural networks for modelling csr-customer relationships
CN111357051A (en) * 2019-12-24 2020-06-30 深圳市优必选科技股份有限公司 Speech emotion recognition method, intelligent device and computer readable storage medium
CN111785301A (en) * 2020-06-28 2020-10-16 重庆邮电大学 Residual error network-based 3DACRNN speech emotion recognition method and storage medium
CN111883178A (en) * 2020-07-17 2020-11-03 渤海大学 Double-channel voice-to-image-based emotion recognition method
CN112562725A (en) * 2020-12-09 2021-03-26 山西财经大学 Mixed voice emotion classification method based on spectrogram and capsule network
CN112800225A (en) * 2021-01-28 2021-05-14 南京邮电大学 Microblog comment emotion classification method and system

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104637497A (en) * 2015-01-16 2015-05-20 南京工程学院 Speech spectrum characteristic extracting method facing speech emotion identification
CN105047194A (en) * 2015-07-28 2015-11-11 东南大学 Self-learning spectrogram feature extraction method for speech emotion recognition
CN107845390A (en) * 2017-09-21 2018-03-27 太原理工大学 A kind of Emotional speech recognition system based on PCNN sound spectrograph Fusion Features
CN108597539A (en) * 2018-02-09 2018-09-28 桂林电子科技大学 Speech-emotion recognition method based on parameter migration and sound spectrograph
US20200159778A1 (en) * 2018-06-19 2020-05-21 Priyadarshini Mohanty Methods and systems of operating computerized neural networks for modelling csr-customer relationships
CN108899051A (en) * 2018-06-26 2018-11-27 北京大学深圳研究生院 A kind of speech emotion recognition model and recognition methods based on union feature expression
CN109523994A (en) * 2018-11-13 2019-03-26 四川大学 A kind of multitask method of speech classification based on capsule neural network
CN111357051A (en) * 2019-12-24 2020-06-30 深圳市优必选科技股份有限公司 Speech emotion recognition method, intelligent device and computer readable storage medium
CN111785301A (en) * 2020-06-28 2020-10-16 重庆邮电大学 Residual error network-based 3DACRNN speech emotion recognition method and storage medium
CN111883178A (en) * 2020-07-17 2020-11-03 渤海大学 Double-channel voice-to-image-based emotion recognition method
CN112562725A (en) * 2020-12-09 2021-03-26 山西财经大学 Mixed voice emotion classification method based on spectrogram and capsule network
CN112800225A (en) * 2021-01-28 2021-05-14 南京邮电大学 Microblog comment emotion classification method and system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
XIN-CHENG WEN ETC: ""The Application of Capsule Neural Network Based CNN for Speech Emotion Recognition"", pages 9356 - 9362 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116577037A (en) * 2023-07-12 2023-08-11 上海电机学院 Air duct leakage signal detection method based on non-uniform frequency spectrogram
CN116577037B (en) * 2023-07-12 2023-09-12 上海电机学院 Air duct leakage signal detection method based on non-uniform frequency spectrogram

Similar Documents

Publication Publication Date Title
Sehgal et al. A convolutional neural network smartphone app for real-time voice activity detection
JP6198872B2 (en) Detection of speech syllable / vowel / phoneme boundaries using auditory attention cues
US9020822B2 (en) Emotion recognition using auditory attention cues extracted from users voice
Qamhan et al. Digital audio forensics: microphone and environment classification using deep learning
CN111341319B (en) Audio scene identification method and system based on local texture features
Rammo et al. Detecting the speaker language using CNN deep learning algorithm
CN114566189B (en) Speech emotion recognition method and system based on three-dimensional depth feature fusion
CN111128222A (en) Speech separation method, speech separation model training method, and computer-readable medium
CN112562725A (en) Mixed voice emotion classification method based on spectrogram and capsule network
Kuang et al. Simplified inverse filter tracked affective acoustic signals classification incorporating deep convolutional neural networks
Kaushik et al. SLINet: Dysphasia detection in children using deep neural network
CN113362857A (en) Real-time speech emotion recognition method based on CapcNN and application device
CN113257279A (en) GTCN-based real-time voice emotion recognition method and application device
Sun Digital audio scene recognition method based on machine learning technology
Sharma et al. HindiSpeech-Net: a deep learning based robust automatic speech recognition system for Hindi language
Dehghani et al. Time-frequency localization using deep convolutional maxout neural network in Persian speech recognition
Tailor et al. Deep learning approach for spoken digit recognition in Gujarati language
Meng et al. A lightweight CNN and Transformer hybrid model for mental retardation screening among children from spontaneous speech
Manjutha et al. An optimized cepstral feature selection method for dysfluencies classification using Tamil speech dataset
Anguraj et al. Analysis of influencing features with spectral feature extraction and multi-class classification using deep neural network for speech recognition system
Kayal et al. Multilingual vocal emotion recognition and classification using back propagation neural network
Gul et al. Single channel speech enhancement by colored spectrograms
US20240038215A1 (en) Method executed by electronic device, electronic device and storage medium
Fartash et al. A scale–rate filter selection method in the spectro-temporal domain for phoneme classification
Chen et al. Separation of Speech from Speech Interference Based on EGG

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination