CN116230020A - Speech emotion recognition and classification method - Google Patents
Speech emotion recognition and classification method Download PDFInfo
- Publication number
- CN116230020A CN116230020A CN202211516305.9A CN202211516305A CN116230020A CN 116230020 A CN116230020 A CN 116230020A CN 202211516305 A CN202211516305 A CN 202211516305A CN 116230020 A CN116230020 A CN 116230020A
- Authority
- CN
- China
- Prior art keywords
- emotion recognition
- voice data
- features
- speech
- data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000008909 emotion recognition Effects 0.000 title claims abstract description 58
- 238000000034 method Methods 0.000 title claims abstract description 37
- 230000008451 emotion Effects 0.000 claims abstract description 13
- 238000012545 processing Methods 0.000 claims abstract description 13
- 238000000605 extraction Methods 0.000 claims description 34
- 238000011176 pooling Methods 0.000 claims description 22
- 238000012549 training Methods 0.000 claims description 21
- 230000006870 function Effects 0.000 claims description 16
- 230000007246 mechanism Effects 0.000 claims description 14
- 230000009466 transformation Effects 0.000 claims description 8
- 238000007781 pre-processing Methods 0.000 claims description 6
- 230000011218 segmentation Effects 0.000 claims description 6
- 238000002372 labelling Methods 0.000 claims description 5
- 230000004927 fusion Effects 0.000 claims description 4
- 230000000694 effects Effects 0.000 abstract description 6
- 230000008569 process Effects 0.000 description 11
- 230000004913 activation Effects 0.000 description 6
- 238000010586 diagram Methods 0.000 description 5
- 239000000284 extract Substances 0.000 description 5
- 239000011159 matrix material Substances 0.000 description 4
- 238000001228 spectrum Methods 0.000 description 4
- 238000004364 calculation method Methods 0.000 description 3
- 238000007689 inspection Methods 0.000 description 3
- 238000005070 sampling Methods 0.000 description 3
- 230000008901 benefit Effects 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 125000004122 cyclic group Chemical group 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000037433 frameshift Effects 0.000 description 2
- 230000014509 gene expression Effects 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 239000000463 material Substances 0.000 description 2
- 108090000623 proteins and genes Proteins 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 238000000354 decomposition reaction Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000036541 health Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/63—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Multimedia (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Signal Processing (AREA)
- Artificial Intelligence (AREA)
- Hospice & Palliative Care (AREA)
- General Health & Medical Sciences (AREA)
- Psychiatry (AREA)
- Child & Adolescent Psychology (AREA)
- Evolutionary Computation (AREA)
- Quality & Reliability (AREA)
- Machine Translation (AREA)
Abstract
The invention belongs to the technical field of voice recognition, and provides a voice emotion recognition classification method, which comprises the following steps: placing the voice data into an emotion recognition model; extracting WP-log-Mel spectrogram characteristics in voice data, and inputting the spectrogram characteristics into an encoder to obtain deep characteristics; dividing the voice data, and extracting the secondary language characteristics of the divided voice data; fusing the secondary language features and the deep features to obtain key features in the fused features; and carrying out emotion prediction classification according to the key features by the emotion recognition model, and outputting a prediction classification result. The invention improves the processing effect on the non-stable voice signal, combines the effective characteristics in the voice signal, and improves the precision and accuracy of the voice emotion recognition result.
Description
Technical Field
The invention belongs to the technical field of voice emotion recognition, and particularly relates to a voice emotion recognition classification method.
Background
The emotion of the person can be judged by the human through expressions, voices, gestures and the like, wherein the voices are the most direct and effective communication bridge of the human, and are the most rapid and efficient medium in man-machine interaction; therefore, the voice emotion recognition analysis gradually becomes an active research field, and research results thereof are widely applied to fields such as artificial intelligence, robot technology, natural man-machine interaction technology and the like, for example, voice emotion quality inspection is performed on telephone customer service in various industries, service quality is inspected, and efficiency and quality inspection management level of quality inspection personnel are improved.
In the prior art, when speech emotion recognition is carried out, a novel linear equidistant frequency band division method for speech signals is generally adopted for extracting log-Mel spectrograms from original speech, and the method can only carry out characteristic extraction on stable speech and has poor processing effect on non-stable speech signals, so that the accuracy of speech emotion recognition is low; in addition, the prior art usually extracts deep features on the basis of artificial design features, ignores effective features in original voice signals, such as genes, formants and the like, and leads to the reduction of voice emotion recognition accuracy.
Disclosure of Invention
The invention aims at least solving the technical problems in the prior art, and provides a voice emotion recognition and classification method which improves the processing effect on non-stable voice signals and improves the precision and accuracy of voice emotion recognition results by combining with effective characteristics in the voice signals.
In order to achieve the above object of the present invention, according to a first aspect of the present invention, there is provided a speech emotion recognition classification method comprising the steps of: placing the voice data into an emotion recognition model; extracting WP-log-Mel spectrogram characteristics in voice data, and inputting the spectrogram characteristics into an encoder to obtain deep characteristics; dividing the voice data, and extracting the secondary language characteristics of the divided voice data; fusing the secondary language features and the deep features to obtain key features in the fused features; and carrying out emotion prediction classification according to the key features by the emotion recognition model, and outputting a prediction classification result.
Further, the steps of extracting WP-log-Mel spectrogram characteristics in voice data by the first extraction layer are specifically as follows: WP-log-Mel spectrogram features are extracted from speech signals of speech data using wavelet packet transformation and audio processing.
Further, the fused feature output dimension is (B, T, 64), where B represents the lot and T represents the frame length.
Further, the emotion recognition model includes a first extraction network, a second extraction network, a fusion network, a self-attention mechanism network, and a softmax layer; the first extraction network comprises a transducer encoder comprising more than 2 identical blocks, the number of heads in the multi-Head attention mechanism in a block being more than 2.
Further, the second extraction network comprises a segmentation layer, a sinc-based convolution structure and a multi-pooling-based convolution structure in sequence.
Further, the convolution structure based on multi-pooling comprises a convolution layer, and an average pooling layer and a maximum pooling layer are arranged behind the convolution layer in parallel.
Further, the convolution layers are standardized, and the dropout value of the convolution layers is between 0.2 and 0.5.
Further, the step of placing the voice data into the emotion recognition model specifically includes: and carrying out classification pretreatment on the voice data, wherein the classification pretreatment comprises mute segment cutting and denoising, and putting the processed voice data into an emotion recognition model.
Further, the emotion recognition model is trained by the following steps: acquiring voice data, and performing training pretreatment on the voice data, wherein the training pretreatment comprises mute segment cutting, denoising, data enhancement and data labeling; the method comprises the steps of putting voice data into an initial model for training, wherein the initial model is an untrained emotion recognition model, extracting WP-log-Mel spectrogram characteristics in the voice data, and inputting the spectrogram characteristics into an encoder to obtain deep characteristics; dividing the voice data, and extracting the secondary language characteristics of the divided voice data; fusing the secondary language features and the deep features to obtain key features in the fused features; carrying out emotion prediction classification according to key features by the emotion recognition model, and outputting a prediction classification result; and calculating the loss of the prediction classification result and the data mark based on the loss function until the model converges to generate an emotion recognition model.
Further, the training preprocessing step further includes generating a plurality of voice copies from the voice data.
The invention has the basic principle and beneficial effects that: the invention replaces Fourier transformation in log-Mel spectrogram feature extraction with wavelet packet transformation, finally obtains WP-log-Mel spectrogram feature so as to extract deep features in voice signals, and the wavelet packet transformation has the characteristic of multiple resolutions, so that signals can be observed gradually from coarse to fine. Compared with the prior art, the method and the device have the advantages that the auxiliary language features in the voice data are extracted, the auxiliary language features and the deep features are fused to identify emotion in the voice data, so that the robustness of an emotion identification model is higher, and the accuracy of emotion identification results are improved.
Drawings
FIG. 1 is a schematic diagram of a speech emotion recognition method of the present invention;
FIG. 2 is a logical schematic of the WP-log-Mel acquisition process of the present invention;
FIG. 3 is a schematic diagram of the structure of a sinc-based convolution structure of the present invention;
FIG. 4 is a schematic diagram of the architecture of the present invention based on a multi-pooling convolution structure;
FIG. 5 is a schematic diagram of the emotion recognition model of the present invention;
FIG. 6 is a schematic diagram of the steps of the training pre-process of the present invention.
Detailed Description
Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative only and are not to be construed as limiting the invention.
In the description of the present invention, it should be understood that the terms "longitudinal," "transverse," "upper," "lower," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," "outer," and the like indicate orientations or positional relationships based on the orientation or positional relationships shown in the drawings, merely to facilitate describing the present invention and simplify the description, and do not indicate or imply that the devices or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and therefore should not be construed as limiting the present invention.
In the description of the present invention, unless otherwise specified and defined, it should be noted that the terms "mounted," "connected," and "coupled" are to be construed broadly, and may be, for example, mechanical or electrical, or may be in communication with each other between two elements, directly or indirectly through intermediaries, as would be understood by those skilled in the art, in view of the specific meaning of the terms described above.
As shown in fig. 1, the invention provides a voice emotion recognition classification method, which comprises the following steps:
placing the voice data into an emotion recognition model; the emotion recognition model comprises a first extraction network, a second extraction network, a fusion network, a self-attention mechanism network and a softmax layer;
the first extraction network extracts WP-log-Mel spectrogram features from voice signals of voice data by wavelet packet transformation and audio processing, and inputs the spectrogram features into an encoder to obtain deep features;
specifically, the WP-log-Mel spectrogram features are log-Mel spectrogram features based on wavelet packet transformation; the extraction process of the WP-log-Mel spectrogram features is as shown in figure 2, voice data is input, wavelet packet coefficients of each frequency band are obtained through wavelet packet decomposition, the frequency spectrum of each layer of wavelet packet coefficient is obtained through wavelet packet reconstruction, the frequency spectrums of each layer of wavelet packet coefficient are spliced according to the frequency sequence of the frequency spectrum to obtain a complete frequency spectrum, a Mel filter bank is constructed through an audio processing library, energy passing through the Mel filter bank is calculated to obtain Mel spectrogram, the Mel spectrogram is logarithmized to obtain WP-log-Mel spectrogram features, and the spectrogram feature matrix is input into a transducer encoder to obtain deep features.
In this example, the voice data is divided into 46ms wide frames and 260 frames are obtained with a frame shift of 23ms, in other implementations, the frame width and frame shift can be defined as other numbers; in this embodiment, the wavelet basis function used in the wavelet transform is Duabechies4 wavelet, the number of filter banks is set to 64, and in other embodiments, other wavelet basis functions and the number of filters may be selected according to the application scenario; in this embodiment, the library is a library audio processing library, and in other embodiments, an audio processing library with the same effect may be used.
Specifically, the first extraction network includes a transform encoder, where the transform encoder includes more than 2 identical blocks, and the number of heads in the multi-Head attention mechanism in the blocks is more than 2. In this embodiment, the first extraction network employs layer normalization; the transducer encoder consists of 6 identical blocks, the number of heads in the multi-Head attention mechanism in each block is 8, and a 260×64 spectrogram feature matrix is input into the transducer encoder to obtain deep features.
The activation function of the feed-forward layer is as follows:
MHA=concat(H 1 ,H 2 ,……,H 8 )*W
Out=LayerNorm(out)
Out=gelu(out)
wherein X represents the input sequence of voice behavior and emotion, i represents the ith attention head, Q i 、K i 、V i Respectively representing the query, the health value and the value of the attention head, W represents the learned linear transformation weight matrix, H represents the hidden dimension of the attention head, and H i Represents the attention score of the attention head,represent K i Is a transposed matrix of (a); MHA represents a multi-headed attention score,
in other embodiments, other activation functions may be selected as the activation function of the feed-forward layer, and the number of blocks and the number of heads in the multi-Head attention mechanism in each block may be set as desired.
The first extraction network has the advantages that: in the prior art, a convolutional neural network, a cyclic neural network and the like are generally used for extracting deep features, the extraction performance of the network structure is reduced due to the problem of long-term dependence, and the training time is long; when processing long voice data, the voice data needs to be segmented into a plurality of blocks and then input into a model, and when using a cyclic neural network model, each block must also be input in sequence. The present embodiment uses a second extraction network including a multi-headed attention mechanism of a transducer encoder to extract deep features of speech data, completely avoiding recursion; the multi-head attention mechanism and the position embedding are benefited, deep features in the voice data can be accurately extracted without inputting the voice data in sequence, the multi-head attention mechanism can remarkably reduce training time and resource requirements, and repetition and convolution are avoided.
The second extraction network performs segmentation processing on the voice data and extracts the secondary language features of the segmented voice data; specifically, the second extraction network comprises, in order, a split layer, a sinc-based convolution structure, and a multi-pooling-based convolution structure (CMPU). The segmentation layer is used for carrying out segmentation processing on the voice data; extracting auxiliary language features in the segmented voice data based on a sinc convolution structure and a multi-pooling convolution structure (CMPU), wherein the auxiliary language features comprise genes and formants in the voice data;
in this embodiment, the segmentation process specifically segments the waveform of the voice data into blocks of 100ms, where each block overlaps 5ms, for example, the first block is 1ms to 100ms, the second block is 95ms to 195ms, and the third block is 190ms to 290ms; in other embodiments, the length of each block and the length of the overlap of the division processing may be set according to the need.
The structure of the convolution structure based on sinc is shown in fig. 3, and a plurality of filters are used in the convolution structure based on sinc, and the length of the filter is 251 and the number of the filters is 80 in the embodiment; in other embodiments the length and number of filters may be selected as desired; the calculation process of the convolution structure based on sinc is as follows:
y[n]=x[n]*g[n,f1,f2]
wherein y n represents the output result of the sinc-based convolution structure, x n represents the input speech, n represents the number of frames, f1 and f2 represent the learned low cut-off frequency and high cut-off frequency, respectively, the cut-off frequency can be randomly initialized within the range of [0, fs/2], and fs represents the sampling frequency of the input signal.
Preferably, in order to capture different characteristic information in the voice data through different pooling modes, the second extraction network comprises three convolution structures (CMPU) based on multi-pooling; each structure of the multi-pooling-based convolution structure (CMPU) is shown in fig. 4, and includes a convolution layer, where an average pooling layer and a maximum pooling layer are disposed behind the convolution layer; the outputs of the two pooling layers are connected and then input into the next multi-pooling based convolution structure. The calculation process of the convolution structure based on the multi-pooling is as follows:
S=conv(x)
S1=Maxpoold(S)
S2=Avgpool1d(S)
S=concat(S1,S2)
where x represents the output of a sinc-based convolution structure and S represents the output of a multi-pooling-based convolution structure.
Preferably, all convolution layers in the second extraction network are layer normalized, the activation function is preferably a Relu activation function, and in other embodiments, other activation functions may be selected. Preferably, the dropout value of the convolution layer of the second extraction network is between 0.2 and 0.5, and in this embodiment, the dropout value is preferably 0.5.
The second extraction network extraction can more effectively extract LLDs feature from voice data. Compared with the prior art that the voice data is directly convolved, the second extraction network convolves the voice data by using a convolution module based on a sine function, and the second extraction network comprises a filter bank for emotion recognition, so that more effective characteristics are provided. The embodiment also provides a convolution unit (CMPU) based on multi-pooling to further extract the characteristics, and different information is obtained through different pooling modes, so that the auxiliary language characteristics can be more comprehensively captured.
The first extraction network and the second extraction network do not have a sequence in the step of extracting the voice data, and are extracted simultaneously.
The fusion network fuses the secondary language features and the deep features; specifically, the deep features and the secondary language features are mapped to the same dimension through the multi-layer perceptron respectively, and the mapping process is as follows:
S1=MLP(X1,64)
S2=MLP(X2,64)
wherein X1 and X2 respectively represent the dimensions of deep features and secondary language features, and MLP is a multi-layer perceptron.
And fusing the mapped depth features and the mapped secondary language features through a Concat function, wherein the fused features have output dimensions of (B, T, 64), B represents a batch, and T represents a frame length. Inputting the fused features into a self-attention mechanism network, capturing global voice expression by the self-attention mechanism network, and acquiring key features in the fused features; the calculation process of the self-attention mechanism is as follows:
and inputting the key features into a softmax layer for emotion prediction classification, and outputting a prediction classification result.
Preferably, in order to accelerate the extraction efficiency of the voice data, the interference factor in the voice data is deleted between the extractions. Before the voice data are put into the emotion recognition model, classifying pretreatment is carried out on the voice data, wherein the classifying pretreatment comprises sampling, silence segment cutting and denoising; specifically, the voice data is sampled to 16K, and silence segment removal and denoising are performed on the voice data based on the VAD technique.
Preferably, as shown in fig. 5, the emotion recognition model is trained by the following steps:
acquiring voice data, and performing training pretreatment on the voice data; as shown in fig. 6, training preprocessing comprises sampling, mute segment cutting, denoising, data enhancement and data labeling; in the training preprocessing, compared with the classifying preprocessing, two steps of data enhancement and data labeling are added, wherein the data enhancement is used for capturing more information in voice data, so that the training effect is improved, the data labeling is used for judging the emotion recognition result output during training, and the accuracy of an emotion recognition model is improved; in this embodiment, the data enhancement method used is channel length perturbation.
The voice data is put into an initial model, and the initial model is trained to be an untrained emotion recognition model; extracting WP-log-Mel spectrogram characteristics in voice data, and inputting the spectrogram characteristics into an encoder to obtain deep characteristics; dividing the voice data, and extracting the secondary language characteristics of the divided voice data; fusing the secondary language features and the deep features to obtain key features in the fused features;
carrying out emotion prediction classification according to key features by the emotion recognition model, and outputting a prediction classification result; and calculating the loss of the prediction classification result and the data mark based on the loss function until the model converges to generate an emotion recognition model. In this embodiment, the loss function is preferably a pyrach cross entropy loss function, which has a better effect on unbalanced data, and in the training process, the learning rate is reduced according to a certain proportion if there is no improvement in 20 epochs by using a dynamic learning rate strategy.
Preferably, if the number of the voice data involved in the training is small in the training process, the training preprocessing step further includes generating a plurality of voice copies of the voice data, and the data labels of the voice copies are the same, so as to increase the number of the voice data for training.
In the prior art, a method for translating voice data into text and then carrying out emotion recognition is also provided, in this embodiment, the voice data is directly input into an emotion recognition model, so that the process of translating voice into text is avoided, the problem of error in the translation process is avoided, and the emotion recognition precision is improved.
In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
While embodiments of the present invention have been shown and described, it will be understood by those of ordinary skill in the art that: many changes, modifications, substitutions and variations may be made to the embodiments without departing from the spirit and principles of the invention, the scope of which is defined by the claims and their equivalents.
Claims (10)
1. The voice emotion recognition and classification method is characterized by comprising the following steps of:
placing the voice data into an emotion recognition model;
extracting WP-log-Mel spectrogram characteristics in voice data, and inputting the spectrogram characteristics into an encoder to obtain deep characteristics; dividing the voice data, and extracting the secondary language characteristics of the divided voice data;
fusing the secondary language features and the deep features to obtain key features in the fused features; and carrying out emotion prediction classification according to the key features by the emotion recognition model, and outputting a prediction classification result.
2. The method of claim 1, wherein the step of extracting WP-log-Mel spectrogram features in the speech data by the first extraction layer comprises the steps of: WP-log-Mel spectrogram features are extracted from speech signals of speech data using wavelet packet transformation and audio processing.
3. The speech emotion recognition classification method of claim 1, wherein the fused feature output dimension is (B, T, 64), wherein B represents a lot and T represents a frame length.
4. A method of classifying speech emotion recognition as claimed in claim 1, 2 or 3, characterized in that the emotion recognition model comprises a first extraction network, a second extraction network, a fusion network, a self-attention mechanism network and a softmax layer;
the first extraction network comprises a transducer encoder comprising more than 2 identical blocks, the number of heads in the multi-Head attention mechanism in a block being more than 2.
5. The speech emotion recognition classification method of claim 4, wherein the second extraction network comprises, in order, a segmentation layer, a sinc-based convolution structure, and a multi-pooling-based convolution structure.
6. The speech emotion recognition classification method of claim 5, wherein the multi-pooling-based convolution structure comprises a convolution layer followed by parallel average pooling layers and maximum pooling layers.
7. The speech emotion recognition classification method of claim 6, wherein the convolutional layers are standardized, and the dropout value of the convolutional layers is between 0.2 and 0.5.
8. The method for classifying speech emotion recognition of claim 1, 2, 3, 5, 6 or 7, wherein the step of placing speech data into emotion recognition model comprises: and carrying out classification pretreatment on the voice data, wherein the classification pretreatment comprises mute segment cutting and denoising, and putting the processed voice data into an emotion recognition model.
9. The speech emotion recognition classification method of claim 1, 2, 3, 5, 6 or 7, wherein emotion recognition model is trained by:
acquiring voice data, and performing training pretreatment on the voice data, wherein the training pretreatment comprises mute segment cutting, denoising, data enhancement and data labeling;
the method comprises the steps of putting voice data into an initial model for training, wherein the initial model is an untrained emotion recognition model, extracting WP-log-Mel spectrogram characteristics in the voice data, and inputting the spectrogram characteristics into an encoder to obtain deep characteristics; dividing the voice data, and extracting the secondary language characteristics of the divided voice data;
fusing the secondary language features and the deep features to obtain key features in the fused features; carrying out emotion prediction classification according to key features by the emotion recognition model, and outputting a prediction classification result;
and calculating the loss of the prediction classification result and the data mark based on the loss function until the model converges to generate an emotion recognition model.
10. A method of speech emotion recognition classification as claimed in claim 8 or 9, wherein the training preprocessing step further comprises generating a plurality of speech copies from the speech data.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211516305.9A CN116230020A (en) | 2022-11-29 | 2022-11-29 | Speech emotion recognition and classification method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211516305.9A CN116230020A (en) | 2022-11-29 | 2022-11-29 | Speech emotion recognition and classification method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116230020A true CN116230020A (en) | 2023-06-06 |
Family
ID=86588018
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211516305.9A Pending CN116230020A (en) | 2022-11-29 | 2022-11-29 | Speech emotion recognition and classification method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116230020A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116504259A (en) * | 2023-06-30 | 2023-07-28 | 中汇丰(北京)科技有限公司 | Semantic recognition method based on natural language processing |
-
2022
- 2022-11-29 CN CN202211516305.9A patent/CN116230020A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116504259A (en) * | 2023-06-30 | 2023-07-28 | 中汇丰(北京)科技有限公司 | Semantic recognition method based on natural language processing |
CN116504259B (en) * | 2023-06-30 | 2023-08-29 | 中汇丰(北京)科技有限公司 | Semantic recognition method based on natural language processing |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Ma et al. | Emotion recognition from variable-length speech segments using deep learning on spectrograms. | |
CN110808033B (en) | Audio classification method based on dual data enhancement strategy | |
Kim et al. | Emotion Recognition from Human Speech Using Temporal Information and Deep Learning. | |
CN111326178A (en) | Multi-mode speech emotion recognition system and method based on convolutional neural network | |
NL2029780B1 (en) | Speech separation method based on time-frequency cross-domain feature selection | |
CN114566189B (en) | Speech emotion recognition method and system based on three-dimensional depth feature fusion | |
CN118248177B (en) | Speech emotion recognition system and method based on approximate nearest neighbor search algorithm | |
CN112183582A (en) | Multi-feature fusion underwater target identification method | |
CN114783418B (en) | End-to-end voice recognition method and system based on sparse self-attention mechanism | |
CN112927723A (en) | High-performance anti-noise speech emotion recognition method based on deep neural network | |
CN116230020A (en) | Speech emotion recognition and classification method | |
Gao et al. | Mixed-bandwidth cross-channel speech recognition via joint optimization of DNN-based bandwidth expansion and acoustic modeling | |
Kamal et al. | An innovative approach utilizing binary-view transformer for speech recognition task | |
CN113611286B (en) | Cross-language speech emotion recognition method and system based on common feature extraction | |
CN118280371A (en) | Voice interaction method and system based on artificial intelligence | |
Singh et al. | E-PANNs: Sound Recognition Using Efficient Pre-trained Audio Neural Networks | |
CN109346104A (en) | A kind of audio frequency characteristics dimension reduction method based on spectral clustering | |
Poojary et al. | Speech Emotion Recognition Using MLP Classifier | |
CN106228984A (en) | Voice recognition information acquisition methods | |
Zhou et al. | Environmental sound classification of western black-crowned gibbon habitat based on spectral subtraction and VGG16 | |
CN114626424A (en) | Data enhancement-based silent speech recognition method and device | |
CN114792518A (en) | Voice recognition system based on scheduling domain technology, method thereof and storage medium | |
Thomas et al. | Language identification using deep neural network for Indian languages | |
CN111312215A (en) | Natural speech emotion recognition method based on convolutional neural network and binaural representation | |
CN113327604B (en) | Method for identifying super-phrase voice and language |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |