CN116230020A - Speech emotion recognition and classification method - Google Patents

Speech emotion recognition and classification method Download PDF

Info

Publication number
CN116230020A
CN116230020A CN202211516305.9A CN202211516305A CN116230020A CN 116230020 A CN116230020 A CN 116230020A CN 202211516305 A CN202211516305 A CN 202211516305A CN 116230020 A CN116230020 A CN 116230020A
Authority
CN
China
Prior art keywords
emotion recognition
voice data
features
speech
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211516305.9A
Other languages
Chinese (zh)
Inventor
王国伟
朱红坤
贺光华
李奇隆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing Chuannan Environmental Protection Technology Co ltd
Original Assignee
Chongqing Chuannan Environmental Protection Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing Chuannan Environmental Protection Technology Co ltd filed Critical Chongqing Chuannan Environmental Protection Technology Co ltd
Priority to CN202211516305.9A priority Critical patent/CN116230020A/en
Publication of CN116230020A publication Critical patent/CN116230020A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Hospice & Palliative Care (AREA)
  • General Health & Medical Sciences (AREA)
  • Psychiatry (AREA)
  • Child & Adolescent Psychology (AREA)
  • Evolutionary Computation (AREA)
  • Quality & Reliability (AREA)
  • Machine Translation (AREA)

Abstract

The invention belongs to the technical field of voice recognition, and provides a voice emotion recognition classification method, which comprises the following steps: placing the voice data into an emotion recognition model; extracting WP-log-Mel spectrogram characteristics in voice data, and inputting the spectrogram characteristics into an encoder to obtain deep characteristics; dividing the voice data, and extracting the secondary language characteristics of the divided voice data; fusing the secondary language features and the deep features to obtain key features in the fused features; and carrying out emotion prediction classification according to the key features by the emotion recognition model, and outputting a prediction classification result. The invention improves the processing effect on the non-stable voice signal, combines the effective characteristics in the voice signal, and improves the precision and accuracy of the voice emotion recognition result.

Description

Speech emotion recognition and classification method
Technical Field
The invention belongs to the technical field of voice emotion recognition, and particularly relates to a voice emotion recognition classification method.
Background
The emotion of the person can be judged by the human through expressions, voices, gestures and the like, wherein the voices are the most direct and effective communication bridge of the human, and are the most rapid and efficient medium in man-machine interaction; therefore, the voice emotion recognition analysis gradually becomes an active research field, and research results thereof are widely applied to fields such as artificial intelligence, robot technology, natural man-machine interaction technology and the like, for example, voice emotion quality inspection is performed on telephone customer service in various industries, service quality is inspected, and efficiency and quality inspection management level of quality inspection personnel are improved.
In the prior art, when speech emotion recognition is carried out, a novel linear equidistant frequency band division method for speech signals is generally adopted for extracting log-Mel spectrograms from original speech, and the method can only carry out characteristic extraction on stable speech and has poor processing effect on non-stable speech signals, so that the accuracy of speech emotion recognition is low; in addition, the prior art usually extracts deep features on the basis of artificial design features, ignores effective features in original voice signals, such as genes, formants and the like, and leads to the reduction of voice emotion recognition accuracy.
Disclosure of Invention
The invention aims at least solving the technical problems in the prior art, and provides a voice emotion recognition and classification method which improves the processing effect on non-stable voice signals and improves the precision and accuracy of voice emotion recognition results by combining with effective characteristics in the voice signals.
In order to achieve the above object of the present invention, according to a first aspect of the present invention, there is provided a speech emotion recognition classification method comprising the steps of: placing the voice data into an emotion recognition model; extracting WP-log-Mel spectrogram characteristics in voice data, and inputting the spectrogram characteristics into an encoder to obtain deep characteristics; dividing the voice data, and extracting the secondary language characteristics of the divided voice data; fusing the secondary language features and the deep features to obtain key features in the fused features; and carrying out emotion prediction classification according to the key features by the emotion recognition model, and outputting a prediction classification result.
Further, the steps of extracting WP-log-Mel spectrogram characteristics in voice data by the first extraction layer are specifically as follows: WP-log-Mel spectrogram features are extracted from speech signals of speech data using wavelet packet transformation and audio processing.
Further, the fused feature output dimension is (B, T, 64), where B represents the lot and T represents the frame length.
Further, the emotion recognition model includes a first extraction network, a second extraction network, a fusion network, a self-attention mechanism network, and a softmax layer; the first extraction network comprises a transducer encoder comprising more than 2 identical blocks, the number of heads in the multi-Head attention mechanism in a block being more than 2.
Further, the second extraction network comprises a segmentation layer, a sinc-based convolution structure and a multi-pooling-based convolution structure in sequence.
Further, the convolution structure based on multi-pooling comprises a convolution layer, and an average pooling layer and a maximum pooling layer are arranged behind the convolution layer in parallel.
Further, the convolution layers are standardized, and the dropout value of the convolution layers is between 0.2 and 0.5.
Further, the step of placing the voice data into the emotion recognition model specifically includes: and carrying out classification pretreatment on the voice data, wherein the classification pretreatment comprises mute segment cutting and denoising, and putting the processed voice data into an emotion recognition model.
Further, the emotion recognition model is trained by the following steps: acquiring voice data, and performing training pretreatment on the voice data, wherein the training pretreatment comprises mute segment cutting, denoising, data enhancement and data labeling; the method comprises the steps of putting voice data into an initial model for training, wherein the initial model is an untrained emotion recognition model, extracting WP-log-Mel spectrogram characteristics in the voice data, and inputting the spectrogram characteristics into an encoder to obtain deep characteristics; dividing the voice data, and extracting the secondary language characteristics of the divided voice data; fusing the secondary language features and the deep features to obtain key features in the fused features; carrying out emotion prediction classification according to key features by the emotion recognition model, and outputting a prediction classification result; and calculating the loss of the prediction classification result and the data mark based on the loss function until the model converges to generate an emotion recognition model.
Further, the training preprocessing step further includes generating a plurality of voice copies from the voice data.
The invention has the basic principle and beneficial effects that: the invention replaces Fourier transformation in log-Mel spectrogram feature extraction with wavelet packet transformation, finally obtains WP-log-Mel spectrogram feature so as to extract deep features in voice signals, and the wavelet packet transformation has the characteristic of multiple resolutions, so that signals can be observed gradually from coarse to fine. Compared with the prior art, the method and the device have the advantages that the auxiliary language features in the voice data are extracted, the auxiliary language features and the deep features are fused to identify emotion in the voice data, so that the robustness of an emotion identification model is higher, and the accuracy of emotion identification results are improved.
Drawings
FIG. 1 is a schematic diagram of a speech emotion recognition method of the present invention;
FIG. 2 is a logical schematic of the WP-log-Mel acquisition process of the present invention;
FIG. 3 is a schematic diagram of the structure of a sinc-based convolution structure of the present invention;
FIG. 4 is a schematic diagram of the architecture of the present invention based on a multi-pooling convolution structure;
FIG. 5 is a schematic diagram of the emotion recognition model of the present invention;
FIG. 6 is a schematic diagram of the steps of the training pre-process of the present invention.
Detailed Description
Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative only and are not to be construed as limiting the invention.
In the description of the present invention, it should be understood that the terms "longitudinal," "transverse," "upper," "lower," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," "outer," and the like indicate orientations or positional relationships based on the orientation or positional relationships shown in the drawings, merely to facilitate describing the present invention and simplify the description, and do not indicate or imply that the devices or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and therefore should not be construed as limiting the present invention.
In the description of the present invention, unless otherwise specified and defined, it should be noted that the terms "mounted," "connected," and "coupled" are to be construed broadly, and may be, for example, mechanical or electrical, or may be in communication with each other between two elements, directly or indirectly through intermediaries, as would be understood by those skilled in the art, in view of the specific meaning of the terms described above.
As shown in fig. 1, the invention provides a voice emotion recognition classification method, which comprises the following steps:
placing the voice data into an emotion recognition model; the emotion recognition model comprises a first extraction network, a second extraction network, a fusion network, a self-attention mechanism network and a softmax layer;
the first extraction network extracts WP-log-Mel spectrogram features from voice signals of voice data by wavelet packet transformation and audio processing, and inputs the spectrogram features into an encoder to obtain deep features;
specifically, the WP-log-Mel spectrogram features are log-Mel spectrogram features based on wavelet packet transformation; the extraction process of the WP-log-Mel spectrogram features is as shown in figure 2, voice data is input, wavelet packet coefficients of each frequency band are obtained through wavelet packet decomposition, the frequency spectrum of each layer of wavelet packet coefficient is obtained through wavelet packet reconstruction, the frequency spectrums of each layer of wavelet packet coefficient are spliced according to the frequency sequence of the frequency spectrum to obtain a complete frequency spectrum, a Mel filter bank is constructed through an audio processing library, energy passing through the Mel filter bank is calculated to obtain Mel spectrogram, the Mel spectrogram is logarithmized to obtain WP-log-Mel spectrogram features, and the spectrogram feature matrix is input into a transducer encoder to obtain deep features.
In this example, the voice data is divided into 46ms wide frames and 260 frames are obtained with a frame shift of 23ms, in other implementations, the frame width and frame shift can be defined as other numbers; in this embodiment, the wavelet basis function used in the wavelet transform is Duabechies4 wavelet, the number of filter banks is set to 64, and in other embodiments, other wavelet basis functions and the number of filters may be selected according to the application scenario; in this embodiment, the library is a library audio processing library, and in other embodiments, an audio processing library with the same effect may be used.
Specifically, the first extraction network includes a transform encoder, where the transform encoder includes more than 2 identical blocks, and the number of heads in the multi-Head attention mechanism in the blocks is more than 2. In this embodiment, the first extraction network employs layer normalization; the transducer encoder consists of 6 identical blocks, the number of heads in the multi-Head attention mechanism in each block is 8, and a 260×64 spectrogram feature matrix is input into the transducer encoder to obtain deep features.
The activation function of the feed-forward layer is as follows:
Figure BDA0003970566180000061
Figure BDA0003970566180000062
Figure BDA0003970566180000063
Figure BDA0003970566180000064
MHA=concat(H 1 ,H 2 ,……,H 8 )*W
Out=LayerNorm(out)
Out=gelu(out)
wherein X represents the input sequence of voice behavior and emotion, i represents the ith attention head, Q i 、K i 、V i Respectively representing the query, the health value and the value of the attention head, W represents the learned linear transformation weight matrix, H represents the hidden dimension of the attention head, and H i Represents the attention score of the attention head,
Figure BDA0003970566180000065
represent K i Is a transposed matrix of (a); MHA represents a multi-headed attention score,
in other embodiments, other activation functions may be selected as the activation function of the feed-forward layer, and the number of blocks and the number of heads in the multi-Head attention mechanism in each block may be set as desired.
The first extraction network has the advantages that: in the prior art, a convolutional neural network, a cyclic neural network and the like are generally used for extracting deep features, the extraction performance of the network structure is reduced due to the problem of long-term dependence, and the training time is long; when processing long voice data, the voice data needs to be segmented into a plurality of blocks and then input into a model, and when using a cyclic neural network model, each block must also be input in sequence. The present embodiment uses a second extraction network including a multi-headed attention mechanism of a transducer encoder to extract deep features of speech data, completely avoiding recursion; the multi-head attention mechanism and the position embedding are benefited, deep features in the voice data can be accurately extracted without inputting the voice data in sequence, the multi-head attention mechanism can remarkably reduce training time and resource requirements, and repetition and convolution are avoided.
The second extraction network performs segmentation processing on the voice data and extracts the secondary language features of the segmented voice data; specifically, the second extraction network comprises, in order, a split layer, a sinc-based convolution structure, and a multi-pooling-based convolution structure (CMPU). The segmentation layer is used for carrying out segmentation processing on the voice data; extracting auxiliary language features in the segmented voice data based on a sinc convolution structure and a multi-pooling convolution structure (CMPU), wherein the auxiliary language features comprise genes and formants in the voice data;
in this embodiment, the segmentation process specifically segments the waveform of the voice data into blocks of 100ms, where each block overlaps 5ms, for example, the first block is 1ms to 100ms, the second block is 95ms to 195ms, and the third block is 190ms to 290ms; in other embodiments, the length of each block and the length of the overlap of the division processing may be set according to the need.
The structure of the convolution structure based on sinc is shown in fig. 3, and a plurality of filters are used in the convolution structure based on sinc, and the length of the filter is 251 and the number of the filters is 80 in the embodiment; in other embodiments the length and number of filters may be selected as desired; the calculation process of the convolution structure based on sinc is as follows:
y[n]=x[n]*g[n,f1,f2]
Figure BDA0003970566180000081
wherein y n represents the output result of the sinc-based convolution structure, x n represents the input speech, n represents the number of frames, f1 and f2 represent the learned low cut-off frequency and high cut-off frequency, respectively, the cut-off frequency can be randomly initialized within the range of [0, fs/2], and fs represents the sampling frequency of the input signal.
Preferably, in order to capture different characteristic information in the voice data through different pooling modes, the second extraction network comprises three convolution structures (CMPU) based on multi-pooling; each structure of the multi-pooling-based convolution structure (CMPU) is shown in fig. 4, and includes a convolution layer, where an average pooling layer and a maximum pooling layer are disposed behind the convolution layer; the outputs of the two pooling layers are connected and then input into the next multi-pooling based convolution structure. The calculation process of the convolution structure based on the multi-pooling is as follows:
S=conv(x)
S1=Maxpoold(S)
S2=Avgpool1d(S)
S=concat(S1,S2)
where x represents the output of a sinc-based convolution structure and S represents the output of a multi-pooling-based convolution structure.
Preferably, all convolution layers in the second extraction network are layer normalized, the activation function is preferably a Relu activation function, and in other embodiments, other activation functions may be selected. Preferably, the dropout value of the convolution layer of the second extraction network is between 0.2 and 0.5, and in this embodiment, the dropout value is preferably 0.5.
The second extraction network extraction can more effectively extract LLDs feature from voice data. Compared with the prior art that the voice data is directly convolved, the second extraction network convolves the voice data by using a convolution module based on a sine function, and the second extraction network comprises a filter bank for emotion recognition, so that more effective characteristics are provided. The embodiment also provides a convolution unit (CMPU) based on multi-pooling to further extract the characteristics, and different information is obtained through different pooling modes, so that the auxiliary language characteristics can be more comprehensively captured.
The first extraction network and the second extraction network do not have a sequence in the step of extracting the voice data, and are extracted simultaneously.
The fusion network fuses the secondary language features and the deep features; specifically, the deep features and the secondary language features are mapped to the same dimension through the multi-layer perceptron respectively, and the mapping process is as follows:
S1=MLP(X1,64)
S2=MLP(X2,64)
wherein X1 and X2 respectively represent the dimensions of deep features and secondary language features, and MLP is a multi-layer perceptron.
And fusing the mapped depth features and the mapped secondary language features through a Concat function, wherein the fused features have output dimensions of (B, T, 64), B represents a batch, and T represents a frame length. Inputting the fused features into a self-attention mechanism network, capturing global voice expression by the self-attention mechanism network, and acquiring key features in the fused features; the calculation process of the self-attention mechanism is as follows:
Figure BDA0003970566180000101
Figure BDA0003970566180000102
Figure BDA0003970566180000103
Figure BDA0003970566180000104
and inputting the key features into a softmax layer for emotion prediction classification, and outputting a prediction classification result.
Preferably, in order to accelerate the extraction efficiency of the voice data, the interference factor in the voice data is deleted between the extractions. Before the voice data are put into the emotion recognition model, classifying pretreatment is carried out on the voice data, wherein the classifying pretreatment comprises sampling, silence segment cutting and denoising; specifically, the voice data is sampled to 16K, and silence segment removal and denoising are performed on the voice data based on the VAD technique.
Preferably, as shown in fig. 5, the emotion recognition model is trained by the following steps:
acquiring voice data, and performing training pretreatment on the voice data; as shown in fig. 6, training preprocessing comprises sampling, mute segment cutting, denoising, data enhancement and data labeling; in the training preprocessing, compared with the classifying preprocessing, two steps of data enhancement and data labeling are added, wherein the data enhancement is used for capturing more information in voice data, so that the training effect is improved, the data labeling is used for judging the emotion recognition result output during training, and the accuracy of an emotion recognition model is improved; in this embodiment, the data enhancement method used is channel length perturbation.
The voice data is put into an initial model, and the initial model is trained to be an untrained emotion recognition model; extracting WP-log-Mel spectrogram characteristics in voice data, and inputting the spectrogram characteristics into an encoder to obtain deep characteristics; dividing the voice data, and extracting the secondary language characteristics of the divided voice data; fusing the secondary language features and the deep features to obtain key features in the fused features;
carrying out emotion prediction classification according to key features by the emotion recognition model, and outputting a prediction classification result; and calculating the loss of the prediction classification result and the data mark based on the loss function until the model converges to generate an emotion recognition model. In this embodiment, the loss function is preferably a pyrach cross entropy loss function, which has a better effect on unbalanced data, and in the training process, the learning rate is reduced according to a certain proportion if there is no improvement in 20 epochs by using a dynamic learning rate strategy.
Preferably, if the number of the voice data involved in the training is small in the training process, the training preprocessing step further includes generating a plurality of voice copies of the voice data, and the data labels of the voice copies are the same, so as to increase the number of the voice data for training.
In the prior art, a method for translating voice data into text and then carrying out emotion recognition is also provided, in this embodiment, the voice data is directly input into an emotion recognition model, so that the process of translating voice into text is avoided, the problem of error in the translation process is avoided, and the emotion recognition precision is improved.
In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
While embodiments of the present invention have been shown and described, it will be understood by those of ordinary skill in the art that: many changes, modifications, substitutions and variations may be made to the embodiments without departing from the spirit and principles of the invention, the scope of which is defined by the claims and their equivalents.

Claims (10)

1. The voice emotion recognition and classification method is characterized by comprising the following steps of:
placing the voice data into an emotion recognition model;
extracting WP-log-Mel spectrogram characteristics in voice data, and inputting the spectrogram characteristics into an encoder to obtain deep characteristics; dividing the voice data, and extracting the secondary language characteristics of the divided voice data;
fusing the secondary language features and the deep features to obtain key features in the fused features; and carrying out emotion prediction classification according to the key features by the emotion recognition model, and outputting a prediction classification result.
2. The method of claim 1, wherein the step of extracting WP-log-Mel spectrogram features in the speech data by the first extraction layer comprises the steps of: WP-log-Mel spectrogram features are extracted from speech signals of speech data using wavelet packet transformation and audio processing.
3. The speech emotion recognition classification method of claim 1, wherein the fused feature output dimension is (B, T, 64), wherein B represents a lot and T represents a frame length.
4. A method of classifying speech emotion recognition as claimed in claim 1, 2 or 3, characterized in that the emotion recognition model comprises a first extraction network, a second extraction network, a fusion network, a self-attention mechanism network and a softmax layer;
the first extraction network comprises a transducer encoder comprising more than 2 identical blocks, the number of heads in the multi-Head attention mechanism in a block being more than 2.
5. The speech emotion recognition classification method of claim 4, wherein the second extraction network comprises, in order, a segmentation layer, a sinc-based convolution structure, and a multi-pooling-based convolution structure.
6. The speech emotion recognition classification method of claim 5, wherein the multi-pooling-based convolution structure comprises a convolution layer followed by parallel average pooling layers and maximum pooling layers.
7. The speech emotion recognition classification method of claim 6, wherein the convolutional layers are standardized, and the dropout value of the convolutional layers is between 0.2 and 0.5.
8. The method for classifying speech emotion recognition of claim 1, 2, 3, 5, 6 or 7, wherein the step of placing speech data into emotion recognition model comprises: and carrying out classification pretreatment on the voice data, wherein the classification pretreatment comprises mute segment cutting and denoising, and putting the processed voice data into an emotion recognition model.
9. The speech emotion recognition classification method of claim 1, 2, 3, 5, 6 or 7, wherein emotion recognition model is trained by:
acquiring voice data, and performing training pretreatment on the voice data, wherein the training pretreatment comprises mute segment cutting, denoising, data enhancement and data labeling;
the method comprises the steps of putting voice data into an initial model for training, wherein the initial model is an untrained emotion recognition model, extracting WP-log-Mel spectrogram characteristics in the voice data, and inputting the spectrogram characteristics into an encoder to obtain deep characteristics; dividing the voice data, and extracting the secondary language characteristics of the divided voice data;
fusing the secondary language features and the deep features to obtain key features in the fused features; carrying out emotion prediction classification according to key features by the emotion recognition model, and outputting a prediction classification result;
and calculating the loss of the prediction classification result and the data mark based on the loss function until the model converges to generate an emotion recognition model.
10. A method of speech emotion recognition classification as claimed in claim 8 or 9, wherein the training preprocessing step further comprises generating a plurality of speech copies from the speech data.
CN202211516305.9A 2022-11-29 2022-11-29 Speech emotion recognition and classification method Pending CN116230020A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211516305.9A CN116230020A (en) 2022-11-29 2022-11-29 Speech emotion recognition and classification method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211516305.9A CN116230020A (en) 2022-11-29 2022-11-29 Speech emotion recognition and classification method

Publications (1)

Publication Number Publication Date
CN116230020A true CN116230020A (en) 2023-06-06

Family

ID=86588018

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211516305.9A Pending CN116230020A (en) 2022-11-29 2022-11-29 Speech emotion recognition and classification method

Country Status (1)

Country Link
CN (1) CN116230020A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116504259A (en) * 2023-06-30 2023-07-28 中汇丰(北京)科技有限公司 Semantic recognition method based on natural language processing

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116504259A (en) * 2023-06-30 2023-07-28 中汇丰(北京)科技有限公司 Semantic recognition method based on natural language processing
CN116504259B (en) * 2023-06-30 2023-08-29 中汇丰(北京)科技有限公司 Semantic recognition method based on natural language processing

Similar Documents

Publication Publication Date Title
Ma et al. Emotion recognition from variable-length speech segments using deep learning on spectrograms.
CN110808033B (en) Audio classification method based on dual data enhancement strategy
Kim et al. Emotion Recognition from Human Speech Using Temporal Information and Deep Learning.
CN111326178A (en) Multi-mode speech emotion recognition system and method based on convolutional neural network
NL2029780B1 (en) Speech separation method based on time-frequency cross-domain feature selection
CN114566189B (en) Speech emotion recognition method and system based on three-dimensional depth feature fusion
CN118248177B (en) Speech emotion recognition system and method based on approximate nearest neighbor search algorithm
CN112183582A (en) Multi-feature fusion underwater target identification method
CN114783418B (en) End-to-end voice recognition method and system based on sparse self-attention mechanism
CN112927723A (en) High-performance anti-noise speech emotion recognition method based on deep neural network
CN116230020A (en) Speech emotion recognition and classification method
Gao et al. Mixed-bandwidth cross-channel speech recognition via joint optimization of DNN-based bandwidth expansion and acoustic modeling
Kamal et al. An innovative approach utilizing binary-view transformer for speech recognition task
CN113611286B (en) Cross-language speech emotion recognition method and system based on common feature extraction
CN118280371A (en) Voice interaction method and system based on artificial intelligence
Singh et al. E-PANNs: Sound Recognition Using Efficient Pre-trained Audio Neural Networks
CN109346104A (en) A kind of audio frequency characteristics dimension reduction method based on spectral clustering
Poojary et al. Speech Emotion Recognition Using MLP Classifier
CN106228984A (en) Voice recognition information acquisition methods
Zhou et al. Environmental sound classification of western black-crowned gibbon habitat based on spectral subtraction and VGG16
CN114626424A (en) Data enhancement-based silent speech recognition method and device
CN114792518A (en) Voice recognition system based on scheduling domain technology, method thereof and storage medium
Thomas et al. Language identification using deep neural network for Indian languages
CN111312215A (en) Natural speech emotion recognition method based on convolutional neural network and binaural representation
CN113327604B (en) Method for identifying super-phrase voice and language

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination