CN111145787B - Voice emotion feature fusion method and system based on main and auxiliary networks - Google Patents

Voice emotion feature fusion method and system based on main and auxiliary networks Download PDF

Info

Publication number
CN111145787B
CN111145787B CN201911368375.2A CN201911368375A CN111145787B CN 111145787 B CN111145787 B CN 111145787B CN 201911368375 A CN201911368375 A CN 201911368375A CN 111145787 B CN111145787 B CN 111145787B
Authority
CN
China
Prior art keywords
voice
features
emotion data
parameters
auxiliary
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911368375.2A
Other languages
Chinese (zh)
Other versions
CN111145787A (en
Inventor
张雪英
胡德生
张静
黄丽霞
牛溥华
李凤莲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Taiyuan University of Technology
Original Assignee
Taiyuan University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Taiyuan University of Technology filed Critical Taiyuan University of Technology
Priority to CN201911368375.2A priority Critical patent/CN111145787B/en
Publication of CN111145787A publication Critical patent/CN111145787A/en
Application granted granted Critical
Publication of CN111145787B publication Critical patent/CN111145787B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychiatry (AREA)
  • General Health & Medical Sciences (AREA)
  • Child & Adolescent Psychology (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a voice emotion feature fusion method and system based on a main network and an auxiliary network, wherein the method comprises the following steps: respectively inputting a plurality of first characteristics and second characteristics corresponding to each voice emotion data in the test set into a lower half part of the main network model and an auxiliary network model with parameters to obtain a main network high-layer characteristic and an auxiliary network high-layer characteristic corresponding to each voice emotion data; performing feature fusion on the main network high-level features, the auxiliary parameters and the auxiliary network high-level features, and determining main and auxiliary network fusion features corresponding to the voice emotion data; and inputting the main and auxiliary network fusion characteristics corresponding to the voice emotion data into the upper half part of the main network model with parameters to obtain fusion characteristics. The invention effectively fuses various features and improves the accuracy of speech emotion fusion.

Description

Voice emotion feature fusion method and system based on main and auxiliary networks
Technical Field
The invention relates to the technical field of emotional feature fusion, in particular to a voice emotional feature fusion method and system based on a main network and an auxiliary network.
Background
Emotional states are important factors for human-to-human communication and for efficient human-to-machine communication. In order to realize the natural communication between people and machines, the machines have the capabilities of speaking, thinking and emotion of people at the same time, and the goal of the field of artificial intelligence is pursued all the time. Research on Speech Emotion Recognition (SER) will promote realization of the target, and achievements of the method can be widely applied to the fields of man-machine interaction, telemedicine, electronic education, criminal investigation, emotion dispersion and the like, so that the development of Speech Emotion Recognition research has important significance and practical value.
The main structure of the existing speech emotion recognition system is to fuse different types of features and provide various mixed network structures so as to further improve the accuracy of the speech emotion recognition system. However, these methods have two major problems:
first, there is a lack of an efficient mechanism to effectively fuse different types of speech emotion features. More specifically, it is currently the mainstream practice to simply concatenate different types of features as input to identify a network. However, due to the difference of dimensions and dimensions of different features and the difference of actual physical meanings of the features, the features of different types affect each other, and the accuracy rate of the features cannot achieve the ideal effect.
Second, the correspondence between the network output and the actual annotation is not reasonable. More specifically, assuming that the width of the spread in time of the identifier with the recurrent neural network LSTM as the core is T, the LSTM corresponds to one output at each time instant. It is obviously unreasonable if we associate the output at each moment with an emotion type.
Disclosure of Invention
Based on this, the invention aims to provide a voice emotion feature fusion method and system based on a main network and an auxiliary network so as to improve the accuracy of voice emotion fusion.
In order to achieve the above object, the present invention provides a method for fusing speech emotion characteristics based on a primary and secondary network, the method comprising:
step S1: determining a training set and a test set;
step S2: determining a main network model with parameters, an auxiliary network model with parameters and auxiliary parameters by using the training set;
and step S3: inputting a plurality of first features corresponding to the voice emotion data in the test set into the lower half part of the main network model with parameters to obtain main network high-level features corresponding to the voice emotion data;
and step S4: inputting the second characteristics corresponding to the voice emotion data in the test set into an auxiliary network model with parameters to obtain auxiliary network high-level characteristics corresponding to the voice emotion data;
step S5: performing feature fusion on the main network high-level features, the auxiliary parameters and the auxiliary network high-level features in a main and auxiliary network mode, and determining main and auxiliary network fusion features corresponding to the voice emotion data;
step S6: and inputting the main and auxiliary network fusion characteristics corresponding to the voice emotion data into the upper half part of the main network model with parameters to obtain fusion characteristics.
Optionally, determining the training set and the test set specifically includes:
step S11: determining a speech emotion database; the voice emotion database comprises 363 pieces of voice emotion data;
step S12: determining a standard database according to the voice emotion database;
step S13: performing feature extraction on each piece of speech emotion data in the standard database to obtain a plurality of speech MFCC segment features and a plurality of speech global features;
step S14: respectively carrying out standardization processing on a plurality of voice MFCC segment characteristics and the voice global characteristics corresponding to the voice emotion data to respectively obtain a first characteristic and a second characteristic;
step S15: selecting 243 first features and second features corresponding to the voice emotion data as a training set; and using the first characteristics and the second characteristics corresponding to the remaining 120 pieces of speech emotion data as a test set.
Optionally, performing feature extraction on each piece of speech emotion data in the standard database to obtain a plurality of speech MFCC segment features and a plurality of speech global features, specifically including:
step S131: performing MFCC frame feature extraction on each piece of speech emotion data in the standard database to obtain a plurality of speech MFCC frame features;
step S132: after averaging the plurality of voice MFCC frame characteristics, obtaining a plurality of voice MFCC segment characteristics;
step S133: and carrying out global feature extraction on each piece of speech emotion data in the standard database to obtain a plurality of speech global features.
Optionally, the determining the standard database according to the speech emotion database specifically includes:
step S121: judging whether each piece of voice emotion data in the voice emotion database is larger than a set voice frame length; if the speech emotion data are larger than the set speech frame length, adopting truncation operation to enable the speech emotion data to be equal to the set speech frame length, and placing the processed speech emotion data into the standard database; if the voice emotion data are smaller than the set voice frame length, zero filling operation is adopted to enable the voice emotion data to be equal to the set voice frame length, and the processed voice emotion data are placed in the standard database; and if each voice emotion data is equal to the length of a set voice frame, directly putting each voice emotion data into the standard database.
Optionally, determining a main network model with parameters, an auxiliary network model with parameters, and auxiliary parameters by using the training set specifically includes:
step S21: determining main network model parameters by using a plurality of first characteristics corresponding to each piece of speech emotion data in the training set;
step S22: and determining auxiliary network model parameters and auxiliary parameters by using a plurality of second characteristics corresponding to each piece of speech emotion data in the training set.
Optionally, determining a main network model parameter by using a plurality of first features corresponding to each piece of speech emotion data in the training set specifically includes:
step S211: keeping the auxiliary parameters and the auxiliary network model parameters unchanged, inputting a plurality of first characteristics corresponding to each voice emotion data in the training set into a main network, training by adopting a gradient descent algorithm to obtain main network model parameters, and determining a first cost function value by utilizing a cost function formula;
step S212: judging whether the first cost function value is larger than or equal to a first set value or not; if the first cost function value is larger than or equal to the first set value, returning to the step S211; and if the first cost function value is smaller than a first set value, outputting the main network model parameters.
Optionally, the determining an auxiliary network model parameter and an auxiliary parameter by using a plurality of second features corresponding to each piece of speech emotion data in the training set specifically includes:
step S221: keeping the parameters of the main network model unchanged, inputting a plurality of second characteristics corresponding to the speech emotion data in the training set into an auxiliary network, training by adopting a gradient descent algorithm to obtain auxiliary network model parameters and auxiliary parameters, and determining a second cost function value by utilizing a cost function formula;
step S222: judging whether the second valence function value is larger than or equal to a second set value or not; if the second cost function value is greater than or equal to a second set value, returning to the step S221; and if the second cost function value is smaller than a second set value, outputting the auxiliary network model parameter and the auxiliary parameter.
The invention also provides a voice emotion feature fusion system based on the main and auxiliary networks, which comprises the following components:
the set determining module is used for determining a training set and a testing set;
the model determining module is used for determining a main network model with parameters, an auxiliary network model with parameters and auxiliary parameters by using the training set;
a main network high-level feature determination module, configured to input the multiple first features corresponding to the voice emotion data in the test set into a lower half of the main network model with parameters, and obtain a main network high-level feature corresponding to the voice emotion data;
the auxiliary network high-level feature determining module is used for inputting the second features corresponding to the voice emotion data in the test set into an auxiliary network model with parameters to obtain the auxiliary network high-level features corresponding to the voice emotion data;
the main network fusion characteristic determining module is used for performing characteristic fusion on the main network high-level characteristics, the auxiliary parameters and the auxiliary network high-level characteristics in a main network mode and an auxiliary network mode, and determining main network fusion characteristics corresponding to the voice emotion data;
and the fusion characteristic determining module is used for inputting the main and auxiliary network fusion characteristics corresponding to the voice emotion data into the upper half part of the main network model with parameters to obtain fusion characteristics.
Optionally, the set determining module specifically includes:
the voice emotion database determining unit is used for determining a voice emotion database; the voice emotion database comprises 363 pieces of voice emotion data;
the standard database determining unit is used for determining a standard database according to the voice emotion database;
the feature extraction unit is used for performing feature extraction on each piece of speech emotion data in the standard database to obtain a plurality of speech MFCC segment features and a plurality of speech global features;
the normalization processing unit is used for respectively normalizing the plurality of voice MFCC segment characteristics and the voice global characteristics corresponding to the voice emotion data to respectively obtain a first characteristic and a second characteristic;
the set determining unit is used for selecting 243 first features and second features corresponding to the voice emotion data as a training set; and using the first characteristics and the second characteristics corresponding to the remaining 120 pieces of speech emotion data as a test set.
Optionally, the feature extraction unit specifically includes:
the first extraction subunit is used for performing MFCC frame feature extraction on each piece of speech emotion data in the standard database to obtain a plurality of speech MFCC frame features;
the average processing subunit is configured to perform average processing on the multiple voice MFCC frame features to obtain multiple voice MFCC segment features;
and the second extraction subunit is used for carrying out global feature extraction on each piece of speech emotion data in the standard database to obtain a plurality of speech global features.
Optionally, the standard database determining unit specifically includes:
the judging subunit is used for judging whether each piece of speech emotion data in the speech emotion database is larger than a set speech frame length; if the speech emotion data are larger than the set speech frame length, adopting truncation operation to enable the speech emotion data to be equal to the set speech frame length, and placing the processed speech emotion data into the standard database; if the voice emotion data are smaller than the set voice frame length, zero filling operation is adopted to enable the voice emotion data to be equal to the set voice frame length, and the processed voice emotion data are placed in the standard database; and if each voice emotion data is equal to the length of a set voice frame, directly putting each voice emotion data into the standard database.
According to the specific embodiment provided by the invention, the invention discloses the following technical effects:
the invention provides a voice emotion feature fusion method and system based on a main network and an auxiliary network, wherein the method comprises the following steps: respectively inputting a plurality of first features and second features corresponding to the voice emotion data in the test set into a lower half part of the main network model and an auxiliary network model with parameters to obtain a main network high-level feature and an auxiliary network high-level feature corresponding to the voice emotion data; performing feature fusion on the main network high-level features, the auxiliary parameters and the auxiliary network high-level features in a main and auxiliary network mode, and determining main and auxiliary network fusion features corresponding to the voice emotion data; and inputting the main and auxiliary network fusion characteristics corresponding to the voice emotion data into the upper half part of the main network model with parameters to obtain fusion characteristics. The invention effectively fuses various features and improves the accuracy of speech emotion fusion.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.
FIG. 1 is a flow chart of a speech emotion feature fusion method based on a main network and an auxiliary network in an embodiment of the present invention;
FIG. 2 is a structural diagram of a voice emotion feature fusion system based on a main network and an auxiliary network in the embodiment of the present invention;
FIG. 3 is a diagram of a speech emotion feature fusion network structure according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.
The invention aims to provide a voice emotion feature fusion method and system based on a main network and an auxiliary network so as to improve the accuracy of voice emotion fusion.
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
Long-short term memory (LSTM): a time-cycle neural network is specially designed for solving the long-term dependence problem of a general RNN (neural network) cycle neural network.
Bidirectional long-short term memory unit (BLSTM): BLSTM is a bi-directional LSTM that can be compatible with contextual information at the same time.
Mel-frequency cepstral coefficients (MFCCs) are linear transforms of the log energy spectrum based on the nonlinear mel scale of the frequencies of a sound.
Support Vector Machines (SVMs) are generalized linear classifiers that perform binary classification on data in a supervised learning manner, and the decision boundary is the maximum margin hyperplane for solving learning samples.
A Deep Neural Network (DNN) perception machine is provided with an input layer, an output layer and a hidden layer, wherein input feature vectors reach the output layer through hidden layer transformation, and classification results are obtained at the output layer. A deep neural network is a network structure of perceptrons with multiple hidden layers.
Hidden Markov Models (HMMs) are statistical models that describe a Markov process with Hidden unknown parameters.
Convolutional Neural Networks (CNNs) are a class of feed-forward Neural Networks that include convolution calculations and have a deep structure, have a characterization learning capability, and can perform a translation invariant classification on input information according to their hierarchical structure.
Attention mechanism (Attention mechanism) in cognitive science, a human being selectively focuses on a part of all information while ignoring other visible information due to a bottleneck of information processing.
Short Time Fourier Transform (STFT) is a mathematical Transform related to the Fourier Transform to determine the frequency and phase of the local area sinusoid of a Time-varying signal.
The spectrogram is a voice spectrogram, the abscissa of which is time, the ordinate of which is frequency, and the coordinate point value of which is voice data energy.
In Softmax probability theory and related fields, the normalized exponential function is called as popularization of a logic function, and actually is the gradient logarithm normalization of finite term discrete probability distribution.
Principal Component Analysis (PCA): is a statistical method. A group of variables which are possibly correlated are converted into a group of linearly uncorrelated variables through orthogonal transformation, and the group of converted variables are called principal components.
Dropout is a method for optimizing an artificial neural network with a deep structure, and in the learning process, partial weights or outputs of hidden layers are randomly zeroed, so that interdependency among nodes is reduced, regularization of the neural network is realized, and the structural risk of the neural network is reduced.
Fig. 1 is a flowchart of a speech emotion feature fusion method based on a primary network and a secondary network in an embodiment of the present invention, and as shown in fig. 1, the present invention provides a speech emotion feature fusion method based on a primary network and a secondary network, where the method includes:
step S1: determining a training set and a test set;
step S2: determining a main network model with parameters, an auxiliary network model with parameters and auxiliary parameters by using the training set;
and step S3: inputting a plurality of first features corresponding to the voice emotion data in the test set into the lower half part of the main network model with parameters to obtain main network high-level features corresponding to the voice emotion data;
and step S4: inputting the second characteristics corresponding to the voice emotion data in the test set into an auxiliary network model with parameters to obtain auxiliary network high-level characteristics corresponding to the voice emotion data;
step S5: performing feature fusion on the main network high-level features, the auxiliary parameters and the auxiliary network high-level features in a main and auxiliary network mode, and determining main and auxiliary network fusion features corresponding to the voice emotion data;
step S6: and inputting the main and auxiliary network fusion characteristics corresponding to the voice emotion data into the upper half part of the main network model with parameters to obtain fusion characteristics.
The following is a detailed discussion of the various steps:
step S1: determining a training set and a test set, specifically comprising:
step S11: determining a speech emotion database, which specifically comprises:
selecting five emotions from the Emo-DB data set to obtain a voice emotion database; the voice emotion database comprises 363 pieces of voice emotion data; the five emotions include happy, sad, angry, surprised, and neutral.
The Emo-DB data set consists of 535 utterances performed by 10 professional actors, covering seven emotions of neutrality, fear, joy, anger, sadness, disgust and boredom, and the sampling frequency of the voice emotion database is reduced from 44.1kHz to 16kHz.
Step S12: determining a standard database according to the voice emotion database, which specifically comprises the following steps: judging whether each piece of voice emotion data in the voice emotion database is larger than a set voice frame length; if the speech emotion data are larger than the set speech frame length, adopting truncation operation to enable the speech emotion data to be equal to the set speech frame length, and placing the processed speech emotion data into the standard database; if the voice emotion data are smaller than the set voice frame length, zero filling operation is adopted, the voice emotion data are enabled to be equal to the set voice frame length, and the processed voice emotion data are placed in the standard database; and if each voice emotion data is equal to the length of a set voice frame, directly putting each voice emotion data into the standard database.
Because the frame lengths of different voices are different, the longest voice frame length is 2441, and the shortest voice frame length is 121, in order to align the different voice frame lengths, truncation and zero filling operations are adopted here. By observation, most of the voice frame lengths are below 1600, so the set voice frame length is 1600, zero filling operation is adopted for insufficient voice, and truncation operation is adopted for excessive voice.
Step S13: performing feature extraction on each voice emotion data in the standard database to obtain a plurality of voice MFCC segment features and a plurality of voice global features, and specifically comprising:
step S131: performing MFCC frame feature extraction on each voice emotion data in the standard database to obtain a plurality of voice MFCC frame features;
step S132: after averaging the plurality of voice MFCC frame characteristics, obtaining a plurality of voice MFCC segment characteristics;
step S133: performing global feature extraction on each voice emotion data in the standard database to obtain a plurality of voice global features; the global features of the speech comprise prosodic features including energy, speech speed and zero-crossing rate, voice quality features represented by formants and spectral features represented by MFCC, the total number of the global features of the speech is 98, and the global features of the speech are converted into 70 dimensions after PCA processing.
Step S14: respectively carrying out standardization processing on a plurality of voice MFCC segment characteristics and the voice global characteristics corresponding to the voice emotion data to respectively obtain a first characteristic and a second characteristic; the first feature is 40 x 60 dimensions and the second feature is 70 dimensions.
Step S15: selecting 243 first features and second features corresponding to the voice emotion data as a training set; and taking the first characteristic and the second characteristic corresponding to the remaining 120 pieces of the speech emotion data as a test set.
Step S2: determining a main network model with parameters, an auxiliary network model with parameters and auxiliary parameters by using the training set, which specifically comprises the following steps:
step S21: determining a main network model parameter by using a plurality of first features corresponding to each piece of speech emotion data in the training set, specifically including:
step S211: keeping the auxiliary parameters and the auxiliary network model parameters unchanged, inputting a plurality of first features corresponding to each piece of speech emotion data in the training set into a main network, training by adopting a gradient descent algorithm to obtain main network model parameters, and determining a first cost function value by using a cost function formula;
step S212: judging whether the first cost function value is larger than or equal to a first set value or not; if the first cost function value is greater than or equal to the first set value, returning to step S211; and if the first cost function value is smaller than a first set value, outputting the main network model parameters.
Step S22: determining an auxiliary network model parameter and an auxiliary parameter by using a plurality of second features corresponding to each piece of speech emotion data in the training set, specifically comprising:
step S221: keeping the parameters of the main network model unchanged, inputting a plurality of second characteristics corresponding to the speech emotion data in the training set into an auxiliary network, training by adopting a gradient descent algorithm to obtain auxiliary network model parameters and auxiliary parameters, and determining a second cost function value by utilizing a cost function formula;
step S222: judging whether the second valence function value is larger than or equal to a second set value or not; if the second cost function value is greater than or equal to a second set value, returning to the step S221; and if the second value function value is smaller than a second set value, outputting the auxiliary network model parameters and the auxiliary parameters.
And step S3: inputting a plurality of first features corresponding to the voice emotion data in the test set into the lower half part of the main network model with parameters to obtain main network high-level features corresponding to the voice emotion data; the high level feature of the main network is 160 dimensions.
As shown in fig. 3, the network structure is divided into two parts, a primary network model and a secondary network model. The master network model is divided into two parts: m is a group of D The lower half part of the main network model is represented and formed by two layers of bidirectional long-time memory unit BLSTM networks with attention mechanisms, and the number of hidden neurons in each layer is 160; m U The upper half part of the main network model is represented and composed of two layers of multi-layer full connection layers DNN, and the number of neurons in each layer is 100. e.g. of a cylinder 0 Features representing speech MFCC segments normalized for input to the main network model, h l And representing the output of the last hidden layer of the lower half part of the main network model after the adaptive time pooling algorithm. The auxiliary network model is composed of two layers of fully connected layers DNN, the number of neurons in the first layer is 200, and the number of neurons in the second layer is 100. v. of 0 Representing global features of speech, normalized as input to an auxiliary network model, v m And representing the output of the last hidden layer of the auxiliary network model. h is a total of l And Wv m M spliced to be input into main network model U In the upper half, the auxiliary parameter W plays a role in controlling the separate training of the main network and the auxiliary network, and meanwhile, the weighting coefficients which can be distributed to the main network characteristic and the auxiliary network characteristic are optimized by the gradient descent algorithm W.
Step S31: inputting said first feature of 40 x 60 dimensions into the lower half M of the main network model D Then obtaining an output result;
step S32: and (4) applying a self-adaptive time pooling algorithm to the output result for training, and finally outputting the main network high-level features with 160 dimensions. The calculation formula of the self-adaptive time pooling algorithm is as follows:
Figure BDA0002339034680000111
wherein h is t For output results at time t after the second BLSTM layer, h t ∈R D ,R D For a range of values, D represents h t Is equal to 80, T is the BLSTM network time step, is equal to 40, alpha t The weighting coefficient is obtained through network learning and is specifically calculated by the following formula:
α t =softmax(β t )
Figure BDA0002339034680000112
wherein σ is a non-linear mapping function, e.g. Sigmoid function, W β And U β Are respectively a coefficient matrix, gamma β Is a coefficient vector, γ β T Represents gamma β Transposition, W β 、U β And gamma β The parameters are network learning parameters, and are obtained by random initialization through truncation normal distribution and then training and optimizing with an identification network through a gradient descent algorithm.
And step S4: inputting the second characteristics corresponding to the voice emotion data in the test set into an auxiliary network model with parameters to obtain auxiliary network high-level characteristics corresponding to the voice emotion data; the second characteristic is 70 dimensions, and the auxiliary network high-level characteristic is 100 dimensions.
Step S5: performing feature fusion on the main network high-level features, the auxiliary parameters and the auxiliary network high-level features in a main and auxiliary network mode, and determining main and auxiliary network fusion features corresponding to the voice emotion data, wherein the specific formula is as follows:
Figure BDA0002339034680000113
wherein the content of the first and second substances,
Figure BDA0002339034680000114
as a primary and secondary network convergence feature, h l For high-level features of the main network, W is an auxiliary parameter, v M To assist in network high-level features.
Step S6: and inputting the main and auxiliary network fusion characteristics corresponding to the voice emotion data into the upper half part of the main network model with parameters to obtain fusion characteristics.
Fig. 2 is a structural diagram of a voice emotion feature fusion system based on a primary network and a secondary network in an embodiment of the present invention, and as shown in fig. 2, the present invention further provides a voice emotion feature fusion system based on a primary network and a secondary network, where the system includes:
the set determining module 1 is used for determining a training set and a testing set;
the model determining module 2 is used for determining a main network model with parameters, an auxiliary network model with parameters and auxiliary parameters by using the training set;
a main network high-level feature determining module 3, configured to input the multiple first features corresponding to the voice emotion data in the test set into a lower half of the main network model with parameters, and obtain a main network high-level feature corresponding to the voice emotion data;
an auxiliary network high-level feature determining module 4, configured to input the second feature corresponding to each piece of speech emotion data in the test set into an auxiliary network model with parameters, so as to obtain an auxiliary network high-level feature corresponding to each piece of speech emotion data;
a main and auxiliary network fusion characteristic determining module 5, configured to perform characteristic fusion on the main network high-level characteristic, the auxiliary parameter, and the auxiliary network high-level characteristic in a main and auxiliary network manner, and determine a main and auxiliary network fusion characteristic corresponding to each piece of speech emotion data;
and the fusion characteristic determining module 6 is used for inputting the main and auxiliary network fusion characteristics corresponding to the voice emotion data into the upper half part of the main network model with parameters to obtain fusion characteristics.
The various modules are discussed in detail below:
as an embodiment, the set determining module of the present invention specifically includes:
the voice emotion database determining unit is used for determining a voice emotion database; the voice emotion database comprises 363 pieces of voice emotion data;
the standard database determining unit is used for determining a standard database according to the voice emotion database;
the feature extraction unit is used for performing feature extraction on each piece of speech emotion data in the standard database to obtain a plurality of speech MFCC segment features and a plurality of speech global features;
the normalization processing unit is used for respectively normalizing the plurality of voice MFCC segment characteristics and the voice global characteristics corresponding to the voice emotion data to respectively obtain a first characteristic and a second characteristic;
the set determining unit is used for selecting 243 corresponding first features and second features of the voice emotion data as training sets; and taking the first characteristic and the second characteristic corresponding to the remaining 120 pieces of the speech emotion data as a test set.
As an embodiment, the feature extraction unit of the present invention specifically includes:
the first extraction subunit is used for performing MFCC frame feature extraction on each piece of speech emotion data in the standard database to obtain a plurality of speech MFCC frame features;
the average processing subunit is configured to perform average processing on the multiple voice MFCC frame features to obtain multiple voice MFCC segment features;
and the second extraction subunit is used for carrying out global feature extraction on each piece of speech emotion data in the standard database to obtain a plurality of speech global features.
As an embodiment, the standard database determining unit of the present invention specifically includes:
the judging subunit is used for judging whether each piece of voice emotion data in the voice emotion database is larger than a set voice frame length; if the voice emotion data are larger than the set voice frame length, adopting truncation operation to enable the voice emotion data to be equal to the set voice frame length, and putting the processed voice emotion data into the standard database; if the voice emotion data are smaller than the set voice frame length, zero filling operation is adopted, the voice emotion data are enabled to be equal to the set voice frame length, and the processed voice emotion data are placed in the standard database; and if each piece of voice emotion data is equal to the length of a set voice frame, directly putting each piece of voice emotion data into the standard database.
As an embodiment, the model determining module 2 of the present invention specifically includes:
a main network model parameter determining unit, configured to determine a main network model parameter by using a plurality of first features corresponding to each piece of speech emotion data in the training set;
and the auxiliary network model parameter determining unit is used for determining auxiliary network model parameters and auxiliary parameters by utilizing the plurality of second characteristics corresponding to the voice emotion data in the training set.
As an implementation manner, the master network model parameter determining unit of the present invention specifically includes:
a main network model parameter determining subunit, configured to keep the auxiliary parameters and the auxiliary network model parameters unchanged, input the plurality of first features corresponding to each piece of speech emotion data in the training set into a main network, and perform training by using a gradient descent algorithm to obtain a main network model parameter;
a first cost function value stator unit for determining a first cost function value using a cost function formula;
the first judgment subunit is used for judging whether the first cost function value is greater than or equal to a first set value or not; if the first cost function value is larger than or equal to a first set value, returning to a main network model parameter determination subunit; and if the first cost function value is smaller than a first set value, outputting the main network model parameter.
As an embodiment, the auxiliary network model parameter determining unit of the present invention specifically includes:
an auxiliary network model parameter determining subunit, configured to keep the main network model parameter unchanged, input the plurality of second features corresponding to each piece of speech emotion data in the training set into an auxiliary network, and perform training by using a gradient descent algorithm to obtain an auxiliary network model parameter and an auxiliary parameter;
a second cost function value stator unit for determining a second cost function value using the cost function formula;
a second judging subunit, configured to judge whether the second cost function value is greater than or equal to a second set value; if the second cost function value is larger than or equal to a second set value, returning to an auxiliary network model parameter determination subunit; and if the second cost function value is smaller than a second set value, outputting the auxiliary network model parameter and the auxiliary parameter.
Simulation verification
In order to verify the effectiveness of the speech emotion feature fusion method provided by the invention, some comparative experiments are performed on an Emo _ DB data set, and the experimental results are shown in Table 1.
TABLE 1 recognition of different network structures on Emo _ DB data set
Figure BDA0002339034680000141
The BLSTM + ATP (MFCC speech segment feature) network structure is a BLSTM network added with an adaptive time pooling algorithm, and takes the speech MFCC segment features after standardized processing as input features.
The DNN (global feature) network structure is a multi-layer fully-connected network, and takes the global features after the normalization process as input features.
The BLSTM + ATP and DNN carbonate network structure is characterized in that the high-level features obtained by inputting the features of a voice MFCC section after standardized processing into a BLSTM + ATP network and the high-level features obtained by inputting the global features after standardized processing into a DNN network are directly spliced together and input into a classifier, and the two networks are trained simultaneously without primary and secondary points.
The last one is the network structure of the method proposed by the present invention, whose accuracy is 89.84%. As can be seen from table 1, the accuracy of the direct splicing feature fusion method is higher than that of single-type feature identification, but is not significantly improved; the method for carrying out the feature fusion through the main network and the auxiliary network has the recognition accuracy rate which is obviously higher than that of the feature fusion method of direct splicing, and verifies that the feature fusion method provided by the invention can effectively improve the accuracy rate of the speech emotion recognition system compared with the feature fusion method of direct splicing.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.
The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims (8)

1. A voice emotion feature fusion method based on a main network and an auxiliary network is characterized by comprising the following steps:
step S1: determining a training set and a test set;
step S2: determining a main network model with parameters, an auxiliary network model with parameters and auxiliary parameters by using the training set;
the main network model comprises an upper half part of the main network model and a lower half part of the main network model, and the lower half part of the main network model comprises two layers of bidirectional long-time and short-time memory unit BLSTM networks added with an attention mechanism and an adaptive time pooling algorithm; the upper half part of the main network model comprises two layers of multi-layer fully connected layers DNN;
the auxiliary network model comprises two layers of fully connected layers DNN;
the calculation formula of the self-adaptive time pooling algorithm is as follows:
Figure QLYQS_1
wherein the content of the first and second substances,
Figure QLYQS_2
after passing through the second BLSTM layer
Figure QLYQS_3
The result of the output of the time of day,
Figure QLYQS_4
Figure QLYQS_5
to take a range of values, D represents
Figure QLYQS_6
Equal to 80, t is the BLSTM network time step, equal to 40,
Figure QLYQS_7
the weighting coefficient is obtained through network learning and is specifically calculated by the following formula:
Figure QLYQS_8
Figure QLYQS_9
wherein the content of the first and second substances,
Figure QLYQS_11
is a function of the Sigmoid and is,
Figure QLYQS_14
and
Figure QLYQS_16
respectively, a matrix of coefficients, each of which is,
Figure QLYQS_10
is a vector of the coefficients of the image data,
Figure QLYQS_13
represents
Figure QLYQS_17
The process of transposition is carried out,
Figure QLYQS_18
Figure QLYQS_12
and
Figure QLYQS_15
the learning parameters are network learning parameters, which are obtained by random initialization through truncation normal distribution and then training and optimizing with an identification network through a gradient descent algorithm;
and step S3: inputting a plurality of first characteristics corresponding to the voice emotion data in the test set into the lower half part of the main network model with parameters to obtain main network high-level characteristics corresponding to the voice emotion data;
and step S4: inputting the second characteristics corresponding to the voice emotion data in the test set into an auxiliary network model with parameters to obtain auxiliary network high-level characteristics corresponding to the voice emotion data;
step S5: performing feature fusion on the main network high-level features, the auxiliary parameters and the auxiliary network high-level features, and determining main and auxiliary network fusion features corresponding to the voice emotion data;
step S6: inputting the main network and auxiliary network fusion characteristics corresponding to each piece of voice emotion data into the upper half part of a main network model with parameters to obtain fusion characteristics;
determining a training set and a test set, specifically comprising:
step S11: determining a speech emotion database; the voice emotion database comprises 363 pieces of voice emotion data;
step S12: determining a standard database according to the voice emotion database;
step S13: performing feature extraction on each voice emotion data in the standard database to obtain a plurality of voice MFCC segment features and a plurality of voice global features;
step S14: respectively carrying out standardization processing on a plurality of voice MFCC segment characteristics and the voice global characteristics corresponding to the voice emotion data to respectively obtain a first characteristic and a second characteristic;
step S15: selecting 243 corresponding first features and second features of the voice emotion data as a training set; using the first characteristics and the second characteristics corresponding to the remaining 120 pieces of speech emotion data as a test set;
the global features of the speech comprise prosodic features including energy, speech speed and zero-crossing rate, voice quality features represented by formants and spectral features represented by MFCC, the total number of the global features of the speech is 98, and the global features of the speech are converted into 70 dimensions after PCA processing.
2. The method as claimed in claim 1, wherein the method for fusing speech emotion features based on primary and secondary networks, which is used for performing feature extraction on each speech emotion data in the standard database to obtain a plurality of speech MFCC segment features and a plurality of speech global features, specifically comprises:
step S131: performing MFCC frame feature extraction on each voice emotion data in the standard database to obtain a plurality of voice MFCC frame features;
step S132: after averaging the plurality of voice MFCC frame characteristics, obtaining a plurality of voice MFCC segment characteristics;
step S133: and carrying out global feature extraction on each voice emotion data in the standard database to obtain a plurality of voice global features.
3. The main-auxiliary network-based speech emotion feature fusion method according to claim 1, wherein the determining a standard database according to the speech emotion database specifically comprises:
step S121: judging whether each piece of voice emotion data in the voice emotion database is larger than a set voice frame length; if the voice emotion data are larger than the set voice frame length, adopting truncation operation to enable the voice emotion data to be equal to the set voice frame length, and putting the processed voice emotion data into the standard database; if the voice emotion data are smaller than the set voice frame length, zero filling operation is adopted to enable the voice emotion data to be equal to the set voice frame length, and the processed voice emotion data are placed in the standard database; and if each voice emotion data is equal to the length of a set voice frame, directly putting each voice emotion data into the standard database.
4. The method for fusing speech emotion characteristics based on primary and secondary networks as claimed in claim 1, wherein the determining of the primary network model with parameters, the secondary network model with parameters and the secondary parameters by using the training set specifically comprises:
step S21: determining a main network model parameter by using a plurality of first characteristics corresponding to each piece of speech emotion data in the training set;
step S22: and determining auxiliary network model parameters and auxiliary parameters by using a plurality of second characteristics corresponding to each piece of speech emotion data in the training set.
5. The main-auxiliary network-based voice emotion feature fusion method of claim 4, wherein the determining of main network model parameters by using the plurality of first features corresponding to each piece of voice emotion data in the training set specifically comprises:
step S211: keeping the auxiliary parameters and the auxiliary network model parameters unchanged, inputting a plurality of first characteristics corresponding to each voice emotion data in the training set into a main network, training by adopting a gradient descent algorithm to obtain main network model parameters, and determining a first cost function value by utilizing a cost function formula;
step S212: judging whether the first cost function value is larger than or equal to a first set value or not; if the first cost function value is greater than or equal to the first set value, returning to step S211; and if the first cost function value is smaller than a first set value, outputting the main network model parameter.
6. The method as claimed in claim 4, wherein the determining of the auxiliary network model parameters and the auxiliary parameters by using the plurality of second features corresponding to each piece of speech emotion data in the training set specifically comprises:
step S221: keeping the main network model parameters unchanged, inputting a plurality of second characteristics corresponding to each piece of speech emotion data in the training set into an auxiliary network, training by adopting a gradient descent algorithm to obtain auxiliary network model parameters and auxiliary parameters, and determining a second cost function value by using a cost function formula;
step S222: judging whether the second valence function value is larger than or equal to a second set value; if the second cost function value is greater than or equal to a second set value, returning to step S221; and if the second value function value is smaller than a second set value, outputting the auxiliary network model parameters and the auxiliary parameters.
7. A voice emotion feature fusion system based on a main network and an auxiliary network is characterized by comprising:
the set determining module is used for determining a training set and a testing set;
the model determining module is used for determining a main network model with parameters, an auxiliary network model with parameters and auxiliary parameters by utilizing the training set; the main network high-level feature determination module is used for inputting a plurality of first features corresponding to the voice emotion data in the test set into the lower half part of the main network model with parameters to obtain main network high-level features corresponding to the voice emotion data;
the main network model comprises an upper half part of the main network model and a lower half part of the main network model, and the lower half part of the main network model comprises two layers of bidirectional long-time memory unit BLSTM networks added with an attention mechanism and a self-adaptive time pooling algorithm; the upper half part of the main network model comprises two layers of multi-layer fully connected layers DNN;
the auxiliary network model comprises two layers of fully connected layers DNN;
the calculation formula of the self-adaptive time pooling algorithm is as follows:
Figure QLYQS_19
wherein the content of the first and second substances,
Figure QLYQS_20
after passing through the second BLSTM layer
Figure QLYQS_21
The result of the output of the time of day,
Figure QLYQS_22
Figure QLYQS_23
to take a range of values, D represents
Figure QLYQS_24
Equal to 80, t is the BLSTM network time step, equal to 40,
Figure QLYQS_25
the weighting coefficient is obtained through network learning and is specifically calculated by the following formula:
Figure QLYQS_26
Figure QLYQS_27
wherein the content of the first and second substances,
Figure QLYQS_29
in order to be a function of Sigmoid,
Figure QLYQS_33
and
Figure QLYQS_35
respectively, are a matrix of coefficients, each of which is,
Figure QLYQS_30
in the form of a vector of coefficients,
Figure QLYQS_31
represents
Figure QLYQS_34
The operation of transposition is carried out,
Figure QLYQS_36
Figure QLYQS_28
and
Figure QLYQS_32
are all network learning parametersThe number is obtained by random initialization through truncation normal distribution and then training and optimizing with an identification network through a gradient descent algorithm;
the auxiliary network high-level feature determining module is used for inputting second features corresponding to the voice emotion data in the test set into an auxiliary network model with parameters to obtain auxiliary network high-level features corresponding to the voice emotion data;
the main network fusion characteristic determining module is used for performing characteristic fusion on the main network high-level characteristic, the auxiliary parameter and the auxiliary network high-level characteristic and determining a main network fusion characteristic corresponding to each piece of voice emotion data;
the fusion characteristic determining module is used for inputting the main and auxiliary network fusion characteristics corresponding to the voice emotion data into the upper half part of the main network model with parameters to obtain fusion characteristics;
the set determining module specifically includes:
the voice emotion database determining unit is used for determining a voice emotion database; the voice emotion database comprises 363 pieces of voice emotion data;
the standard database determining unit is used for determining a standard database according to the voice emotion database;
the feature extraction unit is used for performing feature extraction on each piece of speech emotion data in the standard database to obtain a plurality of speech MFCC segment features and a plurality of speech global features;
the normalization processing unit is used for respectively normalizing the plurality of voice MFCC segment characteristics and the voice global characteristics corresponding to the voice emotion data to respectively obtain a first characteristic and a second characteristic;
the set determining unit is used for selecting 243 first features and second features corresponding to the voice emotion data as a training set; taking the first characteristic and the second characteristic corresponding to the remaining 120 pieces of speech emotion data as a test set;
the global features of the speech comprise prosodic features including energy, speech speed and zero-crossing rate, voice quality features represented by formants and spectral features represented by MFCC, the total number of the global features of the speech is 98, and the global features of the speech are converted into 70 dimensions after PCA processing.
8. The system for fusing emotion voice features based on primary and secondary networks as claimed in claim 7, wherein the feature extraction unit specifically comprises:
the first extraction subunit is used for performing MFCC frame feature extraction on each piece of speech emotion data in the standard database to obtain a plurality of speech MFCC frame features;
the average processing subunit is configured to perform average processing on the multiple voice MFCC frame features to obtain multiple voice MFCC segment features;
and the second extraction subunit is used for carrying out global feature extraction on each piece of speech emotion data in the standard database to obtain a plurality of speech global features.
CN201911368375.2A 2019-12-26 2019-12-26 Voice emotion feature fusion method and system based on main and auxiliary networks Active CN111145787B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911368375.2A CN111145787B (en) 2019-12-26 2019-12-26 Voice emotion feature fusion method and system based on main and auxiliary networks

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911368375.2A CN111145787B (en) 2019-12-26 2019-12-26 Voice emotion feature fusion method and system based on main and auxiliary networks

Publications (2)

Publication Number Publication Date
CN111145787A CN111145787A (en) 2020-05-12
CN111145787B true CN111145787B (en) 2023-03-14

Family

ID=70520497

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911368375.2A Active CN111145787B (en) 2019-12-26 2019-12-26 Voice emotion feature fusion method and system based on main and auxiliary networks

Country Status (1)

Country Link
CN (1) CN111145787B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112488999B (en) * 2020-11-19 2024-04-05 特斯联科技集团有限公司 Small target detection method, small target detection system, storage medium and terminal
CN112560498A (en) * 2020-12-08 2021-03-26 苏州思必驰信息科技有限公司 Emotion detection method and device

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110083716A (en) * 2019-05-07 2019-08-02 青海大学 Multi-modal affection computation method and system based on Tibetan language
CN110263165A (en) * 2019-06-14 2019-09-20 中山大学 A kind of user comment sentiment analysis method based on semi-supervised learning
WO2019222401A2 (en) * 2018-05-17 2019-11-21 Magic Leap, Inc. Gradient adversarial training of neural networks

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019222401A2 (en) * 2018-05-17 2019-11-21 Magic Leap, Inc. Gradient adversarial training of neural networks
CN112368719A (en) * 2018-05-17 2021-02-12 奇跃公司 Gradient antagonism training of neural networks
CN110083716A (en) * 2019-05-07 2019-08-02 青海大学 Multi-modal affection computation method and system based on Tibetan language
CN110263165A (en) * 2019-06-14 2019-09-20 中山大学 A kind of user comment sentiment analysis method based on semi-supervised learning

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
DEEP MULTIMODAL LEARNING FOR EMOTION RECOGNITION IN SPOKEN LANGUAGE;Yue Gu 等;《ICASSP 2018》;20180913;全文 *
Deep Spectrum Feature Representations for Speech Emotion Recognition;Ziping Zhao 等;《ASMMC-MMAC’18》;20181026;全文 *

Also Published As

Publication number Publication date
CN111145787A (en) 2020-05-12

Similar Documents

Publication Publication Date Title
Mao et al. Learning salient features for speech emotion recognition using convolutional neural networks
CN112466326B (en) Voice emotion feature extraction method based on transducer model encoder
CN111312245B (en) Voice response method, device and storage medium
Wang et al. Discriminative neural embedding learning for short-duration text-independent speaker verification
Li et al. Learning fine-grained cross modality excitement for speech emotion recognition
CN111145787B (en) Voice emotion feature fusion method and system based on main and auxiliary networks
Palo et al. Comparative analysis of neural networks for speech emotion recognition
CN111899766B (en) Speech emotion recognition method based on optimization fusion of depth features and acoustic features
Gupta et al. Speech emotion recognition using SVM with thresholding fusion
CN114021524A (en) Emotion recognition method, device and equipment and readable storage medium
Swain et al. A DCRNN-based ensemble classifier for speech emotion recognition in Odia language
CN114898779A (en) Multi-mode fused speech emotion recognition method and system
Soliman et al. Isolated word speech recognition using convolutional neural network
Zhao et al. Knowledge-aware bayesian co-attention for multimodal emotion recognition
Shah et al. Unsupervised Vocal Tract Length Warped Posterior Features for Non-Parallel Voice Conversion.
CN112951270B (en) Voice fluency detection method and device and electronic equipment
US11593641B2 (en) Automatic generation of synthetic samples using dynamic deep autoencoders
Maruf et al. Effects of noise on RASTA-PLP and MFCC based Bangla ASR using CNN
Haque et al. Evaluation of Modified Deep Neural Network Architecture Performance for Speech Recognition
Kaewprateep et al. Evaluation of small-scale deep learning architectures in Thai speech recognition
Nwe et al. On the use of bhattacharyya based GMM distance and neural net features for identification of cognitive load levels
Alfaro-Picado et al. An experimental study on fundamental frequency detection in reverberated speech with pre-trained recurrent neural networks
Ahmad Artificial neural network vs. support vector machine for speech emotion recognition
Mukherjee et al. F 0 modeling in HMM-based speech synthesis system using Deep Belief Network
Priya Dharshini et al. Transfer Accent Identification Learning for Enhancing Speech Emotion Recognition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant