CN115240649B - Voice recognition method and system based on deep learning - Google Patents

Voice recognition method and system based on deep learning Download PDF

Info

Publication number
CN115240649B
CN115240649B CN202210871932.8A CN202210871932A CN115240649B CN 115240649 B CN115240649 B CN 115240649B CN 202210871932 A CN202210871932 A CN 202210871932A CN 115240649 B CN115240649 B CN 115240649B
Authority
CN
China
Prior art keywords
emotion
features
domain
feature
prediction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210871932.8A
Other languages
Chinese (zh)
Other versions
CN115240649A (en
Inventor
于振华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN202210871932.8A priority Critical patent/CN115240649B/en
Publication of CN115240649A publication Critical patent/CN115240649A/en
Application granted granted Critical
Publication of CN115240649B publication Critical patent/CN115240649B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/60Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for measuring the quality of voice signals
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Abstract

The invention provides a voice recognition method based on deep learning, which comprises the steps of firstly, acquiring a voice signal, and preprocessing the voice signal to obtain a voice spectrum feature representation; extracting blocks with different sizes from the speech spectrum feature representation of each label-free training sample, performing unsupervised pre-training by adopting CAE (computer aided engineering) to obtain kernels with different sizes, performing convolution and pooling on the whole speech spectrum input by the kernels with different sizes, and stacking the pooled features with different sizes to obtain rough features; inputting rough features obtained by unsupervised feature learning into a semi-supervised learning frame to learn emotion related features and emotion unrelated features; performing emotion label prediction and domain label prediction on the emotion related characteristics; training a classifier by using high-level emotional features of the source domain and corresponding emotional labels to obtain a trained classifier; the method provided by the invention can reduce the influence of the emotion irrelevant factors in feature learning and can solve the problem of insufficient or lack of target domain samples.

Description

Voice recognition method and system based on deep learning
Technical Field
The invention relates to the field of voice recognition, in particular to a voice recognition method and system based on deep learning.
Background
Speech emotion recognition refers to recognizing the emotional state of a speaker by speech, and is one of the most challenging tasks in the field of speech technology. With the wide application of the voice interaction technology, the voice emotion recognition technology which can make the robot more humanized has wide application prospect and commercial value.
With the development of deep learning technology in recent years, many achievements have appeared in the field of speech emotion recognition. However, even so, the speech emotion recognition technology at present still faces many difficulties, for example, high-dimensional emotion features are difficult to extract manually, the emotion speech data volume is small, the labeling difficulty is large, and the training data and the test data have different distributions, which results in poor model recognition accuracy.
Disclosure of Invention
The invention mainly aims to overcome the defects in the prior art, and provides a voice recognition method and system based on deep learning.
The technical scheme of the invention is as follows:
a speech recognition method based on deep learning comprises the following steps:
acquiring a voice signal, and preprocessing the voice signal to obtain a voice spectrum feature representation, wherein the preprocessing includes but is not limited to: pre-emphasis, framing, windowing, fourier change and PCA dimension reduction;
extracting blocks with different sizes from the speech spectrum feature representation of each label-free training sample, performing unsupervised pre-training by using CAE (computer aided engineering) to obtain kernels with different sizes, performing convolution and pooling on the whole speech spectrum input by using the kernels with different sizes, and stacking the pooled features with different sizes to obtain rough features;
inputting rough features obtained by unsupervised feature learning into a semi-supervised learning frame, setting the rough features as emotion related features and emotion irrelevant features, reconstructing the jointly input rough features, orthogonalizing sensitivity vectors of the emotion related features and sensitivity vectors of the emotion irrelevant features, and performing category prediction on the emotion related features through sigmoid mapping so as to learn the emotion related features and the emotion irrelevant features;
carrying out hierarchical nonlinear conversion on the emotion-related features to obtain high-level emotion features, inputting the high-level emotion features into a classification and domain invariant feature learning model, and carrying out emotion label prediction and domain label prediction;
and training the classifier by utilizing the high-level emotional characteristics of the source domain and the corresponding emotional labels to obtain the trained classifier.
Specifically, the kernel is the weight and bias of the encoder.
Specifically, the loss function in the semi-supervised learning framework comprises: reconstructing a loss function, an orthogonal loss function, a discriminant loss function and an authentication loss function;
the reconstruction loss function is:
Figure GDA0004135926490000021
Figure GDA0004135926490000022
wherein s is a sidmoid function, W and V are semi-supervised learning framework weight matrices, d is a weight in a semi-supervised learning framework, and η is a control constraint term strengthThe hyper-parameter of (c); f. of e (y) is a set emotion-related feature, f o () Is a set emotion independent feature; y is the characteristic of the roughness,
Figure GDA0004135926490000023
is a reconstruction feature;
quadrature loss function:
Figure GDA0004135926490000024
wherein f is i e () Is the ith one of the emotionally relevant features,
Figure GDA0004135926490000025
is the jth emotion-independent feature;
discrimination loss function:
Figure GDA0004135926490000031
wherein C is the total number of samples, z is the sum
Figure GDA0004135926490000033
Respectively an emotion original label and an emotion prediction label, wherein k is a category label;
authentication loss function:
L VERIF (W,Y,y 1 ,y 2 )=(1-Y)D W +(Y)1/2{max(0,m-D W )} 2
wherein D is W Distance between emotion-related features for two samples:
D W (y 1 ,y 2 )=||f e (y 1 )-f e (y 2 )|| 2
y =1 denotes Y 1 And y 2 From the same emotion category, Y =0 denotes Y 1 And y 2 From different emotion categories, m is a set threshold.
Specifically, the emotion related features are subjected to level nonlinear conversion to obtain high-level emotion features, the high-level emotion features are input into a classification and domain invariant feature learning model, emotion label prediction and domain label prediction are carried out, and a classification and domain invariant feature learning model objective function:
Figure GDA0004135926490000032
h obtains high-level emotional characteristics by mapping the emotional related characteristics through hierarchical nonlinear conversion, G y And G d Respectively representing the mapping of high-level emotional features into emotion labels and domain labels, L y And L d Loss functions representing emotion tag prediction and domain tag prediction, respectively, theta y And theta d Parameters in emotion label prediction and domain label prediction are respectively represented, and alpha is the contribution degree of a domain label prediction item.
An embodiment of the present invention further provides a speech recognition system based on deep learning, including:
a voice preprocessing unit: acquiring a voice signal, and preprocessing the voice signal to obtain a voice spectrum feature representation, wherein the preprocessing includes but is not limited to: pre-emphasis, framing, windowing, fourier change and PCA dimension reduction;
a coarse feature acquisition unit: extracting blocks with different sizes from the speech spectrum feature representation of each label-free training sample, performing unsupervised pre-training by adopting CAE (computer aided engineering) to obtain kernels with different sizes, performing convolution and pooling on the whole speech spectrum input by the kernels with different sizes, and stacking the pooled features with different sizes to obtain rough features;
emotion-related feature acquisition means: inputting rough features obtained by unsupervised feature learning into a semi-supervised learning frame, setting the rough features as emotion related features and emotion irrelevant features, reconstructing the jointly input rough features, orthogonalizing sensitivity vectors of the emotion related features and sensitivity vectors of the emotion irrelevant features, and performing category prediction on the emotion related features through sigmoid mapping so as to learn the emotion related features and the emotion irrelevant features;
emotion tag and domain tag prediction unit: carrying out hierarchical nonlinear conversion on the emotion-related features to obtain high-level emotion features, inputting the high-level emotion features into a classification and domain invariant feature learning model, and carrying out emotion label prediction and domain label prediction;
a classifier training unit: and training the classifier by utilizing the high-level emotional characteristics of the source domain and the corresponding emotional labels to obtain the trained classifier.
Specifically, in the coarse feature acquisition unit, the kernel is the weight and the bias of the encoder.
Specifically, in the emotion related feature obtaining unit, the loss function in the semi-supervised learning framework includes: reconstructing a loss function, an orthogonal loss function, a discriminant loss function and an authentication loss function;
the reconstruction loss function is:
Figure GDA0004135926490000041
Figure GDA0004135926490000042
wherein s is a sidmoid function, W and V are semi-supervised learning framework weight matrices, d is a weight in the semi-supervised learning framework, and eta is a hyper-parameter controlling the strength of a constraint term; f. of e (y) is a set emotion-related feature, f o () Is a set emotion independent feature; y is the characteristic of the roughness,
Figure GDA0004135926490000043
is a reconstruction feature;
quadrature loss function:
Figure GDA0004135926490000051
wherein f is i e (y) is the ith emotionally relevant feature,
Figure GDA0004135926490000052
is the jth emotion-independent feature;
discriminant loss function:
Figure GDA0004135926490000053
where C is the total number of samples, z is the sum
Figure GDA0004135926490000055
Respectively an emotion original label and an emotion prediction label, wherein k is a category label;
authentication loss function:
L VERIF (W,Y,y 1 ,y 2 )=(1-Y)D W +(Y)1/2{max(0,m-D W )} 2
wherein D is W Distance between emotion-related features for two samples:
D W (y 1 ,y 2 )=||f e (y 1 )-f e (y 2 )|| 2
y =1 denotes Y 1 And y 2 From the same emotion category, Y =0 denotes Y 1 And y 2 From different emotion categories, m is a set threshold.
Specifically, in the emotion label and domain label prediction unit, the emotion related features are subjected to level nonlinear conversion to obtain high-level emotion features, the high-level emotion features are input into a classification and domain invariant feature learning model, and emotion label prediction and domain label prediction are performed, wherein the classification and domain invariant feature learning model objective function is as follows:
Figure GDA0004135926490000054
h obtains high-level emotional characteristics by mapping the emotional related characteristics through hierarchical nonlinear conversion, G y And G d Respectively representing the mapping of high-level emotional features into emotion labels and domain labels,L y and L d Loss functions representing emotion tag prediction and domain tag prediction, respectively, theta y And theta d Parameters in emotion label prediction and domain label prediction are respectively represented, and alpha is the contribution degree of a domain label prediction item.
Yet another embodiment of the present invention provides an electronic device, including: the system comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor realizes the steps of the speech recognition method based on deep learning when executing the computer program.
Yet another embodiment of the present invention provides a computer-readable storage medium, in which a computer program is stored, and the computer program, when executed by a processor, implements the steps of the above-mentioned voice recognition method based on deep learning.
As can be seen from the above description of the present invention, compared with the prior art, the present invention has the following advantages:
the invention provides a speech recognition method based on deep learning, which comprises the steps of firstly acquiring a speech signal, preprocessing the speech signal to obtain a speech spectrum feature representation, wherein the preprocessing comprises but is not limited to the following steps: pre-emphasis, framing, windowing, fourier change and PCA dimension reduction; extracting blocks with different sizes from the speech spectrum feature representation of each label-free training sample, performing unsupervised pre-training by adopting CAE (computer aided engineering) to obtain kernels with different sizes, performing convolution and pooling on the whole speech spectrum input by the kernels with different sizes, and stacking the pooled features with different sizes to obtain rough features; inputting rough features obtained by unsupervised feature learning into a semi-supervised learning frame, setting the rough features as emotion related features and emotion irrelevant features, reconstructing the jointly input rough features, orthogonalizing sensitivity vectors of the emotion related features and sensitivity vectors of the emotion irrelevant features, and performing category prediction on the emotion related features through sigmoid mapping so as to learn the emotion related features and the emotion irrelevant features; carrying out hierarchical nonlinear conversion on the emotion-related features to obtain high-level emotion features, inputting the high-level emotion features into a classification and domain invariant feature learning model, and carrying out emotion label prediction and domain label prediction; training a classifier by using high-level emotion features of the source domain and corresponding emotion labels to obtain a trained classifier; according to the method provided by the invention, the voice features are divided into the emotion related features and the emotion unrelated features, so that the influence of emotion unrelated factors in feature learning is reduced, emotion label prediction and domain label prediction are carried out, and the problem of insufficient or lack of target domain samples is solved.
Drawings
FIG. 1 is a flowchart of a method for speech recognition based on deep learning according to an embodiment of the present invention;
FIG. 2 is a diagram of a semi-supervised feature learning framework provided by embodiments of the present invention;
FIG. 3 is a diagram of a classification and domain invariant feature learning model provided by an embodiment of the present invention;
FIG. 4 is a block diagram of a deep learning based speech recognition system according to an embodiment of the present invention;
fig. 5 is a schematic diagram of an embodiment of an electronic device according to an embodiment of the present invention;
fig. 6 is a schematic diagram of an embodiment of a computer-readable storage medium according to an embodiment of the present invention.
The invention is described in further detail below with reference to the figures and specific examples.
Detailed Description
According to the voice recognition method based on deep learning, the voice features are divided into the emotion-related features and the emotion-unrelated features, so that the influence of emotion-unrelated factors in feature learning is reduced, emotion label prediction and domain label prediction are performed, and the problem of insufficient or lack of target domain samples is solved.
It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising a … …" does not exclude the presence of another identical element in a process, method, article, or apparatus that comprises the element. The above description is merely exemplary of the present application and is presented to enable those skilled in the art to understand and practice the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
The technical scheme of the invention is as follows:
a speech recognition method based on deep learning comprises the following steps:
s101: acquiring a voice signal, and preprocessing the voice signal to obtain a voice spectrum feature representation, wherein the preprocessing includes but is not limited to: pre-emphasis, framing, windowing, fourier change and PCA dimension reduction;
for the acquired audio signal, firstly, the audio signal is converted into a digital signal, then, the speech spectrum characteristic is obtained through pre-emphasis, framing, windowing and Fourier change, and then, the dimensionality reduction is carried out on the frequency domain through PCA, so that the speech spectrum characteristic representation of an audio sample is extracted.
S102: extracting blocks with different sizes from the speech spectrum feature representation of each label-free training sample, performing unsupervised pre-training by adopting CAE (computer aided engineering) to obtain kernels with different sizes, performing convolution and pooling on the whole speech spectrum input by the kernels with different sizes, and stacking the pooled features with different sizes to obtain rough features;
specifically, the kernel is the weight and bias of the encoder.
Unsupervised feature learning can learn some features hidden in the data from unlabeled samples, which is unsupervised pre-training using a systolic auto-encoder CAE.
Obtaining the weight bias of the encoder through the pre-training learning of CAE, namely kernel (U, c), and obtaining the hidden layer characteristic f (x) epsilon R through the kernel (U, c) for the input x K And K is the number of hidden layer features, for speech spectrum input, the kernel is applied to a plurality of small blocks to obtain new feature representation of the speech spectrum, the small blocks with different sizes are extracted from input data to be pre-trained to obtain kernels with different sizes, the whole speech spectrum input is subjected to convolution and pooling through the kernels with different sizes, and finally the pooled features with different sizes are stacked to obtain the rough feature.
S103: inputting rough features obtained by unsupervised feature learning into a semi-supervised learning frame, setting the rough features as emotion related features and emotion irrelevant features, reconstructing the jointly input rough features, orthogonalizing sensitivity vectors of the emotion related features and sensitivity vectors of the emotion irrelevant features, and performing category prediction on the emotion related features through sigmoid mapping so as to learn the emotion related features and the emotion irrelevant features;
fig. 2 is a schematic diagram of a semi-supervised learning framework, in which the factors related to the voice situation are separated from other factors as much as possible to obtain emotion-related features,
specifically, the loss function in the semi-supervised learning framework comprises: reconstructing a loss function, an orthogonal loss function, a discriminant loss function and an authentication loss function;
the reconstruction loss function is:
Figure GDA0004135926490000091
Figure GDA0004135926490000092
where s is a sidmoid functionW and V are semi-supervised learning framework weight matrixes, d is a weight in the semi-supervised learning framework, and eta is a hyper-parameter for controlling the strength of a constraint term; f. of e (y) set affective characteristics, f o (y) is a set emotion independent feature; y is the characteristic of the roughness,
Figure GDA0004135926490000093
is a reconstruction feature;
orthogonal loss function:
Figure GDA0004135926490000094
wherein f is i e (y) is the ith emotionally relevant feature,
Figure GDA0004135926490000095
is the jth emotion-independent feature;
by applying the ith emotion-related feature f i e Sensitivity vector of (y)
Figure GDA0004135926490000096
And jth emotion-independent feature f j o (y) sensitivity vector +>
Figure GDA0004135926490000097
And the two features can be effectively separated through orthogonality, and the emotion related feature and the emotion unrelated feature are preliminarily divided through an orthogonal loss function.
Discriminant loss function:
Figure GDA0004135926490000098
where C is the total number of samples, z is the sum
Figure GDA0004135926490000099
Respectively an emotion original label and an emotion prediction label, wherein k is a category label;
the category labels try to increase the gap between the emotional characteristics of different kinds of emotions, so that emotion classification can be better performed, and therefore rich information between emotion categories can be found by minimizing the cross entropy loss function.
Authentication loss function:
L VERIF (W,Y,y 1 ,y 2 )=(1-Y)D W +(Y)1/2{max(0,m-D W )} 2
wherein D is W Distance between emotion-related features for two samples:
D W (y 1 ,y 2 )=||f e (y 1 )-f e (y 2 )|| 2
y =1 denotes Y 1 And y 2 From the same emotion category, Y =0 denotes Y 1 And y 2 M is a set threshold value from different emotion categories.
The distance between the emotional terrain feature of the same type can be reduced through the authentication loss function.
S104: carrying out hierarchical nonlinear conversion on the emotion-related features to obtain high-level emotion features, inputting the high-level emotion features into a classification and domain invariant feature learning model, and carrying out emotion label prediction and domain label prediction;
specifically, the emotion related features are subjected to level nonlinear conversion to obtain high-level emotion features, the high-level emotion features are input into a classification and domain invariant feature learning model, emotion label prediction and domain label prediction are carried out, and a classification and domain invariant feature learning model objective function:
Figure GDA0004135926490000101
h obtains high-level emotional characteristics by mapping the emotional related characteristics through hierarchical nonlinear conversion, G y And G d Respectively representing the mapping of high-level emotional features into emotion labels and domain labels, L y And L d Representing impairments of emotion tag prediction and domain tag prediction, respectivelyLoss function, θ y And theta d Parameters in emotion label prediction and domain label prediction are respectively represented, and alpha is the contribution degree of a domain label prediction item. FIG. 3 is a diagram of a classification and domain invariant feature learning model framework.
S105: and training the classifier by utilizing the high-level emotional characteristics of the source domain and the corresponding emotional labels to obtain the trained classifier.
As shown in fig. 4, an embodiment of the present invention further provides a speech recognition system based on deep learning, including:
the voice preprocessing unit 401: acquiring a voice signal, and preprocessing the voice signal to obtain a voice spectrum feature representation, wherein the preprocessing includes but is not limited to: pre-emphasis, framing, windowing, fourier transformation and PCA dimension reduction;
for the acquired audio signal, firstly, the audio signal is converted into a digital signal, then, speech spectrum characteristics are obtained through pre-emphasis, framing, windowing and Fourier change, and then, the dimensionality reduction is carried out on a frequency domain through PCA, so that the speech spectrum characteristics of an audio sample are extracted to be represented.
Coarse feature acquisition unit 402: extracting blocks with different sizes from the speech spectrum feature representation of each label-free training sample, performing unsupervised pre-training by adopting CAE (computer aided engineering) to obtain kernels with different sizes, performing convolution and pooling on the whole speech spectrum input by the kernels with different sizes, and stacking the pooled features with different sizes to obtain rough features;
specifically, the kernel is the weight and bias of the encoder.
Unsupervised feature learning can learn some features hidden in the data from unlabeled samples, using a shrinkage auto-encoder CAE for unsupervised pre-training.
Obtaining the weight bias of the encoder through the pre-training learning of CAE, namely kernel (U, c), and obtaining the hidden layer characteristic f (x) epsilon R through the kernel (U, c) for the input x K K is the number of hidden layer features, for speech spectrum input, the kernel is applied to a plurality of small blocks to obtain new feature representation of the speech spectrum, and small blocks with different sizes are extracted from input data to be pre-trained to obtain small blocks with different sizesAnd (4) kernel, performing convolution and pooling on the whole speech spectrum input through kernels with different sizes, and finally stacking pooled features with different sizes to obtain the rough features.
Emotion-related feature acquisition section 403: inputting rough features obtained by unsupervised feature learning into a semi-supervised learning frame, setting the rough features as emotion related features and emotion irrelevant features, reconstructing the jointly input rough features, orthogonalizing sensitivity vectors of the emotion related features and sensitivity vectors of the emotion irrelevant features, and performing category prediction on the emotion related features through sigmoid mapping so as to learn the emotion related features and the emotion irrelevant features;
fig. 2 is a schematic diagram of a semi-supervised learning framework, in which the factors related to the voice situation and other factors are separated as much as possible to obtain emotion-related features,
specifically, the loss function in the semi-supervised learning framework comprises: reconstructing a loss function, an orthogonal loss function, a discriminant loss function and an authentication loss function;
the reconstruction loss function is:
Figure GDA0004135926490000121
/>
Figure GDA0004135926490000122
wherein s is a sidmoid function, W and V are semi-supervised learning framework weight matrices, d is a weight in the semi-supervised learning framework, and eta is a hyper-parameter controlling the strength of a constraint term; f. of e (y) set affective characteristics, f o (y) is a set emotion independent feature; y is the characteristic of the roughness,
Figure GDA0004135926490000123
is a reconstruction feature;
orthogonal loss function:
Figure GDA0004135926490000124
wherein f is i e (y) is the ith emotionally relevant feature,
Figure GDA0004135926490000125
is the jth emotion-independent feature;
by applying the ith emotion-related feature f i e Sensitivity vector of (y)
Figure GDA0004135926490000126
And jth emotion-independent feature f j o (y) sensitivity vector +>
Figure GDA0004135926490000127
And the two features can be effectively separated through orthogonality, and the emotion related feature and the emotion unrelated feature are preliminarily divided through an orthogonal loss function.
Discrimination loss function:
Figure GDA0004135926490000128
wherein C is the total number of samples, z is the sum
Figure GDA0004135926490000129
Respectively an emotion original label and an emotion prediction label, wherein k is a category label;
the category labels try to increase the gap between the emotional characteristics of different kinds of emotions, so that emotion classification can be better performed, and therefore rich information between emotion categories can be found by minimizing the cross entropy loss function.
Authentication loss function:
L VERIF (W,Y,y 1 ,y 2 )=(1-Y)D W +(Y)1/2{max(0,m-D W )} 2
wherein D is W Distance between emotion-related features for two samples:
D W (y 1 ,y 2 )=||f e (y 1 )-f e (y 2 )|| 2
y =1 denotes Y 1 And y 2 From the same emotion category, Y =0 denotes Y 1 And y 2 From different emotion categories, m is a set threshold.
The distance between the emotional terrain feature of the same type can be reduced through the authentication loss function.
Emotion tag and domain tag prediction unit 404: performing hierarchical nonlinear conversion on the emotion-related features to obtain high-level emotion features, inputting the high-level emotion features into a classification and domain invariant feature learning model, and performing emotion label prediction and domain label prediction;
specifically, carrying out hierarchical nonlinear conversion on emotion-related features to obtain high-level emotion features, inputting the high-level emotion features into a classification and domain invariant feature learning model, and carrying out emotion label prediction and domain label prediction, wherein a classification and domain invariant feature learning model objective function is as follows:
Figure GDA0004135926490000131
h obtains high-level emotional characteristics by mapping the emotional related characteristics through hierarchical nonlinear conversion, G y And G d Respectively representing the mapping of high-level emotional features into emotional tags and domain tags, L y And L d Loss functions, θ, representing affective tag prediction and domain tag prediction, respectively y And theta d Parameters in emotion label prediction and domain label prediction are respectively represented, and alpha is the contribution degree of a domain label prediction item. FIG. 3 is a diagram of a classification and domain invariant feature learning model framework.
Classifier training unit 405: and training the classifier by utilizing the high-level emotional characteristics of the source domain and the corresponding emotional labels to obtain the trained classifier.
As shown in fig. 5, an electronic device 500 according to an embodiment of the present invention includes a memory 510, a processor 520, and a computer program 511 stored in the memory 510 and executable on the processor 520, where the processor 520 executes the computer program 511 to implement a deep learning based speech recognition method according to an embodiment of the present invention.
In a specific implementation, when the processor 520 executes the computer program 511, any of the embodiments corresponding to fig. 1 may be implemented.
Since the electronic device described in this embodiment is a device used for implementing a data processing apparatus in the embodiment of the present invention, based on the method described in this embodiment of the present invention, a person skilled in the art can understand the specific implementation manner of the electronic device in this embodiment and various variations thereof, so that how to implement the method in this embodiment of the present invention by the electronic device is not described in detail herein, and as long as the person skilled in the art implements the device used for implementing the method in this embodiment of the present invention, the device used for implementing the method in this embodiment of the present invention belongs to the protection scope of the present invention.
Referring to fig. 6, fig. 6 is a schematic diagram illustrating an embodiment of a computer-readable storage medium according to an embodiment of the present invention.
As shown in fig. 6, the present embodiment provides a computer-readable storage medium 600, on which a computer program 601 is stored, and when executed by a processor, the computer program 601 implements a deep learning based speech recognition method provided by the present embodiment;
in a specific implementation, the computer program 601 may implement any of the embodiments corresponding to fig. 1 when executed by a processor.
It should be noted that, in the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to relevant descriptions of other embodiments for parts that are not described in detail in a certain embodiment.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The invention provides a speech recognition method based on deep learning, which comprises the steps of firstly acquiring a speech signal, preprocessing the speech signal to obtain a speech spectrum feature representation, wherein the preprocessing comprises but is not limited to the following steps: pre-emphasis, framing, windowing, fourier change and PCA dimension reduction; extracting blocks with different sizes from the speech spectrum feature representation of each label-free training sample, performing unsupervised pre-training by using CAE (computer aided engineering) to obtain kernels with different sizes, performing convolution and pooling on the whole speech spectrum input by using the kernels with different sizes, and stacking the pooled features with different sizes to obtain rough features; inputting rough features obtained by unsupervised feature learning into a semi-supervised learning frame, setting the rough features as emotion related features and emotion irrelevant features, reconstructing the jointly input rough features, orthogonalizing sensitivity vectors of the emotion related features and sensitivity vectors of the emotion irrelevant features, and performing category prediction on the emotion related features through sigmoid mapping so as to learn the emotion related features and the emotion irrelevant features; performing hierarchical nonlinear conversion on the emotion-related features to obtain high-level emotion features, inputting the high-level emotion features into a classification and domain invariant feature learning model, and performing emotion label prediction and domain label prediction; training a classifier by using high-level emotional features of the source domain and corresponding emotional labels to obtain a trained classifier; the method provided by the invention reduces the influence of the emotion-independent factors in feature learning by dividing the voice features into emotion-dependent features and emotion-independent features, and solves the problem of insufficient or lack of target domain samples by performing emotion label prediction and domain label prediction.
The above description is only an embodiment of the present invention, but the design concept of the present invention is not limited thereto, and any insubstantial modifications made by using the design concept belong to the behaviors violating the protection scope of the present invention.

Claims (10)

1. A speech recognition method based on deep learning is characterized by comprising the following steps:
acquiring a voice signal, and preprocessing the voice signal to obtain a voice spectrum feature representation, wherein the preprocessing includes but is not limited to: pre-emphasis, framing, windowing, fourier change and PCA dimension reduction;
extracting blocks with different sizes from the speech spectrum feature representation of each label-free training sample, performing unsupervised pre-training by adopting CAE (computer aided engineering) to obtain kernels with different sizes, performing convolution and pooling on the whole speech spectrum input by the kernels with different sizes, and stacking the pooled features with different sizes to obtain rough features;
inputting rough features obtained by unsupervised feature learning into a semi-supervised learning frame, setting the rough features as emotion related features and emotion irrelevant features, reconstructing the jointly input rough features, orthogonalizing sensitivity vectors of the emotion related features and sensitivity vectors of the emotion irrelevant features, and performing category prediction on the emotion related features through sigmoid mapping so as to learn the emotion related features and the emotion irrelevant features;
performing hierarchical nonlinear conversion on the emotion-related features to obtain high-level emotion features, inputting the high-level emotion features into a classification and domain invariant feature learning model, and performing emotion label prediction and domain label prediction;
and training the classifier by utilizing the high-level emotional characteristics of the source domain and the corresponding emotional labels to obtain the trained classifier.
2. The deep learning based speech recognition method of claim 1, wherein the kernel is a weight and a bias of an encoder.
3. The method according to claim 1, wherein the loss function in the semi-supervised learning framework comprises: reconstructing a loss function, an orthogonal loss function, a discriminant loss function and an authentication loss function;
the reconstruction loss function is:
Figure FDA0004131275680000011
Figure FDA0004131275680000012
wherein s is a sidmoid function, W and V are semi-supervised learning framework weight matrices, d is a weight in the semi-supervised learning framework, and eta is a hyper-parameter controlling the strength of a constraint term; f. of e (y) is a set emotion-related feature, f o (y) is a set emotion independent feature; y is the characteristic of the roughness,
Figure FDA0004131275680000021
is a reconstruction feature;
quadrature loss function:
Figure FDA0004131275680000022
wherein f is i e (y) is the ith affective characteristic, f j o (y) is the jth emotion-independent feature;
discriminant loss function:
Figure FDA0004131275680000023
where C is the total number of samples, z is the sum
Figure FDA0004131275680000024
Respectively an emotion original label and an emotion prediction label, wherein k is a category label;
authentication loss function:
L VERIF (W,Y,y 1 ,y 2 )=(1-Y)D W +(Y)1/2{max(0,m-D W )} 2
wherein D is W Is two samplesDistance between emotion-related features of the present invention:
D W (y 1 ,y 2 )=||f e (y 1 )-f e (y 2 )|| 2
y =1 denotes Y 1 And y 2 From the same emotion category, Y =0 denotes Y 1 And y 2 From different emotion categories, m is a set threshold.
4. The speech recognition method based on deep learning of claim 1, wherein the emotion related features are subjected to level nonlinear conversion to obtain high-level emotion features, and are input into a classification and domain invariant feature learning model, and emotion label prediction and domain label prediction are performed, wherein the classification and domain invariant feature learning model objective function is as follows:
Figure FDA0004131275680000025
h obtains high-level emotional characteristics by mapping the emotional related characteristics through hierarchical nonlinear conversion, G y And G d Respectively representing the mapping of high-level emotional features into emotion labels and domain labels, L y And L d Loss functions representing emotion tag prediction and domain tag prediction, respectively, theta y And theta d Parameters in emotion label prediction and domain label prediction are respectively represented, and alpha is the contribution degree of a domain label prediction item.
5. A deep learning based speech recognition system comprising:
a voice preprocessing unit: acquiring a voice signal, and preprocessing the voice signal to obtain a voice spectrum feature representation, wherein the preprocessing includes but is not limited to: pre-emphasis, framing, windowing, fourier change and PCA dimension reduction;
a coarse feature acquisition unit: extracting blocks with different sizes from the speech spectrum feature representation of each label-free training sample, performing unsupervised pre-training by adopting CAE (computer aided engineering) to obtain kernels with different sizes, performing convolution and pooling on the whole speech spectrum input by the kernels with different sizes, and stacking the pooled features with different sizes to obtain rough features;
emotion-related feature acquisition means: inputting rough features obtained by unsupervised feature learning into a semi-supervised learning frame, setting the rough features as emotion related features and emotion irrelevant features, reconstructing the jointly input rough features, orthogonalizing sensitivity vectors of the emotion related features and sensitivity vectors of the emotion irrelevant features, and performing category prediction on the emotion related features through sigmoid mapping so as to learn the emotion related features and the emotion irrelevant features;
emotion tag and domain tag prediction unit: carrying out hierarchical nonlinear conversion on the emotion-related features to obtain high-level emotion features, inputting the high-level emotion features into a classification and domain invariant feature learning model, and carrying out emotion label prediction and domain label prediction;
a classifier training unit: and training the classifier by utilizing the high-level emotional characteristics of the source domain and the corresponding emotional labels to obtain the trained classifier.
6. The deep learning based speech recognition system of claim 5, wherein the coarse feature obtaining unit is characterized by taking the weights and biases of the encoder as a core.
7. The deep learning based speech recognition system of claim 5, wherein in the emotion related feature acquisition unit, the loss function in the semi-supervised learning framework comprises: reconstructing a loss function, an orthogonal loss function, a discriminant loss function and an authentication loss function;
the reconstruction loss function is:
Figure FDA0004131275680000041
Figure FDA0004131275680000042
wherein s is a sidmoid function, W and V are semi-supervised learning framework weight matrixes, d is the weight in the semi-supervised learning framework, and eta is a hyper-parameter for controlling the strength of a constraint term; f. of e (y) is a set emotion-related feature, f o (y) is a set emotion-independent feature; y is the number of the roughness features,
Figure FDA0004131275680000043
is a reconstruction feature;
quadrature loss function:
Figure FDA0004131275680000044
wherein f is i e (y) is the ith affective characteristic, f j o (y) is the jth emotion-independent feature;
discriminant loss function:
Figure FDA0004131275680000045
where C is the total number of samples, z is the sum
Figure FDA0004131275680000046
Respectively an emotion original label and an emotion prediction label, wherein k is a category label;
authentication loss function:
L VERIF (W,Y,y 1 ,y 2 )=(1-Y)D W +(Y)1/2{max(0,m-D W )} 2
wherein D is W Distance between emotion-related features for two samples:
D W (y 1 ,y 2 )=||f e (y 1 )-f e (y 2 )|| 2
y =1 denotes Y 1 And y 2 From the same emotion category, Y =0 denotes Y 1 And y 2 From different emotion categories, m is a set threshold.
8. The deep learning-based speech recognition system of claim 5, wherein in the emotion label and domain label prediction unit, the emotion related features are subjected to hierarchical nonlinear conversion to obtain high-level emotion features, the high-level emotion features are input into a classification and domain invariant feature learning model, and emotion label prediction and domain label prediction are performed, wherein the classification and domain invariant feature learning model objective function is as follows:
Figure FDA0004131275680000051
h obtains high-level emotional characteristics by mapping the emotional related characteristics through hierarchical nonlinear conversion, G y And G d Respectively representing the mapping of high-level emotional features into emotional tags and domain tags, L y And L d Loss functions representing emotion tag prediction and domain tag prediction, respectively, theta y And theta d Parameters in emotion label prediction and domain label prediction are respectively represented, and alpha is the contribution degree of a domain label prediction item.
9. An electronic device, comprising: memory, processor and computer program stored on the memory and executable on the processor, wherein the processor implements the method steps of any of claims 1-4 when executing the computer program.
10. A computer-readable storage medium, characterized in that a computer program is stored in the computer-readable storage medium, which computer program, when being executed by a processor, carries out the method steps of any one of claims 1 to 4.
CN202210871932.8A 2022-07-19 2022-07-19 Voice recognition method and system based on deep learning Active CN115240649B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210871932.8A CN115240649B (en) 2022-07-19 2022-07-19 Voice recognition method and system based on deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210871932.8A CN115240649B (en) 2022-07-19 2022-07-19 Voice recognition method and system based on deep learning

Publications (2)

Publication Number Publication Date
CN115240649A CN115240649A (en) 2022-10-25
CN115240649B true CN115240649B (en) 2023-04-18

Family

ID=83676279

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210871932.8A Active CN115240649B (en) 2022-07-19 2022-07-19 Voice recognition method and system based on deep learning

Country Status (1)

Country Link
CN (1) CN115240649B (en)

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104021373B (en) * 2014-05-27 2017-02-15 江苏大学 Semi-supervised speech feature variable factor decomposition method
US10127927B2 (en) * 2014-07-28 2018-11-13 Sony Interactive Entertainment Inc. Emotional speech processing
CN106469560B (en) * 2016-07-27 2020-01-24 江苏大学 Voice emotion recognition method based on unsupervised domain adaptation
US11205103B2 (en) * 2016-12-09 2021-12-21 The Research Foundation for the State University Semisupervised autoencoder for sentiment analysis
CN110516696B (en) * 2019-07-12 2023-07-25 东南大学 Self-adaptive weight bimodal fusion emotion recognition method based on voice and expression
CN112397092A (en) * 2020-11-02 2021-02-23 天津理工大学 Unsupervised cross-library speech emotion recognition method based on field adaptive subspace
CN112863494B (en) * 2021-01-19 2023-01-06 湖南大学 Voice emotion recognition method and system based on semi-supervised adversity variation self-coding
CN113555038B (en) * 2021-07-05 2023-12-29 东南大学 Speaker-independent voice emotion recognition method and system based on unsupervised domain countermeasure learning

Also Published As

Publication number Publication date
CN115240649A (en) 2022-10-25

Similar Documents

Publication Publication Date Title
Hatami et al. Classification of time-series images using deep convolutional neural networks
CN107832663B (en) Multi-modal emotion analysis method based on quantum theory
CN106469560B (en) Voice emotion recognition method based on unsupervised domain adaptation
Daneshfar et al. Speech emotion recognition using discriminative dimension reduction by employing a modified quantum-behaved particle swarm optimization algorithm
CN111339913A (en) Method and device for recognizing emotion of character in video
Kaya et al. Introducing Weighted Kernel Classifiers for Handling Imbalanced Paralinguistic Corpora: Snoring, Addressee and Cold.
Carbonneau et al. Feature learning from spectrograms for assessment of personality traits
CN110853656B (en) Audio tampering identification method based on improved neural network
Yogesh et al. Bispectral features and mean shift clustering for stress and emotion recognition from natural speech
CN112562741A (en) Singing voice detection method based on dot product self-attention convolution neural network
Kumar et al. Binary-classifiers-enabled filters for semi-supervised learning
Elleuch et al. Towards unsupervised learning for Arabic handwritten recognition using deep architectures
Ocquaye et al. Dual exclusive attentive transfer for unsupervised deep convolutional domain adaptation in speech emotion recognition
CN112418166A (en) Emotion distribution learning method based on multi-mode information
CN115393933A (en) Video face emotion recognition method based on frame attention mechanism
CN116304984A (en) Multi-modal intention recognition method and system based on contrast learning
Rajeswari et al. Dysarthric speech recognition using variational mode decomposition and convolutional neural networks
CN111563373A (en) Attribute-level emotion classification method for focused attribute-related text
Alharbi et al. Inpainting forgery detection using hybrid generative/discriminative approach based on bounded generalized Gaussian mixture model
CN115240649B (en) Voice recognition method and system based on deep learning
Fedele et al. Explaining siamese networks in few-shot learning for audio data
Ruiz-Muñoz et al. Enhancing the dissimilarity-based classification of birdsong recordings
Mohammed et al. Speech Emotion Recognition Using MELBP Variants of Spectrogram Image.
CN115472182A (en) Attention feature fusion-based voice emotion recognition method and device of multi-channel self-encoder
Rajasekhar et al. A novel speech emotion recognition model using mean update of particle swarm and whale optimization-based deep belief network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant