CN115240649B

CN115240649B - Voice recognition method and system based on deep learning

Info

Publication number: CN115240649B
Application number: CN202210871932.8A
Authority: CN
Inventors: 于振华
Original assignee: Individual
Current assignee: Individual
Priority date: 2022-07-19
Filing date: 2022-07-19
Publication date: 2023-04-18
Anticipated expiration: 2042-07-19
Also published as: CN115240649A

Abstract

The invention provides a voice recognition method based on deep learning, which comprises the steps of firstly, acquiring a voice signal, and preprocessing the voice signal to obtain a voice spectrum feature representation; extracting blocks with different sizes from the speech spectrum feature representation of each label-free training sample, performing unsupervised pre-training by adopting CAE (computer aided engineering) to obtain kernels with different sizes, performing convolution and pooling on the whole speech spectrum input by the kernels with different sizes, and stacking the pooled features with different sizes to obtain rough features; inputting rough features obtained by unsupervised feature learning into a semi-supervised learning frame to learn emotion related features and emotion unrelated features; performing emotion label prediction and domain label prediction on the emotion related characteristics; training a classifier by using high-level emotional features of the source domain and corresponding emotional labels to obtain a trained classifier; the method provided by the invention can reduce the influence of the emotion irrelevant factors in feature learning and can solve the problem of insufficient or lack of target domain samples.

Description

Voice recognition method and system based on deep learning

Technical Field

The invention relates to the field of voice recognition, in particular to a voice recognition method and system based on deep learning.

Background

Speech emotion recognition refers to recognizing the emotional state of a speaker by speech, and is one of the most challenging tasks in the field of speech technology. With the wide application of the voice interaction technology, the voice emotion recognition technology which can make the robot more humanized has wide application prospect and commercial value.

With the development of deep learning technology in recent years, many achievements have appeared in the field of speech emotion recognition. However, even so, the speech emotion recognition technology at present still faces many difficulties, for example, high-dimensional emotion features are difficult to extract manually, the emotion speech data volume is small, the labeling difficulty is large, and the training data and the test data have different distributions, which results in poor model recognition accuracy.

Disclosure of Invention

The invention mainly aims to overcome the defects in the prior art, and provides a voice recognition method and system based on deep learning.

The technical scheme of the invention is as follows:

a speech recognition method based on deep learning comprises the following steps:

acquiring a voice signal, and preprocessing the voice signal to obtain a voice spectrum feature representation, wherein the preprocessing includes but is not limited to: pre-emphasis, framing, windowing, fourier change and PCA dimension reduction;

extracting blocks with different sizes from the speech spectrum feature representation of each label-free training sample, performing unsupervised pre-training by using CAE (computer aided engineering) to obtain kernels with different sizes, performing convolution and pooling on the whole speech spectrum input by using the kernels with different sizes, and stacking the pooled features with different sizes to obtain rough features;

inputting rough features obtained by unsupervised feature learning into a semi-supervised learning frame, setting the rough features as emotion related features and emotion irrelevant features, reconstructing the jointly input rough features, orthogonalizing sensitivity vectors of the emotion related features and sensitivity vectors of the emotion irrelevant features, and performing category prediction on the emotion related features through sigmoid mapping so as to learn the emotion related features and the emotion irrelevant features;

carrying out hierarchical nonlinear conversion on the emotion-related features to obtain high-level emotion features, inputting the high-level emotion features into a classification and domain invariant feature learning model, and carrying out emotion label prediction and domain label prediction;

and training the classifier by utilizing the high-level emotional characteristics of the source domain and the corresponding emotional labels to obtain the trained classifier.

Specifically, the kernel is the weight and bias of the encoder.

Specifically, the loss function in the semi-supervised learning framework comprises: reconstructing a loss function, an orthogonal loss function, a discriminant loss function and an authentication loss function;

the reconstruction loss function is:

wherein s is a sidmoid function, W and V are semi-supervised learning framework weight matrices, d is a weight in a semi-supervised learning framework, and η is a control constraint term strengthThe hyper-parameter of (c); f. of ^e (y) is a set emotion-related feature, f ^o () Is a set emotion independent feature; y is the characteristic of the roughness,

is a reconstruction feature;

quadrature loss function:

wherein f is _i ^e () Is the ith one of the emotionally relevant features,

is the jth emotion-independent feature;

discrimination loss function:

wherein C is the total number of samples, z is the sum

Respectively an emotion original label and an emotion prediction label, wherein k is a category label;

authentication loss function:

L _VERIF (W，Y，y ₁ ，y ₂ )＝(1-Y)D _W +(Y)1/2{max(0，m-D _W )} ²

wherein D is _W Distance between emotion-related features for two samples:

D _W (y ₁ ，y ₂ )＝||f ^e (y ₁ )-f ^e (y ₂ )|| ₂ ；

y =1 denotes Y ₁ And y ₂ From the same emotion category, Y =0 denotes Y ₁ And y ₂ From different emotion categories, m is a set threshold.

Specifically, the emotion related features are subjected to level nonlinear conversion to obtain high-level emotion features, the high-level emotion features are input into a classification and domain invariant feature learning model, emotion label prediction and domain label prediction are carried out, and a classification and domain invariant feature learning model objective function:

h obtains high-level emotional characteristics by mapping the emotional related characteristics through hierarchical nonlinear conversion, G _y And G _d Respectively representing the mapping of high-level emotional features into emotion labels and domain labels, L _y And L _d Loss functions representing emotion tag prediction and domain tag prediction, respectively, theta _y And theta _d Parameters in emotion label prediction and domain label prediction are respectively represented, and alpha is the contribution degree of a domain label prediction item.

An embodiment of the present invention further provides a speech recognition system based on deep learning, including:

a voice preprocessing unit: acquiring a voice signal, and preprocessing the voice signal to obtain a voice spectrum feature representation, wherein the preprocessing includes but is not limited to: pre-emphasis, framing, windowing, fourier change and PCA dimension reduction;

a coarse feature acquisition unit: extracting blocks with different sizes from the speech spectrum feature representation of each label-free training sample, performing unsupervised pre-training by adopting CAE (computer aided engineering) to obtain kernels with different sizes, performing convolution and pooling on the whole speech spectrum input by the kernels with different sizes, and stacking the pooled features with different sizes to obtain rough features;

emotion-related feature acquisition means: inputting rough features obtained by unsupervised feature learning into a semi-supervised learning frame, setting the rough features as emotion related features and emotion irrelevant features, reconstructing the jointly input rough features, orthogonalizing sensitivity vectors of the emotion related features and sensitivity vectors of the emotion irrelevant features, and performing category prediction on the emotion related features through sigmoid mapping so as to learn the emotion related features and the emotion irrelevant features;

emotion tag and domain tag prediction unit: carrying out hierarchical nonlinear conversion on the emotion-related features to obtain high-level emotion features, inputting the high-level emotion features into a classification and domain invariant feature learning model, and carrying out emotion label prediction and domain label prediction;

a classifier training unit: and training the classifier by utilizing the high-level emotional characteristics of the source domain and the corresponding emotional labels to obtain the trained classifier.

Specifically, in the coarse feature acquisition unit, the kernel is the weight and the bias of the encoder.

Specifically, in the emotion related feature obtaining unit, the loss function in the semi-supervised learning framework includes: reconstructing a loss function, an orthogonal loss function, a discriminant loss function and an authentication loss function;

the reconstruction loss function is:

wherein s is a sidmoid function, W and V are semi-supervised learning framework weight matrices, d is a weight in the semi-supervised learning framework, and eta is a hyper-parameter controlling the strength of a constraint term; f. of ^e (y) is a set emotion-related feature, f ^o () Is a set emotion independent feature; y is the characteristic of the roughness,

is a reconstruction feature;

quadrature loss function:

wherein f is _i ^e (y) is the ith emotionally relevant feature,

is the jth emotion-independent feature;

discriminant loss function:

where C is the total number of samples, z is the sum

authentication loss function:

L _VERIF (W，Y，y ₁ ，y ₂ )＝(1-Y)D _W +(Y)1/2{max(0，m-D _W )} ²

wherein D is _W Distance between emotion-related features for two samples:

D _W (y ₁ ，y ₂ )＝||f ^e (y ₁ )-f ^e (y ₂ )|| ₂ ；

Specifically, in the emotion label and domain label prediction unit, the emotion related features are subjected to level nonlinear conversion to obtain high-level emotion features, the high-level emotion features are input into a classification and domain invariant feature learning model, and emotion label prediction and domain label prediction are performed, wherein the classification and domain invariant feature learning model objective function is as follows:

h obtains high-level emotional characteristics by mapping the emotional related characteristics through hierarchical nonlinear conversion, G _y And G _d Respectively representing the mapping of high-level emotional features into emotion labels and domain labels,L _y and L _d Loss functions representing emotion tag prediction and domain tag prediction, respectively, theta _y And theta _d Parameters in emotion label prediction and domain label prediction are respectively represented, and alpha is the contribution degree of a domain label prediction item.

Yet another embodiment of the present invention provides an electronic device, including: the system comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor realizes the steps of the speech recognition method based on deep learning when executing the computer program.

Yet another embodiment of the present invention provides a computer-readable storage medium, in which a computer program is stored, and the computer program, when executed by a processor, implements the steps of the above-mentioned voice recognition method based on deep learning.

As can be seen from the above description of the present invention, compared with the prior art, the present invention has the following advantages:

the invention provides a speech recognition method based on deep learning, which comprises the steps of firstly acquiring a speech signal, preprocessing the speech signal to obtain a speech spectrum feature representation, wherein the preprocessing comprises but is not limited to the following steps: pre-emphasis, framing, windowing, fourier change and PCA dimension reduction; extracting blocks with different sizes from the speech spectrum feature representation of each label-free training sample, performing unsupervised pre-training by adopting CAE (computer aided engineering) to obtain kernels with different sizes, performing convolution and pooling on the whole speech spectrum input by the kernels with different sizes, and stacking the pooled features with different sizes to obtain rough features; inputting rough features obtained by unsupervised feature learning into a semi-supervised learning frame, setting the rough features as emotion related features and emotion irrelevant features, reconstructing the jointly input rough features, orthogonalizing sensitivity vectors of the emotion related features and sensitivity vectors of the emotion irrelevant features, and performing category prediction on the emotion related features through sigmoid mapping so as to learn the emotion related features and the emotion irrelevant features; carrying out hierarchical nonlinear conversion on the emotion-related features to obtain high-level emotion features, inputting the high-level emotion features into a classification and domain invariant feature learning model, and carrying out emotion label prediction and domain label prediction; training a classifier by using high-level emotion features of the source domain and corresponding emotion labels to obtain a trained classifier; according to the method provided by the invention, the voice features are divided into the emotion related features and the emotion unrelated features, so that the influence of emotion unrelated factors in feature learning is reduced, emotion label prediction and domain label prediction are carried out, and the problem of insufficient or lack of target domain samples is solved.

Drawings

FIG. 1 is a flowchart of a method for speech recognition based on deep learning according to an embodiment of the present invention;

FIG. 2 is a diagram of a semi-supervised feature learning framework provided by embodiments of the present invention;

FIG. 3 is a diagram of a classification and domain invariant feature learning model provided by an embodiment of the present invention;

FIG. 4 is a block diagram of a deep learning based speech recognition system according to an embodiment of the present invention;

fig. 5 is a schematic diagram of an embodiment of an electronic device according to an embodiment of the present invention;

fig. 6 is a schematic diagram of an embodiment of a computer-readable storage medium according to an embodiment of the present invention.

The invention is described in further detail below with reference to the figures and specific examples.

Detailed Description

According to the voice recognition method based on deep learning, the voice features are divided into the emotion-related features and the emotion-unrelated features, so that the influence of emotion-unrelated factors in feature learning is reduced, emotion label prediction and domain label prediction are performed, and the problem of insufficient or lack of target domain samples is solved.

It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising a … …" does not exclude the presence of another identical element in a process, method, article, or apparatus that comprises the element. The above description is merely exemplary of the present application and is presented to enable those skilled in the art to understand and practice the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

The technical scheme of the invention is as follows:

s101: acquiring a voice signal, and preprocessing the voice signal to obtain a voice spectrum feature representation, wherein the preprocessing includes but is not limited to: pre-emphasis, framing, windowing, fourier change and PCA dimension reduction;

for the acquired audio signal, firstly, the audio signal is converted into a digital signal, then, the speech spectrum characteristic is obtained through pre-emphasis, framing, windowing and Fourier change, and then, the dimensionality reduction is carried out on the frequency domain through PCA, so that the speech spectrum characteristic representation of an audio sample is extracted.

S102: extracting blocks with different sizes from the speech spectrum feature representation of each label-free training sample, performing unsupervised pre-training by adopting CAE (computer aided engineering) to obtain kernels with different sizes, performing convolution and pooling on the whole speech spectrum input by the kernels with different sizes, and stacking the pooled features with different sizes to obtain rough features;

specifically, the kernel is the weight and bias of the encoder.

Unsupervised feature learning can learn some features hidden in the data from unlabeled samples, which is unsupervised pre-training using a systolic auto-encoder CAE.

Obtaining the weight bias of the encoder through the pre-training learning of CAE, namely kernel (U, c), and obtaining the hidden layer characteristic f (x) epsilon R through the kernel (U, c) for the input x ^K And K is the number of hidden layer features, for speech spectrum input, the kernel is applied to a plurality of small blocks to obtain new feature representation of the speech spectrum, the small blocks with different sizes are extracted from input data to be pre-trained to obtain kernels with different sizes, the whole speech spectrum input is subjected to convolution and pooling through the kernels with different sizes, and finally the pooled features with different sizes are stacked to obtain the rough feature.

S103: inputting rough features obtained by unsupervised feature learning into a semi-supervised learning frame, setting the rough features as emotion related features and emotion irrelevant features, reconstructing the jointly input rough features, orthogonalizing sensitivity vectors of the emotion related features and sensitivity vectors of the emotion irrelevant features, and performing category prediction on the emotion related features through sigmoid mapping so as to learn the emotion related features and the emotion irrelevant features;

fig. 2 is a schematic diagram of a semi-supervised learning framework, in which the factors related to the voice situation are separated from other factors as much as possible to obtain emotion-related features,

the reconstruction loss function is:

where s is a sidmoid functionW and V are semi-supervised learning framework weight matrixes, d is a weight in the semi-supervised learning framework, and eta is a hyper-parameter for controlling the strength of a constraint term; f. of ^e (y) set affective characteristics, f ^o (y) is a set emotion independent feature; y is the characteristic of the roughness,

is a reconstruction feature;

orthogonal loss function:

wherein f is _i ^e (y) is the ith emotionally relevant feature,

is the jth emotion-independent feature;

by applying the ith emotion-related feature f _i ^e Sensitivity vector of (y)

And jth emotion-independent feature f _j ^o (y) sensitivity vector +>

And the two features can be effectively separated through orthogonality, and the emotion related feature and the emotion unrelated feature are preliminarily divided through an orthogonal loss function.

Discriminant loss function:

where C is the total number of samples, z is the sum

the category labels try to increase the gap between the emotional characteristics of different kinds of emotions, so that emotion classification can be better performed, and therefore rich information between emotion categories can be found by minimizing the cross entropy loss function.

Authentication loss function:

L _VERIF (W，Y，y ₁ ，y ₂ )＝(1-Y)D _W +(Y)1/2{max(0，m-D _W )} ²

wherein D is _W Distance between emotion-related features for two samples:

D _W (y ₁ ，y ₂ )＝||f ^e (y ₁ )-f ^e (y ₂ )|| ₂ ；

y =1 denotes Y ₁ And y ₂ From the same emotion category, Y =0 denotes Y ₁ And y ₂ M is a set threshold value from different emotion categories.

The distance between the emotional terrain feature of the same type can be reduced through the authentication loss function.

S104: carrying out hierarchical nonlinear conversion on the emotion-related features to obtain high-level emotion features, inputting the high-level emotion features into a classification and domain invariant feature learning model, and carrying out emotion label prediction and domain label prediction;

h obtains high-level emotional characteristics by mapping the emotional related characteristics through hierarchical nonlinear conversion, G _y And G _d Respectively representing the mapping of high-level emotional features into emotion labels and domain labels, L _y And L _d Representing impairments of emotion tag prediction and domain tag prediction, respectivelyLoss function, θ _y And theta _d Parameters in emotion label prediction and domain label prediction are respectively represented, and alpha is the contribution degree of a domain label prediction item. FIG. 3 is a diagram of a classification and domain invariant feature learning model framework.

S105: and training the classifier by utilizing the high-level emotional characteristics of the source domain and the corresponding emotional labels to obtain the trained classifier.

As shown in fig. 4, an embodiment of the present invention further provides a speech recognition system based on deep learning, including:

the voice preprocessing unit 401: acquiring a voice signal, and preprocessing the voice signal to obtain a voice spectrum feature representation, wherein the preprocessing includes but is not limited to: pre-emphasis, framing, windowing, fourier transformation and PCA dimension reduction;

for the acquired audio signal, firstly, the audio signal is converted into a digital signal, then, speech spectrum characteristics are obtained through pre-emphasis, framing, windowing and Fourier change, and then, the dimensionality reduction is carried out on a frequency domain through PCA, so that the speech spectrum characteristics of an audio sample are extracted to be represented.

Coarse feature acquisition unit 402: extracting blocks with different sizes from the speech spectrum feature representation of each label-free training sample, performing unsupervised pre-training by adopting CAE (computer aided engineering) to obtain kernels with different sizes, performing convolution and pooling on the whole speech spectrum input by the kernels with different sizes, and stacking the pooled features with different sizes to obtain rough features;

specifically, the kernel is the weight and bias of the encoder.

Unsupervised feature learning can learn some features hidden in the data from unlabeled samples, using a shrinkage auto-encoder CAE for unsupervised pre-training.

Obtaining the weight bias of the encoder through the pre-training learning of CAE, namely kernel (U, c), and obtaining the hidden layer characteristic f (x) epsilon R through the kernel (U, c) for the input x ^K K is the number of hidden layer features, for speech spectrum input, the kernel is applied to a plurality of small blocks to obtain new feature representation of the speech spectrum, and small blocks with different sizes are extracted from input data to be pre-trained to obtain small blocks with different sizesAnd (4) kernel, performing convolution and pooling on the whole speech spectrum input through kernels with different sizes, and finally stacking pooled features with different sizes to obtain the rough features.

Emotion-related feature acquisition section 403: inputting rough features obtained by unsupervised feature learning into a semi-supervised learning frame, setting the rough features as emotion related features and emotion irrelevant features, reconstructing the jointly input rough features, orthogonalizing sensitivity vectors of the emotion related features and sensitivity vectors of the emotion irrelevant features, and performing category prediction on the emotion related features through sigmoid mapping so as to learn the emotion related features and the emotion irrelevant features;

fig. 2 is a schematic diagram of a semi-supervised learning framework, in which the factors related to the voice situation and other factors are separated as much as possible to obtain emotion-related features,

the reconstruction loss function is:

/>

wherein s is a sidmoid function, W and V are semi-supervised learning framework weight matrices, d is a weight in the semi-supervised learning framework, and eta is a hyper-parameter controlling the strength of a constraint term; f. of ^e (y) set affective characteristics, f ^o (y) is a set emotion independent feature; y is the characteristic of the roughness,

is a reconstruction feature;

orthogonal loss function:

wherein f is _i ^e (y) is the ith emotionally relevant feature,

is the jth emotion-independent feature;

by applying the ith emotion-related feature f _i ^e Sensitivity vector of (y)

And jth emotion-independent feature f _j ^o (y) sensitivity vector +>

Discrimination loss function:

wherein C is the total number of samples, z is the sum

Authentication loss function:

L _VERIF (W，Y，y ₁ ，y ₂ )＝(1-Y)D _W +(Y)1/2{max(0，m-D _W )} ²

wherein D is _W Distance between emotion-related features for two samples:

D _W (y ₁ ，y ₂ )＝||f ^e (y ₁ )-f ^e (y ₂ )|| ₂ ；

Emotion tag and domain tag prediction unit 404: performing hierarchical nonlinear conversion on the emotion-related features to obtain high-level emotion features, inputting the high-level emotion features into a classification and domain invariant feature learning model, and performing emotion label prediction and domain label prediction;

specifically, carrying out hierarchical nonlinear conversion on emotion-related features to obtain high-level emotion features, inputting the high-level emotion features into a classification and domain invariant feature learning model, and carrying out emotion label prediction and domain label prediction, wherein a classification and domain invariant feature learning model objective function is as follows:

h obtains high-level emotional characteristics by mapping the emotional related characteristics through hierarchical nonlinear conversion, G _y And G _d Respectively representing the mapping of high-level emotional features into emotional tags and domain tags, L _y And L _d Loss functions, θ, representing affective tag prediction and domain tag prediction, respectively _y And theta _d Parameters in emotion label prediction and domain label prediction are respectively represented, and alpha is the contribution degree of a domain label prediction item. FIG. 3 is a diagram of a classification and domain invariant feature learning model framework.

Classifier training unit 405: and training the classifier by utilizing the high-level emotional characteristics of the source domain and the corresponding emotional labels to obtain the trained classifier.

As shown in fig. 5, an electronic device 500 according to an embodiment of the present invention includes a memory 510, a processor 520, and a computer program 511 stored in the memory 510 and executable on the processor 520, where the processor 520 executes the computer program 511 to implement a deep learning based speech recognition method according to an embodiment of the present invention.

In a specific implementation, when the processor 520 executes the computer program 511, any of the embodiments corresponding to fig. 1 may be implemented.

Since the electronic device described in this embodiment is a device used for implementing a data processing apparatus in the embodiment of the present invention, based on the method described in this embodiment of the present invention, a person skilled in the art can understand the specific implementation manner of the electronic device in this embodiment and various variations thereof, so that how to implement the method in this embodiment of the present invention by the electronic device is not described in detail herein, and as long as the person skilled in the art implements the device used for implementing the method in this embodiment of the present invention, the device used for implementing the method in this embodiment of the present invention belongs to the protection scope of the present invention.

Referring to fig. 6, fig. 6 is a schematic diagram illustrating an embodiment of a computer-readable storage medium according to an embodiment of the present invention.

As shown in fig. 6, the present embodiment provides a computer-readable storage medium 600, on which a computer program 601 is stored, and when executed by a processor, the computer program 601 implements a deep learning based speech recognition method provided by the present embodiment;

in a specific implementation, the computer program 601 may implement any of the embodiments corresponding to fig. 1 when executed by a processor.

It should be noted that, in the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to relevant descriptions of other embodiments for parts that are not described in detail in a certain embodiment.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The invention provides a speech recognition method based on deep learning, which comprises the steps of firstly acquiring a speech signal, preprocessing the speech signal to obtain a speech spectrum feature representation, wherein the preprocessing comprises but is not limited to the following steps: pre-emphasis, framing, windowing, fourier change and PCA dimension reduction; extracting blocks with different sizes from the speech spectrum feature representation of each label-free training sample, performing unsupervised pre-training by using CAE (computer aided engineering) to obtain kernels with different sizes, performing convolution and pooling on the whole speech spectrum input by using the kernels with different sizes, and stacking the pooled features with different sizes to obtain rough features; inputting rough features obtained by unsupervised feature learning into a semi-supervised learning frame, setting the rough features as emotion related features and emotion irrelevant features, reconstructing the jointly input rough features, orthogonalizing sensitivity vectors of the emotion related features and sensitivity vectors of the emotion irrelevant features, and performing category prediction on the emotion related features through sigmoid mapping so as to learn the emotion related features and the emotion irrelevant features; performing hierarchical nonlinear conversion on the emotion-related features to obtain high-level emotion features, inputting the high-level emotion features into a classification and domain invariant feature learning model, and performing emotion label prediction and domain label prediction; training a classifier by using high-level emotional features of the source domain and corresponding emotional labels to obtain a trained classifier; the method provided by the invention reduces the influence of the emotion-independent factors in feature learning by dividing the voice features into emotion-dependent features and emotion-independent features, and solves the problem of insufficient or lack of target domain samples by performing emotion label prediction and domain label prediction.

The above description is only an embodiment of the present invention, but the design concept of the present invention is not limited thereto, and any insubstantial modifications made by using the design concept belong to the behaviors violating the protection scope of the present invention.

Claims

1. A speech recognition method based on deep learning is characterized by comprising the following steps:

extracting blocks with different sizes from the speech spectrum feature representation of each label-free training sample, performing unsupervised pre-training by adopting CAE (computer aided engineering) to obtain kernels with different sizes, performing convolution and pooling on the whole speech spectrum input by the kernels with different sizes, and stacking the pooled features with different sizes to obtain rough features;

performing hierarchical nonlinear conversion on the emotion-related features to obtain high-level emotion features, inputting the high-level emotion features into a classification and domain invariant feature learning model, and performing emotion label prediction and domain label prediction;

2. The deep learning based speech recognition method of claim 1, wherein the kernel is a weight and a bias of an encoder.

3. The method according to claim 1, wherein the loss function in the semi-supervised learning framework comprises: reconstructing a loss function, an orthogonal loss function, a discriminant loss function and an authentication loss function;

the reconstruction loss function is:

wherein s is a sidmoid function, W and V are semi-supervised learning framework weight matrices, d is a weight in the semi-supervised learning framework, and eta is a hyper-parameter controlling the strength of a constraint term; f. of ^e (y) is a set emotion-related feature, f ^o (y) is a set emotion independent feature; y is the characteristic of the roughness,

is a reconstruction feature;

quadrature loss function:

wherein f is _i ^e (y) is the ith affective characteristic, f _j ^o (y) is the jth emotion-independent feature;

discriminant loss function:

where C is the total number of samples, z is the sum

authentication loss function:

L _VERIF (W，Y，y ₁ ，y ₂ )＝(1-Y)D _W +(Y)1/2{max(0，m-D _W )} ²

wherein D is _W Is two samplesDistance between emotion-related features of the present invention:

D _W (y ₁ ，y ₂ )＝||f ^e (y ₁ )-f ^e (y ₂ )|| ₂ ；

4. The speech recognition method based on deep learning of claim 1, wherein the emotion related features are subjected to level nonlinear conversion to obtain high-level emotion features, and are input into a classification and domain invariant feature learning model, and emotion label prediction and domain label prediction are performed, wherein the classification and domain invariant feature learning model objective function is as follows:

5. A deep learning based speech recognition system comprising:

6. The deep learning based speech recognition system of claim 5, wherein the coarse feature obtaining unit is characterized by taking the weights and biases of the encoder as a core.

7. The deep learning based speech recognition system of claim 5, wherein in the emotion related feature acquisition unit, the loss function in the semi-supervised learning framework comprises: reconstructing a loss function, an orthogonal loss function, a discriminant loss function and an authentication loss function;

the reconstruction loss function is:

wherein s is a sidmoid function, W and V are semi-supervised learning framework weight matrixes, d is the weight in the semi-supervised learning framework, and eta is a hyper-parameter for controlling the strength of a constraint term; f. of ^e (y) is a set emotion-related feature, f ^o (y) is a set emotion-independent feature; y is the number of the roughness features,

is a reconstruction feature;

quadrature loss function:

discriminant loss function:

where C is the total number of samples, z is the sum

authentication loss function:

L _VERIF (W，Y，y ₁ ，y ₂ )＝(1-Y)D _W +(Y)1/2{max(0，m-D _W )} ²

wherein D is _W Distance between emotion-related features for two samples:

D _W (y ₁ ，y ₂ )＝||f ^e (y ₁ )-f ^e (y ₂ )|| ₂ ；

8. The deep learning-based speech recognition system of claim 5, wherein in the emotion label and domain label prediction unit, the emotion related features are subjected to hierarchical nonlinear conversion to obtain high-level emotion features, the high-level emotion features are input into a classification and domain invariant feature learning model, and emotion label prediction and domain label prediction are performed, wherein the classification and domain invariant feature learning model objective function is as follows:

h obtains high-level emotional characteristics by mapping the emotional related characteristics through hierarchical nonlinear conversion, G _y And G _d Respectively representing the mapping of high-level emotional features into emotional tags and domain tags, L _y And L _d Loss functions representing emotion tag prediction and domain tag prediction, respectively, theta _y And theta _d Parameters in emotion label prediction and domain label prediction are respectively represented, and alpha is the contribution degree of a domain label prediction item.

9. An electronic device, comprising: memory, processor and computer program stored on the memory and executable on the processor, wherein the processor implements the method steps of any of claims 1-4 when executing the computer program.

10. A computer-readable storage medium, characterized in that a computer program is stored in the computer-readable storage medium, which computer program, when being executed by a processor, carries out the method steps of any one of claims 1 to 4.