CN114566155B

CN114566155B - Feature reduction method for continuous speech recognition

Info

Publication number: CN114566155B
Application number: CN202210243971.3A
Authority: CN
Inventors: 游萌; 高君效
Original assignee: Chipintelli Technology Co Ltd
Current assignee: Chipintelli Technology Co Ltd
Priority date: 2022-03-14
Filing date: 2022-03-14
Publication date: 2024-07-12
Anticipated expiration: 2042-03-14
Also published as: CN114566155A

Abstract

A feature reduction method for continuous speech recognition, comprising the steps of: step 1, preparing a training set and calculating a dimension mean value of voice features; step 2, subtracting the mean value of the cepstrum coefficient characteristic from the characteristic of the single voice sample by using cepstrum mean value normalization, and eliminating outlier data; step 3, after eliminating the data of the discrete feature distribution, extracting the voice data features of all training samples of the training set by using a global feature mean affine transformation scaling mean method; and 4, performing calculation feature normalization processing to replace feature frames in the training samples to obtain a training set with reduced features. The invention provides a training method for feature reduction by reducing the characteristic aspect of the features. Aiming at the acoustic model training with large data volume, the recognition effect is improved, the false recognition is reduced, and the training time is shortened.

Description

Feature reduction method for continuous speech recognition

Technical Field

The invention belongs to the technical field of voice recognition, and particularly relates to a feature reduction method for continuous voice recognition.

Background

The traditional acoustic model modeling mode is based on deep neural network model parameters, acoustic feature vectors of each frame of voice data are transmitted along the neural network node connection, the nodes are output as posterior probability vectors of the frame, and the physical meaning of each dimension of the posterior probability vectors is classification probability of the corresponding acoustic state.

The deep neural network training mode greatly improves the overall recognition effect, and simultaneously aggravates the requirement on the data size of training samples, and the training of large data size is accompanied with time consumption. The mechanism of continuous speech recognition decoding is seriously dependent on the training of an acoustic model, the training of the acoustic model determines the overall performance of speech recognition decoding, and the comprehensive performance can be improved by the input training of a large data volume speech corpus, but the input consumption in the aspects of cost of data formation, total duration of training operation and the like is very large.

Disclosure of Invention

In order to overcome the technical defects in the prior art, the invention discloses a feature reduction method for continuous voice recognition.

The feature reduction method for continuous voice recognition comprises the following steps:

step 1, preparing a training set, wherein the training set comprises a plurality of training samples, and each training sample comprises voice and corresponding text; calculating the average value of the voice feature dimensions of a single training sample in the training set by using a formula 1;

Equation 1

Represents the mean value of the dimensions of the speech features,A summation of frame features representing speech data, T representing the total number of feature frames in the speech data,Representing the ith class of t frame features;

Step 2, using cepstrum mean normalization to subtract the mean value of cepstrum coefficient features on the features of the single voice sample, specifically as follows:

Feature frame mean Equation 2

O _i represents a characteristic dimension value of the characteristic frame, and formula 2 represents a characteristic dimension value of the characteristic frame minus a characteristic dimension mean value of the voice to calculate a characteristic frame mean value;

marking outlier data according to the characteristic value mean value and removing the outlier data from the training set;

step 3, after eliminating the data of the discrete feature distribution, extracting the voice data features of all training samples of the training set by using a global feature mean affine transformation scaling mean method;

The mean y _i and standard deviation sigma _i of all training sample voice data in dimension i are calculated as follows:

Equation 3

Equation 4

Representing the feature frames of each dimension of feature i in the training samples, M representing different training samples, M representing the total number of training samples after outlier data is removed,

The training samples in the step 3 are processed by using sample feature normalized data, and outlier data is removed;

Step 4, carrying out calculation feature normalization processing by combining the formula 3 and the formula 4, specifically as shown in the formula 5:

Equation 5

Obtaining normalized values of all data in the training data after feature reduction according to the formula 5Will normalize the valueFeature frames corresponding to each dimension feature i in the replacement training samplesAnd obtaining the training set after the features are reduced.

Preferably, in the step 2, the outlier data is marked and rejected according to the following method: setting a deviation threshold, marking training samples with differences from the characteristic frame mean value larger than the deviation threshold as outlier data, and eliminating

The invention provides a training method for feature reduction by reducing the characteristic aspect of the features. Aiming at the acoustic model training with large data volume, the accuracy of training model identification is improved, false identification is reduced, and training time is shortened.

Drawings

FIG. 1 is a schematic flow chart of an embodiment of the present invention.

Detailed Description

The function of noise signal characteristics is eliminated after the characteristic reduction method is used in the continuous speech recognition acoustic model training process, and in the training process of a large amount of speech data corpus, unnecessary characteristic structures are necessarily existed, so that the data characteristics except for non-target signals are noise signal characteristics, the noise signals need to be removed to the greatest extent before or during training, noise is an interference item to influence the iteration of training parameters, and the model accuracy and the signal processing robustness are reduced. The signal characteristics are kept to be more fit with the actual voice signals and the specific voice environment, or the voice contents of the evaluation corpus such as the test set verification set are fit.

Secondly, the unnecessary relevance of the features is eliminated by using a feature reduction method in the continuous speech recognition acoustic model training process, analysis is performed from the aspect of feature space representation of acoustic model decoding, if the model parameters of the same type of speech corpus training possibly have the same speech signal features, namely, some feature parameters in the feature space are denser, some feature parameters in the feature space are sparser due to the fact that fewer feature parameters occur in the corpus and the occurrence of speech signals with lower frequency, and the feature structure can obtain a model structure with poor robustness in the deep neural network training process, the model is not suitable for a specific speech signal use environment, and the result is poor generalization, poor user sensory experience and particularly causes specific phenomena such as frequency of false recognition.

To improve upon the above-described drawbacks, a detailed description of embodiments of the present invention will be given in further detail with reference to the detailed flow chart shown in fig. 1.

The feature reduction method for continuous speech recognition trains a training set for speech recognition, wherein the training set comprises M training samples, and each training sample comprises speech and corresponding text.

The method specifically comprises the following steps:

Step 1, calculating a voice feature dimension mean value of a single training sample by using a formula 1;

Equation 1

Represents the mean value of the dimensions of the speech features,A summation of frame features representing speech data, T representing the total number of feature frames in the speech data,Representing the ith class of frame features.

And 2, step 2.

And then the cepstrum mean normalization is used for subtracting the mean value of the cepstrum coefficient characteristics from the characteristics of the single voice sample, and the method is concretely as follows:

Feature frame mean Equation 2

O _i in the formula 2 represents the feature frames, and represents the feature dimension values of all the feature frames minus the voice feature dimension mean value calculated in the formula 1, so as to obtain the feature frame mean value.

And marking the outlier data according to the characteristic value mean value and removing the outlier data from the training set.

The selection of outlier data may set a deviation threshold and training samples that differ from the characteristic frame mean by more than the deviation threshold are labeled as outlier data.

And step 3, after eliminating the data of the discrete feature distribution, extracting the voice data features of all training samples of the training set by using a global feature mean affine transformation scaling mean method, and calculating the mean and standard deviation.

Equation 3

Equation 4

The training samples in equations 3 and 4 have been processed using sample feature normalization data, rejecting outlier data.

Step 4, carrying out calculation feature normalization processing by combining the formula 3 and the formula 4, specifically as shown in the formula 9:

Equation 5

When each dimension of the feature is reduced to a standardized numerical range, the subsequent training and processing process can generally obtain better performance improvement, after feature normalization is calculated, the global feature mean affine transformation scales the feature data of each dimension in a certain numerical range, the sample is preprocessed based on the prior voice signal processing in the subsequent training process, the training space data is increased and more feature representations are covered, the feature dimensions of all the data can iteratively update the deep neural network parameters in the model parameter adjustment process, the generalization capability of the training model is enhanced, the decoding effect is improved, and the misrecognition is reduced.

The data set after feature reduction can be processed as follows.

The model initialization in the feature reduction randomly takes values from the initial distribution of the mean value and the standard deviation, and is used for initializing model parameters before training, and taking values and numerical ranges of the mean value and the standard deviation.

The initial value of the mean is set to 0, and the standard deviation formula is as follows:

Equation 6

Equation 6 represents the mean standard deviationIs initialized to a distribution of random values, whereinRepresenting the number of output nodes connected with weights in a deep neural network, aloneRepresenting the implicit layer weights of the neural network. The regularization training is used for restraining and adjusting coefficient estimation towards the zero vector direction, the coefficient estimation is reduced to be similar to the zero vector to the greatest extent, and regularization can reduce model complexity and eliminate instability in the learning process, so that overfitting is avoided.

The norm calculation used in the regularization training process is shown in equation 7-1:

Equation 7-1

Represents a non-negative norm, n represents a dimension of the vector, x _i represents a vector of the ith dimension,Is the sum of squares of all elements of the vector, p is the norm order, and when p=2, 1/p is the open square.

The core role of the norms is to solve smooth deviations and variances under the condition of excessive model parameters or excessively complex structure, and to process fitting in terms of model prediction and variability of real model parameters. When p=2, the sum of squares and reopen root number of all elements in a certain vector are represented, and the calculation result is L2 norm, i.e. euclidean distance.

Coefficient scaling for features in vectors using L2 norms can suppress training features from being too large, and if overfitting occurs, the norms formula can constrain the number of features.

Calculation using equation 7-2:

Equation 7-2

In equation 7-2The loss function representing the L2 norm of vector vec (W) continues to decompose from equation 7-2 to calculate a vector represented as matrix W for the left R (W), the regularization term of matrix WIn the form, the following formula 7-3 is obtained:

Equation 7-3

Equation 7-3The method is a vector representation formed by combining weights of all hidden layers of a deep neural network, and then the vector is calculated and decomposed to obtain an accumulated representation of values of an ith row and an jth column, wherein the accumulated representation is represented by the following formula 7-4:

equation 7-4

The leftmost R (W) of equation 7-2, the canonical term calculation representation of matrix W, is shown in equation 7-3 as the value of the ith row and jth column in matrix W, and in equation 7-4 to the rightTo take the absolute value of the ownership parameter on the basis of the loss function.

N _l-1 、N_l represents the hierarchical numbers of the hidden layers of the deep neural network N _l and N _l-1, respectively, L is the number of all hidden layers, and the sum of three terms is the vector representation formed by the weight of all hidden layers.

Is a vector-expanded absolute value accumulated representation of the implicit layer weight values to the ith row and jth column.

Specific examples:

The identification content is four short sentences: ① Younger intelligent households, ② are good weather today, ③ turn on the air conditioner to raise the temperature, ④ turn off the heating to lower the temperature. The data volume of the four phrases is 2000 samples, the total time length of each phrase is 2.78 hours, the content of each phrase is contained in various 2-4-word household phrases for 120 hours, and the total time length is about 135 hours.

The four sentences are assumed to be used for enabling the traditional training method and the feature reduction training method provided by the application to compare and evaluate the recognition results, and the word error rate is used as a judgment standard.

The training uses an error back propagation algorithm in the deep neural network using a conventional training method, namely a deep neural network multi-hidden layer perceptron training method, using a conventional activation function and training criteria in the deep neural network.

In a specific training process, a training data processing algorithm for fully and randomly scattering training data is adopted, a plurality of groups of acoustic models are trained by extracting home environment training corpus for 160 hours, 270 hours and 205 hours and configuring different parameters, test verification results of the acoustic model training for 160 hours are found to be obviously superior to test verification results of the acoustic model training of the acoustic network same-level data corpus in the traditional 160 hours, and test decoding results of the acoustic model training for 270 hours and 205 hours are higher than traditional neural network training baseline test results above 600 hours: the novel acoustic model training method greatly reduces the word error rate of continuous voice recognition from 11.21% to 8.96%.

And the training speed is slightly increased in terms of training iteration time due to the reduction of the characteristic reduction training parameters, and the comparison finds that the extraction of the home environment training corpus for 160 hours, 270 hours and 205 hours is generally carried out for about 15 hours, and the process of 25.5 hours and 21 hours is shortened to 14 hours, 23 hours and 19.5 hours by the novel training method. The novel method provided by the patent of the invention reduces the total training duration while providing a solution for improving the recognition accuracy, and provides a training method for feature reduction by reducing the characterization aspect of features.

The training of the traditional method is better than the repeated teaching of the process of learning words by pictures, and the like are repeatedly learned, but in the traditional training method, due to the fact that the labeling accuracy of the corpus of training data and the reliability of washing have certain difference influences, the influence of characteristic noise exists, in the repeated simulation learning process, the characteristic expression of a voice signal in a deep neural network is represented by anthropomorphic persons in the training process, different pronunciations of individual Chinese character differences exist in the process of training, such as a 'hello intelligent manager', 'hello intelligent gateway', 'hello intelligent manager', and the like, and the differences of frames of harmony characteristics and real pronunciations and labeled texts occur in error back propagation process of model parameter training, so that unexpected results such as over fitting or under fitting appear.

The invention repeatedly emphasizes the correct content and simultaneously repeatedly emphasizes the content of the correct text, when the voice features enter the deep neural network, the influence of the feature reduction coefficient can be fully considered in the process of training the model parameters by adding a feature reduction mode to the loss function, namely to the optimization target, so that the feature coefficient with smaller influence can be reduced to an irreducible degree, only the feature structure suitable for the training parameters is reserved, the correct voice signals still enter the subsequent network iteration process forcefully, unnecessary relevance of the noise voice signals is weakened gradually, some feature parameters in the feature space are uniformly distributed, and the deep neural network is repeatedly trained in this way.

According to the acoustic model training method for feature reduction, which is provided by the invention and is described by the above example, optimal model parameters can be trained under the real condition so as to achieve the best recognition effect. The comparison data shows that the novel training method with reduced characteristics has great performance improvement on the aspect of voice recognition rate compared with the traditional acoustic model training method, and reduces the word error rate of continuous voice recognition compared with the traditional acoustic model training method.

In a large number of trial and error processes, the feature subtraction method is found to be more effective when the size of the training data set is smaller than the amount of parameters in the deep neural network model.

The foregoing description of the preferred embodiments of the present invention is not obvious contradiction or on the premise of a certain preferred embodiment, but all the preferred embodiments can be used in any overlapped combination, and the embodiments and specific parameters in the embodiments are only for clearly describing the invention verification process of the inventor and are not intended to limit the scope of the invention, and the scope of the invention is still subject to the claims, and all equivalent structural changes made by applying the specification and the content of the drawings of the present invention are included in the scope of the invention.

Claims

1. A feature reduction method for continuous speech recognition, comprising the steps of:

equation 1

Feature frame mean Equation 2

Equation 3

Equation 4

Representing a feature frame of each dimension feature i in the training samples, wherein M represents different training samples, and M represents the total number of training samples after outlier data are removed;

equation 5

2. The feature reduction method of continuous speech recognition according to claim 1, wherein in the step 2, the outlier data is marked and rejected according to the following method: and setting a deviation threshold, marking training samples with differences from the characteristic frame mean value larger than the deviation threshold as outlier data, and eliminating.