WO2019212375A1

WO2019212375A1 - Method for obtaining speaker-dependent small high-level acoustic speech attributes

Info

Publication number: WO2019212375A1
Application number: PCT/RU2018/000286
Authority: WO
Inventors: Алексей Александрович ПРУДНИКОВ; Максим Львович КОРЕНЕВСКИЙ; Иван Павлович МЕДЕННИКОВ
Original assignee: Общество с ограниченной ответственностью "Центр речевых технологий"
Priority date: 2018-05-03
Filing date: 2018-05-03
Publication date: 2019-11-07
Also published as: EA202092400A1

Abstract

The invention relates to the field of speech recognition, specifically to the obtaining of high-level acoustic speech attributes for the purpose of speech recognition in conditions of acoustic variability. A method is proposed for obtaining small high-level acoustic speech attributes, according to which method low-level speech attributes and speaker-specific information corresponding to said attributes are made available, then a neural network is trained using the low-level speech attributes, after which training of the neural network is completed using the low-level speech attributes, supplemented by the speaker-specific information. A small layer is introduced into the neural network and training of the neural network with the small layer is completed using low-level speech attributes, supplemented by speaker-specific information, then the small high-level acoustic speech attributes are extracted from the output of the small layer of the neural network.

Description

METHOD FOR PRODUCING DICTOR-DEPENDENT SMALL-SIZED HIGH-LEVEL ACOUSTIC SPEECH SIGNS

FIELD OF TECHNOLOGY

The invention relates to the field of speech recognition, in particular to obtaining high-level acoustic features of speech for speech recognition in terms of acoustic variability.

BACKGROUND

A known method of obtaining a speaker adaptive acoustic model through a neural network using the i-vector (US2015149165). According to the known method, an acoustic model based on a deep neural network is provided, audio data including one or more speaker statements is received, a plurality of speech recognition features are extracted from said one or more speaker statements, a speaker identification vector for this speaker is created based on the extracted speech recognition features and adapt the acoustic model of the deep neural network for automatic speech recognition using the extracted features p spoznavaniya speech and speaker identification vector.

A known method of adapting an acoustic model based on a neural network (US20170169815). In one embodiment of the method, a trained acoustic neural network can be adapted to the speaker by using speech data corresponding to a variety of statements made by the speaker. A trained acoustic model neural network may have an input layer, one or more hidden layers and an output layer, and may be a deep neural network. The input layer may include a set of input nodes that contain speech features derived from a speech utterance, and another set of input nodes that contain speaker information values derived from a speech utterance. Signs of speech may include values used to collect information about the content of the utterance, including, without limitation, melf-frequency cepstral coefficients (MFCCs) for one or more speech frames, first-order derivatives between MFCCs for consecutive speech frames (delta MFCCs) and derivatives second order between MFCCs for consecutive frames (delta-delta MFCCs). In addition, the speaker information values may include a speaker identification vector (i-vector).

A common disadvantage of the known methods (US2015149165 and US20170169815) is that they do not provide an acoustic model that would be highly resistant to distortion of the input data and would allow speech recognition with high accuracy under conditions of acoustic variability.

Known multilingual acoustic neural network (US9460711). This document describes a multi-tasking learning system. An acoustic model with a multilingual deep neural network may have a direct distribution neural network having several layers with one or more nodes. Each node of this layer is connected by appropriate weights to each node of the subsequent layer, and several layers with one or more nodes can have one or more common hidden layers of nodes and a language-dependent output layer of nodes corresponding to each of two or more languages.

The disadvantages of the known invention is that the method disclosed therein does not provide a multilingual acoustic model that would be highly resistant to distortion of the input data and would allow speech recognition with high accuracy under conditions of acoustic variability.

There is a method of speech recognition using a neural network with adaptation to the speaker (US9721561), according to which an acoustic model is trained based on a deep neural network with a narrow neck (with a small layer), the input of which receives acoustic signs and additional announcer information, due to which speaker-informed training. According to one embodiment of the method, a deep neural network has several hidden layers, a small layer and an output layer, while its first hidden layer includes a first set of nodes processing acoustic features, and a second set of nodes processing additional speaker information, input acoustic features are multiplied by the first matrix of weights, and additional announcer information is multiplied by a second matrix of weights. The outputs of the small layer are connected to the next network layer.

The disadvantage of this method is that training a neural network with a narrow neck does not provide high-quality small-sized features and, as a result, cannot provide an acoustic model that would allow high-precision speech recognition under conditions of acoustic variability. A known method of data replenishment, based on the stochastic conversion of features for automatic speech recognition (US9721559), according to which a speaker-dependent acoustic model of the target speaker is taught for further recognition of his speech, i.e. this model is designed with the best possible quality to recognize one specific speaker. Due to the insufficient data of the target speaker for training the neural network, it is proposed to supplement the existing data with the data of other speakers from the training set, converted using stochastic feature conversion, as well as perturbations of the length of the voice path. The parameters of these transformations are estimated on the basis of the first acoustic model, built only according to the target speaker. After completing the sample, two-stage training is carried out: at the first stage, a deep neural network with a narrow neck (with a small size layer) is trained to obtain features that are extracted from the small size layer and are used in the second stage of training the neural network to obtain a resulting speaker-dependent model.

The disadvantage of this method is that small-sized features are built for a specific speaker and are used to train an acoustic model designed exclusively for recognizing his speech. The signs obtained in this way do not allow learning the acoustic model that would be used for speech recognition of arbitrary speakers. In addition, the proposed method for training a neural network with a narrow neck does not provide high-quality small-sized features.

Thus, the known methods do not provide acoustic characteristics and acoustic models corresponding to a high level of quality for subsequent speech recognition in the conditions of acoustic variability of various speakers. In addition, the possibility of obtaining multilingual acoustic features and / or a multilingual acoustic model that is highly resistant to input data distortions and meets a high level of quality for subsequent speech recognition has not been sufficiently developed.

Due to the disadvantages of the known methods for producing acoustic features and / or acoustic models, the technical problem of the present invention is to provide a method for producing high-level acoustic features that can be used to train an acoustic model characterized by low sensitivity to acoustic variability of a speech signal and providing high accuracy in speech recognition. SUMMARY OF THE INVENTION

The posed problem is solved due to the fact that, according to the proposed method for obtaining small-sized high-level acoustic signs of speech, they provide the presence of low-level signs of speech and the corresponding speaker information, then they train the neural network using low-level signs of speech, after which they train the neural network using low-level signs of speech, supplemented by announcer information. Next, a small-sized layer is introduced into the composition of the neural network and a neural network with a small-sized layer is further trained using low-level features of speech supplemented by announcer information, then small-sized, high-level acoustic features of speech are extracted from the output of the small-sized layer of the neural network.

The proposed method allows to achieve a technical result in the form of increasing the information content of high-level acoustic features, which, in turn, improves the accuracy of speech recognition systems of various (arbitrary) speakers under conditions of acoustic variability.

According to the proposed method, in addition to low-level speech features, additional announcer information is used (for example, using i-vectors), taking into account information about the announcer, and / or channel, and / or surroundings, which allows obtaining so-called speaker-dependent acoustic features that provide recognition speeches of various (arbitrary) speakers and in various conditions. Implementation of the proposed method is based on neural networks, which improves the quality of the obtained acoustic signs. The proposed method uses a neural network with a narrow neck, i.e. a small-sized layer is introduced into the neural network, which reduces the dimension of the input data. In addition, after training the neural network, the outputs of this layer will be small high-level features that are not only resistant to distortion of the input acoustic features, but also accumulate information about the speaker, and / or channel, and / or environment. It is worth noting that the quality of training a neural network directly affects the quality of the resulting speaker-dependent small-sized high-level acoustic features.

In the proposed method, the initial training of the neural network is performed using only low-level speech features, and then using low-level speech features, supplemented by speaker information, without small-sized layer, which allows you to bring the weights of the remaining layers to values that are close enough to optimal, which improves the quality of training the neural network and facilitates retraining of the network after the introduction of the small-sized layer. Additional training of the neural network using low-level speech features, supplemented by announcer information, allows you to compensate for changes in the weight matrix of the last layer after the small layer is inserted into the neural network, which improves the quality of training of the neural network and, as a result, the quality of acoustic characteristics obtained after training. The use of speaker-dependent small-sized high-level acoustic features obtained by the proposed method for training a neural network in speech recognition allows to obtain significant gains in the accuracy of speech recognition.

According to a special case of implementation, after training a neural network using low-level speech features, its input layer is expanded by supplementing the matrix of the layer with zero columns. Expansion of the input layer is necessary to enable the training of the neural network using low-level speech features supplemented by announcer information, otherwise the dimension of the input vector, consisting of low-level speech features and the corresponding speaker information, will be too large for the input layer of the neural network. In addition, expansion by supplementing the input layer matrix with zero columns after training the neural network using low-level speech features allows you to save the behavior of the network, which improves the quality of training the neural network.

According to a special case of implementation, low-level speech features have the form of shallow-frequency cepstral coefficients or logarithms of energy in shallow-frequency bands. The presentation of low-level speech features in the proposed types allows to obtain high-quality high-level acoustic features.

According to a special case of implementation, the announcer's information has the form of a small-sized i-vector. The 1-vector is a small-sized (of the order of 100 elements) vector that allows you to encode the deviation of the distribution of acoustic phonogram signs from the distribution estimated over the entire training sample, and accumulate information about the speaker, as well as, to some extent, the channel and acoustic environment . Thus, the use of a small-sized i-vector together with low-level speech features increases the accuracy of training neural network and, as a result, resulting from the training of high-level acoustic features.

According to a special case of implementation, training a neural network using low-level speech features is carried out according to the criterion of minimum cross-entropy. Cross entropy shows how much the probability distribution at the output of the neural network corresponds to the senon actually observed in this frame. Thus, the use of this criterion increases the accuracy of training the neural network.

According to a special case of implementation, the neural network is retrained using low-level speech features, supplemented by announcer information, according to the criterion of the minimum amount of cross-entropy and an additional regularizing term. An additional regularizing term prevents a strong deviation of weights from previously trained ones, which increases the quality (accuracy) of training the neural network.

According to a special case of implementation, a neural network trained using low-level speech features, supplemented by announcer information, is retrained using the criterion of minimum cross-entropy sum and an additional regularizing term using the sequentially discriminative criterion. This criterion improves recognition accuracy.

According to a special case of implementation, a small-sized layer is introduced by low-ranking factorization of the weight matrix of the last hidden layer, in particular by singular decomposition. The singular decomposition allows to reduce the rank of the weight matrix of the last hidden layer of the neural network by discarding the smallest singular numbers, thereby ensuring the entry of a small-sized layer (small-sized linear layer) into the neural network.

According to a special case of implementation, after completion of retraining of a neural network with a small-sized layer, the layers located after the small-sized layer of the neural network are removed. Removing all layers after the small-sized layer will allow us to consider the trained neural network as an extractor of small-sized high-level features.

According to a special case of implementation, low-level speech features of at least two different languages and the corresponding speaker information are supplied to the input of the neural network, and multilingual small-sized high-level acoustic features of speech are extracted from the output of the small size layer of the neural network. After training the neural network by the method proposed above using various languages from the training set, the small-sized layer contains high-level features that apply to all languages of the training set at once. Received so In this way, acoustic features are highly informative and can increase resistance to changing the input language in speech recognition systems.

According to a special case of implementation, the number of output layers of the neural network is equal to the number of languages, and the weights of each of the output layers are adjusted only according to the data of the corresponding language, and the weights of all hidden layers are adjusted according to the data of all of the indicated at least two languages. The proposed architecture provides the possibility of multilingual learning of a neural network.

BRIEF DESCRIPTION OF THE DRAWINGS

The invention is explained in more detail on non-restrictive examples of its implementation with reference to the accompanying drawings, among which:

FIG. 1 - architecture of a trained neural network without a small layer, according to one embodiment of the invention;

FIG. 2 is an architecture of a trained neural network with a small layer, according to one embodiment of the invention;

FIG. 3 is a training diagram of a speech recognition neural network according to one embodiment of the invention.

DETAILED DESCRIPTION OF THE INVENTION

One of the most difficult tasks in the field of automatic speech recognition is the problem of recognition of spontaneous spoken speech of various (arbitrary) speakers. The complexity of the task is due to the peculiarities of spontaneous speech of various (arbitrary) speakers: high channel and speaker variability, the presence of additive and non-linear distortions, the presence of accent and emotional speech, a diverse manner of pronunciation, the variability of the tempo of speech, reduction and lingering articulation. One way to improve the quality of recognition of spontaneous speech is to reduce the sensitivity of the recognition system to the acoustic variability of the speech signal. The implementation of this method is possible when applying the adaptation of acoustic models based on deep neural networks using speaker information that takes into account information about the speaker and / or channel and / or environment.

The method of obtaining small-sized, high-level acoustic features of speech, proposed according to various embodiments, allows obtaining acoustic features that can be used for adaptive learning acoustic model, characterized by low sensitivity to acoustic variability of the speech signal and providing high accuracy in speech recognition.

A detailed sequence of operations of a method for producing small-sized speaker-dependent high-level features of speech, according to one embodiment of the invention, is disclosed below.

In the present description, the term "retraining" refers to training that begins with the configured parameters obtained during previous training.

The method of obtaining small-sized high-level acoustic features of speech in accordance with the present invention can be carried out using, for example, known computer or multiprocessor systems. In other embodiments, the claimed method can be implemented using specialized software and hardware.

To obtain speaker-dependent high-level features, a deep direct distribution neural network is used. In other implementations, other suitable architectures can be used to train the neural network, for example convolutional neural networks, time-delayed neural networks, etc. The basic deep direct distribution neural network is initially initialized with random weights, after which a training example is fed to its input and the network activity is calculated, then an idea of the error is formed, that is, the difference between what should be on the output layer and what happened to the network. Further weights are adjusted in such a way as to reduce this error.

In FIG. 1 depicts a deep neural network of direct distribution without a small layer (without a narrow throat). The proposed neural network contains an input layer 1, which serves low-level features of speech and i-vector. The neural network also contains several hidden layers 2, which process the signs obtained from the input layer, and the output layer 3, which outputs the result. Each layer contains neurons that receive information, perform calculations and pass it on. There are connections between neurons - synapses that have a parameter - weight, due to which the input information changes during the transfer from one neuron to another, while the set of weights of the neural network form a matrix of weights. In the learning process, neurons change weights; in other words, the weights of the neurons vary with the information coming into the neuron. Initially, training is carried out through a deep neural network without a narrow throat (without small layer), after training the neural network to the required limits add small layer 2A (Fig. 2).

To obtain speaker-dependent high-level features, a deep direct distribution neural network is used, trained to classify speech units. On each short-term part of speech (frame, they usually follow with a frequency of 100 Hz), the classification allows us to evaluate which pronounced "sounds" of speech most likely generated the observed vector of acoustic signs. Speech units can be understood as phonemes. In the present description, the term "phoneme" means the minimum unit of the sound system of a language that does not have an independent lexical or grammatical meaning. For example, according to various phonological schools, the Russian language contains from 39 to 43 phonemes. Also, speech units can be understood as allophones or their parts. In the present description, the term "allophone" refers to a specific implementation of the phoneme in speech, due to its phonetic environment. An allophone that takes into account 1 phoneme before and after this one is called a trifon. As a rule, phonemes or trifons are modeled by a hidden Markov model with states 1–3 (state 1 — entrance to sound, transition from the previous one, state 2 — stable part, state 3 — exit from sound, transition to the next), while some Trifon states “Bind” together to provide enough data to train rare Trifonov. Such bound states are called “senons”, and it is to them that the outputs of the neural network correspond, i.e. the neural network classifies speech feature vectors into classes of senons, estimates the probabilities of each senon with the observed feature vector.

During the experiments, it was revealed that in one implementation option the optimal configuration of a deep neural network provides 6 hidden layers of 1536 neurons each with sigmoid and output softmax layer with 13000 neurons corresponding to the senons of the acoustic model based on Gaussian mixtures. In this case, the optimal configuration depends on the amount of training data.

The training sample is formed from the phonograms of various speakers. Phonograms can be obtained by any known method, for example, by recording telephone conversations. In this embodiment, the speakers speak the same language. For each phonogram from the training sample, low-level acoustic features (mel-frequency cepstral coefficients, for example, dimension 12, or logarithms of energy in mel-frequency bands, for example, dimensions, are pre-calculated 23). By low-level acoustic features are meant features extracted directly from a speech signal or its spectrum by digital signal processing methods. They carry important information about the signal, but are difficult to interpret in terms of classifying speech units. In other embodiments, other low-level acoustic features, such as perceptual linear prediction (PLP) coefficients, output energies of the gammatone filter bank (gammatone interbank, GTFB), etc., can be fed to the input of the neural network. The proposed low-level features do not differ much in terms of information content and can be used both individually and in combination without impairing the quality of training of a neural network.

In addition, from each phonogram, a small-sized representation of the announcer information contained in the phonogram is extracted, in particular, i-vectors, for example, dimension 50 are extracted. The extraction of i-vectors is carried out, for example, using the Universal Background Model (UBM), which was trained in advance. The 1-vector accumulates announcer information, and in some embodiments, it is a small-sized vector encoding the deviation of the distribution of the acoustic features of the phonogram from the distribution estimated over the entire training sample. In other implementations that require relatively less accurate training of neural networks, it is possible to extract announcer information in the form of maximum likelihood coefficients of linear regression in a feature space (feature space Maximum Likelihood Linear Regression, fMLLR).

At the first stage, a deep neural network is trained to predict the probabilities of senon states corresponding to a separate speech frame, using only low-level acoustic signs according to the criterion of minimum cross-entropy.

Cross entropy shows how much the probability distribution at the output of the neural network corresponds to the senon actually observed in this frame. The closer the probability of a given cenon to unity, and the remaining cenons to zero, the cross-entropy in this frame will be lower. Thus, cross-entropy is a measure of the average accuracy of the classification of individual speech frames throughout the training sample, and the smaller it is, the more accurately a given neural network is able to predict senons. In other words, minimizing cross-entropy is equivalent to lowering the average frame-by-frame classification error. After the training has agreed on the criterion of minimum cross-entropy, the initial low-level acoustical features, supplemented by an i-vector, are fed to the input of a deep neural network, previously expanding the input layer of a deep neural network by the dimension of additional features by adding zeros to the layer matrix, which will allow preserving the network behavior due to the multiplication of zeros by the components of the i-vector. Thus, on each frame, the input vector consists of 2 parts - the first part (low-level acoustic features) differs from frame to frame, the second (i-vector) is the same for all vectors of the same phonogram. Moreover, each voice of the speaker is characterized by a set of features that allow him to be perceived as the voice of this particular speaker. These features can be interpreted as coordinates in space, so each voice can be considered a point in the voice space, and if two voices are close in some parameters, then the points will also be close in the voice space and the corresponding i-vectors will also be close in space of voices. Thus, by expanding the input feature vectors by an i-vector characterizing the “location of the speaker’s voice in the voice space”, speech recognition of various (arbitrary) speakers is provided. This is because, since there are usually a lot of speakers in the training sample, the network gains the ability to use information about which area of the voice space the input i-vector came from. Thus, during the recognition of an arbitrary speaker, its i-vector will appear in the region of space where there were i-vectors of speakers from the training set, so that the neural network will be able to take this information into account with maximum efficiency; in other words, the neural network will already represent how this information should be processed.

Trained using only low-level acoustic features, a deep neural network is retrained according to the criterion of the minimum cross-entropy sum, which allows you to combine all values to simultaneously reduce them, and an additional regularizing term, which controls the deviation of the weights of the deep neural network trained in this way from the weights of the deep neural network trained using only low-level acoustic features, which avoids a strong change in the weights of a deep neural th network in comparison with good (quality) initial approximation.

It is important to note that minimizing cross-entropy is equivalent to lowering the average frame error classification (Frame Error Rate, FER), and the purpose of speech recognition is not to obtain classification results for individual frames, as in the case using the criterion of minimum cross-entropy, and obtaining a sequence of spoken words. And a measure of recognition system error is the word error rate (Word Error Rate, WER). Of course, word-by-word error and frame-by-frame error are strongly correlated, and reducing frame-by-frame error to zero almost inevitably leads to perfectly accurate recognition (provided that you use a high-quality lexicon and language model). However, in practice, reducing to zero frame-by-frame error is unattainable. It is extremely difficult to use a word-of-error as a criterion for training a neural network, it is not differentiable (according to network parameters) and difficult to calculate during training. For this reason, other learning criteria are used, in particular, sequentially discriminative, indirectly aimed specifically at reducing the word error, but more accessible from a computational point of view. These criteria consider the best hypothesis about the sequence of recognized words in the decoder and thus strive to adjust the parameters of the neural network in order to bring it closer to the true sequence of words and to keep it as far as possible from all "competing" hypotheses. The criterion of minimum average risk calculated by state (state-level Minimum Bayes Risk, sMBR) is only one of a number of well-known criteria of this class. It shows accuracy comparable to other similar criteria, but it is easier from a computational point of view. Thus, after retraining a deep neural network by the criterion of minimum cross-entropy sums and an additional regularizing term, it is retrained by the criterion of minimum average risk, which gives a significant increase in the accuracy of training the neural network.

After the training has come together, the weight matrix of the last hidden layer of the trained network is subjected to singular decomposition and its rank is reduced by discarding the smallest singular numbers. As a result of this operation, the last layer of the original network is replaced by 2 layers, one of which is linear and contains fewer neurons compared to the input layer. This layer is called the bottleneck or small layer. Part of the information when passing through a small-sized layer is irreversibly lost, but as a result, its most significant components are preserved. Initial training without a small layer allows you to bring the weights of the remaining layers to values that are close enough to optimal, which facilitates retraining of the network after the introduction of a small layer, i.e. sequential training of the network, first without a small layer, and then with it allows you to move through successive improvements, i.e. sequential tuning parameters (weights). It was experimentally found that training a neural network, initially having a small layer, reduces the quality and increases the complexity of its training.

As a result of previous training, the outputs of a deep neural network have good (qualitative) probability distributions of senons, which are already tuned according to the sequentially discriminative criterion. Since, as a result of a singular decomposition, the weight matrix of the last layer has undergone changes, the resulting deep neural network is no longer optimal from the point of view of the criterion of the previous training stage. Therefore, a deep neural network now with a small layer is once again retrained, using distributions from the previous training as target distributions. In this case, the neural network is retrained according to the criterion of the minimum cross-entropy to convergence, which has already been used, which improves the quality of the extracted high-level small-sized features from the small-sized layer. The high level of features is due to the fact that a deep neural network with a small layer, trained by the criterion of minimum cross-entropy, is able to provide almost as low values of cross-entropy as a deep neural network without a small layer, trained by the same criterion. Thus, the features extracted from the outputs of the small-sized layer contain all the essential information from the speech signal contained in the initial low-level acoustic features and the i-vector.

In addition, after the deep neural network is trained to convergence, the layers of the neural network located after the small-sized layer can be removed, which allows the trained deep neural network to become an “extractor” of new speaker-dependent small-sized high-level features, i.e. when a vector of low-level features extended (supplemented) by an i-vector is fed to the input of a neural network, as described previously, the output can be obtained activation values of a small-sized layer (layer of a narrow neck), which are a small-sized, speaker-dependent and high-level representation.

The proposed method can be applied to obtain multilingual speaker-dependent small-sized high-level acoustic features of speech. To this end, low-level speech features of at least two different languages and the corresponding announcer information (i-vector) are supplied to the input of the neural network, while data from different languages is fed randomly to the input of the neural network. In this case, the architecture of the neural network should be designed to multitasking training i.e. the neural network must have several hidden layers, the weights of which will be common for the data from the training set in all languages containing low-level speech features and announcer information, and many output layers, each of which processes data in one of the at least two languages. Thus, when learning using two languages, for example, if the data related to the first language is supplied to the input of the neural network, then after passing through the hidden layers, the data goes to the first output layer that relates directly to the first language, where the error is calculated that is reversed distribution adjusts the common weights of the hidden layers of the neural network for two languages. Further, if data related to the second language is supplied to the input of the neural network, then they follow the same principle to the second output layer corresponding to them, where the error is also calculated, with the help of which the weights of the hidden layers of the neural network, common for two languages, are also corrected. Thus, the neural network is trained according to data in all available languages. At the same time, the process of learning a neural network is similar to that described above for one language, and upon completion of training, multilingual speaker-dependent small-sized features are extracted from the output of the small-sized layer, which are high-level features that contain information related to all languages of the training sample, and, as a result resistant to language changes in speech recognition. Moreover, training one multilingual acoustic model of a neural network may require less computation than training several multilingual acoustic models for each language individually. In addition, if the data of one or another language are limited, when the corresponding training data is not available or expensive to obtain, a multilingual acoustic model can offer better accuracy compared to monolingual acoustic models obtained using limited data of the corresponding language.

It was experimentally revealed that it is the proposed procedure for training a deep neural network that is most suitable for producing speaker-dependent low-level, high-level features that are highly informative and allow the adaptation of the acoustic model to the acoustic variability of the speech signal and, as a result, high accuracy of speech recognition by such a model. The high-level features extracted from the output of a small-sized layer of a trained neural network can subsequently be used to train another neural network for speech recognition.

In FIG. Figure 3 shows the training of another neural network B for speech recognition, designated as block B (the left side of the circuit), to the input layer 4 of which high-level signs are received from the small-sized layer 2a of the trained neural network A trained by the proposed method and designated as block A (the left side of the circuit ) At the input of neural network B, a vector is received, which is a union of vectors from the current frame (delay 0), as well as from frames located 5, 10, and 15 frames before the current and 5, 10, 15 frames after the current. As a result, with a dimension of small-sized features, for example, 100, a vector of dimension 700 arrives at the input of the second network B. The neural network B, which is trained for speech recognition, contains an input layer 4 that receives this vector, hidden layers 5, the number of which is selected experimentally , and the output layer 6, which is the output of the neural network B.

Table 1 compares the values of the word-by-word recognition error (WER) of deep neural networks trained on speaker-dependent low-level high-level features obtained by the proposed method (speaker dependent bottleneck features - Deep Neural Network, SDBN-DNN) and deep neural networks trained on a speaker-adaptive method with using i-vectors (Deep Neural Network - i-vector, DNN-ivec). The table shows that the use of SDBN-signs provides a reduction in recognition errors. At the same time, training a deep neural network by the criterion of minimum average risk (state-level Minimum Bayes Risk, sMBR) provides a lower recognition error in comparison with training a deep neural network only by the criterion of minimum cross-entropy (Cross-Entropy, CE).

Table 1 - Speech Recognition Results.

The present invention is not limited to the specific embodiments disclosed in the description for illustrative purposes, and covers all possible modifications and alternatives included in the scope of the present invention defined by the claims.

Claims

Claim

1. The method of obtaining small-sized high-level acoustic signs of speech, according to which

provide the presence of low-level signs of speech and the corresponding speaker information;

train a neural network using low-level features of speech;

train the neural network using low-level features of speech, supplemented by announcer information;

introduce a small layer into the composition of the neural network;

train the neural network with a small layer using low-level features of speech, supplemented by announcer information;

low-level, high-level acoustic signs of speech are extracted from the output of the small-sized layer of the neural network.

2. The method according to p. 1, according to which, after training a neural network using low-level speech features, its input layer is expanded by supplementing the input layer matrix with zero columns.

3. The method according to any one of paragraphs. 1-2, according to which low-level speech features have the form of shallow-frequency cepstral coefficients.

4. The method according to any one of paragraphs. 1-2, according to which the low-level speech signs have the form of energy logarithms in the shallow frequency bands.

5. The method according to any one of paragraphs. 1 -4, according to which the announcer information has the form of a small-sized i-vector.

6. The method according to any one of paragraphs. 1-5, according to which the training of a neural network using low-level speech features is carried out according to the criterion of minimum cross-entropy.

7. The method according to any one of paragraphs. 1-6, according to which the neural network is retrained using low-level speech features, supplemented by speaker information, according to the criterion of the minimum of the sum of cross-entropy and an additional regularizing term.

8. The method according to p. 7, according to which the neural network is trained using low-level speech features, supplemented by announcer information, according to a sequentially discriminative criterion.

9. The method according to any one of paragraphs. 1-8, according to which a small-sized layer is introduced by low-ranking factorization of the weight matrix of the last hidden layer.

10. The method according to p. 9, according to which the low-ranking factorization of the matrix of weights of the last hidden layer provide a singular decomposition.

11. The method according to any one of paragraphs. 1-10, according to which, after completing the training of the neural network with a small layer, the layers located after the small layer of the neural network are removed.

12. The method according to any one of paragraphs. 1-11, according to which low-level speech features of at least two different languages and the corresponding speaker information are supplied to the input of the neural network, and multilingual small-sized high-level acoustic features of speech are extracted from the output of the small size layer of the neural network.

13. The method according to p. 12, according to which the number of output layers of the neural network is equal to the number of languages, while the weights of each of the output layers are adjusted only according to the data of the corresponding language, and the weights of all hidden layers are adjusted according to the data of all of the indicated at least two languages.