WO2019212375A1 - Procédé d'obtention de signes acoustiques destinés à la reconnaissance de la parole de petites dimensions et dépendant du locuteur - Google Patents

Procédé d'obtention de signes acoustiques destinés à la reconnaissance de la parole de petites dimensions et dépendant du locuteur Download PDF

Info

Publication number
WO2019212375A1
WO2019212375A1 PCT/RU2018/000286 RU2018000286W WO2019212375A1 WO 2019212375 A1 WO2019212375 A1 WO 2019212375A1 RU 2018000286 W RU2018000286 W RU 2018000286W WO 2019212375 A1 WO2019212375 A1 WO 2019212375A1
Authority
WO
WIPO (PCT)
Prior art keywords
neural network
level
speech
low
layer
Prior art date
Application number
PCT/RU2018/000286
Other languages
English (en)
Russian (ru)
Inventor
Алексей Александрович ПРУДНИКОВ
Максим Львович КОРЕНЕВСКИЙ
Иван Павлович МЕДЕННИКОВ
Original Assignee
Общество с ограниченной ответственностью "Центр речевых технологий"
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Общество с ограниченной ответственностью "Центр речевых технологий" filed Critical Общество с ограниченной ответственностью "Центр речевых технологий"
Priority to EA202092400A priority Critical patent/EA202092400A1/ru
Priority to PCT/RU2018/000286 priority patent/WO2019212375A1/fr
Publication of WO2019212375A1 publication Critical patent/WO2019212375A1/fr

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/065Adaptation
    • G10L15/07Adaptation to the speaker
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/18Artificial neural networks; Connectionist approaches

Definitions

  • the invention relates to the field of speech recognition, in particular to obtaining high-level acoustic features of speech for speech recognition in terms of acoustic variability.
  • a known method of obtaining a speaker adaptive acoustic model through a neural network using the i-vector (US2015149165).
  • an acoustic model based on a deep neural network is provided, audio data including one or more speaker statements is received, a plurality of speech recognition features are extracted from said one or more speaker statements, a speaker identification vector for this speaker is created based on the extracted speech recognition features and adapt the acoustic model of the deep neural network for automatic speech recognition using the extracted features p spoznavaniya speech and speaker identification vector.
  • a known method of adapting an acoustic model based on a neural network (US20170169815).
  • a trained acoustic neural network can be adapted to the speaker by using speech data corresponding to a variety of statements made by the speaker.
  • a trained acoustic model neural network may have an input layer, one or more hidden layers and an output layer, and may be a deep neural network.
  • the input layer may include a set of input nodes that contain speech features derived from a speech utterance, and another set of input nodes that contain speaker information values derived from a speech utterance.
  • Signs of speech may include values used to collect information about the content of the utterance, including, without limitation, melf-frequency cepstral coefficients (MFCCs) for one or more speech frames, first-order derivatives between MFCCs for consecutive speech frames (delta MFCCs) and derivatives second order between MFCCs for consecutive frames (delta-delta MFCCs).
  • MFCCs melf-frequency cepstral coefficients
  • delta MFCCs first-order derivatives between MFCCs for consecutive speech frames
  • delta-delta MFCCs derivatives second order between MFCCs for consecutive frames
  • the speaker information values may include a speaker identification vector (i-vector).
  • An acoustic model with a multilingual deep neural network may have a direct distribution neural network having several layers with one or more nodes. Each node of this layer is connected by appropriate weights to each node of the subsequent layer, and several layers with one or more nodes can have one or more common hidden layers of nodes and a language-dependent output layer of nodes corresponding to each of two or more languages.
  • the disadvantages of the known invention is that the method disclosed therein does not provide a multilingual acoustic model that would be highly resistant to distortion of the input data and would allow speech recognition with high accuracy under conditions of acoustic variability.
  • a deep neural network has several hidden layers, a small layer and an output layer, while its first hidden layer includes a first set of nodes processing acoustic features, and a second set of nodes processing additional speaker information, input acoustic features are multiplied by the first matrix of weights, and additional announcer information is multiplied by a second matrix of weights.
  • the outputs of the small layer are connected to the next network layer.
  • the disadvantage of this method is that training a neural network with a narrow neck does not provide high-quality small-sized features and, as a result, cannot provide an acoustic model that would allow high-precision speech recognition under conditions of acoustic variability.
  • a known method of data replenishment based on the stochastic conversion of features for automatic speech recognition (US9721559), according to which a speaker-dependent acoustic model of the target speaker is taught for further recognition of his speech, i.e. this model is designed with the best possible quality to recognize one specific speaker.
  • the disadvantage of this method is that small-sized features are built for a specific speaker and are used to train an acoustic model designed exclusively for recognizing his speech. The signs obtained in this way do not allow learning the acoustic model that would be used for speech recognition of arbitrary speakers.
  • the proposed method for training a neural network with a narrow neck does not provide high-quality small-sized features.
  • the known methods do not provide acoustic characteristics and acoustic models corresponding to a high level of quality for subsequent speech recognition in the conditions of acoustic variability of various speakers.
  • the possibility of obtaining multilingual acoustic features and / or a multilingual acoustic model that is highly resistant to input data distortions and meets a high level of quality for subsequent speech recognition has not been sufficiently developed.
  • the technical problem of the present invention is to provide a method for producing high-level acoustic features that can be used to train an acoustic model characterized by low sensitivity to acoustic variability of a speech signal and providing high accuracy in speech recognition.
  • the posed problem is solved due to the fact that, according to the proposed method for obtaining small-sized high-level acoustic signs of speech, they provide the presence of low-level signs of speech and the corresponding speaker information, then they train the neural network using low-level signs of speech, after which they train the neural network using low-level signs of speech, supplemented by announcer information.
  • a small-sized layer is introduced into the composition of the neural network and a neural network with a small-sized layer is further trained using low-level features of speech supplemented by announcer information, then small-sized, high-level acoustic features of speech are extracted from the output of the small-sized layer of the neural network.
  • the proposed method allows to achieve a technical result in the form of increasing the information content of high-level acoustic features, which, in turn, improves the accuracy of speech recognition systems of various (arbitrary) speakers under conditions of acoustic variability.
  • additional announcer information is used (for example, using i-vectors), taking into account information about the announcer, and / or channel, and / or surroundings, which allows obtaining so-called speaker-dependent acoustic features that provide recognition speeches of various (arbitrary) speakers and in various conditions.
  • Implementation of the proposed method is based on neural networks, which improves the quality of the obtained acoustic signs.
  • the proposed method uses a neural network with a narrow neck, i.e. a small-sized layer is introduced into the neural network, which reduces the dimension of the input data.
  • the outputs of this layer will be small high-level features that are not only resistant to distortion of the input acoustic features, but also accumulate information about the speaker, and / or channel, and / or environment. It is worth noting that the quality of training a neural network directly affects the quality of the resulting speaker-dependent small-sized high-level acoustic features.
  • the initial training of the neural network is performed using only low-level speech features, and then using low-level speech features, supplemented by speaker information, without small-sized layer, which allows you to bring the weights of the remaining layers to values that are close enough to optimal, which improves the quality of training the neural network and facilitates retraining of the network after the introduction of the small-sized layer.
  • Additional training of the neural network using low-level speech features, supplemented by announcer information allows you to compensate for changes in the weight matrix of the last layer after the small layer is inserted into the neural network, which improves the quality of training of the neural network and, as a result, the quality of acoustic characteristics obtained after training.
  • the use of speaker-dependent small-sized high-level acoustic features obtained by the proposed method for training a neural network in speech recognition allows to obtain significant gains in the accuracy of speech recognition.
  • its input layer is expanded by supplementing the matrix of the layer with zero columns. Expansion of the input layer is necessary to enable the training of the neural network using low-level speech features supplemented by announcer information, otherwise the dimension of the input vector, consisting of low-level speech features and the corresponding speaker information, will be too large for the input layer of the neural network.
  • expansion by supplementing the input layer matrix with zero columns after training the neural network using low-level speech features allows you to save the behavior of the network, which improves the quality of training the neural network.
  • low-level speech features have the form of shallow-frequency cepstral coefficients or logarithms of energy in shallow-frequency bands.
  • the presentation of low-level speech features in the proposed types allows to obtain high-quality high-level acoustic features.
  • the announcer's information has the form of a small-sized i-vector.
  • the 1-vector is a small-sized (of the order of 100 elements) vector that allows you to encode the deviation of the distribution of acoustic phonogram signs from the distribution estimated over the entire training sample, and accumulate information about the speaker, as well as, to some extent, the channel and acoustic environment .
  • the use of a small-sized i-vector together with low-level speech features increases the accuracy of training neural network and, as a result, resulting from the training of high-level acoustic features.
  • training a neural network using low-level speech features is carried out according to the criterion of minimum cross-entropy.
  • Cross entropy shows how much the probability distribution at the output of the neural network corresponds to the senon actually observed in this frame.
  • the neural network is retrained using low-level speech features, supplemented by announcer information, according to the criterion of the minimum amount of cross-entropy and an additional regularizing term.
  • An additional regularizing term prevents a strong deviation of weights from previously trained ones, which increases the quality (accuracy) of training the neural network.
  • a neural network trained using low-level speech features, supplemented by announcer information is retrained using the criterion of minimum cross-entropy sum and an additional regularizing term using the sequentially discriminative criterion. This criterion improves recognition accuracy.
  • a small-sized layer is introduced by low-ranking factorization of the weight matrix of the last hidden layer, in particular by singular decomposition.
  • the singular decomposition allows to reduce the rank of the weight matrix of the last hidden layer of the neural network by discarding the smallest singular numbers, thereby ensuring the entry of a small-sized layer (small-sized linear layer) into the neural network.
  • the layers located after the small-sized layer of the neural network are removed. Removing all layers after the small-sized layer will allow us to consider the trained neural network as an extractor of small-sized high-level features.
  • low-level speech features of at least two different languages and the corresponding speaker information are supplied to the input of the neural network, and multilingual small-sized high-level acoustic features of speech are extracted from the output of the small size layer of the neural network.
  • the small-sized layer contains high-level features that apply to all languages of the training set at once. Received so In this way, acoustic features are highly informative and can increase resistance to changing the input language in speech recognition systems.
  • the number of output layers of the neural network is equal to the number of languages, and the weights of each of the output layers are adjusted only according to the data of the corresponding language, and the weights of all hidden layers are adjusted according to the data of all of the indicated at least two languages.
  • the proposed architecture provides the possibility of multilingual learning of a neural network.
  • FIG. 1 architecture of a trained neural network without a small layer, according to one embodiment of the invention
  • FIG. 2 is an architecture of a trained neural network with a small layer, according to one embodiment of the invention.
  • FIG. 3 is a training diagram of a speech recognition neural network according to one embodiment of the invention.
  • One of the most difficult tasks in the field of automatic speech recognition is the problem of recognition of spontaneous spoken speech of various (arbitrary) speakers.
  • the complexity of the task is due to the peculiarities of spontaneous speech of various (arbitrary) speakers: high channel and speaker variability, the presence of additive and non-linear distortions, the presence of accent and emotional speech, a diverse manner of pronunciation, the variability of the tempo of speech, reduction and lingering articulation.
  • One way to improve the quality of recognition of spontaneous speech is to reduce the sensitivity of the recognition system to the acoustic variability of the speech signal.
  • the implementation of this method is possible when applying the adaptation of acoustic models based on deep neural networks using speaker information that takes into account information about the speaker and / or channel and / or environment.
  • the method of obtaining small-sized, high-level acoustic features of speech allows obtaining acoustic features that can be used for adaptive learning acoustic model, characterized by low sensitivity to acoustic variability of the speech signal and providing high accuracy in speech recognition.
  • retraining refers to training that begins with the configured parameters obtained during previous training.
  • the method of obtaining small-sized high-level acoustic features of speech in accordance with the present invention can be carried out using, for example, known computer or multiprocessor systems.
  • the claimed method can be implemented using specialized software and hardware.
  • a deep direct distribution neural network is used.
  • other suitable architectures can be used to train the neural network, for example convolutional neural networks, time-delayed neural networks, etc.
  • the basic deep direct distribution neural network is initially initialized with random weights, after which a training example is fed to its input and the network activity is calculated, then an idea of the error is formed, that is, the difference between what should be on the output layer and what happened to the network. Further weights are adjusted in such a way as to reduce this error.
  • FIG. 1 depicts a deep neural network of direct distribution without a small layer (without a narrow throat).
  • the proposed neural network contains an input layer 1, which serves low-level features of speech and i-vector.
  • the neural network also contains several hidden layers 2, which process the signs obtained from the input layer, and the output layer 3, which outputs the result.
  • Each layer contains neurons that receive information, perform calculations and pass it on.
  • neurons change weights; in other words, the weights of the neurons vary with the information coming into the neuron.
  • training is carried out through a deep neural network without a narrow throat (without small layer), after training the neural network to the required limits add small layer 2A (Fig. 2).
  • a deep direct distribution neural network is used, trained to classify speech units. On each short-term part of speech (frame, they usually follow with a frequency of 100 Hz), the classification allows us to evaluate which pronounced "sounds" of speech most likely generated the observed vector of acoustic signs.
  • Speech units can be understood as phonemes.
  • phoneme means the minimum unit of the sound system of a language that does not have an independent lexical or grammatical meaning. For example, according to various phonological schools, the Russian language contains from 39 to 43 phonemes. Also, speech units can be understood as allophones or their parts.
  • the term "allophone” refers to a specific implementation of the phoneme in speech, due to its phonetic environment.
  • An allophone that takes into account 1 phoneme before and after this one is called a trifon.
  • phonemes or trifons are modeled by a hidden Markov model with states 1–3 (state 1 — entrance to sound, transition from the previous one, state 2 — stable part, state 3 — exit from sound, transition to the next), while some Trifon states “Bind” together to provide enough data to train rare Trifonov.
  • Such bound states are called “senons”, and it is to them that the outputs of the neural network correspond, i.e. the neural network classifies speech feature vectors into classes of senons, estimates the probabilities of each senon with the observed feature vector.
  • the optimal configuration of a deep neural network provides 6 hidden layers of 1536 neurons each with sigmoid and output softmax layer with 13000 neurons corresponding to the senons of the acoustic model based on Gaussian mixtures.
  • the optimal configuration depends on the amount of training data.
  • the training sample is formed from the phonograms of various speakers. Phonograms can be obtained by any known method, for example, by recording telephone conversations. In this embodiment, the speakers speak the same language.
  • low-level acoustic features mel-frequency cepstral coefficients, for example, dimension 12, or logarithms of energy in mel-frequency bands, for example, dimensions
  • low-level acoustic features are meant features extracted directly from a speech signal or its spectrum by digital signal processing methods. They carry important information about the signal, but are difficult to interpret in terms of classifying speech units.
  • low-level acoustic features such as perceptual linear prediction (PLP) coefficients, output energies of the gammatone filter bank (gammatone interbank, GTFB), etc.
  • PPP perceptual linear prediction
  • GTFB gammatone interbank
  • a small-sized representation of the announcer information contained in the phonogram is extracted, in particular, i-vectors, for example, dimension 50 are extracted.
  • the extraction of i-vectors is carried out, for example, using the Universal Background Model (UBM), which was trained in advance.
  • UBM Universal Background Model
  • the 1-vector accumulates announcer information, and in some embodiments, it is a small-sized vector encoding the deviation of the distribution of the acoustic features of the phonogram from the distribution estimated over the entire training sample.
  • announcer information in the form of maximum likelihood coefficients of linear regression in a feature space (feature space Maximum Likelihood Linear Regression, fMLLR).
  • a deep neural network is trained to predict the probabilities of senon states corresponding to a separate speech frame, using only low-level acoustic signs according to the criterion of minimum cross-entropy.
  • Cross entropy shows how much the probability distribution at the output of the neural network corresponds to the senon actually observed in this frame. The closer the probability of a given cenon to unity, and the remaining cenons to zero, the cross-entropy in this frame will be lower.
  • cross-entropy is a measure of the average accuracy of the classification of individual speech frames throughout the training sample, and the smaller it is, the more accurately a given neural network is able to predict senons. In other words, minimizing cross-entropy is equivalent to lowering the average frame-by-frame classification error.
  • the initial low-level acoustical features are fed to the input of a deep neural network, previously expanding the input layer of a deep neural network by the dimension of additional features by adding zeros to the layer matrix, which will allow preserving the network behavior due to the multiplication of zeros by the components of the i-vector.
  • the input vector consists of 2 parts - the first part (low-level acoustic features) differs from frame to frame, the second (i-vector) is the same for all vectors of the same phonogram.
  • each voice of the speaker is characterized by a set of features that allow him to be perceived as the voice of this particular speaker. These features can be interpreted as coordinates in space, so each voice can be considered a point in the voice space, and if two voices are close in some parameters, then the points will also be close in the voice space and the corresponding i-vectors will also be close in space of voices.
  • speech recognition of various (arbitrary) speakers is provided. This is because, since there are usually a lot of speakers in the training sample, the network gains the ability to use information about which area of the voice space the input i-vector came from.
  • a deep neural network is retrained according to the criterion of the minimum cross-entropy sum, which allows you to combine all values to simultaneously reduce them, and an additional regularizing term, which controls the deviation of the weights of the deep neural network trained in this way from the weights of the deep neural network trained using only low-level acoustic features, which avoids a strong change in the weights of a deep neural th network in comparison with good (quality) initial approximation.
  • a word-of-error As a criterion for training a neural network, it is not differentiable (according to network parameters) and difficult to calculate during training. For this reason, other learning criteria are used, in particular, sequentially discriminative, indirectly aimed specifically at reducing the word error, but more accessible from a computational point of view. These criteria consider the best hypothesis about the sequence of recognized words in the decoder and thus strive to adjust the parameters of the neural network in order to bring it closer to the true sequence of words and to keep it as far as possible from all "competing" hypotheses.
  • the criterion of minimum average risk calculated by state (state-level Minimum Bayes Risk, sMBR) is only one of a number of well-known criteria of this class.
  • the weight matrix of the last hidden layer of the trained network is subjected to singular decomposition and its rank is reduced by discarding the smallest singular numbers.
  • the last layer of the original network is replaced by 2 layers, one of which is linear and contains fewer neurons compared to the input layer. This layer is called the bottleneck or small layer.
  • Part of the information when passing through a small-sized layer is irreversibly lost, but as a result, its most significant components are preserved.
  • Initial training without a small layer allows you to bring the weights of the remaining layers to values that are close enough to optimal, which facilitates retraining of the network after the introduction of a small layer, i.e.
  • the outputs of a deep neural network have good (qualitative) probability distributions of senons, which are already tuned according to the sequentially discriminative criterion. Since, as a result of a singular decomposition, the weight matrix of the last layer has undergone changes, the resulting deep neural network is no longer optimal from the point of view of the criterion of the previous training stage. Therefore, a deep neural network now with a small layer is once again retrained, using distributions from the previous training as target distributions.
  • the neural network is retrained according to the criterion of the minimum cross-entropy to convergence, which has already been used, which improves the quality of the extracted high-level small-sized features from the small-sized layer.
  • the high level of features is due to the fact that a deep neural network with a small layer, trained by the criterion of minimum cross-entropy, is able to provide almost as low values of cross-entropy as a deep neural network without a small layer, trained by the same criterion.
  • the features extracted from the outputs of the small-sized layer contain all the essential information from the speech signal contained in the initial low-level acoustic features and the i-vector.
  • the layers of the neural network located after the small-sized layer can be removed, which allows the trained deep neural network to become an “extractor” of new speaker-dependent small-sized high-level features, i.e. when a vector of low-level features extended (supplemented) by an i-vector is fed to the input of a neural network, as described previously, the output can be obtained activation values of a small-sized layer (layer of a narrow neck), which are a small-sized, speaker-dependent and high-level representation.
  • the proposed method can be applied to obtain multilingual speaker-dependent small-sized high-level acoustic features of speech.
  • low-level speech features of at least two different languages and the corresponding announcer information (i-vector) are supplied to the input of the neural network, while data from different languages is fed randomly to the input of the neural network.
  • the architecture of the neural network should be designed to multitasking training i.e. the neural network must have several hidden layers, the weights of which will be common for the data from the training set in all languages containing low-level speech features and announcer information, and many output layers, each of which processes data in one of the at least two languages.
  • the neural network is trained according to data in all available languages.
  • the process of learning a neural network is similar to that described above for one language, and upon completion of training, multilingual speaker-dependent small-sized features are extracted from the output of the small-sized layer, which are high-level features that contain information related to all languages of the training sample, and, as a result resistant to language changes in speech recognition.
  • training one multilingual acoustic model of a neural network may require less computation than training several multilingual acoustic models for each language individually.
  • a multilingual acoustic model can offer better accuracy compared to monolingual acoustic models obtained using limited data of the corresponding language.
  • FIG. Figure 3 shows the training of another neural network B for speech recognition, designated as block B (the left side of the circuit), to the input layer 4 of which high-level signs are received from the small-sized layer 2a of the trained neural network A trained by the proposed method and designated as block A (the left side of the circuit )
  • a vector is received, which is a union of vectors from the current frame (delay 0), as well as from frames located 5, 10, and 15 frames before the current and 5, 10, 15 frames after the current.
  • a vector of dimension 700 arrives at the input of the second network B.
  • the neural network B which is trained for speech recognition, contains an input layer 4 that receives this vector, hidden layers 5, the number of which is selected experimentally , and the output layer 6, which is the output of the neural network B.
  • Table 1 compares the values of the word-by-word recognition error (WER) of deep neural networks trained on speaker-dependent low-level high-level features obtained by the proposed method (speaker dependent bottleneck features - Deep Neural Network, SDBN-DNN) and deep neural networks trained on a speaker-adaptive method with using i-vectors (Deep Neural Network - i-vector, DNN-ivec).
  • WER word-by-word recognition error

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)
  • Measurement Of Mechanical Vibrations Or Ultrasonic Waves (AREA)

Abstract

L'invention concerne le domaine de la reconnaissance de la parole et notamment l'obtention de signes caractéristiques acoustiques de la parole destinés à la reconnaissance de la parole dans des conditions de variabilité acoustique. L'invention concerne un procédé pour obtenir des signes acoustiques de la parole de niveau élevé de petites dimensions selon lequel on assure la présence de signes de la parole de bas niveau et d'informations de présentateur qui leur correspondent et on effectue l'apprentissage d'un réseau neuronal en utilisant des signes de la parole de bas niveau, après quoi on parachève l'apprentissage du réseau neuronal en utilisant des signes de la parole de bas niveau complétés par une informations de présentateur. On introduit une couche de petites dimensions dans la composition du réseau neuronal et on parachève l'apprentissage du réseau neuronal en utilisant des signes de la parole de bas niveau complétés par une information de présentateur puis on extrait à la sortie de la couche de petites dimensions du réseau neuronal des signes acoustiques de niveau élevé de petites dimensions.
PCT/RU2018/000286 2018-05-03 2018-05-03 Procédé d'obtention de signes acoustiques destinés à la reconnaissance de la parole de petites dimensions et dépendant du locuteur WO2019212375A1 (fr)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EA202092400A EA202092400A1 (ru) 2018-05-03 2018-05-03 Способ получения дикторозависимых малоразмерных высокоуровневых акустических признаков речи
PCT/RU2018/000286 WO2019212375A1 (fr) 2018-05-03 2018-05-03 Procédé d'obtention de signes acoustiques destinés à la reconnaissance de la parole de petites dimensions et dépendant du locuteur

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/RU2018/000286 WO2019212375A1 (fr) 2018-05-03 2018-05-03 Procédé d'obtention de signes acoustiques destinés à la reconnaissance de la parole de petites dimensions et dépendant du locuteur

Publications (1)

Publication Number Publication Date
WO2019212375A1 true WO2019212375A1 (fr) 2019-11-07

Family

ID=68386452

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/RU2018/000286 WO2019212375A1 (fr) 2018-05-03 2018-05-03 Procédé d'obtention de signes acoustiques destinés à la reconnaissance de la parole de petites dimensions et dépendant du locuteur

Country Status (2)

Country Link
EA (1) EA202092400A1 (fr)
WO (1) WO2019212375A1 (fr)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111613204A (zh) * 2020-04-29 2020-09-01 云知声智能科技股份有限公司 一种快速响应的神经语音合成系统及其方法
CN113035177A (zh) * 2021-03-11 2021-06-25 平安科技(深圳)有限公司 声学模型训练方法及装置
CN113808581A (zh) * 2021-08-17 2021-12-17 山东大学 一种声学和语言模型训练及联合优化的中文语音识别方法

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017099936A1 (fr) * 2015-12-10 2017-06-15 Nuance Communications, Inc. Système et procédés d'adaptation de modèles acoustiques de réseau neuronal
US9858919B2 (en) * 2013-11-27 2018-01-02 International Business Machines Corporation Speaker adaptation of neural network acoustic models using I-vectors

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9858919B2 (en) * 2013-11-27 2018-01-02 International Business Machines Corporation Speaker adaptation of neural network acoustic models using I-vectors
WO2017099936A1 (fr) * 2015-12-10 2017-06-15 Nuance Communications, Inc. Système et procédés d'adaptation de modèles acoustiques de réseau neuronal

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
MEDENNIKOV I.P: "Metody, algoritmy i programmnye sredstva raspoznavaniya russkoi telefonnoi spontannoi rechi", DISSERTATSIYA NA SOISKANIE UCHENOI STEPENI KANDIDATA TEKHNICHESKIKH NAUK, 2016, Sankt-Peterburg *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111613204A (zh) * 2020-04-29 2020-09-01 云知声智能科技股份有限公司 一种快速响应的神经语音合成系统及其方法
CN113035177A (zh) * 2021-03-11 2021-06-25 平安科技(深圳)有限公司 声学模型训练方法及装置
CN113035177B (zh) * 2021-03-11 2024-02-09 平安科技(深圳)有限公司 声学模型训练方法及装置
CN113808581A (zh) * 2021-08-17 2021-12-17 山东大学 一种声学和语言模型训练及联合优化的中文语音识别方法
CN113808581B (zh) * 2021-08-17 2024-03-12 山东大学 一种声学和语言模型训练及联合优化的中文语音识别方法

Also Published As

Publication number Publication date
EA202092400A1 (ru) 2021-03-03

Similar Documents

Publication Publication Date Title
Wang et al. A joint training framework for robust automatic speech recognition
US11972753B2 (en) System and method for performing automatic speech recognition system parameter adjustment via machine learning
Ghai et al. Literature review on automatic speech recognition
US11183171B2 (en) Method and system for robust language identification
US8762142B2 (en) Multi-stage speech recognition apparatus and method
Liu et al. Towards unsupervised speech recognition and synthesis with quantized speech representation learning
Cai et al. From speaker verification to multispeaker speech synthesis, deep transfer with feedback constraint
Stolcke et al. Speaker recognition with session variability normalization based on MLLR adaptation transforms
EP1647970A1 (fr) Champs aléatoires conditionnels cachés pour la classification de phonèmes et la reconnaissance de la parole
Ma et al. Incremental text-to-speech synthesis with prefix-to-prefix framework
Kumar et al. Improvements in the detection of vowel onset and offset points in a speech sequence
WO2005096271A1 (fr) Dispositif de reconnaissance vocale et méthode de reconnaissance vocale
Karafiát et al. BUT neural network features for spontaneous Vietnamese in BABEL
WO2019212375A1 (fr) Procédé d'obtention de signes acoustiques destinés à la reconnaissance de la parole de petites dimensions et dépendant du locuteur
Georgescu et al. SpeeD's DNN approach to Romanian speech recognition
Müller et al. Towards improving low-resource speech recognition using articulatory and language features
Tóth et al. Cross-lingual Portability of MLP-Based Tandem Features--A Case Study for English and Hungarian
CN112216270A (zh) 语音音素的识别方法及系统、电子设备及存储介质
Kurian A review on technological development of automatic speech recognition
Dimitriadis et al. Use of micro-modulation features in large vocabulary continuous speech recognition tasks
Gehring et al. DNN acoustic modeling with modular multi-lingual feature extraction networks
Sharma et al. Soft-Computational Techniques and Spectro-Temporal Features for Telephonic Speech Recognition: an overview and review of current state of the art
JP5300000B2 (ja) 調音特徴抽出装置、調音特徴抽出方法、及び調音特徴抽出プログラム
Chakroun et al. An improved approach for text-independent speaker recognition
Li et al. DNN online adaptation for automatic speech recognition

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18917168

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18917168

Country of ref document: EP

Kind code of ref document: A1