WO2022024202A1

WO2022024202A1 - Learning device, speech recognition device, learning method, speech recognition method, learning program, and speech recognition program

Info

Publication number: WO2022024202A1
Application number: PCT/JP2020/028766
Authority: WO
Inventors: 崇史森谷; 雄介篠原
Original assignee: 日本電信電話株式会社
Priority date: 2020-07-27
Filing date: 2020-07-27
Publication date: 2022-02-03
Also published as: JP7452661B2; JPWO2022024202A1

Abstract

A learning device (1) acquires an intermediate feature quantity sequence corresponding to an acoustic feature quantity sequence, estimates a first output probability distribution corresponding to the intermediate feature quantity sequence when a first estimation model parameter is given, estimates a second output probability distribution corresponding to the intermediate feature quantity sequence when a second estimation model parameter is given on the basis of the intermediate feature quantity sequence, a character feature quantity obtained by converting a correct answer symbol sequence, and an attention weight having an element indicating the degree of relevance of each frame of the acoustic feature quantity sequence to a timing at which a symbol appears, calculates a CTC loss of the first output probability distribution with respect to the correct answer symbol sequence, calculates a KLD loss of the first output probability distribution with respect to a probability matrix that is a total sum of products of the second output probability distribution and the attention weight of all symbols, calculates a CE loss of the second output probability distribution with respect to the correct answer symbol sequence, and updates the model parameters on the basis of an integrated loss obtained by integrating the CTC loss, the KLD loss, and the CE loss.

Description

Learning device, voice recognition device, learning method, voice recognition method, learning program and voice recognition program

The present invention relates to a learning device, a voice recognition device, a learning method, a voice recognition method, a learning program, and a voice recognition program.

In recent speech recognition systems using neural networks, word sequences can be output directly from acoustic feature sequences. For example, Non-Patent Document 1 describes a method of learning a neural network (NN) for speech recognition using a learning method by Connectionist Temporal Classification (CTC) (Non-Patent Document 1 "3. Connectionist Temporal Classification". And "4. Training the Network" section). In the learning of this CTC model, if the phoneme, character, subword, and word sequence (≠ frame-by-frame) corresponding to the content of the voice are prepared by introducing the "blank" symbol representing the redundancy, the learning is performed. It is possible to dynamically learn the correspondence between voice and output series from data.

However, learning a CTC model that dynamically assigns phonemes, letters, subwords, words, and blank symbols to each speech frame is more difficult than the acoustic model of a conventional speech recognition system.

On the other hand, in recent years, a speech recognition model "Attention-based model" (see Non-Patent Document 2 and Non-Patent Document 3), which has better performance than CTC and can directly output characters, subwords, and words, has been proposed.

In the Attention-based model, a pair of the feature amount (real number vector) extracted from each sample of the training data in advance and the correct answer unit number corresponding to each feature amount, and an appropriate initial model are prepared. As an initial model, a neural network in which random numbers are assigned to each parameter or a neural network that has already been trained with other training data can be used. In the speech recognition device to which the Attention-based model is applied, the intermediate feature amount corresponding to the input dimension is extracted from the input feature amount, characters are input to create a one-hot-vector, and these outputs are used as the basis. The next label is predicted in consideration of the label series up to the previous one. Then, in this voice recognition device, parameters used in each process are calculated from the learning data in order to make it easier to identify the label in the label estimation process.

Here, in the CTC model, since the sequence lengths of the input symbol sequence c and the output label sequence w are the same, learning can be performed in a relatively short time by an error function based on cross entropy, whereas in the Attention-based model, learning is possible. , There is a problem that it takes time to learn.

Therefore, in recent years, a method for improving the learning speed and performance of the Attention-based model has been proposed by using the Hybrid CTC / Attention model (see Non-Patent Document 4) that combines the above models. This Hybrid CTC / Attention model shares the voice distributed expression series conversion function, and its intermediate output is used for output calculation between the CTC model and Attention model. Each loss function is then combined by a weighted sum and used for the integrated loss value to train the entire model. For each pair of the feature amount of the training data and the correct unit number, each process of extraction of the intermediate feature amount, output probability calculation, and model update is repeated, and a predetermined number of times (usually tens of millions to hundreds of millions of times). The model at the time when the repetition of is completed is used as the trained model.

However, the Hybrid CTC / Attention model improves early due to CTC loss, but when focusing on CTC loss, there is no framework for stabilizing learning, and there is a possibility that sufficient recognition performance will not be obtained.

The present invention has been made in view of the above, and is a learning device, a voice recognition device, a learning method, a voice recognition method, a learning program, and a voice recognition program capable of improving the estimation accuracy of voice recognition by stable learning. The purpose is to provide.

In order to solve the above-mentioned problems and achieve the object, the learning device according to the present invention acquires an intermediate feature quantity sequence corresponding to the first acoustic feature quantity sequence when a transformation model parameter is given. Converts the part, the first estimation part that estimates the first output probability distribution corresponding to the intermediate feature quantity series, the intermediate feature quantity series, and the correct symbol sequence when the first estimation model parameter is given. When a second estimated model parameter is given based on the character feature amount and the attention weight having an element indicating the high degree of relevance of each frame of the first acoustic feature amount series to the timing at which the symbol appears. Calculates a probability matrix that is the sum of all symbols of the product of the second output probability distribution, the second output probability distribution, and the attention weight, and the second estimation part that estimates the second output probability distribution corresponding to the intermediate feature quantity series. Based on the probability matrix calculation unit, the correct symbol series corresponding to the first acoustic feature quantity series, and the first output probability distribution, the CTC (Connectionist Temporal Classification) loss of the first output probability distribution for the correct symbol series is calculated. The CTC loss calculation unit to be calculated, the KLD loss calculation unit to calculate the KLD (Kullback-Leibler Divergence) loss of the first output probability distribution with respect to the probability matrix based on the probability matrix and the first output probability distribution, and the first CE loss calculation unit that calculates the CE (Cross Entropy) loss of the second output probability distribution for the correct symbol series and the CTC loss based on the correct symbol series and the second output probability distribution corresponding to the acoustic feature quantity series of The conversion model parameter, the first estimation model parameter, and the second estimation model parameter are updated based on the integration loss that integrates the KLD loss and the CE loss, and the conversion unit, the first estimation unit, and the second estimation unit are updated. It is characterized by having a control unit that repeats the processing of the probability matrix calculation unit, the CTC loss calculation unit, the KLD loss calculation unit, and the CE loss calculation unit until the end condition is satisfied.

Further, the speech recognition device according to the present invention estimates the output probability distribution corresponding to the second acoustic feature quantity series when each model parameter satisfying the end condition is given in the learning device described above. It is characterized by outputting.

According to the present invention, it is possible to improve the estimation accuracy of speech recognition by stable learning.

FIG. 1 is a diagram showing an example of the functional configuration of the learning device according to the first embodiment. FIG. 2 is a flowchart showing a processing procedure of the learning process according to the first embodiment. FIG. 3 is a diagram showing an example of the functional configuration of the learning device according to the modified example of the first embodiment. FIG. 4 is a flowchart showing a processing procedure of the learning process according to the modified example of the first embodiment. FIG. 5 is a diagram showing an example of the functional configuration of the voice recognition device according to the second embodiment. FIG. 6 is a flowchart showing a processing procedure of the voice recognition process according to the second embodiment. FIG. 7 is a diagram showing an example of a computer in which a learning device and a voice recognition device are realized by executing a program.

Hereinafter, an embodiment of the present invention will be described in detail with reference to the drawings. The present invention is not limited to this embodiment. Further, in the description of the drawings, the same parts are indicated by the same reference numerals. In the following, when "^ A" is described for A which is a vector, a matrix, or a scalar, it is assumed to be equivalent to "a symbol in which" ^ "is written immediately above" A "".

[Embodiment 1]
First, as the first embodiment, a learning device for learning a speech recognition model will be described.

[Learning device]
FIG. 1 is a diagram showing an example of the functional configuration of the learning device according to the first embodiment. In the first embodiment, the learning device 1 according to the embodiment adopts a CTC / Attention model.

In the first embodiment, attention is paid to the Attention weight of the Attention-based model. This weight is calculated depending on the symbol output immediately before, and is an intermediate feature encoded by the voice distribution expression series conversion unit 101 (described later) as to which frame should be focused on when the output timing of the next label should be focused. It is obtained from the quantity by the softmax function. The number of dimensions of this Attention weight is 1 × T (number of frames), and if it is learned to have high performance, the value of the element of the frame to be noted is very high, and the value of the other frames is low. Take. In the first embodiment, the behavior of this Attention weight is utilized for learning the output of the CTC branch. Then, in the first embodiment, by improving the performance of the CTC branch, the expressive power of the voice distributed expression series conversion unit 101 is also improved, and as a result, the output of the Attention-based model is also improved.

The learning device 1 learns the CTC model to be adopted and the Attention-based model at the same time. Specifically, the learning device 1 proposes a learning method for learning the output of the Attention-based model at the same time as learning the CTC loss. Then, the learning device 1 stabilizes the learning of the CTC model by using a loss function that can learn the CTC frame-by-frame.

In the learning device 1, for example, a predetermined program is read into a computer or the like including a ROM (Read Only Memory), a RAM (Random Access Memory), a CPU (Central Processing Unit), etc., and the CPU executes the predetermined program. It is realized by. Further, the learning device 1 has a communication interface for transmitting and receiving various information to and from other devices connected via a network or the like. For example, the learning device 1 has a NIC (Network Interface Card) or the like, and communicates with other devices via a telecommunication line such as a LAN (Local Area Network) or the Internet. The learning device 1 has a touch panel, a voice input device, an input device such as a keyboard and a mouse, and a display device such as a liquid crystal display, and inputs and outputs information.

The learning device 1 inputs the acoustic feature sequence X and the corresponding symbol sequence c {c ₁ , c ₂ , ..., C _N } (correct symbol sequence), and corresponds to the acoustic feature sequence X. It is a device that generates and outputs a label sequence (word sequence {^ w ₁ , ^ w ₂ , ..., ^ W _N }) (output probability distribution). However, N is a positive integer and represents the number of symbols included in the symbol series c. The acoustic feature quantity series X is a series of time-series acoustic features extracted from a time-series acoustic signal such as voice. An example of the acoustic feature sequence X is a vector. The symbol sequence c is a sequence of correct answer symbols represented by a time-series acoustic signal corresponding to the acoustic feature quantity sequence X. Examples of correct symbols are phonemes, letters, subwords, words, and so on. An example of a symbol sequence is a vector. The correct symbol corresponds to the acoustic feature sequence X, but it is not specified which frame (time point) of the acoustic feature sequence X each symbol included in the symbol sequence corresponds to. Each part will be described below. It should be noted that each of the above-mentioned parts may be held by a plurality of devices in a dispersed manner.

The learning device 1 includes a voice distributed expression sequence conversion unit 101 (conversion unit), a label estimation unit 102 (first estimation unit), a CTC loss calculation unit 103, a symbol distribution expression conversion unit 104, an attention weight calculation unit 105, and a label estimation. Unit 106 (second estimation unit), CE (Cross Entropy) loss calculation unit 204, probability matrix calculation unit 108, KLD (Kullback-Leibler Divergence) loss calculation unit 402, correct answer accuracy calculation unit 110, loss integration unit 111 and control. It has a portion 112.

The acoustic feature quantity sequence X (first acoustic feature quantity sequence) is input to the voice distributed expression sequence conversion unit 101. The voice distribution expression sequence conversion unit 101 obtains and outputs an intermediate feature quantity sequence H corresponding to the acoustic feature quantity sequence X when the conversion model parameter γ ₁ is given. The voice distribution expression sequence conversion unit 101 is, for example, a multi-stage neural network, and functions as an encoder that converts the input acoustic feature amount into the intermediate feature amount sequence H by the multi-stage neural network.

The voice distribution expression sequence conversion unit 101 calculates the intermediate feature quantity sequence H by using, for example, the equation (17) of Non-Patent Document 4. Alternatively, the voice distributed expression sequence conversion unit 101 may obtain an intermediate feature quantity sequence H by applying LSTM (Long short-term memory) to the acoustic feature quantity sequence X instead of the equation (17) of Non-Patent Document 4. Good (Reference 1).
Reference 1: Sepp Hochreiter, Jurgen Schmidhuber, “LONG SHORT-TERM MEMORY”, Computer Science, MedicinePublished in Neural Computation 1997.

The intermediate feature quantity sequence H encoded by the speech distribution expression sequence conversion unit 101 is input to the label estimation unit 102. The label estimation unit 102 receives the next output probability distribution Y (label sequence in the figure) of the CTC model corresponding to the intermediate feature quantity sequence H when the label estimation model parameter γ ₂ (first estimation model parameter) is given. {^ L ₁ , ^ l ₂ , ..., ^ L _T }) (first output probability distribution) is calculated and output. The output probability distribution Y is calculated using, for example, the equation (16) of Non-Patent Document 4. The number of dimensions of the output probability distribution Y is equal to (K + 1) × T. (K + 1) is the number of output symbols (phonemes, letters, subwords, words, etc.) plus the redundant symbol “blank”. T is the number of frames.

The CTC loss calculation unit 103 calculates and outputs the CTC loss L _CTC of the output probability distribution Y with respect to the symbol series c based on the output probability distribution Y output from the label estimation unit 102 and the symbol series c. The CTC loss calculation unit 103 creates a trellis of the symbol series c with a redundant label on the vertical axis and a trellis of time on the horizontal axis, and calculates the path of the optimum transition probability based on the forward backward algorithm (detailed). For the calculation process, refer to “4. Training the Network” in Non-Patent Document 1). CTC loss L _CTC is calculated using, for example, the equation (14) of Non-Patent Document 1.

The symbol distribution expression conversion unit 104 inputs the output probability distribution Z (second output probability distribution) and the symbol sequence c (for example, phonemes) output from the label estimation unit 106 (described later). The symbol distribution representation conversion unit 104 uses the symbol sequence c as a character feature amount C which is a continuous value feature amount corresponding to the output probability distribution Z when the character feature amount estimation model parameter β ₃ which is a model parameter is given. Convert to and output. The character feature amount C is a one-hot vector representation. The calculation of the character feature amount C using the output probability distribution Z is performed, for example, by the equation (4) of Non-Patent Document 2.

Attention The weight calculation unit 105 is used for the intermediate feature quantity sequence H output from the speech distribution expression sequence conversion unit 101, the recursive intermediate output S of the neural network of the label estimation unit 106 (described later), and the label sequence estimation immediately before. Using the attention weight α _n-1 at the time of occurrence, the attention weight α _n (vector) used when estimating the next label is calculated (the detailed calculation process is described in “2 Attention-Based Model” of Non-Patent Document 2. See “2.1 General Framework” in “for Speech Recognition”). The attention weight α _n has an element representing the high degree of relevance of each frame of the acoustic feature sequence X to the timing at which the symbol appears. n represents the order of the output probability distributions Z arranged in chronological order. Generally, the number of dimensions of α is 1 × T (number of frames). Attention There is also a model in which a plurality (× J) of weights α _n can be obtained.

In the label estimation unit 106, the intermediate feature amount series H encoded by the voice distribution expression sequence conversion unit 101, the character feature amount C output from the symbol distribution expression conversion unit 104, and the attention output from the attention weight calculation unit 105. The weight α _n is input. The label estimation unit 106 uses the intermediate feature sequence H, the character feature amount C, and the attention weight α _n to give the label estimation model parameter β ₂ (second estimation model parameter), and the label estimation unit 106 uses the intermediate feature amount series H, the character feature amount C, and the attention weight α n. The next output probability distribution Z of the Attention-based model corresponding to the sequence H (label sequence {^ l ₁ , ^ l ₂ , ..., ^ L _N } in the figure) is calculated. At the time of learning, the number of input symbols is known because it is the same as the dimension of the symbol series c {c ₁ , c ₂ , ..., C _N }, and the number of dimensions of the output probability distribution Z is K (of the symbol). Number of entries) x N (number of symbols). The generation of the output probability distribution Z is performed, for example, according to the equations (2) and (3) of Non-Patent Document 2.

The CE loss calculation unit 107 calculates the CE loss _LS2S of the output probability distribution Z with respect to the symbol series c based on the symbol series c and the output probability distribution Z output from the label estimation unit 106. Since the output probability distribution Z and the symbol sequence c are the same with respect to the dimension N of the output label, the loss can be calculated using the error function by cross entropy.

The learning device 1 is a Kullback-Leibler Divergence (KLD) based on the CTC loss and the probability matrix P created based on the attention weight α which is the output of the Attention-based model and the output probability distribution Z in the training of the CTC model. Based on the loss due to the above and the CE loss, the speech distributed expression sequence conversion unit 101, the label estimation units 102 and 106, the symbol distribution expression conversion unit 104, and the attention weight calculation unit 105 are learned. Therefore, the calculation of the KLD loss L _KLD , which is the loss of the output probability distribution Y with respect to the probability matrix P and the probability matrix P, will be described.

The probability matrix calculation unit 108 calculates the probability matrix P using the output probability distribution Z output from the label estimation unit 106 and the attention weight α _n output from the attention weight calculation unit 105. As described above, the number of dimensions of the output probability distribution Z (z _n ) is K (number of symbol entries) × N (number of symbols), and the attention weight α _n is a vector of 1 × T (number of frames). There are as many as N (number of symbols). Therefore, the probability matrix calculation unit 108 calculates the probability matrix P by the following equation (1).

As shown in equation (1), the probability matrix P is the sum of all the symbols of the product of the output probability distribution Z and the attention weight α _n . The probability matrix P is the equation (2), the output probability distribution z _n is the equation (3), and the attention weight α _n is the equation (4).

The number of dimensions of the stochastic matrix P is K (number of symbol entries) x T (number of frames). In order to make this probability matrix P perform KLD with the output probability distribution Y of the label estimation unit 102, the probability matrix calculation unit 108 also adds a redundant symbol to the number of dimensions of the output probability distribution Z of the Attention-based model ( It is the dimension of K + 1). Further, since it is desirable that the sum of the stochastic matrix P is 1 in each frame, the following equation (5) is normalized.

Further, since there is a model in which a plurality of attention weights can be obtained as described in Non-Patent Document 3, a plurality of (× J) stochastic matrices P may be obtained.

The KLD loss calculation unit 109 calculates the KLD loss L _KLD of the output probability distribution Y with respect to the probability matrix P based on the probability matrix P (× J) and the output probability distribution Y using the following equation (6). KLD loss L _KLD indicates how much the output probability distribution Y of the model to be learned and the probability matrix P (× J) are.

In the equation (6), the KLD loss L _KLD can be calculated including the case where a plurality of stochastic matrices P (× J) are obtained depending on the case where a plurality of attention weights are obtained. When there is one stochastic matrix P (one attention weight), J = 1.

The correct answer accuracy calculation unit 110 calculates the correct answer accuracy Acc by inputting the output probability distribution Z of the Attention-based model and the symbol sequence c which is the correct answer symbol sequence. The correct answer accuracy calculation unit 110 includes a sequence Z'obtained by multiplying the output probability distribution Z of the Attention-based model to be learned by argmax (obtaining the element ID of the maximum value with respect to the class axis), and the symbol sequence c which is the correct answer. The match rate of is returned to the loss integration unit 111 as the correct answer accuracy Acc. Therefore, the range that the correct answer accuracy Acc can take is 0 ≦ Acc ≦ 1.

The loss integration unit 111 calculates the loss of the CTC / Attention model. The loss integration unit 111 includes a CTC loss L _CTC output from the CTC loss calculation unit 103, a CE loss L _S2S output from the CE loss calculation unit 107, a KLD loss L _KLD output from the KLD loss calculation unit 109, and correct accuracy. By applying the correct answer accuracy Acc, coefficients λ, ρ, ζ (0 ≦ λ ≦ 1), (0 ≦ ζ ≦ 1-λ) output from the calculation unit 110 to the following equation (7), the loss can be reduced. Integrate to obtain the loss _{LS2S + CTC + KLD} of the CTC / Attention model.

The ρ of the equation (7) is the equation (8).

Since the learning device 1 learns the CTC model and the Attention-based model at the same time, the attention weights obtained at the initial stage of learning and the accuracy of the stochastic matrix are very low. Therefore, we introduce hyperparameters such as ρ. ρ consists of ζ and Acc (*). ζ is a hyperparameter (0 ≦ ζ ≦ 1-λ) determined manually, and Acc (*) is the correct answer accuracy when the output probability distribution Z of the Attention-based model and the correct answer symbol are referred to. Therefore, in the progress of learning, the influence of the term of ρL _KLD becomes large, and the accuracy becomes better.

Using the loss _{LS2S + CTC + KLD} of the CTC / Attention model, the control unit 112 uses the conversion model parameter γ ₁ of the voice distribution expression sequence conversion unit 101, the label estimation model parameter γ ₂ of the label estimation unit 102, and the symbol distribution expression conversion unit 104. The character feature amount estimation model parameter β ₃ , the model parameter of the attention weight calculation unit 105, and the label estimation model parameter β ₂ of the label estimation unit 106 are updated.

The control unit 112 includes processing by the voice distribution expression sequence conversion unit 101, processing by the label estimation unit 102, processing by the CTC loss calculation unit 103, processing by the symbol distribution expression conversion unit 104, processing by the attention weight calculation unit 105, and label estimation unit. Predetermined termination conditions are processing by 106, processing by CE loss calculation unit 107, processing by probability matrix calculation unit 108, processing by KLD loss calculation unit 109, processing by correct answer accuracy calculation unit 110, and processing by loss integration unit 111. Repeat until is satisfied.

This end condition is not limited, and may be, for example, that the number of repetitions has reached the threshold value, or that the amount of change in the integration loss has become less than or equal to the threshold value before and after the repetition. It may be that the amount of change of the conversion model parameter γ ₁ and the label estimation model parameter γ ₂ becomes equal to or less than the threshold value before and after. When the end condition is satisfied, the voice distribution expression sequence conversion unit 101 outputs the conversion model parameter γ ₁ , and the label estimation unit 12 outputs the label estimation model parameter γ ₂ .

[Processing procedure of learning process]
FIG. 2 is a flowchart showing a processing procedure of the learning process according to the first embodiment. As shown in FIG. 2, when the input of the acoustic feature quantity sequence X is received, the voice distributed expression sequence conversion unit 101 converts the acoustic feature quantity sequence X into the corresponding intermediate feature quantity sequence H. (Step S1).

Subsequently, the label estimation unit 102 performs the first estimation process for calculating the next output probability distribution Y of the CTC model corresponding to the intermediate feature quantity series H (step S2). The CTC loss calculation unit 103 inputs the output probability distribution Y and the symbol series c, and performs a CTC loss calculation process for calculating the CTC loss L _CTC of the output probability distribution Y with respect to the symbol series c (step S3).

On the other hand, the symbol distributed expression conversion unit 104 receives the input of the output probability distribution Z and the symbol series c output from the label estimation unit 106, and performs the symbol distributed expression conversion process of converting to the character feature amount C (step S4). The attention weight calculation unit 105 uses the intermediate feature quantity sequence H, the recursive intermediate output S of the neural network of the label estimation unit 106, and the attention weight α _n-1 used for the previous label sequence estimation, and then uses the following. Attention weight calculation processing for calculating α _n used when estimating the label of is performed (step S5).

Then, the label estimation unit 106 performs a second estimation process for calculating the next output probability distribution Z of the Attention-based model using the intermediate feature amount series H, the character feature amount C, and the attention weight α _n (step). S6).

The CE loss calculation unit 107 performs a CE loss calculation process for calculating the CE loss _LS2S of the output probability distribution Z with respect to the symbol series c based on the output probability distribution Z and the symbol series c (step S7).

Further, the probability matrix calculation unit 108 calculates the probability matrix P based on the output probability distribution Z output from the label estimation unit 106 and the attention weight α _n output from the attention weight calculation unit 105. (Step S8). The KLD loss calculation unit 109 performs a KLD loss calculation process for calculating the KLD loss L _KLD , which is the loss of the output probability distribution Y with respect to the probability matrix P (step S9). Then, the correct answer accuracy calculation unit 110 performs a correct answer accuracy calculation process for calculating the correct answer accuracy Acc based on the output probability distribution Z and the symbol series c (step S10).

The loss integration unit 111 is based on the CTC loss L _CTC , CE loss L _S2S , KLD loss L _KLD , correct answer accuracy Acc, coefficients λ, ρ, ζ (0 ≦ λ ≦ 1), (0 ≦ ζ ≦ 1-λ). In addition, a loss integration process is performed to integrate the losses and obtain the integrated losses _{LS2S + CTC + KLD} (step S11). The control unit 112 updates the model parameters of the voice distribution expression sequence conversion unit 101, the label estimation unit 102, the symbol distribution expression conversion unit 104, the attention weight calculation unit 105, and the label estimation unit 106 using the loss _{LS2S + CTC + KLD} (step). S12). The control unit 112 repeats each of the above processes until a predetermined end condition is satisfied.

[Effect of Embodiment 1]
As described above, in the learning device 1, the loss L _{S2S + CTC + KLD} of the CTC / Attention model is obtained by integrating the CTC loss L _CTC , the CE loss L _S2S , and the KLD loss L _KLD , and the label is used by using this loss L _{S2S + CTC + KLD} . Together with the estimation units 102 and 203, the model parameters of the voice distribution expression sequence conversion unit 101 shared by the label estimation units 102 and 203 are updated. That is, the shared speech distributed expression sequence conversion unit 101 is learned by the loss related to the label estimation units 102 and 203. In other words, the speech distributed expression sequence conversion unit 101 is learned to output the intermediate feature quantity sequence H that enhances the accuracy of the estimation results of the label estimation units 102 and 203, so that the overall estimation accuracy is improved. Can be done.

Then, the learning device 1 can stabilize the learning of the CTC model and improve the estimation accuracy of speech recognition by using a loss function (see equation (7)) in which the CTC can be learned frame-by-frame. can.

[Modified Example of Embodiment 1]
FIG. 3 is a diagram showing an example of the functional configuration of the learning device according to the modified example of the first embodiment. The probability matrix generated by the attention-based model may contain errors. In order to avoid this problem, as shown in FIG. 3, the learning device 1A according to the modified example of the first embodiment has a label estimation unit 102A (third estimation unit) separately from the label estimation unit 102 (third estimation unit). It is also possible to update the parameter for each loss by providing the estimation unit (4). The CTC loss calculation unit 103 calculates the CTC loss L CTC based on the output probability distribution Y estimated by the label estimation unit 102, and the KLD loss calculation unit 109 calculates the CTC loss L _CTC based on the output probability distribution Y estimated by the label estimation unit 102A. KLD loss L Calculate _KLD .

FIG. 4 is a flowchart showing a processing procedure of the learning process according to the modified example of the first embodiment. Steps S21 to S28 shown in FIG. 4 are the same processes as steps S1 to S8 shown in FIG.

The label estimation unit 102A calculates the next output probability distribution Y of the CTC model corresponding to the intermediate feature quantity series H, and performs a third label estimation process to output to the KLD loss calculation unit 109 (step S29). The KLD loss calculation unit 109 calculates the KLD loss L _KLD using the output probability distribution Y output from the label estimation unit 102A (step S30). Steps S31 to S33 shown in FIG. 4 are the same processes as steps S10 to S12 shown in FIG.

[Embodiment 2]
Next, the second embodiment will be described. In the second embodiment, the conversion model parameter γ ₁ and the label estimation model parameter γ ₂ satisfying the end condition are given in the learning device 1 according to the first embodiment or the learning device 1A according to the modified example of the first embodiment. The voice recognition device constructed in is described. FIG. 5 is a diagram showing an example of the functional configuration of the voice recognition device according to the second embodiment. FIG. 6 is a flowchart showing a processing procedure of the voice recognition process according to the second embodiment.

As illustrated in FIG. 5, the voice recognition device 3 according to the embodiment has a voice distribution expression sequence conversion unit 301 and a label estimation unit 302. The voice distributed expression sequence conversion unit 301 is the same as the above-mentioned voice distributed expression sequence conversion unit 101 except that the conversion model parameter γ ₁ output from the learning device 1 or the learning device 1A is input and set. .. The label estimation unit 302 is the same as the label estimation unit 102 described above, except that the label estimation model parameter γ ₂ output from the learning device 1 or the learning device 1A is input and set.

The acoustic feature quantity sequence X "(second acoustic feature quantity sequence) to be recognized by the voice is input to the voice distribution expression sequence conversion unit 301. The voice distribution expression sequence conversion unit 301 is given the conversion model parameter γ ₁ . In this case, the intermediate feature quantity sequence H "corresponding to the acoustic feature quantity sequence X" is obtained and output (step S41).

An intermediate feature quantity series H ”output from the speech distribution expression sequence conversion unit 301 is input to the label estimation unit 302. The label estimation unit 302 receives an intermediate feature when the label estimation model parameter γ ₂ is given. The label sequence {^ l ₁ , ^ l ₂ , ..., ^ L _F } (output probability distribution) corresponding to the quantity sequence H is obtained and output as a speech recognition result (step S42).

As described above, in the speech recognition device 3, model parameters optimized by the learning device 1 or the learning device 1A using the loss _{LS2S + CTC + KLD} are set in the label estimation unit 302 and the speech distribution expression sequence conversion unit 301. Therefore, the voice recognition process can be performed with high accuracy.

[About the system configuration of the embodiment]
Each component of the learning device 1 and the voice recognition device 3 is a functional concept, and does not necessarily have to be physically configured as shown in the figure. That is, the specific form of the distribution and integration of the functions of the display processing systems 10 and 210 is not limited to the one shown in the figure, and all or part of them may be functional in any unit according to various loads and usage conditions. Or it can be physically distributed or integrated.

Further, each process performed in the learning device 1 and the speech recognition device 3 is realized by a program in which all or any part thereof is analyzed and executed by the CPU, GPU (Graphics Processing Unit), and CPU, GPU. May be good. Further, each process performed by the learning device 1 and the voice recognition device 3 may be realized as hardware by wired logic.

It is also possible to manually perform all or part of the processes described as being automatically performed among the processes described in the embodiment. Alternatively, all or part of the process described as being performed manually can be automatically performed by a known method. In addition, the above-mentioned and illustrated processing procedures, control procedures, specific names, and information including various data and parameters can be appropriately changed unless otherwise specified.

[program]
FIG. 7 is a diagram showing an example of a computer in which the learning device 1 and the voice recognition device 3 are realized by executing the program. The computer 1000 has, for example, a memory 1010 and a CPU 1020. The computer 1000 also has a hard disk drive interface 1030, a disk drive interface 1040, a serial port interface 1050, a video adapter 1060, and a network interface 1070. Each of these parts is connected by a bus 1080.

Memory 1010 includes ROM 1011 and RAM 1012. The ROM 1011 stores, for example, a boot program such as a BIOS (Basic Input Output System). The hard disk drive interface 1030 is connected to the hard disk drive 1090. The disk drive interface 1040 is connected to the disk drive 1100. For example, a removable storage medium such as a magnetic disk or an optical disk is inserted into the disk drive 1100. The serial port interface 1050 is connected to, for example, a mouse 1110 and a keyboard 1120. The video adapter 1060 is connected to, for example, the display 1130.

The hard disk drive 1090 stores, for example, an OS (Operating System) 1091, an application program 1092, a program module 1093, and program data 1094. That is, the program that defines each process of the learning device 1 and the voice recognition device 3 is implemented as a program module 1093 in which a code that can be executed by the computer 1000 is described. The program module 1093 is stored in, for example, the hard disk drive 1090. For example, the program module 1093 for executing the same processing as the functional configuration in the learning device 1 and the voice recognition device 3 is stored in the hard disk drive 1090. The hard disk drive 1090 may be replaced by an SSD (Solid State Drive).

Further, the setting data used in the processing of the above-described embodiment is stored as program data 1094 in, for example, a memory 1010 or a hard disk drive 1090. Then, the CPU 1020 reads the program module 1093 and the program data 1094 stored in the memory 1010 and the hard disk drive 1090 into the RAM 1012 and executes them as needed.

The program module 1093 and the program data 1094 are not limited to those stored in the hard disk drive 1090, but may be stored in, for example, a removable storage medium and read by the CPU 1020 via the disk drive 1100 or the like. Alternatively, the program module 1093 and the program data 1094 may be stored in another computer connected via a network (LAN (Local Area Network), WAN (Wide Area Network), etc.). Then, the program module 1093 and the program data 1094 may be read from another computer by the CPU 1020 via the network interface 1070.

Although the embodiment to which the invention made by the present inventor is applied has been described above, the present invention is not limited by the description and the drawings which form a part of the disclosure of the present invention according to the present embodiment. That is, other embodiments, examples, operational techniques, and the like made by those skilled in the art based on the present embodiment are all included in the scope of the present invention.

1 Learning device 3

Speech recognition device

101, 301 Speech distribution expression

series conversion unit

102, 102A, 106, 302 Label estimation unit 103 CTC loss calculation unit 104 Symbol distribution expression conversion unit 105 Attention weight calculation unit 107 CE loss calculation unit 108 Probability matrix Calculation unit 109 KLD loss calculation unit 110 Correct answer accuracy calculation unit 111 Loss integration unit 112 Control unit

Claims

A conversion unit that acquires an intermediate feature sequence corresponding to the first acoustic feature sequence when the transformation model parameters are given, and a conversion unit.
A first estimation unit that estimates the first output probability distribution corresponding to the intermediate feature series when the first estimation model parameter is given, and a first estimation unit.
Based on the intermediate feature quantity series, the character feature quantity obtained by converting the correct symbol sequence, and the attention weight having an element indicating the high degree of relevance of each frame of the first acoustic feature quantity series to the timing at which the symbol appears. In addition to the second estimation unit that estimates the second output probability distribution corresponding to the intermediate feature series when the second estimation model parameter is given.
A stochastic matrix calculation unit that calculates a stochastic matrix that is the sum of all symbols of the product of the second output probability distribution and the attention weight.
The CTC (Connectionist Temporal Classification) loss of the first output probability distribution with respect to the correct symbol series is calculated based on the correct symbol series corresponding to the first acoustic feature series and the first output probability distribution. CTC loss calculation unit and
A KLD loss calculation unit that calculates the KLD (Kullback-Leibler Divergence) loss of the first output probability distribution with respect to the probability matrix based on the probability matrix and the first output probability distribution.
CE for calculating the CE (Cross Entropy) loss of the second output probability distribution with respect to the correct symbol series based on the correct symbol series corresponding to the first acoustic feature series and the second output probability distribution. Loss calculation unit and
The conversion model parameter, the first estimation model parameter, and the second estimation model parameter are updated based on the integrated loss in which the CTC loss, the KLD loss, and the CE loss are integrated, and the conversion unit and the first estimation model parameter are updated. A control unit that repeats the processing of the estimation unit 1, the second estimation unit, the probability matrix calculation unit, the CTC loss calculation unit, the KLD loss calculation unit, and the CE loss calculation unit until the end condition is satisfied.
A learning device characterized by having.
The first estimation unit has a third estimation unit and a fourth estimation unit, each of which estimates the first output probability distribution corresponding to the intermediate feature quantity series.
The CTC loss calculation unit calculates the CTC loss based on the first output probability distribution estimated by the third estimation unit.
The learning device according to claim 1, wherein the KLD loss calculation unit calculates the KLD loss based on the first output probability distribution estimated by the fourth estimation unit.
It is characterized in that the output probability distribution corresponding to the second acoustic feature quantity series is estimated and output when each model parameter satisfying the end condition is given in the learning device according to claim 1 or 2. Speech recognition device.
It is a learning method executed by the learning device.
A conversion step for acquiring an intermediate feature sequence corresponding to the first acoustic feature sequence when the transformation model parameters are given, and a conversion step.
A first estimation step for estimating a first output probability distribution corresponding to the intermediate feature series when the first estimation model parameter is given, and a first estimation step.
Based on the intermediate feature quantity series, the character feature quantity obtained by converting the correct symbol sequence, and the attention weight having an element indicating the high degree of relevance of each frame of the first acoustic feature quantity series to the timing at which the symbol appears. In the second estimation step of estimating the second output probability distribution corresponding to the intermediate feature quantity series when the second estimation model parameter is given.
A stochastic matrix calculation step for calculating a stochastic matrix, which is the sum of all symbols of the product of the second output probability distribution and the attention weight.
The CTC (Connectionist Temporal Classification) loss of the first output probability distribution with respect to the correct symbol series is calculated based on the correct symbol series corresponding to the first acoustic feature series and the first output probability distribution. CTC loss calculation process and
A KLD loss calculation step for calculating the KLD (Kullback-Leibler Divergence) loss of the first output probability distribution with respect to the probability matrix based on the probability matrix and the first output probability distribution.
CE for calculating the CE (Cross Entropy) loss of the second output probability distribution with respect to the correct symbol series based on the correct symbol series corresponding to the first acoustic feature series and the second output probability distribution. Loss calculation process and
Including
The conversion model parameter, the first estimated model parameter, and the second estimated model parameter are updated based on the integrated loss in which the CTC loss, the KLD loss, and the CE loss are integrated, and the conversion step and the first estimation model parameter are updated. The feature is that the processing of the estimation step 1, the second estimation step, the probability matrix calculation step, the CTC loss calculation step, the KLD loss calculation step, and the CE loss calculation step is repeated until the end condition is satisfied. How to learn.
A voice recognition method that estimates and outputs an output probability distribution corresponding to a second acoustic feature quantity series when each model parameter satisfying the end condition is given in the learning device according to claim 1 or 2.
A learning program for making a computer function as the learning device according to claim 1 or 2.
A voice recognition program for making a computer function as the voice recognition device according to claim 3.