WO2022024202A1 - Learning device, speech recognition device, learning method, speech recognition method, learning program, and speech recognition program - Google Patents

Learning device, speech recognition device, learning method, speech recognition method, learning program, and speech recognition program Download PDF

Info

Publication number
WO2022024202A1
WO2022024202A1 PCT/JP2020/028766 JP2020028766W WO2022024202A1 WO 2022024202 A1 WO2022024202 A1 WO 2022024202A1 JP 2020028766 W JP2020028766 W JP 2020028766W WO 2022024202 A1 WO2022024202 A1 WO 2022024202A1
Authority
WO
WIPO (PCT)
Prior art keywords
loss
probability distribution
output probability
series
estimation
Prior art date
Application number
PCT/JP2020/028766
Other languages
French (fr)
Japanese (ja)
Inventor
崇史 森谷
雄介 篠原
Original Assignee
日本電信電話株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 日本電信電話株式会社 filed Critical 日本電信電話株式会社
Priority to JP2022539819A priority Critical patent/JP7452661B2/en
Priority to PCT/JP2020/028766 priority patent/WO2022024202A1/en
Publication of WO2022024202A1 publication Critical patent/WO2022024202A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks

Definitions

  • the present invention relates to a learning device, a voice recognition device, a learning method, a voice recognition method, a learning program, and a voice recognition program.
  • Non-Patent Document 1 describes a method of learning a neural network (NN) for speech recognition using a learning method by Connectionist Temporal Classification (CTC) (Non-Patent Document 1 "3. Connectionist Temporal Classification”. And “4. Training the Network” section).
  • CNN neural network
  • CTC Connectionist Temporal Classification
  • ⁇ frame-by-frame the phoneme, character, subword, and word sequence ( ⁇ frame-by-frame) corresponding to the content of the voice are prepared by introducing the "blank" symbol representing the redundancy, the learning is performed. It is possible to dynamically learn the correspondence between voice and output series from data.
  • the Attention-based model a pair of the feature amount (real number vector) extracted from each sample of the training data in advance and the correct answer unit number corresponding to each feature amount, and an appropriate initial model are prepared.
  • an initial model a neural network in which random numbers are assigned to each parameter or a neural network that has already been trained with other training data can be used.
  • the speech recognition device to which the Attention-based model is applied the intermediate feature amount corresponding to the input dimension is extracted from the input feature amount, characters are input to create a one-hot-vector, and these outputs are used as the basis.
  • the next label is predicted in consideration of the label series up to the previous one. Then, in this voice recognition device, parameters used in each process are calculated from the learning data in order to make it easier to identify the label in the label estimation process.
  • the Hybrid CTC / Attention model improves early due to CTC loss, but when focusing on CTC loss, there is no framework for stabilizing learning, and there is a possibility that sufficient recognition performance will not be obtained.
  • the present invention has been made in view of the above, and is a learning device, a voice recognition device, a learning method, a voice recognition method, a learning program, and a voice recognition program capable of improving the estimation accuracy of voice recognition by stable learning.
  • the purpose is to provide.
  • the learning device acquires an intermediate feature quantity sequence corresponding to the first acoustic feature quantity sequence when a transformation model parameter is given. Converts the part, the first estimation part that estimates the first output probability distribution corresponding to the intermediate feature quantity series, the intermediate feature quantity series, and the correct symbol sequence when the first estimation model parameter is given.
  • a second estimated model parameter is given based on the character feature amount and the attention weight having an element indicating the high degree of relevance of each frame of the first acoustic feature amount series to the timing at which the symbol appears.
  • the probability matrix calculation unit calculates a probability matrix that is the sum of all symbols of the product of the second output probability distribution, the second output probability distribution, and the attention weight, and the second estimation part that estimates the second output probability distribution corresponding to the intermediate feature quantity series. Based on the probability matrix calculation unit, the correct symbol series corresponding to the first acoustic feature quantity series, and the first output probability distribution, the CTC (Connectionist Temporal Classification) loss of the first output probability distribution for the correct symbol series is calculated.
  • the conversion model parameter, the first estimation model parameter, and the second estimation model parameter are updated based on the integration loss that integrates the KLD loss and the CE loss, and the conversion unit, the first estimation unit, and the second estimation unit are updated. It is characterized by having a control unit that repeats the processing of the probability matrix calculation unit, the CTC loss calculation unit, the KLD loss calculation unit, and the CE loss calculation unit until the end condition is satisfied.
  • the speech recognition device estimates the output probability distribution corresponding to the second acoustic feature quantity series when each model parameter satisfying the end condition is given in the learning device described above. It is characterized by outputting.
  • FIG. 1 is a diagram showing an example of the functional configuration of the learning device according to the first embodiment.
  • FIG. 2 is a flowchart showing a processing procedure of the learning process according to the first embodiment.
  • FIG. 3 is a diagram showing an example of the functional configuration of the learning device according to the modified example of the first embodiment.
  • FIG. 4 is a flowchart showing a processing procedure of the learning process according to the modified example of the first embodiment.
  • FIG. 5 is a diagram showing an example of the functional configuration of the voice recognition device according to the second embodiment.
  • FIG. 6 is a flowchart showing a processing procedure of the voice recognition process according to the second embodiment.
  • FIG. 7 is a diagram showing an example of a computer in which a learning device and a voice recognition device are realized by executing a program.
  • FIG. 1 is a diagram showing an example of the functional configuration of the learning device according to the first embodiment.
  • the learning device 1 according to the embodiment adopts a CTC / Attention model.
  • attention is paid to the Attention weight of the Attention-based model.
  • This weight is calculated depending on the symbol output immediately before, and is an intermediate feature encoded by the voice distribution expression series conversion unit 101 (described later) as to which frame should be focused on when the output timing of the next label should be focused. It is obtained from the quantity by the softmax function.
  • the number of dimensions of this Attention weight is 1 ⁇ T (number of frames), and if it is learned to have high performance, the value of the element of the frame to be noted is very high, and the value of the other frames is low.
  • the behavior of this Attention weight is utilized for learning the output of the CTC branch.
  • the expressive power of the voice distributed expression series conversion unit 101 is also improved, and as a result, the output of the Attention-based model is also improved.
  • the learning device 1 learns the CTC model to be adopted and the Attention-based model at the same time. Specifically, the learning device 1 proposes a learning method for learning the output of the Attention-based model at the same time as learning the CTC loss. Then, the learning device 1 stabilizes the learning of the CTC model by using a loss function that can learn the CTC frame-by-frame.
  • the learning device 1 for example, a predetermined program is read into a computer or the like including a ROM (Read Only Memory), a RAM (Random Access Memory), a CPU (Central Processing Unit), etc., and the CPU executes the predetermined program. It is realized by. Further, the learning device 1 has a communication interface for transmitting and receiving various information to and from other devices connected via a network or the like. For example, the learning device 1 has a NIC (Network Interface Card) or the like, and communicates with other devices via a telecommunication line such as a LAN (Local Area Network) or the Internet. The learning device 1 has a touch panel, a voice input device, an input device such as a keyboard and a mouse, and a display device such as a liquid crystal display, and inputs and outputs information.
  • a NIC Network Interface Card
  • the learning device 1 has a touch panel, a voice input device, an input device such as a keyboard and a mouse, and a display device such as a liquid crystal display,
  • the learning device 1 inputs the acoustic feature sequence X and the corresponding symbol sequence c ⁇ c 1 , c 2 , ..., C N ⁇ (correct symbol sequence), and corresponds to the acoustic feature sequence X. It is a device that generates and outputs a label sequence (word sequence ⁇ w 1 , ⁇ w 2 , ..., ⁇ W N ⁇ ) (output probability distribution).
  • N is a positive integer and represents the number of symbols included in the symbol series c.
  • the acoustic feature quantity series X is a series of time-series acoustic features extracted from a time-series acoustic signal such as voice.
  • the acoustic feature sequence X is a vector.
  • the symbol sequence c is a sequence of correct answer symbols represented by a time-series acoustic signal corresponding to the acoustic feature quantity sequence X. Examples of correct symbols are phonemes, letters, subwords, words, and so on.
  • An example of a symbol sequence is a vector.
  • the correct symbol corresponds to the acoustic feature sequence X, but it is not specified which frame (time point) of the acoustic feature sequence X each symbol included in the symbol sequence corresponds to.
  • Each part will be described below. It should be noted that each of the above-mentioned parts may be held by a plurality of devices in a dispersed manner.
  • the learning device 1 includes a voice distributed expression sequence conversion unit 101 (conversion unit), a label estimation unit 102 (first estimation unit), a CTC loss calculation unit 103, a symbol distribution expression conversion unit 104, an attention weight calculation unit 105, and a label estimation.
  • Unit 106 second estimation unit
  • CE Cross Entropy
  • CE Cross Entropy
  • probability matrix calculation unit 108 probability matrix calculation unit 108
  • KLD Kullback-Leibler Divergence
  • the acoustic feature quantity sequence X (first acoustic feature quantity sequence) is input to the voice distributed expression sequence conversion unit 101.
  • the voice distribution expression sequence conversion unit 101 obtains and outputs an intermediate feature quantity sequence H corresponding to the acoustic feature quantity sequence X when the conversion model parameter ⁇ 1 is given.
  • the voice distribution expression sequence conversion unit 101 is, for example, a multi-stage neural network, and functions as an encoder that converts the input acoustic feature amount into the intermediate feature amount sequence H by the multi-stage neural network.
  • the voice distribution expression sequence conversion unit 101 calculates the intermediate feature quantity sequence H by using, for example, the equation (17) of Non-Patent Document 4.
  • the voice distributed expression sequence conversion unit 101 may obtain an intermediate feature quantity sequence H by applying LSTM (Long short-term memory) to the acoustic feature quantity sequence X instead of the equation (17) of Non-Patent Document 4.
  • Reference 1 Sepp Hochreiter, Jurgen Schmidhuber, “LONG SHORT-TERM MEMORY”, Computer Science, MedicinePublished in Neural Computation 1997.
  • the intermediate feature quantity sequence H encoded by the speech distribution expression sequence conversion unit 101 is input to the label estimation unit 102.
  • the label estimation unit 102 receives the next output probability distribution Y (label sequence in the figure) of the CTC model corresponding to the intermediate feature quantity sequence H when the label estimation model parameter ⁇ 2 (first estimation model parameter) is given.
  • ⁇ L 1 , ⁇ l 2 , ..., ⁇ L T ⁇ ) (first output probability distribution) is calculated and output.
  • the output probability distribution Y is calculated using, for example, the equation (16) of Non-Patent Document 4.
  • the number of dimensions of the output probability distribution Y is equal to (K + 1) ⁇ T.
  • (K + 1) is the number of output symbols (phonemes, letters, subwords, words, etc.) plus the redundant symbol “blank”.
  • T is the number of frames.
  • the CTC loss calculation unit 103 calculates and outputs the CTC loss L CTC of the output probability distribution Y with respect to the symbol series c based on the output probability distribution Y output from the label estimation unit 102 and the symbol series c.
  • the CTC loss calculation unit 103 creates a trellis of the symbol series c with a redundant label on the vertical axis and a trellis of time on the horizontal axis, and calculates the path of the optimum transition probability based on the forward backward algorithm (detailed).
  • CTC loss L CTC is calculated using, for example, the equation (14) of Non-Patent Document 1.
  • the symbol distribution expression conversion unit 104 inputs the output probability distribution Z (second output probability distribution) and the symbol sequence c (for example, phonemes) output from the label estimation unit 106 (described later).
  • the symbol distribution representation conversion unit 104 uses the symbol sequence c as a character feature amount C which is a continuous value feature amount corresponding to the output probability distribution Z when the character feature amount estimation model parameter ⁇ 3 which is a model parameter is given. Convert to and output.
  • the character feature amount C is a one-hot vector representation.
  • the calculation of the character feature amount C using the output probability distribution Z is performed, for example, by the equation (4) of Non-Patent Document 2.
  • the weight calculation unit 105 is used for the intermediate feature quantity sequence H output from the speech distribution expression sequence conversion unit 101, the recursive intermediate output S of the neural network of the label estimation unit 106 (described later), and the label sequence estimation immediately before.
  • the attention weight ⁇ n-1 at the time of occurrence, the attention weight ⁇ n (vector) used when estimating the next label is calculated (the detailed calculation process is described in “2 Attention-Based Model” of Non-Patent Document 2. See “2.1 General Framework” in “for Speech Recognition”).
  • the attention weight ⁇ n has an element representing the high degree of relevance of each frame of the acoustic feature sequence X to the timing at which the symbol appears. n represents the order of the output probability distributions Z arranged in chronological order. Generally, the number of dimensions of ⁇ is 1 ⁇ T (number of frames). Attention There is also a model in which a plurality ( ⁇ J) of weights ⁇ n can be obtained.
  • the weight ⁇ n is input.
  • the label estimation unit 106 uses the intermediate feature sequence H, the character feature amount C, and the attention weight ⁇ n to give the label estimation model parameter ⁇ 2 (second estimation model parameter), and the label estimation unit 106 uses the intermediate feature amount series H, the character feature amount C, and the attention weight ⁇ n.
  • the next output probability distribution Z of the Attention-based model corresponding to the sequence H (label sequence ⁇ l 1 , ⁇ l 2 , ..., ⁇ L N ⁇ in the figure) is calculated.
  • the number of input symbols is known because it is the same as the dimension of the symbol series c ⁇ c 1 , c 2 , ..., C N ⁇ , and the number of dimensions of the output probability distribution Z is K (of the symbol). Number of entries) x N (number of symbols).
  • the generation of the output probability distribution Z is performed, for example, according to the equations (2) and (3) of Non-Patent Document 2.
  • the CE loss calculation unit 107 calculates the CE loss LS2S of the output probability distribution Z with respect to the symbol series c based on the symbol series c and the output probability distribution Z output from the label estimation unit 106. Since the output probability distribution Z and the symbol sequence c are the same with respect to the dimension N of the output label, the loss can be calculated using the error function by cross entropy.
  • the learning device 1 is a Kullback-Leibler Divergence (KLD) based on the CTC loss and the probability matrix P created based on the attention weight ⁇ which is the output of the Attention-based model and the output probability distribution Z in the training of the CTC model.
  • KLD Kullback-Leibler Divergence
  • the speech distributed expression sequence conversion unit 101, the label estimation units 102 and 106, the symbol distribution expression conversion unit 104, and the attention weight calculation unit 105 are learned. Therefore, the calculation of the KLD loss L KLD , which is the loss of the output probability distribution Y with respect to the probability matrix P and the probability matrix P, will be described.
  • the probability matrix calculation unit 108 calculates the probability matrix P using the output probability distribution Z output from the label estimation unit 106 and the attention weight ⁇ n output from the attention weight calculation unit 105.
  • the number of dimensions of the output probability distribution Z (z n ) is K (number of symbol entries) ⁇ N (number of symbols)
  • the attention weight ⁇ n is a vector of 1 ⁇ T (number of frames). There are as many as N (number of symbols). Therefore, the probability matrix calculation unit 108 calculates the probability matrix P by the following equation (1).
  • the probability matrix P is the sum of all the symbols of the product of the output probability distribution Z and the attention weight ⁇ n .
  • the probability matrix P is the equation (2)
  • the output probability distribution z n is the equation (3)
  • the attention weight ⁇ n is the equation (4).
  • the number of dimensions of the stochastic matrix P is K (number of symbol entries) x T (number of frames).
  • the probability matrix calculation unit 108 also adds a redundant symbol to the number of dimensions of the output probability distribution Z of the Attention-based model ( It is the dimension of K + 1). Further, since it is desirable that the sum of the stochastic matrix P is 1 in each frame, the following equation (5) is normalized.
  • Non-Patent Document 3 since there is a model in which a plurality of attention weights can be obtained as described in Non-Patent Document 3, a plurality of ( ⁇ J) stochastic matrices P may be obtained.
  • the KLD loss calculation unit 109 calculates the KLD loss L KLD of the output probability distribution Y with respect to the probability matrix P based on the probability matrix P ( ⁇ J) and the output probability distribution Y using the following equation (6).
  • KLD loss L KLD indicates how much the output probability distribution Y of the model to be learned and the probability matrix P ( ⁇ J) are.
  • the KLD loss L KLD can be calculated including the case where a plurality of stochastic matrices P ( ⁇ J) are obtained depending on the case where a plurality of attention weights are obtained.
  • J 1.
  • the correct answer accuracy calculation unit 110 calculates the correct answer accuracy Acc by inputting the output probability distribution Z of the Attention-based model and the symbol sequence c which is the correct answer symbol sequence.
  • the correct answer accuracy calculation unit 110 includes a sequence Z'obtained by multiplying the output probability distribution Z of the Attention-based model to be learned by argmax (obtaining the element ID of the maximum value with respect to the class axis), and the symbol sequence c which is the correct answer.
  • the match rate of is returned to the loss integration unit 111 as the correct answer accuracy Acc. Therefore, the range that the correct answer accuracy Acc can take is 0 ⁇ Acc ⁇ 1.
  • the loss integration unit 111 calculates the loss of the CTC / Attention model.
  • the loss integration unit 111 includes a CTC loss L CTC output from the CTC loss calculation unit 103, a CE loss L S2S output from the CE loss calculation unit 107, a KLD loss L KLD output from the KLD loss calculation unit 109, and correct accuracy.
  • coefficients ⁇ , ⁇ , ⁇ (0 ⁇ ⁇ ⁇ 1), (0 ⁇ ⁇ ⁇ 1- ⁇ ) output from the calculation unit 110 to the following equation (7), the loss can be reduced. Integrate to obtain the loss LS2S + CTC + KLD of the CTC / Attention model.
  • the ⁇ of the equation (7) is the equation (8).
  • consists of ⁇ and Acc (*).
  • is a hyperparameter (0 ⁇ ⁇ ⁇ 1- ⁇ ) determined manually
  • Acc (*) is the correct answer accuracy when the output probability distribution Z of the Attention-based model and the correct answer symbol are referred to. Therefore, in the progress of learning, the influence of the term of ⁇ L KLD becomes large, and the accuracy becomes better.
  • the control unit 112 uses the conversion model parameter ⁇ 1 of the voice distribution expression sequence conversion unit 101, the label estimation model parameter ⁇ 2 of the label estimation unit 102, and the symbol distribution expression conversion unit 104.
  • the character feature amount estimation model parameter ⁇ 3 , the model parameter of the attention weight calculation unit 105, and the label estimation model parameter ⁇ 2 of the label estimation unit 106 are updated.
  • the control unit 112 includes processing by the voice distribution expression sequence conversion unit 101, processing by the label estimation unit 102, processing by the CTC loss calculation unit 103, processing by the symbol distribution expression conversion unit 104, processing by the attention weight calculation unit 105, and label estimation unit.
  • Predetermined termination conditions are processing by 106, processing by CE loss calculation unit 107, processing by probability matrix calculation unit 108, processing by KLD loss calculation unit 109, processing by correct answer accuracy calculation unit 110, and processing by loss integration unit 111. Repeat until is satisfied.
  • This end condition is not limited, and may be, for example, that the number of repetitions has reached the threshold value, or that the amount of change in the integration loss has become less than or equal to the threshold value before and after the repetition. It may be that the amount of change of the conversion model parameter ⁇ 1 and the label estimation model parameter ⁇ 2 becomes equal to or less than the threshold value before and after.
  • the voice distribution expression sequence conversion unit 101 outputs the conversion model parameter ⁇ 1
  • the label estimation unit 12 outputs the label estimation model parameter ⁇ 2 .
  • FIG. 2 is a flowchart showing a processing procedure of the learning process according to the first embodiment.
  • the voice distributed expression sequence conversion unit 101 converts the acoustic feature quantity sequence X into the corresponding intermediate feature quantity sequence H. (Step S1).
  • the label estimation unit 102 performs the first estimation process for calculating the next output probability distribution Y of the CTC model corresponding to the intermediate feature quantity series H (step S2).
  • the CTC loss calculation unit 103 inputs the output probability distribution Y and the symbol series c, and performs a CTC loss calculation process for calculating the CTC loss L CTC of the output probability distribution Y with respect to the symbol series c (step S3).
  • the symbol distributed expression conversion unit 104 receives the input of the output probability distribution Z and the symbol series c output from the label estimation unit 106, and performs the symbol distributed expression conversion process of converting to the character feature amount C (step S4).
  • the attention weight calculation unit 105 uses the intermediate feature quantity sequence H, the recursive intermediate output S of the neural network of the label estimation unit 106, and the attention weight ⁇ n-1 used for the previous label sequence estimation, and then uses the following. Attention weight calculation processing for calculating ⁇ n used when estimating the label of is performed (step S5).
  • the label estimation unit 106 performs a second estimation process for calculating the next output probability distribution Z of the Attention-based model using the intermediate feature amount series H, the character feature amount C, and the attention weight ⁇ n (step). S6).
  • the CE loss calculation unit 107 performs a CE loss calculation process for calculating the CE loss LS2S of the output probability distribution Z with respect to the symbol series c based on the output probability distribution Z and the symbol series c (step S7).
  • the probability matrix calculation unit 108 calculates the probability matrix P based on the output probability distribution Z output from the label estimation unit 106 and the attention weight ⁇ n output from the attention weight calculation unit 105. (Step S8).
  • the KLD loss calculation unit 109 performs a KLD loss calculation process for calculating the KLD loss L KLD , which is the loss of the output probability distribution Y with respect to the probability matrix P (step S9).
  • the correct answer accuracy calculation unit 110 performs a correct answer accuracy calculation process for calculating the correct answer accuracy Acc based on the output probability distribution Z and the symbol series c (step S10).
  • the loss integration unit 111 is based on the CTC loss L CTC , CE loss L S2S , KLD loss L KLD , correct answer accuracy Acc, coefficients ⁇ , ⁇ , ⁇ (0 ⁇ ⁇ ⁇ 1), (0 ⁇ ⁇ ⁇ 1- ⁇ ).
  • a loss integration process is performed to integrate the losses and obtain the integrated losses LS2S + CTC + KLD (step S11).
  • the control unit 112 updates the model parameters of the voice distribution expression sequence conversion unit 101, the label estimation unit 102, the symbol distribution expression conversion unit 104, the attention weight calculation unit 105, and the label estimation unit 106 using the loss LS2S + CTC + KLD (step). S12).
  • the control unit 112 repeats each of the above processes until a predetermined end condition is satisfied.
  • the loss L S2S + CTC + KLD of the CTC / Attention model is obtained by integrating the CTC loss L CTC , the CE loss L S2S , and the KLD loss L KLD , and the label is used by using this loss L S2S + CTC + KLD .
  • the model parameters of the voice distribution expression sequence conversion unit 101 shared by the label estimation units 102 and 203 are updated. That is, the shared speech distributed expression sequence conversion unit 101 is learned by the loss related to the label estimation units 102 and 203. In other words, the speech distributed expression sequence conversion unit 101 is learned to output the intermediate feature quantity sequence H that enhances the accuracy of the estimation results of the label estimation units 102 and 203, so that the overall estimation accuracy is improved. Can be done.
  • the learning device 1 can stabilize the learning of the CTC model and improve the estimation accuracy of speech recognition by using a loss function (see equation (7)) in which the CTC can be learned frame-by-frame. can.
  • FIG. 3 is a diagram showing an example of the functional configuration of the learning device according to the modified example of the first embodiment.
  • the probability matrix generated by the attention-based model may contain errors.
  • the learning device 1A according to the modified example of the first embodiment has a label estimation unit 102A (third estimation unit) separately from the label estimation unit 102 (third estimation unit). It is also possible to update the parameter for each loss by providing the estimation unit (4).
  • the CTC loss calculation unit 103 calculates the CTC loss L CTC based on the output probability distribution Y estimated by the label estimation unit 102, and the KLD loss calculation unit 109 calculates the CTC loss L CTC based on the output probability distribution Y estimated by the label estimation unit 102A.
  • KLD loss L Calculate KLD .
  • FIG. 4 is a flowchart showing a processing procedure of the learning process according to the modified example of the first embodiment. Steps S21 to S28 shown in FIG. 4 are the same processes as steps S1 to S8 shown in FIG.
  • the label estimation unit 102A calculates the next output probability distribution Y of the CTC model corresponding to the intermediate feature quantity series H, and performs a third label estimation process to output to the KLD loss calculation unit 109 (step S29).
  • the KLD loss calculation unit 109 calculates the KLD loss L KLD using the output probability distribution Y output from the label estimation unit 102A (step S30).
  • Steps S31 to S33 shown in FIG. 4 are the same processes as steps S10 to S12 shown in FIG.
  • FIG. 5 is a diagram showing an example of the functional configuration of the voice recognition device according to the second embodiment.
  • FIG. 6 is a flowchart showing a processing procedure of the voice recognition process according to the second embodiment.
  • the voice recognition device 3 has a voice distribution expression sequence conversion unit 301 and a label estimation unit 302.
  • the voice distributed expression sequence conversion unit 301 is the same as the above-mentioned voice distributed expression sequence conversion unit 101 except that the conversion model parameter ⁇ 1 output from the learning device 1 or the learning device 1A is input and set.
  • the label estimation unit 302 is the same as the label estimation unit 102 described above, except that the label estimation model parameter ⁇ 2 output from the learning device 1 or the learning device 1A is input and set.
  • the acoustic feature quantity sequence X "(second acoustic feature quantity sequence) to be recognized by the voice is input to the voice distribution expression sequence conversion unit 301.
  • the voice distribution expression sequence conversion unit 301 is given the conversion model parameter ⁇ 1 .
  • the intermediate feature quantity sequence H "corresponding to the acoustic feature quantity sequence X" is obtained and output (step S41).
  • An intermediate feature quantity series H ”output from the speech distribution expression sequence conversion unit 301 is input to the label estimation unit 302.
  • the label estimation unit 302 receives an intermediate feature when the label estimation model parameter ⁇ 2 is given.
  • the label sequence ⁇ l 1 , ⁇ l 2 , ..., ⁇ L F ⁇ (output probability distribution) corresponding to the quantity sequence H is obtained and output as a speech recognition result (step S42).
  • model parameters optimized by the learning device 1 or the learning device 1A using the loss LS2S + CTC + KLD are set in the label estimation unit 302 and the speech distribution expression sequence conversion unit 301. Therefore, the voice recognition process can be performed with high accuracy.
  • Each component of the learning device 1 and the voice recognition device 3 is a functional concept, and does not necessarily have to be physically configured as shown in the figure. That is, the specific form of the distribution and integration of the functions of the display processing systems 10 and 210 is not limited to the one shown in the figure, and all or part of them may be functional in any unit according to various loads and usage conditions. Or it can be physically distributed or integrated.
  • each process performed in the learning device 1 and the speech recognition device 3 is realized by a program in which all or any part thereof is analyzed and executed by the CPU, GPU (Graphics Processing Unit), and CPU, GPU. May be good. Further, each process performed by the learning device 1 and the voice recognition device 3 may be realized as hardware by wired logic.
  • FIG. 7 is a diagram showing an example of a computer in which the learning device 1 and the voice recognition device 3 are realized by executing the program.
  • the computer 1000 has, for example, a memory 1010 and a CPU 1020.
  • the computer 1000 also has a hard disk drive interface 1030, a disk drive interface 1040, a serial port interface 1050, a video adapter 1060, and a network interface 1070. Each of these parts is connected by a bus 1080.
  • Memory 1010 includes ROM 1011 and RAM 1012.
  • the ROM 1011 stores, for example, a boot program such as a BIOS (Basic Input Output System).
  • BIOS Basic Input Output System
  • the hard disk drive interface 1030 is connected to the hard disk drive 1090.
  • the disk drive interface 1040 is connected to the disk drive 1100.
  • a removable storage medium such as a magnetic disk or an optical disk is inserted into the disk drive 1100.
  • the serial port interface 1050 is connected to, for example, a mouse 1110 and a keyboard 1120.
  • the video adapter 1060 is connected to, for example, the display 1130.
  • the hard disk drive 1090 stores, for example, an OS (Operating System) 1091, an application program 1092, a program module 1093, and program data 1094. That is, the program that defines each process of the learning device 1 and the voice recognition device 3 is implemented as a program module 1093 in which a code that can be executed by the computer 1000 is described.
  • the program module 1093 is stored in, for example, the hard disk drive 1090.
  • the program module 1093 for executing the same processing as the functional configuration in the learning device 1 and the voice recognition device 3 is stored in the hard disk drive 1090.
  • the hard disk drive 1090 may be replaced by an SSD (Solid State Drive).
  • the setting data used in the processing of the above-described embodiment is stored as program data 1094 in, for example, a memory 1010 or a hard disk drive 1090. Then, the CPU 1020 reads the program module 1093 and the program data 1094 stored in the memory 1010 and the hard disk drive 1090 into the RAM 1012 and executes them as needed.
  • the program module 1093 and the program data 1094 are not limited to those stored in the hard disk drive 1090, but may be stored in, for example, a removable storage medium and read by the CPU 1020 via the disk drive 1100 or the like. Alternatively, the program module 1093 and the program data 1094 may be stored in another computer connected via a network (LAN (Local Area Network), WAN (Wide Area Network), etc.). Then, the program module 1093 and the program data 1094 may be read from another computer by the CPU 1020 via the network interface 1070.
  • LAN Local Area Network
  • WAN Wide Area Network

Landscapes

  • Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Machine Translation (AREA)

Abstract

A learning device (1) acquires an intermediate feature quantity sequence corresponding to an acoustic feature quantity sequence, estimates a first output probability distribution corresponding to the intermediate feature quantity sequence when a first estimation model parameter is given, estimates a second output probability distribution corresponding to the intermediate feature quantity sequence when a second estimation model parameter is given on the basis of the intermediate feature quantity sequence, a character feature quantity obtained by converting a correct answer symbol sequence, and an attention weight having an element indicating the degree of relevance of each frame of the acoustic feature quantity sequence to a timing at which a symbol appears, calculates a CTC loss of the first output probability distribution with respect to the correct answer symbol sequence, calculates a KLD loss of the first output probability distribution with respect to a probability matrix that is a total sum of products of the second output probability distribution and the attention weight of all symbols, calculates a CE loss of the second output probability distribution with respect to the correct answer symbol sequence, and updates the model parameters on the basis of an integrated loss obtained by integrating the CTC loss, the KLD loss, and the CE loss.

Description

学習装置、音声認識装置、学習方法、音声認識方法、学習プログラム及び音声認識プログラムLearning device, voice recognition device, learning method, voice recognition method, learning program and voice recognition program
 本発明は、学習装置、音声認識装置、学習方法、音声認識方法、学習プログラム及び音声認識プログラムに関する。 The present invention relates to a learning device, a voice recognition device, a learning method, a voice recognition method, a learning program, and a voice recognition program.
 近年のニューラルネットワークを用いた音声認識システムでは、音響特徴量系列から単語系列を直接出力することができる。例えば、非特許文献1には、Connectionist Temporal Classification(CTC)による学習方法を用いて、音響認識用のニューラルネットワーク(NN)を学習する方法記載されている(非特許文献1「3. Connectionist Temporal Classification」及び「4. Training the Network」節参照)。このCTCモデルの学習には、冗長性を表す“blank”シンボルの導入により、音声の内容と対応する音素、文字、サブワード、単語系列(≠frame-by-frame)が用意されていれば、学習データから動的に音声と出力系列の対応を学習することが可能である。 In recent speech recognition systems using neural networks, word sequences can be output directly from acoustic feature sequences. For example, Non-Patent Document 1 describes a method of learning a neural network (NN) for speech recognition using a learning method by Connectionist Temporal Classification (CTC) (Non-Patent Document 1 "3. Connectionist Temporal Classification". And "4. Training the Network" section). In the learning of this CTC model, if the phoneme, character, subword, and word sequence (≠ frame-by-frame) corresponding to the content of the voice are prepared by introducing the "blank" symbol representing the redundancy, the learning is performed. It is possible to dynamically learn the correspondence between voice and output series from data.
 しかしながら、各音声のフレームに、音素、文字、サブワード、単語およびblankシンボルを動的に割り当てるようなCTCモデルの学習は、従来の音声認識システムの音響モデルに比べて困難である。 However, learning a CTC model that dynamically assigns phonemes, letters, subwords, words, and blank symbols to each speech frame is more difficult than the acoustic model of a conventional speech recognition system.
 一方、近年では、CTCよりも性能がよく、直接、文字、サブワード、単語を出力可能な音声認識モデル“Attention-based model”(非特許文献2及び非特許文献3参照)が提案されている。 On the other hand, in recent years, a speech recognition model "Attention-based model" (see Non-Patent Document 2 and Non-Patent Document 3), which has better performance than CTC and can directly output characters, subwords, and words, has been proposed.
 Attention-based modelでは、事前に学習データの各サンプルから抽出した特徴量(実数ベクトル)と各特徴量に対応する正解ユニット番号のペア、および適当な初期モデルを用意する。初期モデルとしては、各パラメータに乱数を割り当てたニューラルネットワークや、既に別の学習データで学習済みのニューラルネットワークなどが利用できる。Attention-based modelを適用した音声認識装置では、入力された特徴量から入力次元と対応した中間特徴量を抽出し、文字を入力してone-hot-vectorを作成し、これらの出力をもとに直前までのラベル系列を考慮して次のラベルを予測する。そして、この音声認識装置では、ラベル推定処理においてラベルを識別しやすくするために各処理において使用するパラメータを学習データから計算する。 In the Attention-based model, a pair of the feature amount (real number vector) extracted from each sample of the training data in advance and the correct answer unit number corresponding to each feature amount, and an appropriate initial model are prepared. As an initial model, a neural network in which random numbers are assigned to each parameter or a neural network that has already been trained with other training data can be used. In the speech recognition device to which the Attention-based model is applied, the intermediate feature amount corresponding to the input dimension is extracted from the input feature amount, characters are input to create a one-hot-vector, and these outputs are used as the basis. The next label is predicted in consideration of the label series up to the previous one. Then, in this voice recognition device, parameters used in each process are calculated from the learning data in order to make it easier to identify the label in the label estimation process.
 ここで、CTCモデルでは、入力シンボル系列cと出力ラベル系列wの系列長は同一であるため、クロスエントロピーによる誤差関数により比較的短時間で学習が可能であるのに対し、Attention-based modelでは、学習するのに時間がかかるといった問題がある。 Here, in the CTC model, since the sequence lengths of the input symbol sequence c and the output label sequence w are the same, learning can be performed in a relatively short time by an error function based on cross entropy, whereas in the Attention-based model, learning is possible. , There is a problem that it takes time to learn.
 そこで、近年、上述のモデルを組み合わせたHybrid CTC/Attention model(非特許文献4参照)によって、Attention-based modelの学習速度と性能を改善する方法が提案されている。このHybrid CTC/Attention modelは、音声分散表現系列変換機能を共有しており、その中間出力をCTCモデルとAttention modelとの出力計算に用いる。そして、各損失関数は、重み付け和によって結合され、その統合された損失値に用いてモデル全体を学習する。学習データの特徴量と正解ユニット番号との各ペアに対して、上記の中間特徴量の抽出、出力確率計算、モデル更新の各処理を繰り返し、所定回数(通常、数千万~数億回)の繰り返しが完了した時点のモデルを学習済みモデルとして利用する。 Therefore, in recent years, a method for improving the learning speed and performance of the Attention-based model has been proposed by using the Hybrid CTC / Attention model (see Non-Patent Document 4) that combines the above models. This Hybrid CTC / Attention model shares the voice distributed expression series conversion function, and its intermediate output is used for output calculation between the CTC model and Attention model. Each loss function is then combined by a weighted sum and used for the integrated loss value to train the entire model. For each pair of the feature amount of the training data and the correct unit number, each process of extraction of the intermediate feature amount, output probability calculation, and model update is repeated, and a predetermined number of times (usually tens of millions to hundreds of millions of times). The model at the time when the repetition of is completed is used as the trained model.
 しかしながら、Hybrid CTC/Attention modelは、CTC損失によって早期に改善するが、CTCの損失に着目すると学習を安定する枠組みがなく、十分な認識性能が出ない可能性がある。 However, the Hybrid CTC / Attention model improves early due to CTC loss, but when focusing on CTC loss, there is no framework for stabilizing learning, and there is a possibility that sufficient recognition performance will not be obtained.
 本発明は、上記に鑑みてなされたものであって、安定した学習により音声認識の推定精度を向上させることができる学習装置、音声認識装置、学習方法、音声認識方法、学習プログラム及び音声認識プログラムを提供することを目的とする。 The present invention has been made in view of the above, and is a learning device, a voice recognition device, a learning method, a voice recognition method, a learning program, and a voice recognition program capable of improving the estimation accuracy of voice recognition by stable learning. The purpose is to provide.
 上述した課題を解決し、目的を達成するために、本発明に係る学習装置は、変換モデルパラメータが与えられた場合における、第1の音響特徴量系列に対応する中間特徴量系列を取得する変換部と、第1の推定モデルパラメータが与えられた場合における、中間特徴量系列に対応する第1の出力確率分布を推定する第1の推定部と、中間特徴量系列と、正解シンボル系列を変換した文字特徴量と、シンボルが表れるタイミングに対する第1の音響特徴量系列の各フレームの関連性の高さを表す要素を有する注意重みとを基に、第2の推定モデルパラメータが与えられた場合における、中間特徴量系列に対応する第2の出力確率分布を推定する第2の推定部と、第2の出力確率分布と、注意重みとの積の全シンボルについての総和である確率行列を計算する確率行列計算部と、第1の音響特徴量系列に対応する正解シンボル系列および第1の出力確率分布を基に、正解シンボル系列に対する第1の出力確率分布のCTC(Connectionist Temporal Classification)損失を計算するCTC損失計算部と、確率行列および第1の出力確率分布を基に、確率行列に対する第1の出力確率分布のKLD(Kullback-Leibler Divergence)損失を計算するKLD損失計算部と、第1の音響特徴量系列に対応する正解シンボル系列および第2の出力確率分布を基に、正解シンボル系列に対する第2の出力確率分布のCE(Cross Entropy)損失を計算するCE損失計算部と、CTC損失とKLD損失とCE損失とを統合した統合損失に基づいて変換モデルパラメータ、第1の推定モデルパラメータ、第2の推定モデルパラメータを更新し、変換部と第1の推定部と第2の推定部と確率行列計算部とCTC損失計算部とKLD損失計算部とCE損失計算部との処理を終了条件が満たされるまで繰り返す制御部と、を有することを特徴とする。 In order to solve the above-mentioned problems and achieve the object, the learning device according to the present invention acquires an intermediate feature quantity sequence corresponding to the first acoustic feature quantity sequence when a transformation model parameter is given. Converts the part, the first estimation part that estimates the first output probability distribution corresponding to the intermediate feature quantity series, the intermediate feature quantity series, and the correct symbol sequence when the first estimation model parameter is given. When a second estimated model parameter is given based on the character feature amount and the attention weight having an element indicating the high degree of relevance of each frame of the first acoustic feature amount series to the timing at which the symbol appears. Calculates a probability matrix that is the sum of all symbols of the product of the second output probability distribution, the second output probability distribution, and the attention weight, and the second estimation part that estimates the second output probability distribution corresponding to the intermediate feature quantity series. Based on the probability matrix calculation unit, the correct symbol series corresponding to the first acoustic feature quantity series, and the first output probability distribution, the CTC (Connectionist Temporal Classification) loss of the first output probability distribution for the correct symbol series is calculated. The CTC loss calculation unit to be calculated, the KLD loss calculation unit to calculate the KLD (Kullback-Leibler Divergence) loss of the first output probability distribution with respect to the probability matrix based on the probability matrix and the first output probability distribution, and the first CE loss calculation unit that calculates the CE (Cross Entropy) loss of the second output probability distribution for the correct symbol series and the CTC loss based on the correct symbol series and the second output probability distribution corresponding to the acoustic feature quantity series of The conversion model parameter, the first estimation model parameter, and the second estimation model parameter are updated based on the integration loss that integrates the KLD loss and the CE loss, and the conversion unit, the first estimation unit, and the second estimation unit are updated. It is characterized by having a control unit that repeats the processing of the probability matrix calculation unit, the CTC loss calculation unit, the KLD loss calculation unit, and the CE loss calculation unit until the end condition is satisfied.
 また、本発明に係る音声認識装置は、上記に記載の学習装置において終了条件を満たした各モデルパラメータが与えられた場合における、第2の音響特徴量系列に対応する出力確率分布を推定して出力することを特徴とする。 Further, the speech recognition device according to the present invention estimates the output probability distribution corresponding to the second acoustic feature quantity series when each model parameter satisfying the end condition is given in the learning device described above. It is characterized by outputting.
 本発明によれば、安定した学習により音声認識の推定精度を向上させることができる。 According to the present invention, it is possible to improve the estimation accuracy of speech recognition by stable learning.
図1は、実施の形態1に係る学習装置の機能構成の一例を示す図である。FIG. 1 is a diagram showing an example of the functional configuration of the learning device according to the first embodiment. 図2は、実施の形態1に係る学習処理の処理手順を示すフローチャートである。FIG. 2 is a flowchart showing a processing procedure of the learning process according to the first embodiment. 図3は、実施の形態1の変形例に係る学習装置の機能構成の一例を示す図である。FIG. 3 is a diagram showing an example of the functional configuration of the learning device according to the modified example of the first embodiment. 図4は、実施の形態1の変形例に係る学習処理の処理手順を示すフローチャートである。FIG. 4 is a flowchart showing a processing procedure of the learning process according to the modified example of the first embodiment. 図5は、実施の形態2に係る音声認識装置の機能構成の一例を示す図である。FIG. 5 is a diagram showing an example of the functional configuration of the voice recognition device according to the second embodiment. 図6は、実施の形態2に係る音声認識処理の処理手順を示すフローチャートである。FIG. 6 is a flowchart showing a processing procedure of the voice recognition process according to the second embodiment. 図7は、プログラムが実行されることにより、学習装置及び音声認識装置が実現されるコンピュータの一例を示す図である。FIG. 7 is a diagram showing an example of a computer in which a learning device and a voice recognition device are realized by executing a program.
 以下、図面を参照して、本発明の一実施形態を詳細に説明する。なお、この実施の形態により本発明が限定されるものではない。また、図面の記載において、同一部分には同一の符号を付して示している。なお、以下では、ベクトル、行列またはスカラーであるAに対し、“^A”と記載する場合は「“A”の直上に“^”が記された記号」と同等であるとする。 Hereinafter, an embodiment of the present invention will be described in detail with reference to the drawings. The present invention is not limited to this embodiment. Further, in the description of the drawings, the same parts are indicated by the same reference numerals. In the following, when "^ A" is described for A which is a vector, a matrix, or a scalar, it is assumed to be equivalent to "a symbol in which" ^ "is written immediately above" A "".
[実施の形態1]
 まず、実施の形態1として、音声認識モデルの学習を行う学習装置について説明する。
[Embodiment 1]
First, as the first embodiment, a learning device for learning a speech recognition model will be described.
[学習装置]
 図1は、実施の形態1に係る学習装置の機能構成の一例を示す図である。本実施の形態1では、実施の形態に係る学習装置1は、CTC/Attentionモデルを採用する。
[Learning device]
FIG. 1 is a diagram showing an example of the functional configuration of the learning device according to the first embodiment. In the first embodiment, the learning device 1 according to the embodiment adopts a CTC / Attention model.
 本実施の形態1では、Attention-based modelのAttention重みに着目する。この重みは、直前に出力したシンボルに依存して算出されており、次のラベルの出力するタイミングをどこのフレームに着目すべきかを音声分散表現系列変換部101(後述)によりエンコードされた中間特徴量からソフトマックス関数により求めたものである。このAttention重みの次元数は1×T(フレーム数)であり、高性能となるように学習されていれば、着目すべきフレームの要素の値は非常に高く、それ以外のフレームでは低い値をとる。実施の形態1では、このAttention重みの振る舞いをCTC branchの出力の学習に活用する。そして、実施の形態1では、CTC branchの性能改善により、音声分散表現系列変換部101の表現力も改善し、結果としてAttention-based modelの出力も改善する。 In the first embodiment, attention is paid to the Attention weight of the Attention-based model. This weight is calculated depending on the symbol output immediately before, and is an intermediate feature encoded by the voice distribution expression series conversion unit 101 (described later) as to which frame should be focused on when the output timing of the next label should be focused. It is obtained from the quantity by the softmax function. The number of dimensions of this Attention weight is 1 × T (number of frames), and if it is learned to have high performance, the value of the element of the frame to be noted is very high, and the value of the other frames is low. Take. In the first embodiment, the behavior of this Attention weight is utilized for learning the output of the CTC branch. Then, in the first embodiment, by improving the performance of the CTC branch, the expressive power of the voice distributed expression series conversion unit 101 is also improved, and as a result, the output of the Attention-based model is also improved.
 学習装置1は、採用するCTCモデルとAttention-based modelとを同時に学習する。具体的には、学習装置1は、CTC損失の学習にAttention-based modelの出力についても同時に学習する学習方法を提案する。そして、学習装置1は、CTCをframe-by-frameに学習可能な損失関数を用いることで、CTCモデルの学習を安定化させる。 The learning device 1 learns the CTC model to be adopted and the Attention-based model at the same time. Specifically, the learning device 1 proposes a learning method for learning the output of the Attention-based model at the same time as learning the CTC loss. Then, the learning device 1 stabilizes the learning of the CTC model by using a loss function that can learn the CTC frame-by-frame.
 学習装置1は、例えば、ROM(Read Only Memory)、RAM(Random Access Memory)、CPU(Central Processing Unit)等を含むコンピュータ等に所定のプログラムが読み込まれて、CPUが所定のプログラムを実行することで実現される。また、学習装置1は、ネットワーク等を介して接続された他の装置との間で、各種情報を送受信する通信インタフェースを有する。例えば、学習装置1は、NIC(Network Interface Card)等を有し、LAN(Local Area Network)やインターネットなどの電気通信回線を介した他の装置との間の通信を行う。そして、学習装置1は、タッチパネル、音声入力デバイス、キーボードやマウス等の入力デバイス、液晶ディスプレイなどの表示装置を有し、情報の入出力を行う。 In the learning device 1, for example, a predetermined program is read into a computer or the like including a ROM (Read Only Memory), a RAM (Random Access Memory), a CPU (Central Processing Unit), etc., and the CPU executes the predetermined program. It is realized by. Further, the learning device 1 has a communication interface for transmitting and receiving various information to and from other devices connected via a network or the like. For example, the learning device 1 has a NIC (Network Interface Card) or the like, and communicates with other devices via a telecommunication line such as a LAN (Local Area Network) or the Internet. The learning device 1 has a touch panel, a voice input device, an input device such as a keyboard and a mouse, and a display device such as a liquid crystal display, and inputs and outputs information.
 学習装置1は、音響特徴量系列Xと、それに対応するシンボル系列c{c,c,・・・,c}(正解シンボル系列)とを入力とし、音響特徴量系列Xに対応するラベル系列(単語系列{^w,^w,・・・,^w})(出力確率分布)を生成して出力する装置である。ただし、Nは正整数であり、シンボル系列cに含まれたシンボルの個数を表す。音響特徴量系列Xは、音声などの時系列音響信号から抽出された時系列の音響特徴量の系列である。音響特徴量系列Xの例はベクトルである。シンボル系列cは、音響特徴量系列Xに対応する時系列の音響信号が表す正解シンボルの系列である。正解シンボルの例は、音素、文字、サブワード、単語などである。シンボル系列の例はベクトルである。正解シンボルは音響特徴量系列Xに対応するが、シンボル系列に含まれる各シンボルが音響特徴量系列Xのどのフレーム(時点)に対応しているのかは特定されていない。以下では、各部について説明する。なお、上述した各部は、複数の装置が分散して保持してもよい。 The learning device 1 inputs the acoustic feature sequence X and the corresponding symbol sequence c {c 1 , c 2 , ..., C N } (correct symbol sequence), and corresponds to the acoustic feature sequence X. It is a device that generates and outputs a label sequence (word sequence {^ w 1 , ^ w 2 , ..., ^ W N }) (output probability distribution). However, N is a positive integer and represents the number of symbols included in the symbol series c. The acoustic feature quantity series X is a series of time-series acoustic features extracted from a time-series acoustic signal such as voice. An example of the acoustic feature sequence X is a vector. The symbol sequence c is a sequence of correct answer symbols represented by a time-series acoustic signal corresponding to the acoustic feature quantity sequence X. Examples of correct symbols are phonemes, letters, subwords, words, and so on. An example of a symbol sequence is a vector. The correct symbol corresponds to the acoustic feature sequence X, but it is not specified which frame (time point) of the acoustic feature sequence X each symbol included in the symbol sequence corresponds to. Each part will be described below. It should be noted that each of the above-mentioned parts may be held by a plurality of devices in a dispersed manner.
 学習装置1は、音声分散表現系列変換部101(変換部)、ラベル推定部102(第1の推定部)、CTC損失計算部103、シンボル分散表現変換部104、注意重み計算部105、ラベル推定部106(第2の推定部)、CE(Cross Entropy)損失計算部204、確率行列計算部108、KLD(Kullback-Leibler Divergence)損失計算部402、正解精度計算部110、損失統合部111及び制御部112を有する。 The learning device 1 includes a voice distributed expression sequence conversion unit 101 (conversion unit), a label estimation unit 102 (first estimation unit), a CTC loss calculation unit 103, a symbol distribution expression conversion unit 104, an attention weight calculation unit 105, and a label estimation. Unit 106 (second estimation unit), CE (Cross Entropy) loss calculation unit 204, probability matrix calculation unit 108, KLD (Kullback-Leibler Divergence) loss calculation unit 402, correct answer accuracy calculation unit 110, loss integration unit 111 and control. It has a portion 112.
 音声分散表現系列変換部101には、音響特徴量系列X(第1の音響特徴量系列)が入力される。音声分散表現系列変換部101は、変換モデルパラメータγが与えられた場合における、音響特徴量系列Xに対応する中間特徴量系列Hを得て出力する。音声分散表現系列変換部101は、例えば、多段のニューラルネットワークであり、入力された音響特徴量を多段のニューラルネットワークにより中間特徴量系列Hへ変換するエンコーダとして機能する。 The acoustic feature quantity sequence X (first acoustic feature quantity sequence) is input to the voice distributed expression sequence conversion unit 101. The voice distribution expression sequence conversion unit 101 obtains and outputs an intermediate feature quantity sequence H corresponding to the acoustic feature quantity sequence X when the conversion model parameter γ 1 is given. The voice distribution expression sequence conversion unit 101 is, for example, a multi-stage neural network, and functions as an encoder that converts the input acoustic feature amount into the intermediate feature amount sequence H by the multi-stage neural network.
 音声分散表現系列変換部101は、例えば、非特許文献4の式(17)を用いて、中間特徴量系列Hを計算する。或いは、音声分散表現系列変換部101は、非特許文献4の式(17)に代え、音響特徴量系列XにLSTM(Long short-term memory)を適用して中間特徴量系列Hを得てもよい(参考文献1)。
参考文献1:Sepp Hochreiter, Jurgen Schmidhuber, “LONG SHORT-TERM MEMORY”, Computer Science, MedicinePublished in Neural Computation 1997.
The voice distribution expression sequence conversion unit 101 calculates the intermediate feature quantity sequence H by using, for example, the equation (17) of Non-Patent Document 4. Alternatively, the voice distributed expression sequence conversion unit 101 may obtain an intermediate feature quantity sequence H by applying LSTM (Long short-term memory) to the acoustic feature quantity sequence X instead of the equation (17) of Non-Patent Document 4. Good (Reference 1).
Reference 1: Sepp Hochreiter, Jurgen Schmidhuber, “LONG SHORT-TERM MEMORY”, Computer Science, MedicinePublished in Neural Computation 1997.
 ラベル推定部102には、音声分散表現系列変換部101においてエンコードされた中間特徴量系列Hが入力される。ラベル推定部102は、ラベル推定モデルパラメータγ(第1の推定モデルパラメータ)が与えられた場合における、中間特徴量系列Hに対応する、CTCモデルの次の出力確率分布Y(図中ラベル系列{^l,^l,・・・,^l})(第1の出力確率分布)を計算し、出力する。出力確率分布Yは、例えば、非特許文献4の式(16)を用いて計算される。出力確率分布Yの次元数は、(K+1)×Tに等しい。(K+1)は、出力シンボル(音素、文字、サブワード、単語など)の数に冗長シンボル“blank”を加えた数である。Tは、フレーム数である。 The intermediate feature quantity sequence H encoded by the speech distribution expression sequence conversion unit 101 is input to the label estimation unit 102. The label estimation unit 102 receives the next output probability distribution Y (label sequence in the figure) of the CTC model corresponding to the intermediate feature quantity sequence H when the label estimation model parameter γ 2 (first estimation model parameter) is given. {^ L 1 , ^ l 2 , ..., ^ L T }) (first output probability distribution) is calculated and output. The output probability distribution Y is calculated using, for example, the equation (16) of Non-Patent Document 4. The number of dimensions of the output probability distribution Y is equal to (K + 1) × T. (K + 1) is the number of output symbols (phonemes, letters, subwords, words, etc.) plus the redundant symbol “blank”. T is the number of frames.
 CTC損失計算部103は、ラベル推定部102から出力された出力確率分布Yとシンボル系列cとを基に、シンボル系列cに対する出力確率分布YのCTC損失LCTCを計算し、出力する。CTC損失計算部103は、縦軸に冗長ラベルを挟んだシンボル系列cのシンボル数、横軸に時間のトレリスを作成し、最適な遷移確率のパスをフォワードバックワードアルゴリズムに基づき計算する(詳細な計算過程は、非特許文献1の“4. Training the Network”参照)。CTC損失LCTCは、例えば、非特許文献1の式(14)を用いて計算される。 The CTC loss calculation unit 103 calculates and outputs the CTC loss L CTC of the output probability distribution Y with respect to the symbol series c based on the output probability distribution Y output from the label estimation unit 102 and the symbol series c. The CTC loss calculation unit 103 creates a trellis of the symbol series c with a redundant label on the vertical axis and a trellis of time on the horizontal axis, and calculates the path of the optimum transition probability based on the forward backward algorithm (detailed). For the calculation process, refer to “4. Training the Network” in Non-Patent Document 1). CTC loss L CTC is calculated using, for example, the equation (14) of Non-Patent Document 1.
 シンボル分散表現変換部104は、ラベル推定部106(後述)から出力された出力確率分布Z(第2の出力確率分布)、シンボル系列c(例えば、音素)が入力される。シンボル分散表現変換部104は、シンボル系列cを、モデルパラメータである文字特徴量推定モデルパラメータβが与えられた場合における、出力確率分布Zに対応する連続値の特徴量である文字特徴量Cに変換して出力する。文字特徴量Cは、one-hotなベクトル表現である。出力確率分布Zを用いた文字特徴量Cの算出は、例えば、非特許文献2の式(4)によって行われる。 The symbol distribution expression conversion unit 104 inputs the output probability distribution Z (second output probability distribution) and the symbol sequence c (for example, phonemes) output from the label estimation unit 106 (described later). The symbol distribution representation conversion unit 104 uses the symbol sequence c as a character feature amount C which is a continuous value feature amount corresponding to the output probability distribution Z when the character feature amount estimation model parameter β 3 which is a model parameter is given. Convert to and output. The character feature amount C is a one-hot vector representation. The calculation of the character feature amount C using the output probability distribution Z is performed, for example, by the equation (4) of Non-Patent Document 2.
 注意重み計算部105は、音声分散表現系列変換部101から出力された中間特徴量系列H、ラベル推定部106(後述)のニューラルネットワークの再帰的中間出力S、1つ前のラベル系列推定に用いた時の注意重みαn-1を用いて、次のラベルを推定する際に用いる注意重みα(ベクトル)を計算する(詳細な計算過程は、非特許文献2の“2 Attention-Based Model for Speech Recognition”の“2.1 General Framework”を参照)。注意重みαは、シンボルが表れるタイミングに対する音響特徴量系列Xの各フレームの関連性の高さを表す要素を有する。nは時系列に並ぶ出力確率分布Zの順序を表す。一般にαの次元数は、1×T(フレーム数)である。注意重みαが、複数個(×J)得られるモデルも存在する。 Attention The weight calculation unit 105 is used for the intermediate feature quantity sequence H output from the speech distribution expression sequence conversion unit 101, the recursive intermediate output S of the neural network of the label estimation unit 106 (described later), and the label sequence estimation immediately before. Using the attention weight α n-1 at the time of occurrence, the attention weight α n (vector) used when estimating the next label is calculated (the detailed calculation process is described in “2 Attention-Based Model” of Non-Patent Document 2. See “2.1 General Framework” in “for Speech Recognition”). The attention weight α n has an element representing the high degree of relevance of each frame of the acoustic feature sequence X to the timing at which the symbol appears. n represents the order of the output probability distributions Z arranged in chronological order. Generally, the number of dimensions of α is 1 × T (number of frames). Attention There is also a model in which a plurality (× J) of weights α n can be obtained.
 ラベル推定部106には、音声分散表現系列変換部101でエンコードした中間特徴量系列H、シンボル分散表現変換部104から出力された文字特徴量C、および、注意重み計算部105から出力された注意重みαが入力される。ラベル推定部106は、中間特徴量系列Hと文字特徴量Cと注意重みαとを用いて、ラベル推定モデルパラメータβ(第2の推定モデルパラメータ)が与えられた場合における、中間特徴量系列Hに対応する、Attention-based modelの次の出力確率分布Z(図中ラベル系列{^l,^l,・・・,^l})を計算する。学習時において、入力されるシンボル数はシンボル系列c{c,c,・・・,c}の次元と同一のため既知であり、出力確率分布Zの次元数は、K(シンボルのエントリ数)×N(シンボル数)となる。出力確率分布Zの生成は、例えば、非特許文献2の式(2)及び式(3)に従って行われる。 In the label estimation unit 106, the intermediate feature amount series H encoded by the voice distribution expression sequence conversion unit 101, the character feature amount C output from the symbol distribution expression conversion unit 104, and the attention output from the attention weight calculation unit 105. The weight α n is input. The label estimation unit 106 uses the intermediate feature sequence H, the character feature amount C, and the attention weight α n to give the label estimation model parameter β 2 (second estimation model parameter), and the label estimation unit 106 uses the intermediate feature amount series H, the character feature amount C, and the attention weight α n. The next output probability distribution Z of the Attention-based model corresponding to the sequence H (label sequence {^ l 1 , ^ l 2 , ..., ^ L N } in the figure) is calculated. At the time of learning, the number of input symbols is known because it is the same as the dimension of the symbol series c {c 1 , c 2 , ..., C N }, and the number of dimensions of the output probability distribution Z is K (of the symbol). Number of entries) x N (number of symbols). The generation of the output probability distribution Z is performed, for example, according to the equations (2) and (3) of Non-Patent Document 2.
 CE損失計算部107は、シンボル系列cとラベル推定部106から出力された出力確率分布Zとを基に、シンボル系列cに対する出力確率分布ZのCE損失LS2Sを計算する。出力確率分布Zとシンボル系列cとは、出力ラベルの次元Nに関して同一のため、クロスエントロピーによる誤差関数を用いて損失の計算が可能である。 The CE loss calculation unit 107 calculates the CE loss LS2S of the output probability distribution Z with respect to the symbol series c based on the symbol series c and the output probability distribution Z output from the label estimation unit 106. Since the output probability distribution Z and the symbol sequence c are the same with respect to the dimension N of the output label, the loss can be calculated using the error function by cross entropy.
 学習装置1は、CTCモデルの学習において、CTC損失と、Attention-based modelの出力である注意重みαと出力確率分布Zとを基に作成した確率行列Pとに基づくKullback-Leibler Divergence(KLD)による損失と、CE損失とを基に、音声分散表現系列変換部101、ラベル推定部102,106、シンボル分散表現変換部104、注意重み計算部105を学習する。そこで、確率行列P、及び、確率行列Pに対する出力確率分布Yの損失であるKLD損失LKLDの計算について説明する。 The learning device 1 is a Kullback-Leibler Divergence (KLD) based on the CTC loss and the probability matrix P created based on the attention weight α which is the output of the Attention-based model and the output probability distribution Z in the training of the CTC model. Based on the loss due to the above and the CE loss, the speech distributed expression sequence conversion unit 101, the label estimation units 102 and 106, the symbol distribution expression conversion unit 104, and the attention weight calculation unit 105 are learned. Therefore, the calculation of the KLD loss L KLD , which is the loss of the output probability distribution Y with respect to the probability matrix P and the probability matrix P, will be described.
 確率行列計算部108は、ラベル推定部106から出力された出力確率分布Z及び注意重み計算部105から出力された注意重みαを用いて、確率行列Pを計算する。上述したように、出力確率分布Z(z)の次元数は、K(シンボルのエントリ数)×N(シンボル数)であり、注意重みαは、1×T(フレーム数)のベクトルがN(シンボル数)だけ存在する。このため、確率行列計算部108は、下記の式(1)によって確率行列Pを計算する。 The probability matrix calculation unit 108 calculates the probability matrix P using the output probability distribution Z output from the label estimation unit 106 and the attention weight α n output from the attention weight calculation unit 105. As described above, the number of dimensions of the output probability distribution Z (z n ) is K (number of symbol entries) × N (number of symbols), and the attention weight α n is a vector of 1 × T (number of frames). There are as many as N (number of symbols). Therefore, the probability matrix calculation unit 108 calculates the probability matrix P by the following equation (1).
Figure JPOXMLDOC01-appb-M000001
Figure JPOXMLDOC01-appb-M000001
 式(1)に示すように、確率行列Pは、出力確率分布Zと、注意重みαとの積の全シンボルについての総和である。確率行列Pは、式(2)であり、出力確率分布zは、式(3)であり、注意重みαは、式(4)である。 As shown in equation (1), the probability matrix P is the sum of all the symbols of the product of the output probability distribution Z and the attention weight α n . The probability matrix P is the equation (2), the output probability distribution z n is the equation (3), and the attention weight α n is the equation (4).
Figure JPOXMLDOC01-appb-M000002
Figure JPOXMLDOC01-appb-M000002
Figure JPOXMLDOC01-appb-M000003
Figure JPOXMLDOC01-appb-M000003
Figure JPOXMLDOC01-appb-M000004
Figure JPOXMLDOC01-appb-M000004
 確率行列Pの次元数は、K(シンボルのエントリ数)×T(フレーム数)となる。この確率行列Pを、ラベル推定部102の出力確率分布YとKLDを行うようにするため、確率行列計算部108は、Attention-based modelの出力確率分布Zの次元数も冗長シンボルを加えた(K+1)の次元とする。また、確率行列Pは各フレームにおいて和が1であることが望ましいため、下記の式(5)の正規化を行う。 The number of dimensions of the stochastic matrix P is K (number of symbol entries) x T (number of frames). In order to make this probability matrix P perform KLD with the output probability distribution Y of the label estimation unit 102, the probability matrix calculation unit 108 also adds a redundant symbol to the number of dimensions of the output probability distribution Z of the Attention-based model ( It is the dimension of K + 1). Further, since it is desirable that the sum of the stochastic matrix P is 1 in each frame, the following equation (5) is normalized.
Figure JPOXMLDOC01-appb-M000005
Figure JPOXMLDOC01-appb-M000005
 また、確率行列Pは、非特許文献3に記載のように注意重みが複数得られるモデルも存在するため、複数個(×J)得られることがある。 Further, since there is a model in which a plurality of attention weights can be obtained as described in Non-Patent Document 3, a plurality of (× J) stochastic matrices P may be obtained.
 KLD損失計算部109は、確率行列P(×J)および出力確率分布Yを基に、確率行列Pに対する出力確率分布YのKLD損失LKLDを、下記の式(6)を用いて計算する。KLD損失LKLDは、学習するモデルの出力確率分布Yと確率行列P(×J)とがどれくらいずれているかを示す。 The KLD loss calculation unit 109 calculates the KLD loss L KLD of the output probability distribution Y with respect to the probability matrix P based on the probability matrix P (× J) and the output probability distribution Y using the following equation (6). KLD loss L KLD indicates how much the output probability distribution Y of the model to be learned and the probability matrix P (× J) are.
Figure JPOXMLDOC01-appb-M000006
Figure JPOXMLDOC01-appb-M000006
 式(6)では、注意重みが複数得られる場合に応じて確率行列Pが複数個(×J)得られる場合も含めて、KLD損失LKLDが計算できるように設定されている。なお、確率行列Pが1つ(注意重みが1つ)の時は、J=1となる。 In the equation (6), the KLD loss L KLD can be calculated including the case where a plurality of stochastic matrices P (× J) are obtained depending on the case where a plurality of attention weights are obtained. When there is one stochastic matrix P (one attention weight), J = 1.
 正解精度計算部110は、Attention-based modelの出力確率分布Zと、正解シンボル系列であるシンボル系列cとを入力として、正解精度Accを計算する。正解精度計算部110は、学習するAttention-based modelの出力確率分布Zにargmax(クラス軸に対する最大値の要素IDを求める。)をかけて得られる系列Z´と、正解であるシンボル系列cとの一致率を正解精度Accとして、損失統合部111に返す.したがって、正解精度Accの取りうる範囲は0≦Acc≦1となる。 The correct answer accuracy calculation unit 110 calculates the correct answer accuracy Acc by inputting the output probability distribution Z of the Attention-based model and the symbol sequence c which is the correct answer symbol sequence. The correct answer accuracy calculation unit 110 includes a sequence Z'obtained by multiplying the output probability distribution Z of the Attention-based model to be learned by argmax (obtaining the element ID of the maximum value with respect to the class axis), and the symbol sequence c which is the correct answer. The match rate of is returned to the loss integration unit 111 as the correct answer accuracy Acc. Therefore, the range that the correct answer accuracy Acc can take is 0 ≦ Acc ≦ 1.
 損失統合部111は、CTC/Attentionモデルの損失を計算する。損失統合部111は、CTC損失計算部103から出力されたCTC損失LCTC、CE損失計算部107から出力されたCE損失LS2S,KLD損失計算部109から出力されたKLD損失LKLD,正解精度計算部110から出力された正解精度Acc、係数λ,ρ,ζ(0≦λ≦1),(0≦ζ≦1-λ)を、下記の式(7)に適用することで、損失を統合し、CTC/Attentionモデルの損失LS2S+CTC+KLDを得る。 The loss integration unit 111 calculates the loss of the CTC / Attention model. The loss integration unit 111 includes a CTC loss L CTC output from the CTC loss calculation unit 103, a CE loss L S2S output from the CE loss calculation unit 107, a KLD loss L KLD output from the KLD loss calculation unit 109, and correct accuracy. By applying the correct answer accuracy Acc, coefficients λ, ρ, ζ (0 ≦ λ ≦ 1), (0 ≦ ζ ≦ 1-λ) output from the calculation unit 110 to the following equation (7), the loss can be reduced. Integrate to obtain the loss LS2S + CTC + KLD of the CTC / Attention model.
Figure JPOXMLDOC01-appb-M000007
Figure JPOXMLDOC01-appb-M000007
 式(7)のρは、式(8)である。 The ρ of the equation (7) is the equation (8).
Figure JPOXMLDOC01-appb-M000008
Figure JPOXMLDOC01-appb-M000008
 学習装置1は、CTCモデルとAttention-based modelとを同時に学習するため、学習初期に得られる注意重みとその確率行列の精度とは非常に低い。したがって、ρのようなハイパーパラメータを導入する。ρは、ζとAcc(*)とからなる。ζは、人手で決めるハイパーパラメータ(0≦ζ≦1-λ)であり、Acc(*)は、Attention-based modelの出力確率分布Zと正解シンボルとを参照した時の正解精度である。このため、学習が進む上で、ρLKLDの項の影響が大きくなり、精度がよくなっていく。 Since the learning device 1 learns the CTC model and the Attention-based model at the same time, the attention weights obtained at the initial stage of learning and the accuracy of the stochastic matrix are very low. Therefore, we introduce hyperparameters such as ρ. ρ consists of ζ and Acc (*). ζ is a hyperparameter (0 ≦ ζ ≦ 1-λ) determined manually, and Acc (*) is the correct answer accuracy when the output probability distribution Z of the Attention-based model and the correct answer symbol are referred to. Therefore, in the progress of learning, the influence of the term of ρL KLD becomes large, and the accuracy becomes better.
 制御部112は、CTC/Attentionモデルの損失LS2S+CTC+KLDを用いて、音声分散表現系列変換部101の変換モデルパラメータγ、ラベル推定部102のラベル推定モデルパラメータγ、シンボル分散表現変換部104の文字特徴量推定モデルパラメータβ、注意重み計算部105のモデルパラメータ、ラベル推定部106のラベル推定モデルパラメータβを更新する。 Using the loss LS2S + CTC + KLD of the CTC / Attention model, the control unit 112 uses the conversion model parameter γ 1 of the voice distribution expression sequence conversion unit 101, the label estimation model parameter γ 2 of the label estimation unit 102, and the symbol distribution expression conversion unit 104. The character feature amount estimation model parameter β 3 , the model parameter of the attention weight calculation unit 105, and the label estimation model parameter β 2 of the label estimation unit 106 are updated.
 制御部112は、音声分散表現系列変換部101による処理、ラベル推定部102による処理、CTC損失計算部103による処理、シンボル分散表現変換部104による処理、注意重み計算部105による処理、ラベル推定部106による処理、CE損失計算部107による処理、確率行列計算部108による処理、KLD損失計算部109による処理、正解精度計算部110による処理、及び、損失統合部111による処理を、所定の終了条件が満たされるまで繰り返す。 The control unit 112 includes processing by the voice distribution expression sequence conversion unit 101, processing by the label estimation unit 102, processing by the CTC loss calculation unit 103, processing by the symbol distribution expression conversion unit 104, processing by the attention weight calculation unit 105, and label estimation unit. Predetermined termination conditions are processing by 106, processing by CE loss calculation unit 107, processing by probability matrix calculation unit 108, processing by KLD loss calculation unit 109, processing by correct answer accuracy calculation unit 110, and processing by loss integration unit 111. Repeat until is satisfied.
 この終了条件に限定はなく、例えば、繰り返し回数が閾値に達したことであってもよいし、繰り返しの前後で統合損失の変化量が閾値以下になったことであってもよいし、繰り返しの前後で変換モデルパラメータγやラベル推定モデルパラメータγの変化量が閾値以下になったことであってもよい。終了条件が満たされた場合、音声分散表現系列変換部101は、変換モデルパラメータγを出力し、ラベル推定部12は、ラベル推定モデルパラメータγを出力する。 This end condition is not limited, and may be, for example, that the number of repetitions has reached the threshold value, or that the amount of change in the integration loss has become less than or equal to the threshold value before and after the repetition. It may be that the amount of change of the conversion model parameter γ 1 and the label estimation model parameter γ 2 becomes equal to or less than the threshold value before and after. When the end condition is satisfied, the voice distribution expression sequence conversion unit 101 outputs the conversion model parameter γ 1 , and the label estimation unit 12 outputs the label estimation model parameter γ 2 .
[学習処理の処理手順]
 図2は、実施の形態1に係る学習処理の処理手順を示すフローチャートである。図2に示すように、音響特徴量系列Xの入力を受け付けると、音声分散表現系列変換部101は、音響特徴量系列Xを、対応する中間特徴量系列Hに変換する音声分散表現系列変換処理を行う(ステップS1)。
[Processing procedure of learning process]
FIG. 2 is a flowchart showing a processing procedure of the learning process according to the first embodiment. As shown in FIG. 2, when the input of the acoustic feature quantity sequence X is received, the voice distributed expression sequence conversion unit 101 converts the acoustic feature quantity sequence X into the corresponding intermediate feature quantity sequence H. (Step S1).
 続いて、ラベル推定部102は、中間特徴量系列Hに対応する、CTCモデルの次の出力確率分布Yを計算する第1推定処理を行う(ステップS2)。CTC損失計算部103は、出力確率分布Yとシンボル系列cとを入力として、シンボル系列cに対する出力確率分布YのCTC損失LCTCを計算するCTC損失計算処理を行う(ステップS3)。 Subsequently, the label estimation unit 102 performs the first estimation process for calculating the next output probability distribution Y of the CTC model corresponding to the intermediate feature quantity series H (step S2). The CTC loss calculation unit 103 inputs the output probability distribution Y and the symbol series c, and performs a CTC loss calculation process for calculating the CTC loss L CTC of the output probability distribution Y with respect to the symbol series c (step S3).
 一方、シンボル分散表現変換部104は、ラベル推定部106から出力された出力確率分布Z、シンボル系列cの入力を受け、文字特徴量Cに変換するシンボル分散表現変換処理を行う(ステップS4)。注意重み計算部105は、中間特徴量系列H、ラベル推定部106のニューラルネットワークの再帰的中間出力S、1つ前のラベル系列推定に用いた時の注意重みαn-1を用いて、次のラベルを推定する際に用いるαを計算する注意重み計算処理を行う(ステップS5)。 On the other hand, the symbol distributed expression conversion unit 104 receives the input of the output probability distribution Z and the symbol series c output from the label estimation unit 106, and performs the symbol distributed expression conversion process of converting to the character feature amount C (step S4). The attention weight calculation unit 105 uses the intermediate feature quantity sequence H, the recursive intermediate output S of the neural network of the label estimation unit 106, and the attention weight α n-1 used for the previous label sequence estimation, and then uses the following. Attention weight calculation processing for calculating α n used when estimating the label of is performed (step S5).
 そして、ラベル推定部106は、中間特徴量系列Hと文字特徴量Cと注意重みαとを用いて、Attention-based modelの次の出力確率分布Zを計算する第2推定処理を行う(ステップS6)。 Then, the label estimation unit 106 performs a second estimation process for calculating the next output probability distribution Z of the Attention-based model using the intermediate feature amount series H, the character feature amount C, and the attention weight α n (step). S6).
 CE損失計算部107は、出力確率分布Zとシンボル系列cをとを基に、シンボル系列cに対する出力確率分布ZのCE損失LS2Sを計算するCE損失計算処理を行う(ステップS7)。 The CE loss calculation unit 107 performs a CE loss calculation process for calculating the CE loss LS2S of the output probability distribution Z with respect to the symbol series c based on the output probability distribution Z and the symbol series c (step S7).
 また、確率行列計算部108は、ラベル推定部106から出力された出力確率分布Zと注意重み計算部105から出力された注意重みαとを基に、確率行列Pを計算する確率行列計算処理を行う(ステップS8)。KLD損失計算部109は、確率行列Pに対する出力確率分布Yの損失であるKLD損失LKLDを計算するKLD損失計算処理を行う(ステップS9)。そして、正解精度計算部110は、出力確率分布Zとシンボル系列cとを基に、正解精度Accを計算する正解精度計算処理を行う(ステップS10)。 Further, the probability matrix calculation unit 108 calculates the probability matrix P based on the output probability distribution Z output from the label estimation unit 106 and the attention weight α n output from the attention weight calculation unit 105. (Step S8). The KLD loss calculation unit 109 performs a KLD loss calculation process for calculating the KLD loss L KLD , which is the loss of the output probability distribution Y with respect to the probability matrix P (step S9). Then, the correct answer accuracy calculation unit 110 performs a correct answer accuracy calculation process for calculating the correct answer accuracy Acc based on the output probability distribution Z and the symbol series c (step S10).
 損失統合部111は、CTC損失LCTC、CE損失LS2S,KLD損失LKLD,正解精度Acc、係数λ,ρ,ζ(0≦λ≦1),(0≦ζ≦1-λ)を基に、損失を統合し、統合した損失LS2S+CTC+KLDを得る損失統合処理を行う(ステップS11)。制御部112は、損失LS2S+CTC+KLDを用いて、音声分散表現系列変換部101、ラベル推定部102、シンボル分散表現変換部104、注意重み計算部105、ラベル推定部106のモデルパラメータを更新する(ステップS12)。制御部112は、上記の各処理を、所定の終了条件が満たされるまで繰り返す。 The loss integration unit 111 is based on the CTC loss L CTC , CE loss L S2S , KLD loss L KLD , correct answer accuracy Acc, coefficients λ, ρ, ζ (0 ≦ λ ≦ 1), (0 ≦ ζ ≦ 1-λ). In addition, a loss integration process is performed to integrate the losses and obtain the integrated losses LS2S + CTC + KLD (step S11). The control unit 112 updates the model parameters of the voice distribution expression sequence conversion unit 101, the label estimation unit 102, the symbol distribution expression conversion unit 104, the attention weight calculation unit 105, and the label estimation unit 106 using the loss LS2S + CTC + KLD (step). S12). The control unit 112 repeats each of the above processes until a predetermined end condition is satisfied.
[実施の形態1の効果]
 このように、学習装置1では、CTC損失LCTCとCE損失LS2SとKLD損失LKLDとを統合することで、CTC/Attentionモデルの損失LS2S+CTC+KLDを求め、この損失LS2S+CTC+KLDを用いて、ラベル推定部102,203とともに、ラベル推定部102,203において共有される音声分散表現系列変換部101のモデルパラメータを更新する。すなわち、共有されている音声分散表現系列変換部101がラベル推定部102,203に関する損失によって学習される。言い換えると、音声分散表現系列変換部101は、ラベル推定部102,203の推定結果の精度を高めるような中間特徴量系列Hを出力するように学習されるため、全体の推定精度を向上させることができる。
[Effect of Embodiment 1]
As described above, in the learning device 1, the loss L S2S + CTC + KLD of the CTC / Attention model is obtained by integrating the CTC loss L CTC , the CE loss L S2S , and the KLD loss L KLD , and the label is used by using this loss L S2S + CTC + KLD . Together with the estimation units 102 and 203, the model parameters of the voice distribution expression sequence conversion unit 101 shared by the label estimation units 102 and 203 are updated. That is, the shared speech distributed expression sequence conversion unit 101 is learned by the loss related to the label estimation units 102 and 203. In other words, the speech distributed expression sequence conversion unit 101 is learned to output the intermediate feature quantity sequence H that enhances the accuracy of the estimation results of the label estimation units 102 and 203, so that the overall estimation accuracy is improved. Can be done.
 そして、学習装置1は、CTCをframe-by-frameに学習可能な損失関数(式(7)参照)を用いることで、CTCモデルの学習を安定化し、音声認識の推定精度を向上させることができる。 Then, the learning device 1 can stabilize the learning of the CTC model and improve the estimation accuracy of speech recognition by using a loss function (see equation (7)) in which the CTC can be learned frame-by-frame. can.
[実施の形態1の変形例]
 図3は、実施の形態1の変形例に係る学習装置の機能構成の一例を示す図である。attention-based modelで生成される確率行列は誤りを含む可能性がある。この問題を回避するために、図3に示すように、実施の形態1の変形例に係る学習装置1Aは、ラベル推定部102(第3の推定部)とは別に、ラベル推定部102A(第4の推定部)を設け、損失毎にパラメータを更新することも可能とする。CTC損失計算部103は、ラベル推定部102が推定した出力確率分布Yを基にCTC損失LCTCを計算し、KLD損失計算部109は、ラベル推定部102Aが推定した出力確率分布Yを基にKLD損失LKLDを計算する。
[Modified Example of Embodiment 1]
FIG. 3 is a diagram showing an example of the functional configuration of the learning device according to the modified example of the first embodiment. The probability matrix generated by the attention-based model may contain errors. In order to avoid this problem, as shown in FIG. 3, the learning device 1A according to the modified example of the first embodiment has a label estimation unit 102A (third estimation unit) separately from the label estimation unit 102 (third estimation unit). It is also possible to update the parameter for each loss by providing the estimation unit (4). The CTC loss calculation unit 103 calculates the CTC loss L CTC based on the output probability distribution Y estimated by the label estimation unit 102, and the KLD loss calculation unit 109 calculates the CTC loss L CTC based on the output probability distribution Y estimated by the label estimation unit 102A. KLD loss L Calculate KLD .
 図4は、実施の形態1の変形例に係る学習処理の処理手順を示すフローチャートである。図4に示すステップS21~ステップS28は、図2に示すステップS1~ステップS8と同じ処理である。 FIG. 4 is a flowchart showing a processing procedure of the learning process according to the modified example of the first embodiment. Steps S21 to S28 shown in FIG. 4 are the same processes as steps S1 to S8 shown in FIG.
 ラベル推定部102Aは、中間特徴量系列Hに対応する、CTCモデルの次の出力確率分布Yを計算し、KLD損失計算部109に出力する第3ラベル推定処理を行う(ステップS29)。KLD損失計算部109は、ラベル推定部102Aから出力された出力確率分布Yを用いて、KLD損失LKLDを計算する(ステップS30)。図4に示すステップS31~ステップS33は、図2に示すステップS10~ステップS12と同じ処理である。 The label estimation unit 102A calculates the next output probability distribution Y of the CTC model corresponding to the intermediate feature quantity series H, and performs a third label estimation process to output to the KLD loss calculation unit 109 (step S29). The KLD loss calculation unit 109 calculates the KLD loss L KLD using the output probability distribution Y output from the label estimation unit 102A (step S30). Steps S31 to S33 shown in FIG. 4 are the same processes as steps S10 to S12 shown in FIG.
[実施の形態2]
 次に、実施の形態2を説明する。実施の形態2では、実施の形態1に係る学習装置1または実施の形態1の変形例に係る学習装置1Aにおいて終了条件を満たした変換モデルパラメータγおよびラベル推定モデルパラメータγが与えられることで構築される音声認識装置について説明する。図5は、実施の形態2に係る音声認識装置の機能構成の一例を示す図である。図6は、実施の形態2に係る音声認識処理の処理手順を示すフローチャートである。
[Embodiment 2]
Next, the second embodiment will be described. In the second embodiment, the conversion model parameter γ 1 and the label estimation model parameter γ 2 satisfying the end condition are given in the learning device 1 according to the first embodiment or the learning device 1A according to the modified example of the first embodiment. The voice recognition device constructed in is described. FIG. 5 is a diagram showing an example of the functional configuration of the voice recognition device according to the second embodiment. FIG. 6 is a flowchart showing a processing procedure of the voice recognition process according to the second embodiment.
 図5に例示するように、実施の形態に係る音声認識装置3は、音声分散表現系列変換部301およびラベル推定部302を有する。音声分散表現系列変換部301は、学習装置1または学習装置1Aから出力された変換モデルパラメータγが入力されて設定されている点を除き、前述の音声分散表現系列変換部101と同一である。ラベル推定部302は、学習装置1または学習装置1Aから出力されたラベル推定モデルパラメータγが入力されて設定されている点を除き、前述のラベル推定部102と同一である。 As illustrated in FIG. 5, the voice recognition device 3 according to the embodiment has a voice distribution expression sequence conversion unit 301 and a label estimation unit 302. The voice distributed expression sequence conversion unit 301 is the same as the above-mentioned voice distributed expression sequence conversion unit 101 except that the conversion model parameter γ 1 output from the learning device 1 or the learning device 1A is input and set. .. The label estimation unit 302 is the same as the label estimation unit 102 described above, except that the label estimation model parameter γ 2 output from the learning device 1 or the learning device 1A is input and set.
 音声分散表現系列変換部301には、音声認識対象の音響特徴量系列X”(第2の音響特徴量系列)が入力される。音声分散表現系列変換部301は、変換モデルパラメータγが与えられた場合における、音響特徴量系列X”に対応する中間特徴量系列H”を得て出力する(ステップS41)。 The acoustic feature quantity sequence X "(second acoustic feature quantity sequence) to be recognized by the voice is input to the voice distribution expression sequence conversion unit 301. The voice distribution expression sequence conversion unit 301 is given the conversion model parameter γ 1 . In this case, the intermediate feature quantity sequence H "corresponding to the acoustic feature quantity sequence X" is obtained and output (step S41).
 ラベル推定部302には、音声分散表現系列変換部301から出力された中間特徴量系列H”が入力される。ラベル推定部302は、ラベル推定モデルパラメータγが与えられた場合における、中間特徴量系列Hに対応するラベル系列{^l,^l,…,^l}(出力確率分布)を音声認識結果として得て出力する(ステップS42)。 An intermediate feature quantity series H ”output from the speech distribution expression sequence conversion unit 301 is input to the label estimation unit 302. The label estimation unit 302 receives an intermediate feature when the label estimation model parameter γ 2 is given. The label sequence {^ l 1 , ^ l 2 , ..., ^ L F } (output probability distribution) corresponding to the quantity sequence H is obtained and output as a speech recognition result (step S42).
 このように、音声認識装置3には、学習装置1または学習装置1Aによって、損失LS2S+CTC+KLDを用いて最適化されたモデルパラメータが、ラベル推定部302及び音声分散表現系列変換部301に設定されているため、音声認識処理を高精度に実施することができる。 As described above, in the speech recognition device 3, model parameters optimized by the learning device 1 or the learning device 1A using the loss LS2S + CTC + KLD are set in the label estimation unit 302 and the speech distribution expression sequence conversion unit 301. Therefore, the voice recognition process can be performed with high accuracy.
[実施の形態のシステム構成について]
 学習装置1及び音声認識装置3の各構成要素は機能概念的なものであり、必ずしも物理的に図示のように構成されていることを要しない。すなわち、表示処理システム10,210の機能の分散及び統合の具体的形態は図示のものに限られず、その全部または一部を、各種の負荷や使用状況などに応じて、任意の単位で機能的または物理的に分散または統合して構成することができる。
[About the system configuration of the embodiment]
Each component of the learning device 1 and the voice recognition device 3 is a functional concept, and does not necessarily have to be physically configured as shown in the figure. That is, the specific form of the distribution and integration of the functions of the display processing systems 10 and 210 is not limited to the one shown in the figure, and all or part of them may be functional in any unit according to various loads and usage conditions. Or it can be physically distributed or integrated.
 また、学習装置1及び音声認識装置3においておこなわれる各処理は、全部または任意の一部が、CPU、GPU(Graphics Processing Unit)、及び、CPU、GPUにより解析実行されるプログラムにて実現されてもよい。また、学習装置1及び音声認識装置3においておこなわれる各処理は、ワイヤードロジックによるハードウェアとして実現されてもよい。 Further, each process performed in the learning device 1 and the speech recognition device 3 is realized by a program in which all or any part thereof is analyzed and executed by the CPU, GPU (Graphics Processing Unit), and CPU, GPU. May be good. Further, each process performed by the learning device 1 and the voice recognition device 3 may be realized as hardware by wired logic.
 また、実施の形態において説明した各処理のうち、自動的におこなわれるものとして説明した処理の全部または一部を手動的に行うこともできる。もしくは、手動的におこなわれるものとして説明した処理の全部または一部を公知の方法で自動的に行うこともできる。この他、上述及び図示の処理手順、制御手順、具体的名称、各種のデータやパラメータを含む情報については、特記する場合を除いて適宜変更することができる。 It is also possible to manually perform all or part of the processes described as being automatically performed among the processes described in the embodiment. Alternatively, all or part of the process described as being performed manually can be automatically performed by a known method. In addition, the above-mentioned and illustrated processing procedures, control procedures, specific names, and information including various data and parameters can be appropriately changed unless otherwise specified.
[プログラム]
 図7は、プログラムが実行されることにより、学習装置1及び音声認識装置3が実現されるコンピュータの一例を示す図である。コンピュータ1000は、例えば、メモリ1010、CPU1020を有する。また、コンピュータ1000は、ハードディスクドライブインタフェース1030、ディスクドライブインタフェース1040、シリアルポートインタフェース1050、ビデオアダプタ1060、ネットワークインタフェース1070を有する。これらの各部は、バス1080によって接続される。
[program]
FIG. 7 is a diagram showing an example of a computer in which the learning device 1 and the voice recognition device 3 are realized by executing the program. The computer 1000 has, for example, a memory 1010 and a CPU 1020. The computer 1000 also has a hard disk drive interface 1030, a disk drive interface 1040, a serial port interface 1050, a video adapter 1060, and a network interface 1070. Each of these parts is connected by a bus 1080.
 メモリ1010は、ROM1011及びRAM1012を含む。ROM1011は、例えば、BIOS(Basic Input Output System)等のブートプログラムを記憶する。ハードディスクドライブインタフェース1030は、ハードディスクドライブ1090に接続される。ディスクドライブインタフェース1040は、ディスクドライブ1100に接続される。例えば磁気ディスクや光ディスク等の着脱可能な記憶媒体が、ディスクドライブ1100に挿入される。シリアルポートインタフェース1050は、例えばマウス1110、キーボード1120に接続される。ビデオアダプタ1060は、例えばディスプレイ1130に接続される。 Memory 1010 includes ROM 1011 and RAM 1012. The ROM 1011 stores, for example, a boot program such as a BIOS (Basic Input Output System). The hard disk drive interface 1030 is connected to the hard disk drive 1090. The disk drive interface 1040 is connected to the disk drive 1100. For example, a removable storage medium such as a magnetic disk or an optical disk is inserted into the disk drive 1100. The serial port interface 1050 is connected to, for example, a mouse 1110 and a keyboard 1120. The video adapter 1060 is connected to, for example, the display 1130.
 ハードディスクドライブ1090は、例えば、OS(Operating System)1091、アプリケーションプログラム1092、プログラムモジュール1093、プログラムデータ1094を記憶する。すなわち、学習装置1及び音声認識装置3の各処理を規定するプログラムは、コンピュータ1000により実行可能なコードが記述されたプログラムモジュール1093として実装される。プログラムモジュール1093は、例えばハードディスクドライブ1090に記憶される。例えば、学習装置1及び音声認識装置3における機能構成と同様の処理を実行するためのプログラムモジュール1093が、ハードディスクドライブ1090に記憶される。なお、ハードディスクドライブ1090は、SSD(Solid State Drive)により代替されてもよい。 The hard disk drive 1090 stores, for example, an OS (Operating System) 1091, an application program 1092, a program module 1093, and program data 1094. That is, the program that defines each process of the learning device 1 and the voice recognition device 3 is implemented as a program module 1093 in which a code that can be executed by the computer 1000 is described. The program module 1093 is stored in, for example, the hard disk drive 1090. For example, the program module 1093 for executing the same processing as the functional configuration in the learning device 1 and the voice recognition device 3 is stored in the hard disk drive 1090. The hard disk drive 1090 may be replaced by an SSD (Solid State Drive).
 また、上述した実施の形態の処理で用いられる設定データは、プログラムデータ1094として、例えばメモリ1010やハードディスクドライブ1090に記憶される。そして、CPU1020が、メモリ1010やハードディスクドライブ1090に記憶されたプログラムモジュール1093やプログラムデータ1094を必要に応じてRAM1012に読み出して実行する。 Further, the setting data used in the processing of the above-described embodiment is stored as program data 1094 in, for example, a memory 1010 or a hard disk drive 1090. Then, the CPU 1020 reads the program module 1093 and the program data 1094 stored in the memory 1010 and the hard disk drive 1090 into the RAM 1012 and executes them as needed.
 なお、プログラムモジュール1093やプログラムデータ1094は、ハードディスクドライブ1090に記憶される場合に限らず、例えば着脱可能な記憶媒体に記憶され、ディスクドライブ1100等を介してCPU1020によって読み出されてもよい。あるいは、プログラムモジュール1093及びプログラムデータ1094は、ネットワーク(LAN(Local Area Network)、WAN(Wide Area Network)等)を介して接続された他のコンピュータに記憶されてもよい。そして、プログラムモジュール1093及びプログラムデータ1094は、他のコンピュータから、ネットワークインタフェース1070を介してCPU1020によって読み出されてもよい。 The program module 1093 and the program data 1094 are not limited to those stored in the hard disk drive 1090, but may be stored in, for example, a removable storage medium and read by the CPU 1020 via the disk drive 1100 or the like. Alternatively, the program module 1093 and the program data 1094 may be stored in another computer connected via a network (LAN (Local Area Network), WAN (Wide Area Network), etc.). Then, the program module 1093 and the program data 1094 may be read from another computer by the CPU 1020 via the network interface 1070.
 以上、本発明者によってなされた発明を適用した実施の形態について説明したが、本実施の形態による本発明の開示の一部をなす記述及び図面により本発明は限定されることはない。すなわち、本実施の形態に基づいて当業者等によりなされる他の実施の形態、実施例及び運用技術等は全て本発明の範疇に含まれる。 Although the embodiment to which the invention made by the present inventor is applied has been described above, the present invention is not limited by the description and the drawings which form a part of the disclosure of the present invention according to the present embodiment. That is, other embodiments, examples, operational techniques, and the like made by those skilled in the art based on the present embodiment are all included in the scope of the present invention.
 1 学習装置
 3 音声認識装置
 101,301 音声分散表現系列変換部
 102,102A,106,302 ラベル推定部
 103 CTC損失計算部
 104 シンボル分散表現変換部
 105 注意重み計算部
 107 CE損失計算部
 108 確率行列計算部
 109 KLD損失計算部
 110 正解精度計算部
 111 損失統合部
 112 制御部
1 Learning device 3 Speech recognition device 101, 301 Speech distribution expression series conversion unit 102, 102A, 106, 302 Label estimation unit 103 CTC loss calculation unit 104 Symbol distribution expression conversion unit 105 Attention weight calculation unit 107 CE loss calculation unit 108 Probability matrix Calculation unit 109 KLD loss calculation unit 110 Correct answer accuracy calculation unit 111 Loss integration unit 112 Control unit

Claims (7)

  1.  変換モデルパラメータが与えられた場合における、第1の音響特徴量系列に対応する中間特徴量系列を取得する変換部と、
     第1の推定モデルパラメータが与えられた場合における、前記中間特徴量系列に対応する第1の出力確率分布を推定する第1の推定部と、
     前記中間特徴量系列と、正解シンボル系列を変換した文字特徴量と、シンボルが表れるタイミングに対する前記第1の音響特徴量系列の各フレームの関連性の高さを表す要素を有する注意重みとを基に、第2の推定モデルパラメータが与えられた場合における、前記中間特徴量系列に対応する第2の出力確率分布を推定する第2の推定部と、
     前記第2の出力確率分布と、前記注意重みとの積の全シンボルについての総和である確率行列を計算する確率行列計算部と、
     前記第1の音響特徴量系列に対応する前記正解シンボル系列および前記第1の出力確率分布を基に、前記正解シンボル系列に対する前記第1の出力確率分布のCTC(Connectionist Temporal Classification)損失を計算するCTC損失計算部と、
     前記確率行列および前記第1の出力確率分布を基に、前記確率行列に対する前記第1の出力確率分布のKLD(Kullback-Leibler Divergence)損失を計算するKLD損失計算部と、
     前記第1の音響特徴量系列に対応する前記正解シンボル系列および前記第2の出力確率分布を基に、前記正解シンボル系列に対する前記第2の出力確率分布のCE(Cross Entropy)損失を計算するCE損失計算部と、
     前記CTC損失と前記KLD損失と前記CE損失とを統合した統合損失に基づいて前記変換モデルパラメータ、前記第1の推定モデルパラメータ、前記第2の推定モデルパラメータを更新し、前記変換部と前記第1の推定部と前記第2の推定部と前記確率行列計算部と前記CTC損失計算部と前記KLD損失計算部と前記CE損失計算部との処理を終了条件が満たされるまで繰り返す制御部と、
     を有することを特徴とする学習装置。
    A conversion unit that acquires an intermediate feature sequence corresponding to the first acoustic feature sequence when the transformation model parameters are given, and a conversion unit.
    A first estimation unit that estimates the first output probability distribution corresponding to the intermediate feature series when the first estimation model parameter is given, and a first estimation unit.
    Based on the intermediate feature quantity series, the character feature quantity obtained by converting the correct symbol sequence, and the attention weight having an element indicating the high degree of relevance of each frame of the first acoustic feature quantity series to the timing at which the symbol appears. In addition to the second estimation unit that estimates the second output probability distribution corresponding to the intermediate feature series when the second estimation model parameter is given.
    A stochastic matrix calculation unit that calculates a stochastic matrix that is the sum of all symbols of the product of the second output probability distribution and the attention weight.
    The CTC (Connectionist Temporal Classification) loss of the first output probability distribution with respect to the correct symbol series is calculated based on the correct symbol series corresponding to the first acoustic feature series and the first output probability distribution. CTC loss calculation unit and
    A KLD loss calculation unit that calculates the KLD (Kullback-Leibler Divergence) loss of the first output probability distribution with respect to the probability matrix based on the probability matrix and the first output probability distribution.
    CE for calculating the CE (Cross Entropy) loss of the second output probability distribution with respect to the correct symbol series based on the correct symbol series corresponding to the first acoustic feature series and the second output probability distribution. Loss calculation unit and
    The conversion model parameter, the first estimation model parameter, and the second estimation model parameter are updated based on the integrated loss in which the CTC loss, the KLD loss, and the CE loss are integrated, and the conversion unit and the first estimation model parameter are updated. A control unit that repeats the processing of the estimation unit 1, the second estimation unit, the probability matrix calculation unit, the CTC loss calculation unit, the KLD loss calculation unit, and the CE loss calculation unit until the end condition is satisfied.
    A learning device characterized by having.
  2.  前記第1の推定部は、それぞれが前記中間特徴量系列に対応する前記第1の出力確率分布を推定する第3の推定部と第4の推定部とを有し、
     前記CTC損失計算部は、前記第3の推定部が推定した前記第1の出力確率分布を基に前記CTC損失を計算し、
     前記KLD損失計算部は、前記第4の推定部が推定した前記第1の出力確率分布を基に前記KLD損失を計算することを特徴とする請求項1に記載の学習装置。
    The first estimation unit has a third estimation unit and a fourth estimation unit, each of which estimates the first output probability distribution corresponding to the intermediate feature quantity series.
    The CTC loss calculation unit calculates the CTC loss based on the first output probability distribution estimated by the third estimation unit.
    The learning device according to claim 1, wherein the KLD loss calculation unit calculates the KLD loss based on the first output probability distribution estimated by the fourth estimation unit.
  3.  請求項1または2に記載の学習装置において終了条件を満たした各モデルパラメータが与えられた場合における、第2の音響特徴量系列に対応する出力確率分布を推定して出力することを特徴とする音声認識装置。 It is characterized in that the output probability distribution corresponding to the second acoustic feature quantity series is estimated and output when each model parameter satisfying the end condition is given in the learning device according to claim 1 or 2. Speech recognition device.
  4.  学習装置が実行する学習方法であって、
     変換モデルパラメータが与えられた場合における、第1の音響特徴量系列に対応する中間特徴量系列を取得する変換工程と、
     第1の推定モデルパラメータが与えられた場合における、前記中間特徴量系列に対応する第1の出力確率分布を推定する第1の推定工程と、
     前記中間特徴量系列と、正解シンボル系列を変換した文字特徴量と、シンボルが表れるタイミングに対する前記第1の音響特徴量系列の各フレームの関連性の高さを表す要素を有する注意重みとを基に、第2の推定モデルパラメータが与えられた場合における、前記中間特徴量系列に対応する第2の出力確率分布を推定する第2の推定工程と、
     前記第2の出力確率分布と、前記注意重みとの積の全シンボルについての総和である確率行列を計算する確率行列計算工程と、
     前記第1の音響特徴量系列に対応する前記正解シンボル系列および前記第1の出力確率分布を基に、前記正解シンボル系列に対する前記第1の出力確率分布のCTC(Connectionist Temporal Classification)損失を計算するCTC損失計算工程と、
     前記確率行列および前記第1の出力確率分布を基に、前記確率行列に対する前記第1の出力確率分布のKLD(Kullback-Leibler Divergence)損失を計算するKLD損失計算工程と、
     前記第1の音響特徴量系列に対応する前記正解シンボル系列および前記第2の出力確率分布を基に、前記正解シンボル系列に対する前記第2の出力確率分布のCE(Cross Entropy)損失を計算するCE損失計算工程と、
     を含み、
     前記CTC損失と前記KLD損失と前記CE損失とを統合した統合損失に基づいて前記変換モデルパラメータ、前記第1の推定モデルパラメータ、前記第2の推定モデルパラメータを更新し、前記変換工程と前記第1の推定工程と前記第2の推定工程と前記確率行列計算工程と前記CTC損失計算工程と前記KLD損失計算工程と前記CE損失計算工程との処理を終了条件が満たされるまで繰り返すことを特徴とする学習方法。
    It is a learning method executed by the learning device.
    A conversion step for acquiring an intermediate feature sequence corresponding to the first acoustic feature sequence when the transformation model parameters are given, and a conversion step.
    A first estimation step for estimating a first output probability distribution corresponding to the intermediate feature series when the first estimation model parameter is given, and a first estimation step.
    Based on the intermediate feature quantity series, the character feature quantity obtained by converting the correct symbol sequence, and the attention weight having an element indicating the high degree of relevance of each frame of the first acoustic feature quantity series to the timing at which the symbol appears. In the second estimation step of estimating the second output probability distribution corresponding to the intermediate feature quantity series when the second estimation model parameter is given.
    A stochastic matrix calculation step for calculating a stochastic matrix, which is the sum of all symbols of the product of the second output probability distribution and the attention weight.
    The CTC (Connectionist Temporal Classification) loss of the first output probability distribution with respect to the correct symbol series is calculated based on the correct symbol series corresponding to the first acoustic feature series and the first output probability distribution. CTC loss calculation process and
    A KLD loss calculation step for calculating the KLD (Kullback-Leibler Divergence) loss of the first output probability distribution with respect to the probability matrix based on the probability matrix and the first output probability distribution.
    CE for calculating the CE (Cross Entropy) loss of the second output probability distribution with respect to the correct symbol series based on the correct symbol series corresponding to the first acoustic feature series and the second output probability distribution. Loss calculation process and
    Including
    The conversion model parameter, the first estimated model parameter, and the second estimated model parameter are updated based on the integrated loss in which the CTC loss, the KLD loss, and the CE loss are integrated, and the conversion step and the first estimation model parameter are updated. The feature is that the processing of the estimation step 1, the second estimation step, the probability matrix calculation step, the CTC loss calculation step, the KLD loss calculation step, and the CE loss calculation step is repeated until the end condition is satisfied. How to learn.
  5.  請求項1または2に記載の学習装置において終了条件を満たした各モデルパラメータが与えられた場合における、第2の音響特徴量系列に対応する出力確率分布を推定して出力する音声認識方法。 A voice recognition method that estimates and outputs an output probability distribution corresponding to a second acoustic feature quantity series when each model parameter satisfying the end condition is given in the learning device according to claim 1 or 2.
  6.  コンピュータを、請求項1または2に記載の学習装置として機能させるための学習プログラム。 A learning program for making a computer function as the learning device according to claim 1 or 2.
  7.  コンピュータを、請求項3に記載の音声認識装置として機能させるための音声認識プログラム。 A voice recognition program for making a computer function as the voice recognition device according to claim 3.
PCT/JP2020/028766 2020-07-27 2020-07-27 Learning device, speech recognition device, learning method, speech recognition method, learning program, and speech recognition program WO2022024202A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
JP2022539819A JP7452661B2 (en) 2020-07-27 2020-07-27 Learning device, speech recognition device, learning method, speech recognition method, learning program, and speech recognition program
PCT/JP2020/028766 WO2022024202A1 (en) 2020-07-27 2020-07-27 Learning device, speech recognition device, learning method, speech recognition method, learning program, and speech recognition program

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2020/028766 WO2022024202A1 (en) 2020-07-27 2020-07-27 Learning device, speech recognition device, learning method, speech recognition method, learning program, and speech recognition program

Publications (1)

Publication Number Publication Date
WO2022024202A1 true WO2022024202A1 (en) 2022-02-03

Family

ID=80037866

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2020/028766 WO2022024202A1 (en) 2020-07-27 2020-07-27 Learning device, speech recognition device, learning method, speech recognition method, learning program, and speech recognition program

Country Status (2)

Country Link
JP (1) JP7452661B2 (en)
WO (1) WO2022024202A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115910044A (en) * 2023-01-10 2023-04-04 广州小鹏汽车科技有限公司 Voice recognition method and device and vehicle

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2020505650A (en) * 2017-05-11 2020-02-20 三菱電機株式会社 Voice recognition system and voice recognition method
US20200219486A1 (en) * 2019-01-08 2020-07-09 Baidu Online Network Technology (Beijing) Co., Ltd. Methods, devices and computer-readable storage media for real-time speech recognition

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2020505650A (en) * 2017-05-11 2020-02-20 三菱電機株式会社 Voice recognition system and voice recognition method
US20200219486A1 (en) * 2019-01-08 2020-07-09 Baidu Online Network Technology (Beijing) Co., Ltd. Methods, devices and computer-readable storage media for real-time speech recognition

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
MORIYA TAKAFUMI, SATO HIROSHI, TANAKA TOMOHIRO, ASHIHARA TAKANORI, MASUMURA RYO, SHINOHARA YUSUKE: "Distilling knowledge of attention-based encoder-decoder into CTC-based automatic speech recognition systems", SPRING AND AUTUMN MEETING OF THE ACOUSTICAL SOCIETY OF JAPAN, 29 February 2020 (2020-02-29) - 3 March 2020 (2020-03-03), JP , pages 883 - 886, XP009534491, ISSN: 1880-7658 *
UENO, SEI ET AL.: "Attention Based End-to-End Speech Recognition of Word Units Complemented with CTC Based Character Unit Model", IPSJ SIG TECHNICAL REPORT, vol. 2018, no. 16, February 2018 (2018-02-01), pages 1 - 8, ISSN: 2118-8663 *
WATANABE, SHINJI ET AL.: "Hybrid CTC/Attention Architecture for End-to-End Speech Recognition", IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, vol. 11, no. 8, December 2017 (2017-12-01), pages 1240 - 1253, XP055494520, ISSN: 1932-4553, DOI: 10.1109/JSTSP.2017.2763455 *
YAN GAO; TITOUAN PARCOLLET; NICHOLAS LANE: "Distilling Knowledge from Ensembles of Acoustic Models for Joint CTC-Attention End-to-End Speech Recognition", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 19 May 2020 (2020-05-19), 201 Olin Library Cornell University Ithaca, NY 14853 , XP081672531 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115910044A (en) * 2023-01-10 2023-04-04 广州小鹏汽车科技有限公司 Voice recognition method and device and vehicle

Also Published As

Publication number Publication date
JP7452661B2 (en) 2024-03-19
JPWO2022024202A1 (en) 2022-02-03

Similar Documents

Publication Publication Date Title
EP3926623A1 (en) Speech recognition method and apparatus, and neural network training method and apparatus
CN108804611B (en) Dialog reply generation method and system based on self comment sequence learning
US20140156575A1 (en) Method and Apparatus of Processing Data Using Deep Belief Networks Employing Low-Rank Matrix Factorization
CN110275939B (en) Method and device for determining conversation generation model, storage medium and electronic equipment
US11887008B2 (en) Contextual text generation for question answering and text summarization with supervised representation disentanglement and mutual information minimization
CN107766319B (en) Sequence conversion method and device
CN112084301B (en) Training method and device for text correction model, text correction method and device
WO2022024202A1 (en) Learning device, speech recognition device, learning method, speech recognition method, learning program, and speech recognition program
US6173076B1 (en) Speech recognition pattern adaptation system using tree scheme
WO2019138897A1 (en) Learning device and method, and program
CN114282555A (en) Translation model training method and device, and translation method and device
JP7211103B2 (en) Sequence labeling device, sequence labeling method, and program
US20230108579A1 (en) Dynamic entity representations for sequence generation
JP6633556B2 (en) Acoustic model learning device, speech recognition device, acoustic model learning method, speech recognition method, and program
WO2021117089A1 (en) Model learning device, voice recognition device, method for same, and program
WO2023017568A1 (en) Learning device, inference device, learning method, and program
WO2020162240A1 (en) Language model score calculation device, language model creation device, methods therefor, program, and recording medium
CN113077785B (en) End-to-end multi-language continuous voice stream voice content identification method and system
WO2022068197A1 (en) Conversation generation method and apparatus, device, and readable storage medium
CN112364602B (en) Multi-style text generation method, device, equipment and readable storage medium
CN117859173A (en) Speech recognition with speech synthesis based model adaptation
JP7359028B2 (en) Learning devices, learning methods, and learning programs
WO2022168162A1 (en) Prior learning method, prior learning device, and prior learning program
CN114730380A (en) Deep parallel training of neural networks
WO2020250279A1 (en) Model learning device, method, and program

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20947282

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2022539819

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20947282

Country of ref document: EP

Kind code of ref document: A1