WO2022259555A1 - Voice recognition method, voice recognition device, and voice recognition program - Google Patents

Voice recognition method, voice recognition device, and voice recognition program Download PDF

Info

Publication number
WO2022259555A1
WO2022259555A1 PCT/JP2021/022414 JP2021022414W WO2022259555A1 WO 2022259555 A1 WO2022259555 A1 WO 2022259555A1 JP 2021022414 W JP2021022414 W JP 2021022414W WO 2022259555 A1 WO2022259555 A1 WO 2022259555A1
Authority
WO
WIPO (PCT)
Prior art keywords
speech recognition
model
spike
learning
sequence
Prior art date
Application number
PCT/JP2021/022414
Other languages
French (fr)
Japanese (ja)
Inventor
孝典 芦原
太一 浅見
Original Assignee
日本電信電話株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 日本電信電話株式会社 filed Critical 日本電信電話株式会社
Priority to PCT/JP2021/022414 priority Critical patent/WO2022259555A1/en
Publication of WO2022259555A1 publication Critical patent/WO2022259555A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks

Definitions

  • the present invention relates to a speech recognition method, a speech recognition device, and a speech recognition program.
  • the feature values are extracted after windowing the input speech waveform to a frame with a certain width, and the feature values are sequentially shifted while shifting the frames by a certain width.
  • a sequence is generated and used as an input for a speech recognition model (see Patent Document 1).
  • Non-Patent Documents 1 and 2 a technique for speeding up speech recognition by reducing the number of frames of feature amounts to be input to the speech recognition model is expected (see Non-Patent Documents 1 and 2).
  • Non-Patent Document 3 For example, in a speech recognition model with a loss function of CTC (Connectionist Temporal Classification), corresponding symbols are output in synchronization with input frames (see Non-Patent Document 3).
  • CTC Connectionist Temporal Classification
  • Non-Patent Document 3 describes a hybrid model such as DNN-HMM (Deep Neural Network-Hidden Markov Model). Also, Non-Patent Document 4 describes simultaneous learning of a speech recognition task and a speaker recognition task.
  • DNN-HMM Deep Neural Network-Hidden Markov Model
  • Bengio "DYNAMIC FRAME SKIPPING FOR FAST SPEECH RECOGNITION IN RECURRENT NEURAL NETWORK BASED ACOUSTIC MODELS", 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary AB, Canada, 2018, pp.4984-4988, doi: 10.1109/ICASSP.2018.8462615 Alex Graves, Santiago Fernandez, Faustino Gomez, and Jurgen Schmidhuber, "Connectionist Temporal Classification: Labeling Unsegmented Sequence Data with Recurrent Neural Networks", In Proceedings of the 23rd international conference on Machine learning (ICML '06). York, NY, USA, 2006, pp.369-376.
  • the present invention has been made in view of the above, and aims to speed up speech recognition by a speech recognition model using CTC.
  • a speech recognition method is a speech recognition method executed by a speech recognition device, which extracts a series of feature amounts for each frame of a speech signal. a learning step of learning a speech recognition model using CTC (Connectionist Temporal Classification) using the series of extracted feature values; and a speech recognition model spike using the learned speech recognition model and a prediction learning step of learning a spike point prediction model that predicts when the spike will be output using the generated spike sequence. characterized by containing
  • FIG. 1 is a schematic diagram illustrating a schematic configuration of a speech recognition device.
  • FIG. 2 is a flow chart showing a speech recognition processing procedure.
  • FIG. 3 is a flow chart showing a speech recognition processing procedure.
  • FIG. 4 is a diagram illustrating a computer that executes a speech recognition program.
  • the speech recognition apparatus of this embodiment focuses on the posterior probability sequence, which is the spike sequence for each frame in the speech recognition model using CTC.
  • the spike series is a series of posterior probabilities, in which the target recognition label symbol instantaneously appears as a spike with an extremely high posterior probability in a short frame of about 1 to 3 frames.
  • other frames are mostly occupied by blank symbols having no recognition label with high posterior probability. For example, for 10 frames of input speech "patent”, if the output label is two characters “patent”, the blank symbol _ is output together with "____special________".
  • the speech recognition device limits the input frames for speech recognition to frames in which spikes appear. This enables the speech recognition device to speed up speech recognition.
  • the speech recognition apparatus learns in advance a model for predicting a spike point, and uses it in the preceding stage of the speech recognition model.
  • a model for predicting this spike point is learned as a binary classification model using a teacher label that assigns 1 to the label of the spike point and 0 to the blank label of the other points.
  • the speech recognition apparatus performs high-speed speech recognition by inputting only spike point frames having a label of 1 predicted from such a binary classification model into a speech recognition model using CTC. Realize.
  • speech recognition models using CTC words and characters are used as output label symbols. In other words, the input required for the output of is extremely narrowed down.
  • the speech recognition device applies multi-task learning, in which one model learns multiple tasks at the same time and multiple tasks are expressed in one model.
  • the speech recognition apparatus integrates the binary classification model and the speech recognition model, and simultaneously solves the speech recognition task and the binary classification task.
  • multi-task learning is realized by sharing an encoder for a speech recognition task and a binary classification task, and learning a decoder for the binary classification task.
  • the speech recognition apparatus achieves an improvement in accuracy and a reduction in size of the model.
  • FIG. 1 is a schematic diagram illustrating a schematic configuration of a speech recognition device.
  • a speech recognition apparatus 10 of this embodiment is realized by a general-purpose computer such as a personal computer, and includes an input unit 11, an output unit 12, a communication control unit 13, a storage unit 14, and a control unit 15. .
  • the input unit 11 is implemented using input devices such as a keyboard and a mouse, and inputs various instruction information such as processing start to the control unit 15 in response to input operations by the practitioner.
  • the output unit 12 is implemented by a display device such as a liquid crystal display, a printing device such as a printer, an information communication device, or the like.
  • the communication control unit 13 is implemented by a NIC (Network Interface Card) or the like, and controls communication between the control unit 15 and an external device such as a server or a device that acquires an acoustic signal via a network.
  • NIC Network Interface Card
  • the storage unit 14 is implemented by semiconductor memory devices such as RAM (Random Access Memory) and flash memory, or storage devices such as hard disks and optical disks. Note that the storage unit 14 may be configured to communicate with the control unit 15 via the communication control unit 13 . In the present embodiment, the storage unit 14 stores, for example, a speech recognition model 14a, a spike point prediction model 14b, and the like used for speech recognition processing, which will be described later.
  • RAM Random Access Memory
  • flash memory or storage devices such as hard disks and optical disks.
  • the storage unit 14 may be configured to communicate with the control unit 15 via the communication control unit 13 .
  • the storage unit 14 stores, for example, a speech recognition model 14a, a spike point prediction model 14b, and the like used for speech recognition processing, which will be described later.
  • the control unit 15 is implemented using a CPU (Central Processing Unit), NP (Network Processor), FPGA (Field Programmable Gate Array), etc., and executes a processing program stored in memory. Thereby, the control unit 15 functions as an acquisition unit 15a, an extraction unit 15b, a learning unit 15c, a generation unit 15d, a prediction learning unit 15e, and a recognition unit 15f, as illustrated in FIG. Note that these functional units may be implemented in different hardware.
  • the learning unit 15c may be implemented as a learning device
  • the recognition unit 15f may be implemented as a recognition device.
  • the control unit 15 may include other functional units.
  • the acquisition unit 15a acquires an audio signal. Specifically, the acquisition unit 15a acquires an analog audio signal and performs A/D conversion to obtain an audio digital signal.
  • the acquisition unit may cause the storage unit 14 to store an audio digital signal obtained by A/D converting the acquired audio signal. Alternatively, the acquiring unit 15a may immediately transfer the audio digital signal obtained by A/D converting the acquired audio signal to the extracting unit 15b described below without storing it in the storage unit 14.
  • the extraction unit 15b extracts a series of feature amounts for each frame of the audio signal. Specifically, the extraction unit 15b extracts the acoustic feature quantity from the speech digital signal, and obtains a feature quantity sequence for each utterance. The extraction unit 15b extracts, for example, a feature amount using MFCC or power based on short-time frame analysis of an audio signal. For example, the extracting unit 15b extracts 1 to 12 dimensions of MFCC (Mel-Frequency Cepstrum Coefficients), dynamic parameters such as ⁇ MFCC, ⁇ MFCC, power, ⁇ power, ⁇ power, etc., as feature amounts. use.
  • MFCC Fel-Frequency Cepstrum Coefficients
  • the extraction unit 15b may perform Cepstral Mean Normalization (CMN) processing on the MFCC.
  • CNN Cepstral Mean Normalization
  • the feature amount is not limited to those using MFCC or power, and may be parameters such as autocorrelation peak value and group delay, which are used for identifying special utterances.
  • the extraction unit 15b may store the extracted feature series in the storage unit 14. Alternatively, the extracting unit 15b may immediately transfer the extracted feature series to the learning unit 15c described below without storing them in the storage unit 14. FIG.
  • the learning unit 15c learns the speech recognition model 14a using CTC using the series of extracted feature amounts. Specifically, the learning unit 15c learns the end-to-end speech recognition model 14a using the feature amount series of the teacher data.
  • This speech recognition model 14a uses a CTC loss function and outputs a sequence of labels that is a spike sequence.
  • the generation unit 15d uses the learned speech recognition model 14a to generate a spike sequence, which is a sequence of labels output by the speech recognition model 14a as spikes. Specifically, the generating unit 15d uses the trained speech recognition model 14a learned by the learning unit 15c to cause the speech recognition model 14a to speech-recognize the feature quantity series used for learning, thereby generating the Generate a spike sequence that is a stochastic sequence.
  • the generation unit 15d may store the generated spike series in the storage unit 14. Alternatively, the generating unit 15d may immediately transfer the generated spike sequence to the prediction learning unit 15e described below without storing it in the storage unit 14. FIG.
  • the prediction learning unit 15e uses the generated spike series to learn the spike point prediction model 14b that predicts when the spike will be output.
  • the spike point prediction model 14b is a binary classification model.
  • the prediction learning unit 15e learns the spike point prediction model 14b using teacher data labeled as 1 for spike points in the spike series and 0 for other non-spike points.
  • the prediction learning unit 15e uses a predetermined threshold as a parameter, and sets 1 when it is equal to or greater than the threshold and 0 when it is less than the threshold. The larger the threshold, the fewer spike points are predicted, and the smaller the threshold, the more spike points are predicted.
  • the prediction learning unit 15e learns a multitask learning model that integrates the spike point prediction model 14b and the speech recognition model 14a using the generated spike series and the extracted feature quantity series.
  • the multitask learning model is, but not limited to, an encoder-decoder model, for example.
  • the prediction learning unit 15e realizes multitask learning by sharing the encoder for the speech recognition task and the spike point prediction task and learning the decoder for the spike point prediction task. do.
  • the speech recognition apparatus 10 can improve the accuracy and reduce the size of the model.
  • the output of the spike point prediction model 14b which is a binary classification model, is the prediction result at time t+1 for the input time t. If the prediction result is 1, run the decoder of the speech recognition model with the input at time t+1.
  • the prediction learning unit 15e predicts whether or not there will be a spike point at time t+1 for the input at time t. Therefore, the target label at the time (+1) next to the input is used for learning of the speech recognition model 14a.
  • the method is not limited to this method. For example, if the output label is 5 for the input time t, the next spike point appears at time t+5. You can predict. In this case, the prediction learning unit 15e inputs the feature amount of the frame at time t+5 to the speech recognition model 14a and the spike point prediction model 14b without using the intermediate frame as an input to the model, and inputs it to the spike point prediction model 14b so that the next spike Used for point prediction.
  • the recognition unit 15f performs speech recognition by inputting into the learned speech recognition model 14a the feature quantity of the speech signal at the spike points predicted using the spike point prediction model 14b.
  • the recognition unit 15f uses the multitask learning model learned by the prediction learning unit 15e to simultaneously perform spike point prediction and speech recognition.
  • the recognition unit 15f inputs, to the speech recognition model 14a, the feature quantity of the speech signal frame at the predicted time of the next spike point with respect to the time at which the speech signal frame was input, as described above. to perform speech recognition.
  • the speech recognition apparatus 10 can speed up speech recognition by performing speech recognition using the speech recognition model 14a limited to the feature quantity of the spark point.
  • FIG. 2 shows the learning processing procedure.
  • the flowchart of FIG. 2 is started, for example, when an input instructing the start of learning processing is received.
  • the acquisition unit 15a acquires an audio signal (step S1). Further, the extracting unit 15b extracts a series of feature amounts for each frame of the audio signal (step S2). Specifically, the extraction unit 15b extracts the acoustic feature quantity from the speech digital signal, and obtains a feature quantity sequence for each utterance.
  • the learning unit 15c learns the end-to-end speech recognition model 14a using CTC using the series of extracted feature amounts (step S3).
  • the generation unit 15d uses the learned speech recognition model 14a to generate a spike sequence, which is a posterior probability sequence of labels output as spikes by the speech recognition model 14a (step S4).
  • the prediction learning unit 15e uses the generated spike series to learn the spike point prediction model 14b that predicts when the spike will be output (step S5).
  • the prediction learning unit 15e learns a multitask learning model that integrates the spike point prediction model 14b and the speech recognition model 14a using the generated spike series and the extracted feature quantity series. This completes a series of learning processes.
  • FIG. 3 shows the recognition processing procedure.
  • the flowchart of FIG. 3 is started, for example, at the timing when an input instructing the start of recognition processing is received.
  • the acquisition unit 15a acquires a speech signal to be speech-recognized (step S11).
  • the recognition unit 15f performs speech recognition by inputting the feature amount of the speech signal at the spike points predicted using the spike point prediction model 14b to the learned speech recognition model 14a (step S12).
  • the recognition unit 15f uses the multitask learning model learned by the prediction learning unit 15e to simultaneously perform spike point prediction and speech recognition. This completes a series of recognition processes.
  • the extracting unit 15b extracts a series of feature quantities for each frame of the speech signal.
  • the learning unit 15c learns the speech recognition model 14a using CTC using the series of extracted feature amounts.
  • the generation unit 15d uses the learned speech recognition model 14a to generate a spike sequence, which is a sequence of labels output as spikes by the speech recognition model.
  • the prediction learning unit 15e uses the generated spike series to learn the spike point prediction model 14b that predicts the point in time when the spike will be output.
  • the spike point prediction model 14b is a binary classification model.
  • the speech recognition apparatus 10 predicts frames in which sparks of speech signals appear, and limits input frames for speech recognition to frames in which spikes appear. As a result, the speech recognition apparatus 10 can speed up speech recognition using a speech recognition model using CTC.
  • the prediction learning unit 15e learns a multitask learning model that integrates the spike point prediction model 14b and the speech recognition model 14a using the generated spike series and the extracted feature quantity series.
  • a multi-task learning model is an encoder-decoder model.
  • the recognition unit 15f performs speech recognition by inputting the feature amount of the speech signal at the spike points predicted using the spike point prediction model 14b to the learned speech recognition model 14a. This makes it possible to suppress the processing load and perform speech recognition at high speed.
  • the speech recognition apparatus 10 can be implemented by installing a speech recognition program for executing the above speech recognition processing as package software or online software on a desired computer.
  • the information processing apparatus can function as the speech recognition apparatus 10 by causing the information processing apparatus to execute the above speech recognition program.
  • information processing devices include mobile communication terminals such as smartphones, mobile phones and PHS (Personal Handyphone Systems), and slate terminals such as PDAs (Personal Digital Assistants).
  • the functions of the speech recognition device 10 may be implemented in a cloud server.
  • FIG. 4 is a diagram showing an example of a computer that executes a speech recognition program.
  • Computer 1000 includes, for example, memory 1010 , CPU 1020 , hard disk drive interface 1030 , disk drive interface 1040 , serial port interface 1050 , video adapter 1060 and network interface 1070 . These units are connected by a bus 1080 .
  • the memory 1010 includes a ROM (Read Only Memory) 1011 and a RAM 1012 .
  • the ROM 1011 stores a boot program such as BIOS (Basic Input Output System).
  • BIOS Basic Input Output System
  • Hard disk drive interface 1030 is connected to hard disk drive 1031 .
  • Disk drive interface 1040 is connected to disk drive 1041 .
  • a removable storage medium such as a magnetic disk or an optical disk is inserted into the disk drive 1041, for example.
  • a mouse 1051 and a keyboard 1052 are connected to the serial port interface 1050, for example.
  • a display 1061 is connected to the video adapter 1060 .
  • the hard disk drive 1031 stores an OS 1091, application programs 1092, program modules 1093 and program data 1094, for example. Each piece of information described in the above embodiment is stored in the hard disk drive 1031 or the memory 1010, for example.
  • the speech recognition program is stored in the hard disk drive 1031 as a program module 1093 in which commands to be executed by the computer 1000 are described, for example.
  • the hard disk drive 1031 stores a program module 1093 that describes each process executed by the speech recognition apparatus 10 described in the above embodiment.
  • Data used for information processing by the voice recognition program is stored as program data 1094 in the hard disk drive 1031, for example. Then, the CPU 1020 reads out the program module 1093 and the program data 1094 stored in the hard disk drive 1031 to the RAM 1012 as necessary, and executes each procedure described above.
  • program module 1093 and the program data 1094 related to the speech recognition program are not limited to being stored in the hard disk drive 1031, but may be stored in a removable storage medium and read by the CPU 1020 via the disk drive 1041 or the like. may be issued.
  • the program module 1093 and program data 1094 related to the speech recognition program are stored in another computer connected via a network such as LAN (Local Area Network) or WAN (Wide Area Network), and are stored via network interface 1070. may be read by the CPU 1020 at the same time.
  • LAN Local Area Network
  • WAN Wide Area Network

Landscapes

  • Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

An extraction unit (15b) extracts a sequence of features for each voice signal frame. A learning unit (15c) trains a voice recognition model (14a) which uses CTC, by using the extracted sequence of features. A generation unit (15d) generates a spike sequence, which is a sequence of labels outputted by the voice recognition model as spikes, by using the trained voice recognition model (14a). A prediction learning unit (15e) trains a spike point prediction model (14b) for predicting the point in time at which said spike will be outputted by using the generated spike sequence.

Description

音声認識方法、音声認識装置および音声認識プログラムSpeech recognition method, speech recognition device and speech recognition program
 本発明は、音声認識方法、音声認識装置および音声認識プログラムに関する。 The present invention relates to a speech recognition method, a speech recognition device, and a speech recognition program.
 一般に、音声認識の際には、入力された音声波形に対し、ある一定の幅を持つフレームに窓掛けをしたうえで特徴量を抽出し、フレームを一定の幅でシフトさせながら、順次特徴量系列を生成し、音声認識モデルの入力とする(特許文献1参照)。 In general, during speech recognition, the feature values are extracted after windowing the input speech waveform to a frame with a certain width, and the feature values are sequentially shifted while shifting the frames by a certain width. A sequence is generated and used as an input for a speech recognition model (see Patent Document 1).
 近年、音声認識モデルとして、大規模なニューラルネットワークであるDNN(Deep Neural Network)モデルが利用され、音声認識の際のコストが大きく処理時間も増加傾向にある。そこで、音声認識モデルに入力する特徴量のフレームを削減して、音声認識を高速化する技術が期待されている(非特許文献1、2参照)。 In recent years, a DNN (Deep Neural Network) model, which is a large-scale neural network, has been used as a speech recognition model, and the cost of speech recognition is high and the processing time tends to increase. Therefore, a technique for speeding up speech recognition by reducing the number of frames of feature amounts to be input to the speech recognition model is expected (see Non-Patent Documents 1 and 2).
 例えば、CTC(Connectionist Temporal Classification)を損失関数とする音声認識モデルでは、入力フレームに同期して対応するシンボルが出力される(非特許文献3参照)。 For example, in a speech recognition model with a loss function of CTC (Connectionist Temporal Classification), corresponding symbols are output in synchronization with input frames (see Non-Patent Document 3).
 なお、CTCでは、各フレームに対する事後確率系列がしばしばスパイク系列になることが知られている。また、非特許文献3には、DNN-HMM(Deep Neural Network-Hidden Markov Model)のようなハイブリッドモデルについて記載されている。また、非特許文献4には、音声認識タスクと話者認識タスクとを同時に学習することについて記載されている。 It is known that in CTC, the posterior probability sequence for each frame is often a spike sequence. In addition, Non-Patent Document 3 describes a hybrid model such as DNN-HMM (Deep Neural Network-Hidden Markov Model). Also, Non-Patent Document 4 describes simultaneous learning of a speech recognition task and a speaker recognition task.
特開2007-249051号公報JP 2007-249051 A
 しかしながら、従来技術では、CTCを用いた音声認識モデルによる音声認識の十分な高速化が困難であった。つまり、従来のDNN-HMMのようなハイブリッドモデルでは、ラベルはスパイクではなく、該当する音声区間の全てのフレームに充てがわれる。そのため、音声認識モデルに入力する特徴量のフレームを削減することが困難であった。 However, with conventional technology, it was difficult to sufficiently speed up speech recognition using a speech recognition model using CTC. That is, in a hybrid model such as conventional DNN-HMM, labels are assigned to all frames of the relevant speech interval, not to spikes. For this reason, it has been difficult to reduce the number of frames of feature amounts to be input to the speech recognition model.
 本発明は、上記に鑑みてなされたものであって、CTCを用いた音声認識モデルによる音声認識を高速化することを目的とする。 The present invention has been made in view of the above, and aims to speed up speech recognition by a speech recognition model using CTC.
 上述した課題を解決し、目的を達成するために、本発明に係る音声認識方法は、音声認識装置が実行する音声認識方法であって、音声信号のフレームごとの特徴量の系列を抽出する抽出工程と、抽出された特徴量の系列を用いて、CTC(Connectionist Temporal Classification)を用いた音声認識モデルを学習する学習工程と、学習された前記音声認識モデルを用いて、該音声認識モデルがスパイクとして出力するラベルの系列であるスパイク系列を生成する生成工程と、生成された前記スパイク系列を用いて、該スパイクが出力される時点を予測するスパイク点予測モデルを学習する予測学習工程と、を含んだことを特徴とする。 In order to solve the above-described problems and achieve the object, a speech recognition method according to the present invention is a speech recognition method executed by a speech recognition device, which extracts a series of feature amounts for each frame of a speech signal. a learning step of learning a speech recognition model using CTC (Connectionist Temporal Classification) using the series of extracted feature values; and a speech recognition model spike using the learned speech recognition model and a prediction learning step of learning a spike point prediction model that predicts when the spike will be output using the generated spike sequence. characterized by containing
 本発明によれば、CTCを用いた音声認識モデルによる音声認識を高速化することが可能となる。 According to the present invention, it is possible to speed up speech recognition by a speech recognition model using CTC.
図1は、音声認識装置の概略構成を例示する模式図である。FIG. 1 is a schematic diagram illustrating a schematic configuration of a speech recognition device. 図2は、音声認識処理手順を示すフローチャートである。FIG. 2 is a flow chart showing a speech recognition processing procedure. 図3は、音声認識処理手順を示すフローチャートである。FIG. 3 is a flow chart showing a speech recognition processing procedure. 図4は、音声認識プログラムを実行するコンピュータを例示する図である。FIG. 4 is a diagram illustrating a computer that executes a speech recognition program.
 以下、図面を参照して、本発明の一実施形態を詳細に説明する。なお、この実施形態により本発明が限定されるものではない。また、図面の記載において、同一部分には同一の符号を付して示している。 An embodiment of the present invention will be described in detail below with reference to the drawings. It should be noted that the present invention is not limited by this embodiment. Moreover, in the description of the drawings, the same parts are denoted by the same reference numerals.
[音声認識装置の概要]
 本実施形態の音声認識装置は、CTCを用いた音声認識モデルにおける各フレームに対するスパイク系列となる事後確率系列に着目する。ここで、スパイク系列とは、事後確率系列であって、1~3フレーム程度の短いフレームに、ターゲットになる認識ラベルシンボルが極端に高い事後確率のスパイクとして瞬時的に出現する。また、それ以外のフレームでは、高い事後確率で認識ラベルを持たないblankシンボルでほぼ占められる。例えば、「特許」という10フレームの入力音声に対し、出力ラベルが「特許」という2文字であれば、blankシンボル_とともに「___特___許__」というように出力される。
[Overview of speech recognition device]
The speech recognition apparatus of this embodiment focuses on the posterior probability sequence, which is the spike sequence for each frame in the speech recognition model using CTC. Here, the spike series is a series of posterior probabilities, in which the target recognition label symbol instantaneously appears as a spike with an extremely high posterior probability in a short frame of about 1 to 3 frames. In addition, other frames are mostly occupied by blank symbols having no recognition label with high posterior probability. For example, for 10 frames of input speech "patent", if the output label is two characters "patent", the blank symbol _ is output together with "____special________".
 音声認識装置は、音声認識の際の入力フレームをスパイクが発現するフレームに限定する。これにより、音声認識装置は、音声認識の高速化を可能とする。そのために、音声認識装置は、予めスパイクとなる時点であるスパイク点を予測するモデルを学習して、音声認識モデルの前段に用いる。このスパイク点を予測するモデルは2値分類モデルとして、スパイク点のラベルを1、それ以外の点のblankラベルを0とする教師ラベルを用いて学習する。 The speech recognition device limits the input frames for speech recognition to frames in which spikes appear. This enables the speech recognition device to speed up speech recognition. For this purpose, the speech recognition apparatus learns in advance a model for predicting a spike point, and uses it in the preceding stage of the speech recognition model. A model for predicting this spike point is learned as a binary classification model using a teacher label that assigns 1 to the label of the spike point and 0 to the blank label of the other points.
 また、音声認識装置は、このような2値分類モデルから予測されたラベルが1であるスパイク点のフレームに限定して、CTCを用いた音声認識モデルに入力することにより、高速な音声認識を実現する。CTCを用いた音声認識モデルでは、出力ラベルシンボルとして、単語や文字がそのまま用いられるため、スパイクに対応するフレームだけに入力を絞ることにより、音声の時間的な冗長性を廃し、ターゲットとなるラベルの出力に必要な入力だけに極限的に絞ることになる。 In addition, the speech recognition apparatus performs high-speed speech recognition by inputting only spike point frames having a label of 1 predicted from such a binary classification model into a speech recognition model using CTC. Realize. In speech recognition models using CTC, words and characters are used as output label symbols. In other words, the input required for the output of is extremely narrowed down.
 さらに、音声認識装置は、1つのモデルで複数のタスクを同時に学習し、複数のタスクを1つのモデルで表現するマルチタスクラーニングを適用する。具体的には、音声認識装置は、2値分類モデルと音声認識モデルとを統合し、音声認識タスクと2値分類タスクとを同時に解く。例えば、エンコーダ・デコーダ型音声認識モデルにおいて、エンコーダを音声認識タスクと2値分類タスクとで共有した上で、2値分類タスク用デコーダを学習することにより、マルチタスクラーニングを実現する。これにより、音声認識装置は、精度の向上とモデルの省サイズ化とを実現する。 Furthermore, the speech recognition device applies multi-task learning, in which one model learns multiple tasks at the same time and multiple tasks are expressed in one model. Specifically, the speech recognition apparatus integrates the binary classification model and the speech recognition model, and simultaneously solves the speech recognition task and the binary classification task. For example, in an encoder-decoder type speech recognition model, multi-task learning is realized by sharing an encoder for a speech recognition task and a binary classification task, and learning a decoder for the binary classification task. As a result, the speech recognition apparatus achieves an improvement in accuracy and a reduction in size of the model.
[音声認識装置の構成]
 図1は、音声認識装置の概略構成を例示する模式図である。図1に例示するように、本実施形態の音声認識装置10は、パソコン等の汎用コンピュータで実現され、入力部11、出力部12、通信制御部13、記憶部14、および制御部15を備える。
[Structure of speech recognition device]
FIG. 1 is a schematic diagram illustrating a schematic configuration of a speech recognition device. As illustrated in FIG. 1, a speech recognition apparatus 10 of this embodiment is realized by a general-purpose computer such as a personal computer, and includes an input unit 11, an output unit 12, a communication control unit 13, a storage unit 14, and a control unit 15. .
 入力部11は、キーボードやマウス等の入力デバイスを用いて実現され、実施者による入力操作に対応して、制御部15に対して処理開始などの各種指示情報を入力する。出力部12は、液晶ディスプレイなどの表示装置、プリンター等の印刷装置、情報通信装置等によって実現される。通信制御部13は、NIC(Network Interface Card)等で実現され、サーバや、音響信号を取得する装置等の外部の装置と制御部15とのネットワークを介した通信を制御する。 The input unit 11 is implemented using input devices such as a keyboard and a mouse, and inputs various instruction information such as processing start to the control unit 15 in response to input operations by the practitioner. The output unit 12 is implemented by a display device such as a liquid crystal display, a printing device such as a printer, an information communication device, or the like. The communication control unit 13 is implemented by a NIC (Network Interface Card) or the like, and controls communication between the control unit 15 and an external device such as a server or a device that acquires an acoustic signal via a network.
 記憶部14は、RAM(Random Access Memory)、フラッシュメモリ(Flash Memory)等の半導体メモリ素子、または、ハードディスク、光ディスク等の記憶装置によって実現される。なお、記憶部14は、通信制御部13を介して制御部15と通信する構成でもよい。本実施形態において、記憶部14には、例えば、後述する音声認識処理に用いられる音声認識モデル14aやスパイク点予測モデル14b等が記憶される。 The storage unit 14 is implemented by semiconductor memory devices such as RAM (Random Access Memory) and flash memory, or storage devices such as hard disks and optical disks. Note that the storage unit 14 may be configured to communicate with the control unit 15 via the communication control unit 13 . In the present embodiment, the storage unit 14 stores, for example, a speech recognition model 14a, a spike point prediction model 14b, and the like used for speech recognition processing, which will be described later.
 制御部15は、CPU(Central Processing Unit)やNP(Network Processor)やFPGA(Field Programmable Gate Array)等を用いて実現され、メモリに記憶された処理プログラムを実行する。これにより、制御部15は、図1に例示するように、取得部15a、抽出部15b、学習部15c、生成部15d、予測学習部15eおよび認識部15fとして機能する。なお、これらの機能部は、それぞれが異なるハードウェアに実装されてもよい。例えば、学習部15cは学習装置として実装され、認識部15fは、認識装置として実装されてもよい。また、制御部15は、その他の機能部を備えてもよい。 The control unit 15 is implemented using a CPU (Central Processing Unit), NP (Network Processor), FPGA (Field Programmable Gate Array), etc., and executes a processing program stored in memory. Thereby, the control unit 15 functions as an acquisition unit 15a, an extraction unit 15b, a learning unit 15c, a generation unit 15d, a prediction learning unit 15e, and a recognition unit 15f, as illustrated in FIG. Note that these functional units may be implemented in different hardware. For example, the learning unit 15c may be implemented as a learning device, and the recognition unit 15f may be implemented as a recognition device. Also, the control unit 15 may include other functional units.
 取得部15aは、音声信号を取得する。具体的には、取得部15aは、アナログの音声信号を取得して、A/D変換を行って音声ディジタル信号を得る。取得部は、取得した音声信号をA/D変換した音声ディジタル信号を記憶部14に記憶させてもよい。あるいは、取得部15aは、取得した音声信号をA/D変換した音声ディジタル信号を記憶部14に記憶させずに、直ちに以下に説明する抽出部15bに転送してもよい。 The acquisition unit 15a acquires an audio signal. Specifically, the acquisition unit 15a acquires an analog audio signal and performs A/D conversion to obtain an audio digital signal. The acquisition unit may cause the storage unit 14 to store an audio digital signal obtained by A/D converting the acquired audio signal. Alternatively, the acquiring unit 15a may immediately transfer the audio digital signal obtained by A/D converting the acquired audio signal to the extracting unit 15b described below without storing it in the storage unit 14. FIG.
 抽出部15bは、音声信号のフレームごとの特徴量の系列を抽出する。具体的には、抽出部15bは、音声ディジタル信号から、音響特徴量を抽出し、発話毎の特徴量系列を得る。抽出部15bは、例えば、音声信号の短時間フレーム分析に基づくMFCCやパワーを用いた特徴量を抽出する。例えば、抽出部15bは、特徴量として、MFCC(Mel-Frequency Cepstrum Coefficients)の1~12次元、その動的特徴量であるΔMFCC、ΔΔMFCC等の動的パラメータ、パワー、Δパワー、ΔΔパワー等を用いる。 The extraction unit 15b extracts a series of feature amounts for each frame of the audio signal. Specifically, the extraction unit 15b extracts the acoustic feature quantity from the speech digital signal, and obtains a feature quantity sequence for each utterance. The extraction unit 15b extracts, for example, a feature amount using MFCC or power based on short-time frame analysis of an audio signal. For example, the extracting unit 15b extracts 1 to 12 dimensions of MFCC (Mel-Frequency Cepstrum Coefficients), dynamic parameters such as ΔMFCC, ΔΔMFCC, power, Δpower, ΔΔpower, etc., as feature amounts. use.
 なお、抽出部15bは、MFCCに対して、ケプストラム平均正規化(CMN,Cepstral Mean Normalization)処理を行ってもよい。また、特徴量は、MFCCやパワーを用いたものに限定されず、特殊発話の識別に用いられる、自己相関ピーク値や群遅延等のパラメータでもよい。 Note that the extraction unit 15b may perform Cepstral Mean Normalization (CMN) processing on the MFCC. Further, the feature amount is not limited to those using MFCC or power, and may be parameters such as autocorrelation peak value and group delay, which are used for identifying special utterances.
 抽出部15bは、抽出した特徴量系列を、記憶部14に記憶してもよい。あるいは、抽出部15bは、抽出した特徴量系列を、記憶部14に記憶せずに、直ちに以下に説明する学習部15cに転送してもよい。 The extraction unit 15b may store the extracted feature series in the storage unit 14. Alternatively, the extracting unit 15b may immediately transfer the extracted feature series to the learning unit 15c described below without storing them in the storage unit 14. FIG.
 学習部15cは、抽出された特徴量の系列を用いて、CTCを用いた音声認識モデル14aを学習する。具体的には、学習部15cは、教師データの特徴量系列を用いて、End-to-Endの音声認識モデル14aを学習する。この音声認識モデル14aは、CTC損失関数が用いられ、出力されるラベルの系列がスパイク系列となるようなものとする。 The learning unit 15c learns the speech recognition model 14a using CTC using the series of extracted feature amounts. Specifically, the learning unit 15c learns the end-to-end speech recognition model 14a using the feature amount series of the teacher data. This speech recognition model 14a uses a CTC loss function and outputs a sequence of labels that is a spike sequence.
 生成部15dが、学習された音声認識モデル14aを用いて、該音声認識モデル14aがスパイクとして出力するラベルの系列であるスパイク系列を生成する。具体的には、生成部15dは、学習部15cにより学習された学習済の音声認識モデル14aを用いて、学習に用いられた特徴量系列を音声認識モデル14aで音声認識させることにより、その事後確率系列であるスパイク系列を生成する。 The generation unit 15d uses the learned speech recognition model 14a to generate a spike sequence, which is a sequence of labels output by the speech recognition model 14a as spikes. Specifically, the generating unit 15d uses the trained speech recognition model 14a learned by the learning unit 15c to cause the speech recognition model 14a to speech-recognize the feature quantity series used for learning, thereby generating the Generate a spike sequence that is a stochastic sequence.
 生成部15dは、生成したスパイク系列を記憶部14に記憶してもよい。あるいは、生成部15dは、生成したスパイク系列を、記憶部14に記憶せずに、直ちに以下に説明する予測学習部15eに転送してもよい。 The generation unit 15d may store the generated spike series in the storage unit 14. Alternatively, the generating unit 15d may immediately transfer the generated spike sequence to the prediction learning unit 15e described below without storing it in the storage unit 14. FIG.
 予測学習部15eが、生成されたスパイク系列を用いて、該スパイクが出力される時点を予測するスパイク点予測モデル14bを学習する。 The prediction learning unit 15e uses the generated spike series to learn the spike point prediction model 14b that predicts when the spike will be output.
 ここで、スパイク点予測モデル14bは、2値分類モデルである。予測学習部15eは、スパイク系列のスパイクとなる点は1、それ以外の非スパイクとなる点は0としてラベル付けした教師データを用いて、スパイク点予測モデル14bを学習する。 Here, the spike point prediction model 14b is a binary classification model. The prediction learning unit 15e learns the spike point prediction model 14b using teacher data labeled as 1 for spike points in the spike series and 0 for other non-spike points.
 なお、生成されたスパイク系列は事後確率系列であるため、予測学習部15eは、所定の閾値をパラメータとして、閾値以上の場合に1、閾値未満の場合に0とする。この閾値が大きいほど予測されるスパイク点が少なくなり、閾値が小さいほど予測されるスパイク点が多くなる。 Note that since the generated spike series is a posterior probability series, the prediction learning unit 15e uses a predetermined threshold as a parameter, and sets 1 when it is equal to or greater than the threshold and 0 when it is less than the threshold. The larger the threshold, the fewer spike points are predicted, and the smaller the threshold, the more spike points are predicted.
 また、予測学習部15eは、生成されたスパイク系列と抽出された特徴量の系列とを用いて、スパイク点予測モデル14bと音声認識モデル14aとを統合したマルチタスクラーニングモデルを学習する。マルチタスクラーニングモデルは、特に限定されないが、例えばエンコーダデコーダモデルである。 In addition, the prediction learning unit 15e learns a multitask learning model that integrates the spike point prediction model 14b and the speech recognition model 14a using the generated spike series and the extracted feature quantity series. The multitask learning model is, but not limited to, an encoder-decoder model, for example.
 具体的には、予測学習部15eは、上記したように、エンコーダを音声認識タスクとスパイク点予測タスクとで共有した上で、スパイク点予測タスク用デコーダを学習することにより、マルチタスクラーニングを実現する。これにより、音声認識装置10は、精度の向上とモデルの省サイズ化とが可能となる。 Specifically, as described above, the prediction learning unit 15e realizes multitask learning by sharing the encoder for the speech recognition task and the spike point prediction task and learning the decoder for the spike point prediction task. do. As a result, the speech recognition apparatus 10 can improve the accuracy and reduce the size of the model.
 ここで、2値分類モデルであるスパイク点予測モデル14bの出力は、入力の時刻tに対し、時刻t+1における予測結果である。予測結果が1である場合に、時刻t+1での入力で音声認識モデルのデコーダを動作させる。 Here, the output of the spike point prediction model 14b, which is a binary classification model, is the prediction result at time t+1 for the input time t. If the prediction result is 1, run the decoder of the speech recognition model with the input at time t+1.
 そこで、予測学習部15eは、時刻tの入力に対して時刻t+1にスパイク点かどうかを予測する。そこで、音声認識モデル14aの学習には、入力に対して次(+1)の時刻におけるターゲットラベルが用いられる。 Therefore, the prediction learning unit 15e predicts whether or not there will be a spike point at time t+1 for the input at time t. Therefore, the target label at the time (+1) next to the input is used for learning of the speech recognition model 14a.
 ただし、この手法に限定されず、例えば、入力の時刻tに対して、出力ラベルが5であれば、次のスパイク点が時刻t+5に出現するというように、次のスパイク点が出現する時刻を予測してもよい。この場合に予測学習部15eは、中間のフレームはモデルへの入力に用いず、時刻t+5のフレームの特徴量を音声認識モデル14aに入力するとともに、スパイク点予測モデル14bに入力して次のスパイク点の予測に用いる。 However, the method is not limited to this method. For example, if the output label is 5 for the input time t, the next spike point appears at time t+5. You can predict. In this case, the prediction learning unit 15e inputs the feature amount of the frame at time t+5 to the speech recognition model 14a and the spike point prediction model 14b without using the intermediate frame as an input to the model, and inputs it to the spike point prediction model 14b so that the next spike Used for point prediction.
 認識部15fは、学習された音声認識モデル14aに、スパイク点予測モデル14bを用いて予測されたスパイク点における音声信号の特徴量を入力することにより、音声認識を行う。例えば、認識部15fは、予測学習部15eが学習したマルチタスクラーニングモデルを用いて、スパイク点予測と音声認識とを同時に行う。 The recognition unit 15f performs speech recognition by inputting into the learned speech recognition model 14a the feature quantity of the speech signal at the spike points predicted using the spike point prediction model 14b. For example, the recognition unit 15f uses the multitask learning model learned by the prediction learning unit 15e to simultaneously perform spike point prediction and speech recognition.
 その際には、認識部15fは、上記したように、音声信号のフレームが入力された時刻に対し、次のスパイク点の予測時刻の音声信号フレームの特徴量を音声認識モデル14aに入力することにより、音声認識を行う。このようにして、音声認識装置10は、スパーク点の特徴量に限定して音声認識モデル14aにより音声認識を行うことにより、音声認識を高速化することができる。 In this case, the recognition unit 15f inputs, to the speech recognition model 14a, the feature quantity of the speech signal frame at the predicted time of the next spike point with respect to the time at which the speech signal frame was input, as described above. to perform speech recognition. In this manner, the speech recognition apparatus 10 can speed up speech recognition by performing speech recognition using the speech recognition model 14a limited to the feature quantity of the spark point.
[音声認識処理]
 次に、音声認識装置10による音声認識処理について説明する。図2および図3は、音声認識処理手順を示すフローチャートである。本実施形態の音声認識処理は、学習処理と認識処理とを含む。まず、図2は、学習処理手順を示す。図2のフローチャートは、例えば、学習処理の開始を指示する入力があったタイミングで開始される。
[Voice recognition processing]
Next, speech recognition processing by the speech recognition device 10 will be described. 2 and 3 are flowcharts showing the speech recognition processing procedure. The speech recognition processing of this embodiment includes learning processing and recognition processing. First, FIG. 2 shows the learning processing procedure. The flowchart of FIG. 2 is started, for example, when an input instructing the start of learning processing is received.
 まず、取得部15aが、音声信号を取得する(ステップS1)。また、抽出部15bが、音声信号のフレームごとの特徴量の系列を抽出する(ステップS2)。具体的には、抽出部15bは、音声ディジタル信号から、音響特徴量を抽出し、発話毎の特徴量系列を得る。 First, the acquisition unit 15a acquires an audio signal (step S1). Further, the extracting unit 15b extracts a series of feature amounts for each frame of the audio signal (step S2). Specifically, the extraction unit 15b extracts the acoustic feature quantity from the speech digital signal, and obtains a feature quantity sequence for each utterance.
 次に、学習部15cが、抽出された特徴量の系列を用いて、CTCを用いたEnd-to-Endの音声認識モデル14aを学習する(ステップS3)。 Next, the learning unit 15c learns the end-to-end speech recognition model 14a using CTC using the series of extracted feature amounts (step S3).
 また、生成部15dが、学習された音声認識モデル14aを用いて、該音声認識モデル14aがスパイクとして出力するラベルの事後確率系列であるスパイク系列を生成する(ステップS4)。 Also, the generation unit 15d uses the learned speech recognition model 14a to generate a spike sequence, which is a posterior probability sequence of labels output as spikes by the speech recognition model 14a (step S4).
 そして、予測学習部15eが、生成されたスパイク系列を用いて、該スパイクが出力される時点を予測するスパイク点予測モデル14bを学習する(ステップS5)。例えば、予測学習部15eは、生成されたスパイク系列と抽出された特徴量の系列とを用いて、スパイク点予測モデル14bと音声認識モデル14aとを統合したマルチタスクラーニングモデルを学習する。これにより、一連の学習処理が終了する。 Then, the prediction learning unit 15e uses the generated spike series to learn the spike point prediction model 14b that predicts when the spike will be output (step S5). For example, the prediction learning unit 15e learns a multitask learning model that integrates the spike point prediction model 14b and the speech recognition model 14a using the generated spike series and the extracted feature quantity series. This completes a series of learning processes.
 次に、図3は、認識処理手順を示す。図3のフローチャートは、例えば、認識処理の開始を指示する入力があったタイミングで開始される。 Next, FIG. 3 shows the recognition processing procedure. The flowchart of FIG. 3 is started, for example, at the timing when an input instructing the start of recognition processing is received.
 まず、取得部15aが、音声認識対象の音声信号を取得する(ステップS11)。 First, the acquisition unit 15a acquires a speech signal to be speech-recognized (step S11).
 次に、認識部15fが、学習された音声認識モデル14aに、スパイク点予測モデル14bを用いて予測されたスパイク点における音声信号の特徴量を入力することにより、音声認識を行う(ステップS12)。例えば、認識部15fは、予測学習部15eが学習したマルチタスクラーニングモデルを用いて、スパイク点予測と音声認識とを同時に行う。これにより、一連の認識処理が終了する。 Next, the recognition unit 15f performs speech recognition by inputting the feature amount of the speech signal at the spike points predicted using the spike point prediction model 14b to the learned speech recognition model 14a (step S12). . For example, the recognition unit 15f uses the multitask learning model learned by the prediction learning unit 15e to simultaneously perform spike point prediction and speech recognition. This completes a series of recognition processes.
 以上、説明したように、本実施形態の音声認識装置10において、抽出部15bが、音声信号のフレームごとの特徴量の系列を抽出する。また、学習部15cが、抽出された特徴量の系列を用いて、CTCを用いた音声認識モデル14aを学習する。また、生成部15dが、学習された音声認識モデル14aを用いて、該音声認識モデルがスパイクとして出力するラベルの系列であるスパイク系列を生成する。また、予測学習部15eが、生成されたスパイク系列を用いて、該スパイクが出力される時点を予測するスパイク点予測モデル14bを学習する。具体的には、スパイク点予測モデル14bは、2値分類モデルである。 As described above, in the speech recognition device 10 of the present embodiment, the extracting unit 15b extracts a series of feature quantities for each frame of the speech signal. Also, the learning unit 15c learns the speech recognition model 14a using CTC using the series of extracted feature amounts. Further, the generation unit 15d uses the learned speech recognition model 14a to generate a spike sequence, which is a sequence of labels output as spikes by the speech recognition model. Also, the prediction learning unit 15e uses the generated spike series to learn the spike point prediction model 14b that predicts the point in time when the spike will be output. Specifically, the spike point prediction model 14b is a binary classification model.
 このように、音声認識装置10は、音声信号のスパークが発現するフレームを予測して、音声認識の際の入力フレームをスパイクが発現するフレームに限定する。これにより、音声認識装置10は、CTCを用いた音声認識モデルによる音声認識を高速化することが可能となる。 In this way, the speech recognition apparatus 10 predicts frames in which sparks of speech signals appear, and limits input frames for speech recognition to frames in which spikes appear. As a result, the speech recognition apparatus 10 can speed up speech recognition using a speech recognition model using CTC.
 また、予測学習部15eは、生成されたスパイク系列と抽出された特徴量の系列とを用いて、スパイク点予測モデル14bと音声認識モデル14aとを統合したマルチタスクラーニングモデルを学習する。例えば、マルチタスクラーニングモデルは、エンコーダデコーダモデルである。これにより、音声認識装置10は、音声認識の精度の向上とモデルの省サイズ化とが可能となる。 In addition, the prediction learning unit 15e learns a multitask learning model that integrates the spike point prediction model 14b and the speech recognition model 14a using the generated spike series and the extracted feature quantity series. For example, a multi-task learning model is an encoder-decoder model. As a result, the speech recognition apparatus 10 can improve the accuracy of speech recognition and reduce the size of the model.
 また、認識部15fが、学習された音声認識モデル14aに、スパイク点予測モデル14bを用いて予測されたスパイク点における音声信号の特徴量を入力することにより、音声認識を行う。これにより、処理負荷を抑えて高速に音声認識を行うことが可能となる。 In addition, the recognition unit 15f performs speech recognition by inputting the feature amount of the speech signal at the spike points predicted using the spike point prediction model 14b to the learned speech recognition model 14a. This makes it possible to suppress the processing load and perform speech recognition at high speed.
[プログラム]
 上記実施形態に係る音声認識装置10が実行する処理をコンピュータが実行可能な言語で記述したプログラムを作成することもできる。一実施形態として、音声認識装置10は、パッケージソフトウェアやオンラインソフトウェアとして上記の音声認識処理を実行する音声認識プログラムを所望のコンピュータにインストールさせることによって実装できる。例えば、上記の音声認識プログラムを情報処理装置に実行させることにより、情報処理装置を音声認識装置10として機能させることができる。また、その他にも、情報処理装置にはスマートフォン、携帯電話機やPHS(Personal Handyphone System)等の移動体通信端末、さらには、PDA(Personal Digital Assistant)等のスレート端末等がその範疇に含まれる。また、音声認識装置10の機能を、クラウドサーバに実装してもよい。
[program]
It is also possible to create a program in which the processing executed by the speech recognition apparatus 10 according to the above embodiment is described in a computer-executable language. As one embodiment, the speech recognition apparatus 10 can be implemented by installing a speech recognition program for executing the above speech recognition processing as package software or online software on a desired computer. For example, the information processing apparatus can function as the speech recognition apparatus 10 by causing the information processing apparatus to execute the above speech recognition program. In addition, information processing devices include mobile communication terminals such as smartphones, mobile phones and PHS (Personal Handyphone Systems), and slate terminals such as PDAs (Personal Digital Assistants). Also, the functions of the speech recognition device 10 may be implemented in a cloud server.
 図4は、音声認識プログラムを実行するコンピュータの一例を示す図である。コンピュータ1000は、例えば、メモリ1010と、CPU1020と、ハードディスクドライブインタフェース1030と、ディスクドライブインタフェース1040と、シリアルポートインタフェース1050と、ビデオアダプタ1060と、ネットワークインタフェース1070とを有する。これらの各部は、バス1080によって接続される。 FIG. 4 is a diagram showing an example of a computer that executes a speech recognition program. Computer 1000 includes, for example, memory 1010 , CPU 1020 , hard disk drive interface 1030 , disk drive interface 1040 , serial port interface 1050 , video adapter 1060 and network interface 1070 . These units are connected by a bus 1080 .
 メモリ1010は、ROM(Read Only Memory)1011およびRAM1012を含む。ROM1011は、例えば、BIOS(Basic Input Output System)等のブートプログラムを記憶する。ハードディスクドライブインタフェース1030は、ハードディスクドライブ1031に接続される。ディスクドライブインタフェース1040は、ディスクドライブ1041に接続される。ディスクドライブ1041には、例えば、磁気ディスクや光ディスク等の着脱可能な記憶媒体が挿入される。シリアルポートインタフェース1050には、例えば、マウス1051およびキーボード1052が接続される。ビデオアダプタ1060には、例えば、ディスプレイ1061が接続される。 The memory 1010 includes a ROM (Read Only Memory) 1011 and a RAM 1012 . The ROM 1011 stores a boot program such as BIOS (Basic Input Output System). Hard disk drive interface 1030 is connected to hard disk drive 1031 . Disk drive interface 1040 is connected to disk drive 1041 . A removable storage medium such as a magnetic disk or an optical disk is inserted into the disk drive 1041, for example. A mouse 1051 and a keyboard 1052 are connected to the serial port interface 1050, for example. For example, a display 1061 is connected to the video adapter 1060 .
 ここで、ハードディスクドライブ1031は、例えば、OS1091、アプリケーションプログラム1092、プログラムモジュール1093およびプログラムデータ1094を記憶する。上記実施形態で説明した各情報は、例えばハードディスクドライブ1031やメモリ1010に記憶される。 Here, the hard disk drive 1031 stores an OS 1091, application programs 1092, program modules 1093 and program data 1094, for example. Each piece of information described in the above embodiment is stored in the hard disk drive 1031 or the memory 1010, for example.
 また、音声認識プログラムは、例えば、コンピュータ1000によって実行される指令が記述されたプログラムモジュール1093として、ハードディスクドライブ1031に記憶される。具体的には、上記実施形態で説明した音声認識装置10が実行する各処理が記述されたプログラムモジュール1093が、ハードディスクドライブ1031に記憶される。 Also, the speech recognition program is stored in the hard disk drive 1031 as a program module 1093 in which commands to be executed by the computer 1000 are described, for example. Specifically, the hard disk drive 1031 stores a program module 1093 that describes each process executed by the speech recognition apparatus 10 described in the above embodiment.
 また、音声認識プログラムによる情報処理に用いられるデータは、プログラムデータ1094として、例えば、ハードディスクドライブ1031に記憶される。そして、CPU1020が、ハードディスクドライブ1031に記憶されたプログラムモジュール1093やプログラムデータ1094を必要に応じてRAM1012に読み出して、上述した各手順を実行する。 Data used for information processing by the voice recognition program is stored as program data 1094 in the hard disk drive 1031, for example. Then, the CPU 1020 reads out the program module 1093 and the program data 1094 stored in the hard disk drive 1031 to the RAM 1012 as necessary, and executes each procedure described above.
 なお、音声認識プログラムに係るプログラムモジュール1093やプログラムデータ1094は、ハードディスクドライブ1031に記憶される場合に限られず、例えば、着脱可能な記憶媒体に記憶されて、ディスクドライブ1041等を介してCPU1020によって読み出されてもよい。あるいは、音声認識プログラムに係るプログラムモジュール1093やプログラムデータ1094は、LAN(Local Area Network)やWAN(Wide Area Network)等のネットワークを介して接続された他のコンピュータに記憶され、ネットワークインタフェース1070を介してCPU1020によって読み出されてもよい。 Note that the program module 1093 and the program data 1094 related to the speech recognition program are not limited to being stored in the hard disk drive 1031, but may be stored in a removable storage medium and read by the CPU 1020 via the disk drive 1041 or the like. may be issued. Alternatively, the program module 1093 and program data 1094 related to the speech recognition program are stored in another computer connected via a network such as LAN (Local Area Network) or WAN (Wide Area Network), and are stored via network interface 1070. may be read by the CPU 1020 at the same time.
 以上、本発明者によってなされた発明を適用した実施形態について説明したが、本実施形態による本発明の開示の一部をなす記述および図面により本発明は限定されることはない。すなわち、本実施形態に基づいて当業者等によりなされる他の実施形態、実施例および運用技術等は全て本発明の範疇に含まれる。 Although the embodiment to which the invention made by the present inventor is applied has been described above, the present invention is not limited by the descriptions and drawings forming part of the disclosure of the present invention according to the present embodiment. That is, other embodiments, examples, operation techniques, etc. made by those skilled in the art based on this embodiment are all included in the scope of the present invention.
 10 音声認識装置
 11 入力部
 12 出力部
 13 通信制御部
 14 記憶部
 14a 音声認識モデル
 14b スパイク点予測モデル
 15 制御部
 15a 取得部
 15b 抽出部
 15c 学習部
 15d 生成部
 15e 予測学習部
 15f 認識部
10 speech recognition device 11 input unit 12 output unit 13 communication control unit 14 storage unit 14a speech recognition model 14b spike point prediction model 15 control unit 15a acquisition unit 15b extraction unit 15c learning unit 15d generation unit 15e prediction learning unit 15f recognition unit

Claims (7)

  1.  音声認識装置が実行する音声認識方法であって、
     音声信号のフレームごとの特徴量の系列を抽出する抽出工程と、
     抽出された特徴量の系列を用いて、CTC(Connectionist Temporal Classification)を用いた音声認識モデルを学習する学習工程と、
     学習された前記音声認識モデルを用いて、該音声認識モデルがスパイクとして出力するラベルの系列であるスパイク系列を生成する生成工程と、
     生成された前記スパイク系列を用いて、該スパイクが出力される時点を予測するスパイク点予測モデルを学習する予測学習工程と、
     を含んだことを特徴とする音声認識方法。
    A speech recognition method executed by a speech recognition device,
    an extraction step of extracting a series of feature amounts for each frame of the audio signal;
    A learning step of learning a speech recognition model using CTC (Connectionist Temporal Classification) using the series of extracted feature amounts;
    a generation step of generating a spike sequence, which is a sequence of labels output by the speech recognition model as spikes, using the learned speech recognition model;
    a prediction learning step of learning a spike point prediction model that predicts when the spike will be output using the generated spike series;
    A speech recognition method characterized by comprising:
  2.  前記スパイク点予測モデルは、2値分類モデルであることを特徴とする請求項1に記載の音声認識方法。 The speech recognition method according to claim 1, wherein the spike point prediction model is a binary classification model.
  3.  前記予測学習工程において、生成された前記スパイク系列と抽出された前記特徴量の系列とを用いて、前記スパイク点予測モデルと前記音声認識モデルとを統合したマルチタスクラーニングモデルを学習することを特徴とする請求項1に記載の音声認識方法。 In the prediction learning step, learning a multitask learning model that integrates the spike point prediction model and the speech recognition model using the generated spike sequence and the extracted feature amount sequence. The speech recognition method according to claim 1, wherein
  4.  前記マルチタスクラーニングモデルは、エンコーダデコーダモデルであることを特徴とする請求項3に記載の音声認識方法。 The speech recognition method according to claim 3, wherein the multitask learning model is an encoder-decoder model.
  5.  学習された前記音声認識モデルに、前記スパイク点予測モデルを用いて予測されたスパイク点における音声信号の特徴量を入力することにより、音声認識を行う認識工程を、さらに含んだことを特徴とする請求項1に記載の音声認識方法。 The method further includes a recognition step of performing speech recognition by inputting, into the learned speech recognition model, the feature quantity of the speech signal at the spike point predicted using the spike point prediction model. The speech recognition method according to claim 1.
  6.  音声信号のフレームごとの特徴量の系列を抽出する抽出部と、
     抽出された特徴量の系列を用いて、CTC(Connectionist Temporal Classification)を用いた音声認識モデルを学習する学習部と、
     学習された前記音声認識モデルを用いて、該音声認識モデルがスパイクとして出力するラベルの系列であるスパイク系列を生成する生成部と、
     生成された前記スパイク系列を用いて、該スパイクが出力される時点を予測するスパイク点予測モデルを学習する予測学習部と、
     を有することを特徴とする音声認識装置。
    an extraction unit for extracting a series of feature amounts for each frame of an audio signal;
    a learning unit that learns a speech recognition model using CTC (Connectionist Temporal Classification) using the series of extracted feature amounts;
    a generation unit that uses the learned speech recognition model to generate a spike sequence that is a sequence of labels that the speech recognition model outputs as spikes;
    a prediction learning unit that uses the generated spike series to learn a spike point prediction model that predicts when the spike will be output;
    A speech recognition device characterized by comprising:
  7.  音声信号のフレームごとの特徴量の系列を抽出する抽出ステップと、
     抽出された特徴量の系列を用いて、CTC(Connectionist Temporal Classification)を用いた音声認識モデルを学習する学習ステップと、
     学習された前記音声認識モデルを用いて、該音声認識モデルがスパイクとして出力するラベルの系列であるスパイク系列を生成する生成ステップと、
     生成された前記スパイク系列を用いて、該スパイクが出力される時点を予測するスパイク点予測モデルを学習する予測学習ステップと、
     をコンピュータに実行させるための音声認識プログラム。
    an extraction step of extracting a sequence of feature amounts for each frame of the audio signal;
    a learning step of learning a speech recognition model using CTC (Connectionist Temporal Classification) using the series of extracted feature amounts;
    a generation step of generating a spike sequence, which is a sequence of labels output by the speech recognition model as spikes, using the learned speech recognition model;
    a prediction learning step of learning a spike point prediction model that predicts when the spike will be output using the generated spike series;
    A speech recognition program that allows a computer to run
PCT/JP2021/022414 2021-06-11 2021-06-11 Voice recognition method, voice recognition device, and voice recognition program WO2022259555A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/JP2021/022414 WO2022259555A1 (en) 2021-06-11 2021-06-11 Voice recognition method, voice recognition device, and voice recognition program

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2021/022414 WO2022259555A1 (en) 2021-06-11 2021-06-11 Voice recognition method, voice recognition device, and voice recognition program

Publications (1)

Publication Number Publication Date
WO2022259555A1 true WO2022259555A1 (en) 2022-12-15

Family

ID=84424561

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2021/022414 WO2022259555A1 (en) 2021-06-11 2021-06-11 Voice recognition method, voice recognition device, and voice recognition program

Country Status (1)

Country Link
WO (1) WO2022259555A1 (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2020112787A (en) * 2019-01-08 2020-07-27 バイドゥ オンライン ネットワーク テクノロジー (ベイジン) カンパニー リミテッド Real-time voice recognition method based on cutting attention, device, apparatus and computer readable storage medium
JP2021039219A (en) * 2019-09-02 2021-03-11 日本電信電話株式会社 Speech signal processing device, speech signal processing method, speech signal process program, learning device, learning method, and learning program

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2020112787A (en) * 2019-01-08 2020-07-27 バイドゥ オンライン ネットワーク テクノロジー (ベイジン) カンパニー リミテッド Real-time voice recognition method based on cutting attention, device, apparatus and computer readable storage medium
JP2021039219A (en) * 2019-09-02 2021-03-11 日本電信電話株式会社 Speech signal processing device, speech signal processing method, speech signal process program, learning device, learning method, and learning program

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
HIROFUMI INAKUMA; MASATO MIMURA; TATSUYA KAWAHARA: "Speech recognition by streaming attention mechanism type sequence-to-sequence model", IPSJ SIG TECHNICAL REPORT, vol. 2020-SLP-131, no. 9, 6 February 2020 (2020-02-06), JP, pages 1 - 7, XP009535113 *

Similar Documents

Publication Publication Date Title
CN111613212B (en) Speech recognition method, system, electronic device and storage medium
US20180357998A1 (en) Wake-on-voice keyword detection with integrated language identification
US10672380B2 (en) Dynamic enrollment of user-defined wake-up key-phrase for speech enabled computer system
CN113327609B (en) Method and apparatus for speech recognition
CN118737132A (en) End-to-end stream keyword detection
KR20110038474A (en) Apparatus and method for detecting sentence boundaries
JPH09127978A (en) Voice recognition method, device therefor, and computer control device
CN103854643A (en) Method and apparatus for speech synthesis
CN112151003A (en) Parallel speech synthesis method, device, equipment and computer readable storage medium
CN114495904B (en) Speech recognition method and device
CN112331207A (en) Service content monitoring method and device, electronic equipment and storage medium
EP4367663A1 (en) Improving speech recognition with speech synthesis-based model adaption
US11587567B2 (en) User utterance generation for counterfactual analysis and improved conversation flow
US20230410794A1 (en) Audio recognition method, method of training audio recognition model, and electronic device
CN116601648A (en) Alternative soft label generation
WO2022259555A1 (en) Voice recognition method, voice recognition device, and voice recognition program
CN116580698A (en) Speech synthesis method, device, computer equipment and medium based on artificial intelligence
US20230107493A1 (en) Predicting Word Boundaries for On-Device Batching of End-To-End Speech Recognition Models
US20220310061A1 (en) Regularizing Word Segmentation
JP7505584B2 (en) SPEAKER DIARIZATION METHOD, SPEAKER DIARIZATION DEVICE, AND SPEAKER DIARIZATION PROGRAM
CN113763939A (en) Mixed speech recognition system and method based on end-to-end model
CN115171724A (en) Speech rate analysis method and system
CN112882760A (en) Awakening method, device and equipment of intelligent equipment
US20240290322A1 (en) Clustering and mining accented speech for inclusive and fair speech recognition
US12046227B2 (en) Key frame networks

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21945224

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21945224

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: JP