WO2022259555A1

WO2022259555A1 - Voice recognition method, voice recognition device, and voice recognition program

Info

Publication number: WO2022259555A1
Application number: PCT/JP2021/022414
Authority: WO
Inventors: 孝典芦原; 太一浅見
Original assignee: 日本電信電話株式会社
Priority date: 2021-06-11
Filing date: 2021-06-11
Publication date: 2022-12-15

Abstract

An extraction unit (15b) extracts a sequence of features for each voice signal frame. A learning unit (15c) trains a voice recognition model (14a) which uses CTC, by using the extracted sequence of features. A generation unit (15d) generates a spike sequence, which is a sequence of labels outputted by the voice recognition model as spikes, by using the trained voice recognition model (14a). A prediction learning unit (15e) trains a spike point prediction model (14b) for predicting the point in time at which said spike will be outputted by using the generated spike sequence.

Description

Speech recognition method, speech recognition device and speech recognition program

The present invention relates to a speech recognition method, a speech recognition device, and a speech recognition program.

In general, during speech recognition, the feature values are extracted after windowing the input speech waveform to a frame with a certain width, and the feature values are sequentially shifted while shifting the frames by a certain width. A sequence is generated and used as an input for a speech recognition model (see Patent Document 1).

In recent years, a DNN (Deep Neural Network) model, which is a large-scale neural network, has been used as a speech recognition model, and the cost of speech recognition is high and the processing time tends to increase. Therefore, a technique for speeding up speech recognition by reducing the number of frames of feature amounts to be input to the speech recognition model is expected (see Non-Patent Documents 1 and 2).

For example, in a speech recognition model with a loss function of CTC (Connectionist Temporal Classification), corresponding symbols are output in synchronization with input frames (see Non-Patent Document 3).

It is known that in CTC, the posterior probability sequence for each frame is often a spike sequence. In addition, Non-Patent Document 3 describes a hybrid model such as DNN-HMM (Deep Neural Network-Hidden Markov Model). Also, Non-Patent Document 4 describes simultaneous learning of a speech recognition task and a speaker recognition task.

JP 2007-249051 A

However, with conventional technology, it was difficult to sufficiently speed up speech recognition using a speech recognition model using CTC. That is, in a hybrid model such as conventional DNN-HMM, labels are assigned to all frames of the relevant speech interval, not to spikes. For this reason, it has been difficult to reduce the number of frames of feature amounts to be input to the speech recognition model.

The present invention has been made in view of the above, and aims to speed up speech recognition by a speech recognition model using CTC.

In order to solve the above-described problems and achieve the object, a speech recognition method according to the present invention is a speech recognition method executed by a speech recognition device, which extracts a series of feature amounts for each frame of a speech signal. a learning step of learning a speech recognition model using CTC (Connectionist Temporal Classification) using the series of extracted feature values; and a speech recognition model spike using the learned speech recognition model and a prediction learning step of learning a spike point prediction model that predicts when the spike will be output using the generated spike sequence. characterized by containing

According to the present invention, it is possible to speed up speech recognition by a speech recognition model using CTC.

FIG. 1 is a schematic diagram illustrating a schematic configuration of a speech recognition device. FIG. 2 is a flow chart showing a speech recognition processing procedure. FIG. 3 is a flow chart showing a speech recognition processing procedure. FIG. 4 is a diagram illustrating a computer that executes a speech recognition program.

An embodiment of the present invention will be described in detail below with reference to the drawings. It should be noted that the present invention is not limited by this embodiment. Moreover, in the description of the drawings, the same parts are denoted by the same reference numerals.

[Overview of speech recognition device]
The speech recognition apparatus of this embodiment focuses on the posterior probability sequence, which is the spike sequence for each frame in the speech recognition model using CTC. Here, the spike series is a series of posterior probabilities, in which the target recognition label symbol instantaneously appears as a spike with an extremely high posterior probability in a short frame of about 1 to 3 frames. In addition, other frames are mostly occupied by blank symbols having no recognition label with high posterior probability. For example, for 10 frames of input speech "patent", if the output label is two characters "patent", the blank symbol _ is output together with "____special________".

The speech recognition device limits the input frames for speech recognition to frames in which spikes appear. This enables the speech recognition device to speed up speech recognition. For this purpose, the speech recognition apparatus learns in advance a model for predicting a spike point, and uses it in the preceding stage of the speech recognition model. A model for predicting this spike point is learned as a binary classification model using a teacher label that assigns 1 to the label of the spike point and 0 to the blank label of the other points.

In addition, the speech recognition apparatus performs high-speed speech recognition by inputting only spike point frames having a label of 1 predicted from such a binary classification model into a speech recognition model using CTC. Realize. In speech recognition models using CTC, words and characters are used as output label symbols. In other words, the input required for the output of is extremely narrowed down.

Furthermore, the speech recognition device applies multi-task learning, in which one model learns multiple tasks at the same time and multiple tasks are expressed in one model. Specifically, the speech recognition apparatus integrates the binary classification model and the speech recognition model, and simultaneously solves the speech recognition task and the binary classification task. For example, in an encoder-decoder type speech recognition model, multi-task learning is realized by sharing an encoder for a speech recognition task and a binary classification task, and learning a decoder for the binary classification task. As a result, the speech recognition apparatus achieves an improvement in accuracy and a reduction in size of the model.

[Structure of speech recognition device]
FIG. 1 is a schematic diagram illustrating a schematic configuration of a speech recognition device. As illustrated in FIG. 1, a speech recognition apparatus 10 of this embodiment is realized by a general-purpose computer such as a personal computer, and includes an input unit 11, an output unit 12, a communication control unit 13, a storage unit 14, and a control unit 15. .

The input unit 11 is implemented using input devices such as a keyboard and a mouse, and inputs various instruction information such as processing start to the control unit 15 in response to input operations by the practitioner. The output unit 12 is implemented by a display device such as a liquid crystal display, a printing device such as a printer, an information communication device, or the like. The communication control unit 13 is implemented by a NIC (Network Interface Card) or the like, and controls communication between the control unit 15 and an external device such as a server or a device that acquires an acoustic signal via a network.

The storage unit 14 is implemented by semiconductor memory devices such as RAM (Random Access Memory) and flash memory, or storage devices such as hard disks and optical disks. Note that the storage unit 14 may be configured to communicate with the control unit 15 via the communication control unit 13 . In the present embodiment, the storage unit 14 stores, for example, a speech recognition model 14a, a spike point prediction model 14b, and the like used for speech recognition processing, which will be described later.

The control unit 15 is implemented using a CPU (Central Processing Unit), NP (Network Processor), FPGA (Field Programmable Gate Array), etc., and executes a processing program stored in memory. Thereby, the control unit 15 functions as an acquisition unit 15a, an extraction unit 15b, a learning unit 15c, a generation unit 15d, a prediction learning unit 15e, and a recognition unit 15f, as illustrated in FIG. Note that these functional units may be implemented in different hardware. For example, the learning unit 15c may be implemented as a learning device, and the recognition unit 15f may be implemented as a recognition device. Also, the control unit 15 may include other functional units.

The acquisition unit 15a acquires an audio signal. Specifically, the acquisition unit 15a acquires an analog audio signal and performs A/D conversion to obtain an audio digital signal. The acquisition unit may cause the storage unit 14 to store an audio digital signal obtained by A/D converting the acquired audio signal. Alternatively, the acquiring unit 15a may immediately transfer the audio digital signal obtained by A/D converting the acquired audio signal to the extracting unit 15b described below without storing it in the storage unit 14. FIG.

The extraction unit 15b extracts a series of feature amounts for each frame of the audio signal. Specifically, the extraction unit 15b extracts the acoustic feature quantity from the speech digital signal, and obtains a feature quantity sequence for each utterance. The extraction unit 15b extracts, for example, a feature amount using MFCC or power based on short-time frame analysis of an audio signal. For example, the extracting unit 15b extracts 1 to 12 dimensions of MFCC (Mel-Frequency Cepstrum Coefficients), dynamic parameters such as ΔMFCC, ΔΔMFCC, power, Δpower, ΔΔpower, etc., as feature amounts. use.

Note that the extraction unit 15b may perform Cepstral Mean Normalization (CMN) processing on the MFCC. Further, the feature amount is not limited to those using MFCC or power, and may be parameters such as autocorrelation peak value and group delay, which are used for identifying special utterances.

The extraction unit 15b may store the extracted feature series in the storage unit 14. Alternatively, the extracting unit 15b may immediately transfer the extracted feature series to the learning unit 15c described below without storing them in the storage unit 14. FIG.

The learning unit 15c learns the speech recognition model 14a using CTC using the series of extracted feature amounts. Specifically, the learning unit 15c learns the end-to-end speech recognition model 14a using the feature amount series of the teacher data. This speech recognition model 14a uses a CTC loss function and outputs a sequence of labels that is a spike sequence.

The generation unit 15d uses the learned speech recognition model 14a to generate a spike sequence, which is a sequence of labels output by the speech recognition model 14a as spikes. Specifically, the generating unit 15d uses the trained speech recognition model 14a learned by the learning unit 15c to cause the speech recognition model 14a to speech-recognize the feature quantity series used for learning, thereby generating the Generate a spike sequence that is a stochastic sequence.

The generation unit 15d may store the generated spike series in the storage unit 14. Alternatively, the generating unit 15d may immediately transfer the generated spike sequence to the prediction learning unit 15e described below without storing it in the storage unit 14. FIG.

The prediction learning unit 15e uses the generated spike series to learn the spike point prediction model 14b that predicts when the spike will be output.

Here, the spike point prediction model 14b is a binary classification model. The prediction learning unit 15e learns the spike point prediction model 14b using teacher data labeled as 1 for spike points in the spike series and 0 for other non-spike points.

Note that since the generated spike series is a posterior probability series, the prediction learning unit 15e uses a predetermined threshold as a parameter, and sets 1 when it is equal to or greater than the threshold and 0 when it is less than the threshold. The larger the threshold, the fewer spike points are predicted, and the smaller the threshold, the more spike points are predicted.

In addition, the prediction learning unit 15e learns a multitask learning model that integrates the spike point prediction model 14b and the speech recognition model 14a using the generated spike series and the extracted feature quantity series. The multitask learning model is, but not limited to, an encoder-decoder model, for example.

Specifically, as described above, the prediction learning unit 15e realizes multitask learning by sharing the encoder for the speech recognition task and the spike point prediction task and learning the decoder for the spike point prediction task. do. As a result, the speech recognition apparatus 10 can improve the accuracy and reduce the size of the model.

Here, the output of the spike point prediction model 14b, which is a binary classification model, is the prediction result at time t+1 for the input time t. If the prediction result is 1, run the decoder of the speech recognition model with the input at time t+1.

Therefore, the prediction learning unit 15e predicts whether or not there will be a spike point at time t+1 for the input at time t. Therefore, the target label at the time (+1) next to the input is used for learning of the speech recognition model 14a.

However, the method is not limited to this method. For example, if the output label is 5 for the input time t, the next spike point appears at time t+5. You can predict. In this case, the prediction learning unit 15e inputs the feature amount of the frame at time t+5 to the speech recognition model 14a and the spike point prediction model 14b without using the intermediate frame as an input to the model, and inputs it to the spike point prediction model 14b so that the next spike Used for point prediction.

The recognition unit 15f performs speech recognition by inputting into the learned speech recognition model 14a the feature quantity of the speech signal at the spike points predicted using the spike point prediction model 14b. For example, the recognition unit 15f uses the multitask learning model learned by the prediction learning unit 15e to simultaneously perform spike point prediction and speech recognition.

In this case, the recognition unit 15f inputs, to the speech recognition model 14a, the feature quantity of the speech signal frame at the predicted time of the next spike point with respect to the time at which the speech signal frame was input, as described above. to perform speech recognition. In this manner, the speech recognition apparatus 10 can speed up speech recognition by performing speech recognition using the speech recognition model 14a limited to the feature quantity of the spark point.

[Voice recognition processing]
Next, speech recognition processing by the speech recognition device 10 will be described. 2 and 3 are flowcharts showing the speech recognition processing procedure. The speech recognition processing of this embodiment includes learning processing and recognition processing. First, FIG. 2 shows the learning processing procedure. The flowchart of FIG. 2 is started, for example, when an input instructing the start of learning processing is received.

First, the acquisition unit 15a acquires an audio signal (step S1). Further, the extracting unit 15b extracts a series of feature amounts for each frame of the audio signal (step S2). Specifically, the extraction unit 15b extracts the acoustic feature quantity from the speech digital signal, and obtains a feature quantity sequence for each utterance.

Next, the learning unit 15c learns the end-to-end speech recognition model 14a using CTC using the series of extracted feature amounts (step S3).

Also, the generation unit 15d uses the learned speech recognition model 14a to generate a spike sequence, which is a posterior probability sequence of labels output as spikes by the speech recognition model 14a (step S4).

Then, the prediction learning unit 15e uses the generated spike series to learn the spike point prediction model 14b that predicts when the spike will be output (step S5). For example, the prediction learning unit 15e learns a multitask learning model that integrates the spike point prediction model 14b and the speech recognition model 14a using the generated spike series and the extracted feature quantity series. This completes a series of learning processes.

Next, FIG. 3 shows the recognition processing procedure. The flowchart of FIG. 3 is started, for example, at the timing when an input instructing the start of recognition processing is received.

First, the acquisition unit 15a acquires a speech signal to be speech-recognized (step S11).

Next, the recognition unit 15f performs speech recognition by inputting the feature amount of the speech signal at the spike points predicted using the spike point prediction model 14b to the learned speech recognition model 14a (step S12). . For example, the recognition unit 15f uses the multitask learning model learned by the prediction learning unit 15e to simultaneously perform spike point prediction and speech recognition. This completes a series of recognition processes.

As described above, in the speech recognition device 10 of the present embodiment, the extracting unit 15b extracts a series of feature quantities for each frame of the speech signal. Also, the learning unit 15c learns the speech recognition model 14a using CTC using the series of extracted feature amounts. Further, the generation unit 15d uses the learned speech recognition model 14a to generate a spike sequence, which is a sequence of labels output as spikes by the speech recognition model. Also, the prediction learning unit 15e uses the generated spike series to learn the spike point prediction model 14b that predicts the point in time when the spike will be output. Specifically, the spike point prediction model 14b is a binary classification model.

In this way, the speech recognition apparatus 10 predicts frames in which sparks of speech signals appear, and limits input frames for speech recognition to frames in which spikes appear. As a result, the speech recognition apparatus 10 can speed up speech recognition using a speech recognition model using CTC.

In addition, the prediction learning unit 15e learns a multitask learning model that integrates the spike point prediction model 14b and the speech recognition model 14a using the generated spike series and the extracted feature quantity series. For example, a multi-task learning model is an encoder-decoder model. As a result, the speech recognition apparatus 10 can improve the accuracy of speech recognition and reduce the size of the model.

In addition, the recognition unit 15f performs speech recognition by inputting the feature amount of the speech signal at the spike points predicted using the spike point prediction model 14b to the learned speech recognition model 14a. This makes it possible to suppress the processing load and perform speech recognition at high speed.

[program]
It is also possible to create a program in which the processing executed by the speech recognition apparatus 10 according to the above embodiment is described in a computer-executable language. As one embodiment, the speech recognition apparatus 10 can be implemented by installing a speech recognition program for executing the above speech recognition processing as package software or online software on a desired computer. For example, the information processing apparatus can function as the speech recognition apparatus 10 by causing the information processing apparatus to execute the above speech recognition program. In addition, information processing devices include mobile communication terminals such as smartphones, mobile phones and PHS (Personal Handyphone Systems), and slate terminals such as PDAs (Personal Digital Assistants). Also, the functions of the speech recognition device 10 may be implemented in a cloud server.

FIG. 4 is a diagram showing an example of a computer that executes a speech recognition program. Computer 1000 includes, for example, memory 1010 , CPU 1020 , hard disk drive interface 1030 , disk drive interface 1040 , serial port interface 1050 , video adapter 1060 and network interface 1070 . These units are connected by a bus 1080 .

The memory 1010 includes a ROM (Read Only Memory) 1011 and a RAM 1012 . The ROM 1011 stores a boot program such as BIOS (Basic Input Output System). Hard disk drive interface 1030 is connected to hard disk drive 1031 . Disk drive interface 1040 is connected to disk drive 1041 . A removable storage medium such as a magnetic disk or an optical disk is inserted into the disk drive 1041, for example. A mouse 1051 and a keyboard 1052 are connected to the serial port interface 1050, for example. For example, a display 1061 is connected to the video adapter 1060 .

Here, the hard disk drive 1031 stores an OS 1091, application programs 1092, program modules 1093 and program data 1094, for example. Each piece of information described in the above embodiment is stored in the hard disk drive 1031 or the memory 1010, for example.

Also, the speech recognition program is stored in the hard disk drive 1031 as a program module 1093 in which commands to be executed by the computer 1000 are described, for example. Specifically, the hard disk drive 1031 stores a program module 1093 that describes each process executed by the speech recognition apparatus 10 described in the above embodiment.

Data used for information processing by the voice recognition program is stored as program data 1094 in the hard disk drive 1031, for example. Then, the CPU 1020 reads out the program module 1093 and the program data 1094 stored in the hard disk drive 1031 to the RAM 1012 as necessary, and executes each procedure described above.

Note that the program module 1093 and the program data 1094 related to the speech recognition program are not limited to being stored in the hard disk drive 1031, but may be stored in a removable storage medium and read by the CPU 1020 via the disk drive 1041 or the like. may be issued. Alternatively, the program module 1093 and program data 1094 related to the speech recognition program are stored in another computer connected via a network such as LAN (Local Area Network) or WAN (Wide Area Network), and are stored via network interface 1070. may be read by the CPU 1020 at the same time.

Although the embodiment to which the invention made by the present inventor is applied has been described above, the present invention is not limited by the descriptions and drawings forming part of the disclosure of the present invention according to the present embodiment. That is, other embodiments, examples, operation techniques, etc. made by those skilled in the art based on this embodiment are all included in the scope of the present invention.

10 speech recognition device 11 input unit 12 output unit 13 communication control unit 14 storage unit 14a speech recognition model 14b spike point prediction model 15 control unit 15a acquisition unit 15b extraction unit 15c learning unit 15d generation unit 15e prediction learning unit 15f recognition unit

Claims

A speech recognition method executed by a speech recognition device,
an extraction step of extracting a series of feature amounts for each frame of the audio signal;
A learning step of learning a speech recognition model using CTC (Connectionist Temporal Classification) using the series of extracted feature amounts;
a generation step of generating a spike sequence, which is a sequence of labels output by the speech recognition model as spikes, using the learned speech recognition model;
a prediction learning step of learning a spike point prediction model that predicts when the spike will be output using the generated spike series;
A speech recognition method characterized by comprising:
The speech recognition method according to claim 1, wherein the spike point prediction model is a binary classification model.
In the prediction learning step, learning a multitask learning model that integrates the spike point prediction model and the speech recognition model using the generated spike sequence and the extracted feature amount sequence. The speech recognition method according to claim 1, wherein
The speech recognition method according to claim 3, wherein the multitask learning model is an encoder-decoder model.
The method further includes a recognition step of performing speech recognition by inputting, into the learned speech recognition model, the feature quantity of the speech signal at the spike point predicted using the spike point prediction model. The speech recognition method according to claim 1.
an extraction unit for extracting a series of feature amounts for each frame of an audio signal;
a learning unit that learns a speech recognition model using CTC (Connectionist Temporal Classification) using the series of extracted feature amounts;
a generation unit that uses the learned speech recognition model to generate a spike sequence that is a sequence of labels that the speech recognition model outputs as spikes;
a prediction learning unit that uses the generated spike series to learn a spike point prediction model that predicts when the spike will be output;
A speech recognition device characterized by comprising:
an extraction step of extracting a sequence of feature amounts for each frame of the audio signal;
a learning step of learning a speech recognition model using CTC (Connectionist Temporal Classification) using the series of extracted feature amounts;
a generation step of generating a spike sequence, which is a sequence of labels output by the speech recognition model as spikes, using the learned speech recognition model;
a prediction learning step of learning a spike point prediction model that predicts when the spike will be output using the generated spike series;
A speech recognition program that allows a computer to run