WO2021234905A1

WO2021234905A1 - Learning data generation device, model learning device, learning data generation method, and program

Info

Publication number: WO2021234905A1
Application number: PCT/JP2020/020106
Authority: WO
Inventors: 孝典芦原; 雄介篠原; 義和山口
Original assignee: 日本電信電話株式会社
Priority date: 2020-05-21
Filing date: 2020-05-21
Publication date: 2021-11-25

Abstract

Provided is a learning data generation device, etc., that generates learning data for learning an acoustic model that simulates robustness for perception of a human voice. The learning data generation device generates learning data for use when learning an acoustic model used in a voice recognition device. The learning data generation device includes a signal conversion unit that converts a first learning voice signal to a second learning voice signal, which is a voice signal that can produce an acoustic illusion.

Description

Training data generator, model learning device, training data generation method, and program

The present invention relates to a learning data generation device that generates learning data used when learning an acoustic model used in a speech recognition device, a model learning device that uses learning data, a learning data generation method, and a program.

In a speech recognition device using an acoustic model, Patent Document 1 is a technique for adapting an acoustic model to a task to be recognized in order to ensure a practical level of speech recognition performance. In other words, Patent Document 1 is a technique for adapting an original acoustic model to a task having different acoustic characteristics such as a speaker, a noise type, and a way of speaking. In general, the speech recognition performance has an aspect of increasing or decreasing depending on the amount of learning data of the task to be recognized and the acoustic coverage. Therefore, usually, desired learning data is collected by sufficiently collecting the voices of the tasks to be recognized and transcribing the voices.

However, the conventional technology has a problem that it requires a huge financial and time cost.

Data Augmentation is one of the solutions to such problems. Data expansion is to add some variation to the original training data, generate new training data, and inflate the training data. By data expansion, it is possible to reduce learning with the same learning data and obtain further generalization performance.

For example, in Non-Patent Document 1, by converting the speaking speed to the original data, various speaker data are generated, and the generalization performance for a wider range of speakers is improved.

Further, in Non-Patent Document 2, in order to improve noise immunity and recognition performance for reverberation sound, noise is superimposed on the original learning data, and the impulse response of a room with strong reverberation is convoluted to create a pseudo reverberation sound. Is generated, and the reverberation sound is superimposed on the original learning data to improve the generalization performance.

Japanese Unexamined Patent Publication No. 2007-249051

Here, consider Data Augmentation for capturing information for a long time for a voice recognition device. First, the voice recognition device and long-time information will be described. There are many reports that incorporating long-term information into a speech recognition device makes it more robust to various acoustic events and improves speech recognition accuracy.

For example, the recurrent neural network (RNN) model is different from the multi-layer perceptron (MLP) model, and the model itself is devised to explicitly capture long-term information and handles time-series information such as voice recognition. The task has greatly improved accuracy.

Further, in Reference 1, the speech recognition accuracy is improved by explicitly incorporating a linguistic long-time context into the end-to-end speech recognition model.

(Reference 1) R. Masumura, T. Tanaka, T. Moriya, Y. Shinohara, T. Oba and Y. Aono, "Large Context End-to-end Automatic Speech Recognition via Extension of Hierarchical Recurrent Encoder-decoder Models" , ICASSP 2019 --2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, United Kingdom, 2019, pp. 5661-5665.
As described above, there are various methods for incorporating long-term information into the speech recognition device by devising a model. However, there has never been an approach to acquire a speech recognition device that incorporates information for a long time by devising the learning data itself.

In the present invention, the learning data itself is devised. By learning an acoustic model using the learning data generated by the learning data generation device according to the present invention, it is possible to acquire robustness to speech perception that humans acquire innately and acquiredly. In this embodiment, learning data is generated by executing a Data Augmentation process using an auditory illusion.

The present invention is a learning data generation device that generates learning data for learning an acoustic model that simulates the robustness of human speech perception, a model learning device that learns an acoustic model using the learning data, and a learning data generation method. , And the purpose of providing the program.

In order to solve the above problem, according to one aspect of the present invention, the learning data generation device generates learning data used when learning the acoustic model used in the speech recognition device. The learning data generation device includes a signal conversion unit that converts a first learning voice signal into a second learning voice signal that is a voice signal that can cause an auditory illusion.

In order to solve the above problem, according to another aspect of the present invention, the learning data generation device generates learning data used when learning an acoustic model used in a speech recognition device. The learning data generator deletes a part of the first learning audio signal in the time axis direction, and embeds the noise above the sound pressure of each frequency before and after the deleted part in the deleted part. It includes a signal conversion unit that converts a learning audio signal into a second learning audio signal.

In order to solve the above problem, according to one aspect of the present invention, the learning data generation device generates learning data used when learning the acoustic model used in the speech recognition device. The learning data generator divides the first learning voice signal into voice waveforms for each fixed short time window width, inverts the voice waveform on the time axis, and then concatenates the inverted voice waveforms. This includes a signal conversion unit that converts the first learning audio signal into the second learning audio signal.

The present invention has the effect of being able to learn an acoustic model that simulates the robustness of human speech perception.

The functional block diagram of the model learning apparatus which concerns on 1st Embodiment. The figure which shows the example of the processing flow of the model learning apparatus which concerns on 1st Embodiment. The figure which shows the example which converts into the audio signal which can obtain the continuous listening effect. The figure which shows the example which converts into the voice signal which becomes the time-inverted voice. The figure which shows the configuration example of the computer to which this method is applied.

Hereinafter, embodiments of the present invention will be described. In the drawings used in the following description, the same reference numerals are given to the components having the same function and the steps performing the same processing, and duplicate description is omitted. In the following description, the processing performed for each element of the vector or matrix shall be applied to all the elements of the vector or matrix unless otherwise specified.

<Points of the first embodiment>
In the present embodiment, in order to make the speech recognition device acquire the robustness of speech perception acquired by humans, a data augmentation process using auditory illusion is executed.

The illusion is an illusion phenomenon in which the physically presented sound stimulus is not always perceived as it is due to the human auditory characteristics, and can be said to be an auditory version of the illusion.

For example, in the continuity illusion, a part of a pure tone or voice whose frequency changes is deleted, and noise that sufficiently masks the original sound is superimposed on the deleted part to physically perform it. Is a phenomenon in which the sound section that should have been deleted is complemented and perceived (see Reference 2).

(Reference 2) R. M. Warren: "Perceptual restoration of missing speech sounds", Science, 167, pp. 392-393 (1970).
In Locally Time-reversed Speech, a voice waveform is divided into certain short time segments, the waveform is inverted on the time axis in each segment, and then each inverted segment is used. The audio is reconnected (see Reference 3).

(Reference 3) K. Saberi and D. R. Perrott, "Cognitive restoration of reversed speech", Nature, 398, 6730, pp. 760-760 (1999).
When a human hears such a time-reversed speech, the intelligibility of the speech perception is maintained sufficiently high when the segment length is relatively short, for example, about 25 ms. However, it has been experimentally shown that the longer the segment length, the lower the intelligibility in a sigmoid function, and it becomes almost difficult to perceive speech in about 100 ms. In other words, it can be seen that if the time series is destroyed locally to some extent, it does not affect human speech perception (it is robust).

If an acoustic model is learned using a voice signal that can produce such a continuous listening effect or a voice signal that becomes a time-reversed voice, it is naturally more than a deleted or masked part or a segment when inverting. The acoustic model will be learned in consideration of a long time interval, and the acoustic model will incorporate information for a long time, and will acquire the robustness of speech perception acquired by humans.

In the present embodiment, by using the above-mentioned auditory illusion voice waveform as extended data, it is possible to learn a robust acoustic model from the learning data to long-term information. In the present embodiment, the above-mentioned continuous length effect and time-reversed voice are mentioned among the auditory illusions.

With the continuous length effect, it becomes possible for the speech recognition device to acquire the robustness of a human being who can perceive speech even when a part is missing, and as a result, a speech recognition device that is robust to long-term information is constructed.

With time-reversed speech, it becomes possible for the speech recognition device to acquire the robustness of humans who can perceive speech even when the time series is locally inverted (destroyed), and as a result, speech recognition that is robust to long-term information. The device is built.

<First Embodiment>
FIG. 1 shows a functional block diagram of the model learning device according to the first embodiment, and FIG. 2 shows a processing flow thereof.

The model learning device 100 includes a voice signal acquisition unit 110, a voice digital signal storage unit 120, a signal conversion unit 125, a voice digital signal storage unit 126, a feature amount analysis unit 130, a feature amount storage unit 140, and a learning unit 160.

The model learning device is, for example, a special program configured by loading a special program into a publicly known or dedicated computer having a central processing unit (CPU: Central Processing Unit), a main storage device (RAM: Random Access Memory), and the like. It is a device. The model learning device executes each process under the control of the central processing unit, for example. The data input to the model learning device and the data obtained by each process are stored in the main storage device, for example, and the data stored in the main storage device is read out to the central processing unit as needed. It is used for processing. At least a part of each processing unit of the model learning device may be configured by hardware such as an integrated circuit. Each storage unit included in the model learning device is, for example, a main storage device such as RAM (RandomAccessMemory), an auxiliary storage device composed of a hard disk, an optical disk, or a semiconductor memory element such as a flash memory (FlashMemory), or a relational device. It can be configured by middleware such as database and key value store.

The model learning device inputs an analog audio signal x (k) for learning and a corresponding correct answer label r (j), learns an acoustic model based on this information, and outputs a trained acoustic model f. do. Note that k is an index indicating the time. The correct label is, for example, a phoneme label, and j is an index indicating the order of phonemes. Information indicating from where to where the correct answer label (sound element label) corresponds to the analog audio signal shall be included in the training data in advance, and the audio digital signal and feature amount described below are also linked in the same manner. It is assumed that it has been done.

The processing contents of each part will be explained below.

<Audio signal acquisition unit 110>
Input: Audio signal x (k)
Output: Audio digital signal x (t)
Processing: AD conversion The audio signal acquisition unit 110 acquires an analog audio signal x (k) and converts it into a digital audio digital signal x (t) (S110). Note that t is an index indicating the sample number of the audio digital signal.

<Audio digital signal storage unit 120>
Input: Audio digital signal x (t)
Processing: Accumulation of voice digital signal The voice digital signal storage unit 120 stores voice digital signal x (t) (S120).

<Signal converter 125>
Input: Audio digital signal x (t)
Output: Audio digital signal r (t)
Processing: Data Augmentation processing The signal conversion unit 125 executes Data Augmentation processing on the voice digital signal x (t) and converts the voice digital signal x (t) into the voice digital signal r (t) (S125).

The Data Augmentation process in this embodiment converts the audio digital signal x (t) into the audio digital signal r (t) based on a certain conversion rule. By this conversion, pseudo inflated learning data is generated.

In this embodiment, a conversion rule is adopted so that the converted voice digital signal r (t) becomes a voice signal that can cause an auditory illusion.

As a conversion rule such that the converted audio digital signal r (t) becomes an audio signal that can cause illusion, in the present embodiment, the audio signal and the time-reversed audio that can obtain the continuous listening effect as described above are used. Adopt a conversion rule that converts to such an audio signal.

(i) When converting to an audio signal that can obtain a continuous listening effect, the signal conversion unit 125 deletes a part (for a certain period of time) of the audio digital signal x (t) in the time axis direction, and the deleted part. The audio digital signal x (t) is converted into the audio digital signal r (t) by embedding it in the portion where the noise above the sound pressure of each frequency before and after is deleted (see FIG. 3). The length of the part to be deleted and the part to be embedded is a length that can cause an illusion. In addition, the deletion process and the embedding process are performed at intervals that may cause an auditory illusion. The noise to be embedded is, for example, white noise. Note that noise is prepared in advance prior to the process S125.

(ii) When converting to a voice signal that becomes a time-inverted voice, the signal conversion unit 125 divides the voice digital signal x (t) into a voice waveform for each fixed short time window width, and the voice waveform is divided into time. After being inverted on the axis, the inverted audio waveform is concatenated to convert the audio digital signal x (t) into the audio digital signal r (t) (see FIG. 4). The length of carving is the length that can cause an illusion.

<Audio digital signal storage unit 126>
Input: Audio digital signal r (t)
Processing: Accumulation of voice digital signal The voice digital signal storage unit 126 stores the voice digital signal r (t) (S126).

<Feature quantity analysis unit 130>
Input: Audio digital signal x (t), r (t)
Output: Feature series X, R
Processing: Feature analysis The feature analysis unit 130 performs feature analysis on the voice digital signals x (t) and r (t) to obtain feature series X and R.

For example, the feature quantity analysis unit 130 takes out the voice digital signal x (t) from the voice digital signal storage unit 120, takes out the voice digital signal r (t) from the voice digital signal storage unit 126, and takes out the voice digital signal x (t). , R (t) are divided into frame units, acoustic feature quantity extraction is performed for each frame, and (acoustic) feature quantity series X and R are acquired.

For example, the total number of frames included in the audio digital signal x (t) is N, n = 1,2, ..., N, the frame length is M, m = 1,2, ..., M, and the shift width is D. Then, the m-th audio digital signal x (t) in the n-th frame can be expressed as x (D (n-1) + m). The feature amount analysis unit 130 has audio digital signals x (D (n-1) + 1), x (D (n-1) + 2), ..., x (D (n-1) + M) for each frame n. ), The acoustic feature amount is extracted and the feature amount X (n) is acquired. The feature amount analysis unit 130 processes all frames 1,2, ..., N and acquires the feature amount series X = {X (1), X (2), ..., X (N)}. .. The feature amount analysis unit 130 performs the same processing on the voice digital signal r (t), and acquires the feature amount series R = {R (1), R (2), ..., R (N)}.

The features to be extracted include, for example, 1 to 12 dimensions of MFCC (Mel-Frequenct Cepstrum Coefficient) based on short-time frame analysis of voice signals, dynamic parameters such as ΔMFCC and ΔΔMFCC, which are the dynamic features, and dynamic parameters such as ΔMFCC and ΔΔMFCC. Power, Δpower, ΔΔ power, etc. are used. Further, CMN (cepstrum average normalization) processing may be performed on the MFCC. The feature amount is not limited to MFCC and power, and parameters used for identifying special utterances (for example, autocorrelation peak value and group delay) may be used.

<Feature amount storage unit 140>
Input: Feature series X, R
Processing: Accumulation of feature quantity series The feature quantity storage unit 140 accumulates the feature quantity series X and R analyzed by the feature quantity analysis unit 130 (S140).

<Learning Department 160>
Input: Feature series X, feature series R, correct label r (j)
Output: Trained acoustic model f
Processing: Model learning The learning unit 160 learns the acoustic model f using the feature quantity sequence X, the feature quantity sequence R, and the correct label r (j) (S160). The acoustic model f is a model that inputs a feature sequence and outputs a phoneme label. For example, GMM-HMM and DNN-HMM are often used as the acoustic model in speech recognition, and in recent years, the End-to-End speech recognition model is also used. Since there are no restrictions on the speech recognition model, it may be either a GMM / DNN-HMM or an End-to-End speech recognition model. The correct label r (j) corresponds to the analog audio signal x (k) for learning, and the feature quantity sequence X obtained from the audio signal x (k) and the feature quantity sequence X are converted. It also corresponds to the feature quantity series R obtained by.

<Effect>
With the above configuration, it is possible to learn an acoustic model that simulates the robustness of human speech perception. In addition, financial and time costs can be reduced.

<Modification example>
A configuration that does not include the feature amount analysis unit 130, the feature amount storage unit 140, and the learning unit 160 of the model learning device 100 is also referred to as a learning data generation device. That is, the learning data generation device includes an audio signal acquisition unit 110, an audio digital signal storage unit 120, a signal conversion unit 125, and an audio digital signal storage unit 126. The learning data generator inputs an analog audio signal x (k) for learning and a correct answer label r (j), and from the audio signal x (k) to an audio digital signal x (t) and an audio digital signal r (t). ) And is generated, and the combination of the voice digital signal x (t), the voice digital signal r (t) and the correct answer label r (j) is output as training data.

In the first embodiment, it is assumed that the audio digital signal r (t) is an audio signal that can cause an auditory illusion, but as a result of an experiment, the same effect can be obtained even if the audio signal cannot cause an auditory illusion. It turned out that you can get.

For example, in the first embodiment, in order to obtain the continuous listening effect, the signal conversion unit 125 deletes a part of the audio digital signal r (t) in the time axis direction, and the sound pressure of each frequency before and after the deleted part is deleted. By embedding the above noise in the removed portion, the audio digital signal x (t) is converted into the audio digital signal r (t). At this time, the lengths of the portion to be deleted and the portion to be embedded may be so long that the continuous listening effect cannot be produced. Further, the interval between the deletion process and the embedding process may be so short that the continuous listening effect cannot be produced. Even when such data expansion is executed, it is possible to learn an acoustic model having the same accuracy as that of the first embodiment.

Further, for example, in the first embodiment, in order to convert the voice signal into a time-inverted voice, the signal conversion unit 125 converts the voice digital signal x (t) into a voice waveform for each fixed short time window width. After inverting the voice waveform on the time axis, the voice digital signal x (t) is converted into the voice digital signal r (t) by concatenating the inverted voice waveforms. At this time, the length of the cut voice waveform may be long enough to prevent an auditory illusion. Even when such data expansion is executed, it is possible to learn an acoustic model having the same accuracy as that of the first embodiment.

<Other variants>
The present invention is not limited to the above embodiments and modifications. For example, the various processes described above may not only be executed in chronological order according to the description, but may also be executed in parallel or individually as required by the processing capacity of the device that executes the processes. In addition, changes can be made as appropriate without departing from the spirit of the present invention.

<Programs and recording media>
The various processes described above can be performed by causing the storage unit 2020 of the computer shown in FIG. 5 to read a program for executing each step of the above method and operating the control unit 2010, the input unit 2030, the output unit 2040, and the like. ..

The program that describes this processing content can be recorded on a computer-readable recording medium. The recording medium that can be read by a computer may be, for example, a magnetic recording device, an optical disk, a photomagnetic recording medium, a semiconductor memory, or the like.

The distribution of this program is carried out, for example, by selling, transferring, renting, etc. a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Further, the program may be stored in the storage device of the server computer, and the program may be distributed by transferring the program from the server computer to another computer via the network.

A computer that executes such a program first, for example, first stores a program recorded on a portable recording medium or a program transferred from a server computer in its own storage device. Then, when the process is executed, the computer reads the program stored in its own recording medium and executes the process according to the read program. Further, as another execution form of this program, a computer may read the program directly from a portable recording medium and execute processing according to the program, and further, the program is transferred from the server computer to this computer. You may execute the process according to the received program one by one each time. In addition, the above-mentioned processing is executed by a so-called ASP (Application Service Provider) type service that realizes the processing function only by the execution instruction and the result acquisition without transferring the program from the server computer to this computer. May be. The program in this embodiment includes information used for processing by a computer and equivalent to the program (data that is not a direct command to the computer but has a property that regulates the processing of the computer, etc.).

Further, in this form, the present device is configured by executing a predetermined program on a computer, but at least a part of these processing contents may be realized in terms of hardware.

Claims

It is a learning data generation device that generates learning data used when learning an acoustic model used in a speech recognition device.
A signal conversion unit that converts a first learning audio signal into a second learning audio signal, which is an audio signal that can cause an auditory illusion.
Training data generator.
It is a learning data generation device that generates learning data used when learning an acoustic model used in a speech recognition device.
By deleting a part of the first learning audio signal in the time axis direction and embedding the noise above the sound pressure of each frequency before and after the deleted part in the deleted part, the first learning audio signal is obtained. Includes a signal converter that converts to a second learning audio signal,
Training data generator.
It is a learning data generation device that generates learning data used when learning an acoustic model used in a speech recognition device.
The first learning voice signal is divided into voice waveforms for a certain short time window width, the voice waveform is inverted on the time axis, and then the inverted voice waveforms are connected to each other. Includes a signal converter that converts the learning voice signal of
Training data generator.
A model learning device including the learning data generation device according to any one of claims 1 to 3.
Using the feature quantity sequence obtained from the first learning audio signal, the feature quantity sequence obtained from the second learning audio signal, and the correct answer label corresponding to the first learning audio signal, Including a learning department that learns acoustic models,
Model learning device.
It is a learning data generation method that generates learning data used when learning an acoustic model used in a speech recognition device.
A signal conversion step of converting a first learning audio signal into a second learning audio signal, which is an audio signal that can cause an auditory illusion.
Training data generation method.
It is a learning data generation method that generates learning data used when learning an acoustic model used in a speech recognition device.
By deleting a part of the first learning audio signal in the time axis direction and embedding the noise above the sound pressure of each frequency before and after the deleted part in the deleted part, the first learning audio signal is obtained. Includes a signal conversion step to convert to a second learning audio signal,
Training data generation method.
It is a learning data generation method that generates learning data used when learning an acoustic model used in a speech recognition device.
The first learning voice signal is divided into voice waveforms for a certain short time window width, the voice waveform is inverted on the time axis, and then the inverted voice waveforms are connected to each other. Includes a signal conversion step to convert the learning voice signal of
Training data generation method.
A program for operating a computer as the learning data generation device according to any one of claims 1 to 3 or the model learning device according to claim 4.