WO2021245771A1

WO2021245771A1 - Training data generation device, model training device, training data generation method, model training method, and program

Info

Publication number: WO2021245771A1
Application number: PCT/JP2020/021699
Authority: WO
Inventors: 孝典芦原; 雄介篠原; 義和山口
Original assignee: 日本電信電話株式会社
Priority date: 2020-06-02
Filing date: 2020-06-02
Publication date: 2021-12-09

Abstract

Provided is a training data generation device, etc., which generates training data for training an acoustic model that simulates robustness of human speech perception. The training data generation device includes a signal conversion unit that sets Q as an integer of 2 or higher and converts a first audio signal for training to second audio signals for training which are a (Q – 1) number of audio signals that have different perception strengths. At least the second audio signal for training that has the lowest perception strength, among the (Q – 1) second audio signals for training, is capable of generating auditory illusions.

Description

Training data generator, model learning device, training data generation method, model learning method, and program

The present invention relates to a learning data generation device that generates learning data used when learning an acoustic model used in a speech recognition device, a model learning device that uses training data, a training data generation method, a model learning method, and a program.

In a speech recognition device using an acoustic model, Patent Document 1 is a technique for adapting an acoustic model to a task to be recognized in order to ensure a practical level of speech recognition performance. In other words, Patent Document 1 is a technique for adapting an original acoustic model to a task having different acoustic characteristics such as a speaker, a noise type, and a way of speaking. In general, the speech recognition performance has an aspect of increasing or decreasing depending on the amount of learning data of the task to be recognized and the acoustic coverage. Therefore, usually, desired learning data is collected by sufficiently collecting the voices of the tasks to be recognized and transcribing the voices.

However, the conventional technology has a problem that it requires a huge financial and time cost.

Data Augmentation is one of the solutions to such problems. Data expansion is to add some variation to the original training data, generate new training data, and inflate the training data. By data expansion, it is possible to reduce learning with the same learning data and obtain further generalization performance.

For example, in Non-Patent Document 1, by converting the speaking speed to the original data, various speaker data are generated, and the generalization performance for a wider range of speakers is improved.

Further, in Non-Patent Document 2, in order to improve noise immunity and recognition performance for reverberation sound, noise is superimposed on the original learning data, and the impulse response of a room with strong reverberation is convoluted to create a pseudo reverberation sound. Is generated, and the reverberation sound is superimposed on the original learning data to improve the generalization performance.

Japanese Unexamined Patent Publication No. 2007-249051

Here, consider Data Augmentation for capturing information for a long time for a voice recognition device. First, the voice recognition device and long-time information will be described. There are many reports that incorporating long-term information into a speech recognition device makes it more robust to various acoustic events and improves speech recognition accuracy.

For example, the recurrent neural network (RNN) model is different from the multi-layer perceptron (MLP) model, and the model itself is devised to explicitly capture long-term information and handles time-series information such as voice recognition. The task has greatly improved accuracy.

Further, in Reference 1, the speech recognition accuracy is improved by explicitly incorporating a linguistic long-time context into the end-to-end speech recognition model.

(Reference 1) R. Masumura, T. Tanaka, T. Moriya, Y. Shinohara, T. Oba and Y. Aono, "Large Context End-to-end Automatic Speech Recognition via Extension of Hierarchical Recurrent Encoder-decoder Models" , ICASSP 2019 --2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, United Kingdom, 2019, pp. 5661-5665.
As described above, there are various methods for incorporating long-term information into the speech recognition device by devising a model. However, there has never been an approach to acquire a speech recognition device that incorporates information for a long time by devising the learning data itself.

In the present invention, the learning data itself is devised. By learning an acoustic model using the learning data generated by the learning data generation device according to the present invention, it is possible to acquire robustness to speech perception that humans acquire innately and acquiredly. In this embodiment, learning data is generated by executing a Data Augmentation process using an auditory illusion. Further, the model learning device according to the present invention further improves the performance of the acoustic model by learning the curriculum based on the perceptual intensity in the auditory illusion. The learning data generation device according to the present invention generates learning data suitable for curriculum learning based on perceptual intensity.

The present invention is a learning data generation device that generates learning data for learning an acoustic model that simulates the robustness of human speech perception, a model learning device that learns an acoustic model using the learning data, and a learning data generation method. , Model learning methods, and programs.

In order to solve the above problem, according to one aspect of the present invention, the learning data generation device generates learning data used when learning the acoustic model used in the speech recognition device. The learning data generator sets Q as one of two or more integers, and converts the first learning audio signal into a second learning audio signal, which is Q-1 audio signals with different perceptual intensities. Of the Q-1 second learning voice signals including the conversion unit, at least the second learning voice signal having the lowest perceptual intensity is a voice signal that can cause illusion.

In order to solve the above problem, according to another aspect of the present invention, the learning data generation device generates learning data used when learning an acoustic model used in a speech recognition device. In the training data generator, Q is one of two or more integers, and the first feature quantity series, which is the acoustic feature quantity series obtained from the first learning audio signal, is the second feature quantity of Q-1. The Q-1 second learning audio signal corresponding to the Q-1 second feature sequence, including the feature conversion unit that converts to the sequence, is the Q-1 audio signal with different perceptual intensity. At least the lowest perceptual intensity voice signal is a voice signal that can cause illusion.

The present invention has the effect of being able to learn an acoustic model that simulates the robustness of human speech perception.

The functional block diagram of the model learning apparatus which concerns on 1st Embodiment. The figure which shows the example of the processing flow of the model learning apparatus which concerns on 1st Embodiment. The figure which shows the example which converts into the audio signal which can obtain the continuous listening effect. The figure which shows the example which converts into the voice signal which becomes the time-inverted voice. The functional block diagram of the model learning apparatus which concerns on 2nd Embodiment. The figure which shows the example of the processing flow of the model learning apparatus which concerns on 2nd Embodiment. The figure which shows the example which converts into the feature quantity series corresponding to the audio signal which can obtain the continuous listening effect. The figure which shows the example which converts into the feature quantity series corresponding to the voice signal which becomes the time-inverted voice. The figure which shows the configuration example of the computer to which this method is applied.

Hereinafter, embodiments of the present invention will be described. In the drawings used in the following description, the same reference numerals are given to the components having the same function and the steps performing the same processing, and duplicate description is omitted. In the following description, the processing performed for each element of the vector or matrix shall be applied to all the elements of the vector or matrix unless otherwise specified.

<Points of the first embodiment>
In the present embodiment, in order to make the speech recognition device acquire the robustness of speech perception acquired by humans, a data augmentation process using auditory illusion is executed.

The illusion is an illusion phenomenon in which the physically presented sound stimulus is not always perceived as it is due to the human auditory characteristics, and can be said to be an auditory version of the illusion.

For example, in the continuity illusion, a part of a pure tone or voice whose frequency changes is deleted, and noise that sufficiently masks the original sound is superimposed on the deleted part to physically perform it. Is a phenomenon in which the sound section that should have been deleted is complemented and perceived (see Reference 2).

(Reference 2) R. M. Warren: "Perceptual restoration of missing speech sounds", Science, 167, pp. 392-393 (1970).
In Locally Time-reversed Speech, a voice waveform is divided into certain short time segments, the waveform is inverted on the time axis in each segment, and then each inverted segment is used. The audio is reconnected (see Reference 3).

(Reference 3) K. Saberi and D. R. Perrott, "Cognitive restoration of reversed speech", Nature, 398, 6730, pp. 760-760 (1999).
When a human hears such a time-reversed speech, the intelligibility of the speech perception is maintained sufficiently high when the segment length is relatively short, for example, about 25 ms. However, it has been experimentally shown that the longer the segment length, the lower the intelligibility in a sigmoid function, and it becomes almost difficult to perceive speech in about 100 ms. In other words, it can be seen that if the time series is destroyed locally to some extent, it does not affect human speech perception (it is robust).

If the acoustic model is trained using a voice signal that can produce such a continuous listening effect or a voice signal that becomes a time-reversed voice, it is naturally more than a deleted or masked part or a segment when reversing. The acoustic model will be learned in consideration of a long time interval, and the acoustic model will incorporate information for a long time, and will acquire the robustness of speech perception acquired by humans.

In the present embodiment, by using the above-mentioned auditory illusion voice waveform as extended data, it is possible to learn a robust acoustic model from the learning data to long-term information. In the present embodiment, the above-mentioned continuous length effect and time-reversed voice are mentioned among the auditory illusions.

With the continuous length effect, it becomes possible for the speech recognition device to acquire the robustness of a human being who can perceive speech even when a part is missing, and as a result, a speech recognition device that is robust to long-term information is constructed.

With time-reversed speech, it becomes possible for the speech recognition device to acquire the robustness of humans who can perceive speech even when the time series is locally inverted (destroyed), and as a result, speech recognition that is robust to long-term information. The device is built.

Also, during learning, further performance improvement will be realized through curriculum learning.

Curriculum learning is a method in which the difficulty level of a learning data sample is determined in advance according to some criteria, and the learning data sample is gradually raised from a simple learning data sample to a difficult learning data sample. It is well experimentally shown that curriculum learning accelerates the convergence to the optimal solution and leads to a better local optimal solution (see Reference 4).

(Reference 4) Bengio, Y., et al .: "Curriculum learning", in ICML, pp. 41-48 (2009)
For example, in the case of a language model task, it has been reported that by gradually increasing the number of vocabularies of the learning data, the performance is improved as compared with the case of normal learning. Also in this embodiment, further improvement in performance is realized by adopting this learning method.

Specifically, since there are parameters in the auditory illusion that can control the human perceptual intensity, curriculum learning can be realized by defining the perceptual intensity as the difficulty level of the learning data. .. It should be noted that controlling the perceptual intensity means controlling the voice so that the voice is easy to perceive and the voice is difficult to perceive. The lower the perceptual intensity, the easier it is to perceive speech, and the less difficult the learning data is.

For example, in the continuous length effect, it is possible to increase the difficulty of the task by gradually changing from the learning data sample with a short time segment length to the training data sample with a long time segment length. It will be possible.

Also, in time-reversed speech, it is often experimentally shown that the perceptual intensity can be manipulated by the segment length to be inverted, and that the shorter the segment to the longer the segment, the more difficult it is to perceive in general. Therefore, even in curriculum learning, it is possible to increase the difficulty of the task by gradually changing from a learning data sample with a short segment length to be inverted to a learning data sample with a long segment length. Become.

<First Embodiment>
FIG. 1 shows a functional block diagram of the model learning device according to the first embodiment, and FIG. 2 shows a processing flow thereof.

The model learning device 100 includes a voice signal acquisition unit 110, a voice digital signal storage unit 120, a signal conversion unit 125, a voice digital signal storage unit 126, a feature amount analysis unit 130, a feature amount storage unit 140, and a learning unit 160.

The model learning device is, for example, a special program configured by loading a special program into a publicly known or dedicated computer having a central processing unit (CPU: Central Processing Unit), a main storage device (RAM: Random Access Memory), and the like. It is a device. The model learning device executes each process under the control of the central processing unit, for example. The data input to the model learning device and the data obtained by each process are stored in the main storage device, for example, and the data stored in the main storage device is read out to the central processing unit as needed. It is used for processing. At least a part of each processing unit of the model learning device may be configured by hardware such as an integrated circuit. Each storage unit included in the model learning device is, for example, a main storage device such as RAM (RandomAccessMemory), an auxiliary storage device composed of a hard disk, an optical disk, or a semiconductor memory element such as a flash memory (FlashMemory), or a relational device. It can be configured by middleware such as database and key value store.

The model learning device inputs an analog audio signal x (k) for learning and a corresponding correct answer label r (j), learns an acoustic model based on this information, and outputs a trained acoustic model f. do. Note that k is an index indicating the time. The correct label is, for example, a phoneme label, and j is an index indicating the order of phonemes. Information indicating from where to where the correct answer label (sound element label) corresponds to the analog audio signal shall be included in the training data in advance, and the audio digital signal and feature amount described below are also linked in the same manner. It is assumed that it has been done.

The processing contents of each part will be explained below.

<Audio signal acquisition unit 110>
Input: Audio signal x (k)
Output: Audio digital signal x (t)
Processing: AD conversion The audio signal acquisition unit 110 acquires an analog audio signal x (k) and converts it into a digital audio digital signal x (t) (S110). Note that t is an index indicating the sample number of the audio digital signal.

<Audio digital signal storage unit 120>
Input: Audio digital signal x (t)
Processing: Accumulation of voice digital signal The voice digital signal storage unit 120 stores voice digital signal x (t) (S120).

<Signal converter 125>
Input: Audio digital signal x (t)
Output: Audio digital signal r _q (t)
Processing: Data Augmentation processing The signal conversion unit 125 executes Data Augmentation processing on the voice digital signal x (t) and converts the voice digital signal x (t) into the voice digital signal r _q (t) (S125). .. However, q is an index indicating the perceptual intensity, and the larger q is, the higher the perceptual intensity is assumed. It should be noted that the learning data has a lower perceptual intensity as the speech is easier to perceive. Here, the perceptual intensity is set to the Q stage. For curriculum learning, Q is one of the integers of 2 or more, and q = 2,3, ..., Q. The signal conversion unit 125 converts the audio digital signal x (t) into Q-1 audio digital signals r _q (t) having different perceptual intensities. It should be noted that the lower the perceptual intensity, the easier the speech is perceived, and the learning data having the lowest perceptual intensity is the speech digital signal x (t) corresponding to the original learning data. Therefore, x (t) = r ₁ (t), and Q voice digital signals with different perceptual intensities r _q , q = 1,2, ..., Q are also expressed. For example, the signal conversion unit 125 changes only the time segment length that lacks the continuous listening effect or the segment length that inverts the time-reversed audio (hereinafter, also simply referred to as the segment length) to change only the same original audio digital signal x (t). ) Generates Q-1 audio digital signals r _q (t).

The Data Augmentation process in the present embodiment converts the audio digital signal x (t) into an audio digital signal r _q (t) (q = 2,3, ..., Q) based on a certain conversion rule. By this conversion, pseudo inflated learning data is generated.

In this embodiment, _{a conversion rule is adopted so that the converted audio digital signal r q} (t) becomes an audio signal that can cause an auditory illusion.

As a conversion rule such that the converted audio digital signal r _q (t) becomes an audio signal that can cause illusion, in the present embodiment, an audio signal or a time-reversed audio that can obtain the continuous listening effect as described above can be obtained. Adopt a conversion rule that converts to an audio signal that becomes.

(i) When converting to an audio signal that can obtain a continuous listening effect, the signal conversion unit 125 deletes a part (for a certain period of time) of the audio digital signal x (t) in the time axis direction, and the deleted part. _{The audio digital signal x (t) is converted into the audio digital signal r q} (t) by embedding it in the portion where the noise above the sound pressure of each frequency before and after is deleted (see FIG. 3). The length of the part to be deleted and the part to be embedded are the lengths that can cause an illusion, and Q-1 different lengths are applied. In addition, the deletion process and the embedding process are performed at intervals that may cause an auditory illusion. The noise to be embedded is, for example, white noise. Note that noise is prepared in advance prior to the process S125. For example, by changing the length of the deleted part and the embedded part from 100 ms to 200 ms and 300 ms, three voice digital signals r _q (t) having different perceptual intensities can be obtained from the voice digital signal x (t).

(ii) When converting to a voice signal that becomes a time-inverted voice, the signal conversion unit 125 divides the voice digital signal x (t) into a voice waveform for each fixed short time window width, and the voice waveform is divided into time. After inverting on the axis, the inverted voice waveform is concatenated to _{convert the voice digital signal x (t) into the voice digital signal r q} (t) (see FIG. 4). The length when carving is the length that can cause an illusion, and Q-1 different lengths are applied. For example, by changing the length of the inverted portion from 20 ms to 40 ms and 60 ms, three voice digital signals r _q (t) having different perceptual intensities can be obtained from the voice digital signal x (t).

In the present embodiment, the audio digital signal x (t) is converted into either an audio signal that can obtain a continuous listening effect and an audio signal that becomes a time-reversed audio, or both.

<Audio digital signal storage unit 126>
Input: Audio digital signal r _q (t)
Processing: Accumulation of voice digital signal The voice digital signal storage unit 126 stores the voice digital signal r _q (t) (S126).

<Feature quantity analysis unit 130>
Input: Audio digital signal x (t), r _q (t)
Output: Feature series X, R _q
Processing: Feature analysis The feature analysis unit 130 performs feature analysis on the voice digital signals x (t) and r _q (t), and obtains feature series X and R _q.

For example, the feature quantity analysis unit 130 takes out the voice digital signal x (t) from the voice digital signal storage unit 120, takes out the voice digital signal r _q (t) from the voice digital signal storage unit 126, and takes out the voice digital signal x (t). ) And r _q (t) are divided into frame units, acoustic feature quantity extraction is performed for each frame, and (acoustic) feature quantity series X and R _q are acquired.

For example, the total number of frames included in the audio digital signal x (t) is N, n = 1,2, ..., N, the frame length is M, m = 1,2, ..., M, and the shift width is D. Then, the m-th audio digital signal x (t) in the n-th frame can be expressed as x (D (n-1) + m). The feature amount analysis unit 130 is a voice digital signal x (D (n-1) + 1), x (D (n-1) + 2),…, x (D (n-1) + M) for each frame n. ), The acoustic feature amount is extracted and the feature amount X (n) is acquired. The feature amount analysis unit 130 processes all frames 1,2, ..., N and acquires the feature amount series X = {X (1), X (2), ..., X (N)}. .. The feature analysis unit 130 performs the same processing on the voice digital signal r _q (t), and the feature sequence R _q = {R _q (1), R _q (2),…, R _q (N). } To get.

The features to be extracted include, for example, 1 to 12 dimensions of MFCC (Mel-Frequenct Cepstrum Coefficient) based on short-time frame analysis of voice signals, dynamic parameters such as ΔMFCC and ΔΔMFCC, which are the dynamic features, and dynamic parameters such as ΔMFCC and ΔΔMFCC. Power, Δpower, ΔΔ power, etc. are used. Further, CMN (cepstrum average normalization) processing may be performed on the MFCC. The feature amount is not limited to MFCC and power, and parameters used for identifying special utterances (for example, autocorrelation peak value and group delay) may be used.

<Feature amount storage unit 140>
Input: Feature series X, R _q
Processing: Accumulation of feature quantity series The feature quantity storage unit 140 accumulates the feature quantity series X and R _q analyzed by the feature quantity analysis unit 130 (S140).

<Learning Department 160>
Input: Feature series X, feature series R _q , correct label r (j)
Output: Trained acoustic model f
Processing: Model learning The learning unit 160 learns the acoustic model f by curriculum learning based on perceptual intensity using the feature quantity series X, the feature quantity series R _q , and the correct answer label r (j) (S160). That is, the learning unit 160 first learns the acoustic model f using the feature sequence X and the correct label r (j), and R ₂ , R ₃ , ..., R step by step as the learning process progresses. _The acoustic model f is learned while adding the feature series R _{q to the training data in the order of Q.} The acoustic model f is a model that inputs a feature sequence and outputs a phoneme label. For example, GMM-HMM and DNN-HMM are often used as the acoustic model in speech recognition, and in recent years, the End-to-End speech recognition model has also been used. Since there are no restrictions on the speech recognition model, it may be either a GMM / DNN-HMM or an End-to-End speech recognition model. The correct label r (j) corresponds to the analog audio signal x (k) for learning, and the feature quantity sequence X obtained from the audio signal x (k) and the feature quantity sequence X are converted. It also corresponds to the feature quantity series R _{q obtained by.}

<Effect>
With the above configuration, it is possible to learn an acoustic model that simulates the robustness of human speech perception. In addition, financial and time costs can be reduced.

<Modification example>
A configuration that does not include the feature amount analysis unit 130, the feature amount storage unit 140, and the learning unit 160 of the model learning device 100 is also referred to as a learning data generation device. That is, the learning data generation device includes an audio signal acquisition unit 110, an audio digital signal storage unit 120, a signal conversion unit 125, and an audio digital signal storage unit 126. The learning data generator inputs an analog audio signal x (k) for learning and a correct answer label r (j), and from the audio signal x (k) to an audio digital signal x (t) and an audio digital signal r _q ( t) is generated, and the combination of the voice digital signal x (t), the voice digital signal r _q (t), and the correct answer label r (j) is output as training data.

In the first embodiment, it is assumed that the audio digital signal r _q (t) is an audio signal that can cause an auditory illusion, but as a result of an experiment, the same applies to an audio signal that cannot cause an auditory illusion. It turned out that the effect can be obtained. Therefore, an audio signal that cannot cause an auditory illusion may be prepared as learning data having high perceptual intensity. In other words, at least the voice digital signal r ₂ (t) having the lowest perceptual intensity of the Q-1 voice digital signals r _q (t) may be any voice signal that can cause an auditory illusion.

For example, in the first embodiment, in order to obtain the continuous listening effect, the signal conversion unit 125 _{deletes a part of the audio digital signal r q} (t) in the time axis direction, and the sound of each frequency before and after the deleted part is deleted. _{The audio digital signal x (t) is converted into the audio digital signal r q} (t) by embedding it in the part where the noise above the pressure is removed. At this time, when converting to learning data having high perceptual intensity, the lengths of the deleted portion and the embedded portion may be long enough to prevent the continuous listening effect. Further, the interval between the deletion process and the embedding process may be so short that the continuous listening effect cannot be produced. Even when such data expansion is executed, it is possible to learn an acoustic model having the same accuracy as that of the first embodiment.

Further, for example, in the first embodiment, in order to convert the voice signal into a time-inverted voice, the signal conversion unit 125 converts the voice digital signal x (t) into a voice waveform for each fixed short time window width. After inverting the voice waveform on the time axis, the voice digital signal x (t) is converted into the _{voice digital signal r q (t) by concatenating the inverted voice waveforms.} At this time, when converting to learning data having high perceptual intensity, the length of the cut voice waveform may be long enough to prevent illusion. Even when such data expansion is executed, it is possible to learn an acoustic model having the same accuracy as that of the first embodiment.

Further, in the first embodiment, Q-1 voices from the same original voice digital signal x (t) by changing only the time segment length to be omitted or the segment length to be inverted (hereinafter, also simply referred to as segment length). The conversion method for generating the digital signal r _q (t) is illustrated, but what if the signal conversion unit 125 can convert the original voice digital signal into Q-1 voice digital signals having different perceptual intensities? It may be converted as follows. For example, the original audio digital signal x (t) is Q-1 time interval V (2) = {x (v _2s ) ~ x (v _2e )}, V (3) = {x (v _3s ) ~ x (v _3e )},…, V (Q) = {x (v _Qs ) ～ x (v _Qe )}, and one time interval V (q) = {x (v _qs ) ～ x ( One audio digital signal r _q may be generated from v _qe)}. v _qs and v _qe indicate the sample numbers of the first and last audio digital signals in the qth time interval, respectively. The utterance contained in the original voice digital signal x (t) may be divided into Q-1 groups, and the segment length may be set for each group _{to generate the voice digital signal r q} (t).
<Points of the second embodiment>

In the present embodiment, the Data Augmentation process using the illusion as in the first embodiment is not executed for the voice waveform, but the Data Augmentation process is executed on the feature space to lengthen the data from the training data. It becomes possible to construct a voice recognition device that is robust to time information. When the Data Augmentation process is executed for the voice waveform as in the first embodiment, the amount of training data is simply Q times, and when the data is stored, the original data is also taken into consideration and the Q times. Capacity is required. However, by executing the Data Augmentation process on the feature amount space, it is possible to convert the feature amount that becomes the learning data during learning, so the data capacity is only for the original learning data. I'm done.

As an example of the illusion that can be processed in the feature space, the above-mentioned continuous length effect and time-reversed voice are given in this embodiment.

With the continuous length effect, it becomes possible for the speech recognition device to acquire the robustness of a human being who can perceive speech even when a part is missing, and as a result, a speech recognition device that is robust to long-term information is constructed. In addition, similar expressions are possible in the feature space. For example, by deleting a certain segment on the time axis in the feature amount and embedding a value larger than the size of the feature amount before and after the segment in the segment, the expression equivalent to the continuous length effect can be obtained.

With time-reversed speech, it becomes possible for the speech recognition device to acquire the robustness of humans who can perceive speech even when the time series is locally inverted (destroyed), and as a result, speech recognition that is robust to long-term information. The device is built. Similarly, in order to make a similar expression in the feature space, the feature series is inverted in each segment on the time axis on the feature, and all the data are reconcatenated to expand the data. Used as.

<Second embodiment>
The part different from the first embodiment will be mainly described.
FIG. 5 shows a functional block diagram of the model learning device according to the second embodiment, and FIG. 6 shows a processing flow thereof.

The model learning device 100 includes a voice signal acquisition unit 110, a voice digital signal storage unit 120, a feature amount analysis unit 130, a feature amount storage unit 140, a feature amount conversion unit 150, and a learning unit 160.

The processing contents of each part will be explained below.

The processing contents of the audio signal acquisition unit 110 and the audio digital signal storage unit 120 are the same as those in the first embodiment.

<Feature quantity analysis unit 130>
Input: Audio digital signal x (t)
Output: Feature series X (p)
Processing: Feature analysis The feature analysis unit 130 performs feature analysis on the voice digital signal x (t) to obtain a feature series X (p).

For example, the feature amount analysis unit 130 takes out the voice digital signal x (t) for each utterance p from the voice digital signal storage unit 120, divides the voice digital signal x (t) into frame units, and has an acoustic feature amount for each frame. Extraction is performed, and the (acoustic) feature quantity series X (p) for each utterance p is acquired.

For example, the total number of frames included in a certain utterance p is N _p , n _p = 1 _p , 2 _p , ..., N _p , the frame length is M, m = 1,2, ..., M, and the shift width is set. If D, the m-th voice digital signal x (t) in _{the n-p-} th frame of a certain utterance p _{can be expressed as x (D (n p} -1) + m). However, the subscript subscript _p indicates that the value corresponds to the utterance p. The feature analysis unit 130 uses the audio digital signal x (D (n _p -1) +1), x (D (n _p -1) + 2),…, x (D (n _p −— _{) for each frame n p.} 1) For + M), the acoustic features are extracted and the features X (n _p ) are obtained. _{The feature analysis unit 130 processes all frames 1 p} , 2 _p , ..., N _p included in the utterance p, and the feature sequence X (p) = {X (1 _p ) for each utterance p. , X (2 _p ),…, X (N _p )}.

The feature amount to be extracted is the same as that of the first embodiment.

<Feature amount storage unit 140>
Input: Feature series X (p)
Processing: Accumulation of feature quantity series The feature quantity storage unit 140 accumulates the feature quantity series X (p) analyzed by the feature quantity analysis unit 130 (S140).

<Feature quantity conversion unit 150>
Input: Feature series X (p)
Output: Feature series R _q (p)
Processing: Data Augmentation processing The feature quantity conversion unit 150 executes Data Augmentation processing on the feature quantity series X _{(p) and converts the feature quantity series X (p) into the feature quantity series R q} (p) (S150). ). In other words, the feature amount conversion unit 150 converts the feature amount series X (p) into Q-1 feature amount series R _q (p) having different perceptual intensities. _{For example, the feature amount conversion unit 150 generates Q-1 feature amount series R q} (p) from the same original feature amount sequence X (p) by changing only the segment length.

The Data Augmentation process is executed online at the same time as learning by the learning unit 160 described later. More specifically, the data augmentation process may be performed in advance on the feature sequence X (p) corresponding to all the utterances p (here, p = 1,2, ..., P) used in the learning unit 160 described later. For a feature sequence X (p') when learning using a feature sequence X (p') corresponding to a certain utterance p'(p'is one of 1, 2, ..., P) It means that the Data Augmentation process is executed and the feature series X (p') is _{converted to the feature series R q} (p'). However, P represents the total number of utterances contained in the analog audio signal x (k) for learning. Since the inflated learning data is used only during learning and does not need to be stored, the amount of learning data to be stored can be reduced. Since the input is a feature quantity series, all the data augmentation processing is performed on the feature quantity space, and it is not necessary to perform the data augmentation processing on the audio digital signal.

The Data Augmentation process in this embodiment converts the feature sequence X (p) into the feature sequence R _q (p) (q = 2,3, ..., Q) based on a certain conversion rule. By this conversion, pseudo inflated learning data is generated.

In the present embodiment, _{a conversion rule is adopted so that the audio signal corresponding to the converted feature sequence R q} (p) becomes an audio signal that can cause an auditory illusion. Generally, when an audio signal that can cause an auditory illusion is generated, processing is performed on the audio waveform, but in the present embodiment, conversion processing is performed on the feature quantity series.

As a conversion rule such that the audio signal corresponding to the converted feature sequence R _q (p) becomes an audio signal that can cause illusion, in the present embodiment, the audio that can obtain the continuous listening effect as described above can be obtained. Adopt a conversion rule that converts to a feature sequence corresponding to an audio signal such as a signal or time-reversed audio.

(i) When converting to a feature quantity series corresponding to a voice signal that can obtain a continuous listening effect, the feature quantity conversion unit 150 deletes a segment having the feature quantity sequence X (p), and before and after the deleted segment. _{The feature quantity series X (p) is converted into the feature quantity series R q} (p) by embedding the feature quantity having a value equal to or higher than the feature quantity value of. The segment length is a length that can cause an illusion, and Q-1 different lengths are applied. In addition, the deletion process and the embedding process are performed at intervals that may cause an auditory illusion. Further, the feature amount to be embedded is a feature amount corresponding to noise, and the noise is, for example, white noise. Prior to the processing S150, a feature amount corresponding to noise is prepared in advance. For example, feature series…, X (s + 1 _p ), X (s + 2 _p ), X (s + 3 _p ), X (s + 4 _p ), X (s + 5 _p ), X (s) Of the + 6 _p ), X (s + 7 _p ), X (s + 8 _p ), X (s + 9 _p ), X (s + 10 _p ),…, three features X (s + 3) _{Process of deleting p} ), X (s + 4 _p ), X (s + 5 _p ) and embedding three features X (1 _n ), X (2 _n ), X (3 _{n) corresponding to noise} (See Fig. 7). The values of X (1 _n ), X (2 _n ), X (3 _n ) should be greater than or equal to the values of the front feature X (s + 2 _p ) and the back feature X (s + 6 _p). Set to. For example, this process is performed every 20 frames. For example, by changing the length of the deleted part and the embedded part from 3 frames to 4,5 frames, 3 feature series R _q (p) with different perceptual intensities can be obtained from the feature series X (p). can get.

(ii) When converting to a feature quantity sequence corresponding to a voice signal that becomes a time-reversed voice, the feature quantity conversion unit 150 divides the feature quantity sequence X (p) into segments having a predetermined time length, and each segment. The feature series X (p) is converted into the _{feature series R q} (p) by reversing the feature series divided in the process in time and concatenating the inverted feature series. The segment length is a length that can cause an illusion, and Q-1 different lengths are applied. For example, the feature amount conversion unit 150 has a feature amount series ..., X (s + 1 _p ), X (s + 2 _p ), X (s + 3 _p ), X (s + 4 _p ), X (s +). 5 _p ), X (s + 6 _p ), X (s + 7 _p ), X (s + 8 _p ), X (s + 9 _p ), X (s + 10 _p ),… for 5 frames Time length segment…, s (1) = {X (s + 1 _p ), X (s + 2 _p ), X (s + 3 _p ), X (s + 4 _p ), X (s + 5 _p) )}, S (2) = {X (s + 6 _p ), X (s + 7 _p ), X (s + 8 _p ), X (s + 9 _p ), X (s + 10 _p )}, Divide into ... Further, the feature amount conversion unit 150 inverts the feature amount series in each segment in time, and ..., s'(1) = {X (s + 5 _p ), X (s + 4 _p ), X ( s + 3 _p ), X (s + 2 _p ), X (s + 1 _p )}, s'(2) = {X (s + 10 _p ), X (s + 9 _p ), X (s + 8 _p ), X (s + 7 _p ), X (s + 6 _p )},…, and concatenate in the order of…, s'(1), s'(2),… (see Fig. 8). For example, by changing the length of the inverted portion from 5 frames to 6,7 frames, three feature series R _q (p) with different perceptual intensities can be obtained from the feature series X (p).

In the present embodiment, the feature quantity sequence X (feature quantity sequence X () for either or both of the feature quantity sequence corresponding to the audio signal such that the continuous listening effect can be obtained and the feature quantity sequence corresponding to the audio signal such as the time-reversed audio. Convert p).

The processing content of the learning unit 160 is the same as that of the first embodiment.

<Effect>
With the above configuration, the same effect as that of the first embodiment can be obtained. Further, by performing the Data Augmentation process on the feature amount space instead of performing the data augmentation process on the voice waveform, it is possible to reduce the processes S110 to S140 for the inflated learning data. Further, by performing the Data Augmentation process at the same time as learning, the storage capacity of the learning data can be reduced.

<Modification example>
A configuration that does not include the learning unit 160 of the model learning device 100 is also referred to as a learning data generation device. That is, the learning data generation device includes a voice signal acquisition unit 110, a voice digital signal storage unit 120, a feature amount analysis unit 130, a feature amount storage unit 140, and a feature amount conversion unit 150. The learning data generator takes an analog audio signal x (k) for training and a correct label r (j) as inputs, and features sequence X (p) and feature sequence R _q (feature sequence R q) from the audio signal x (k). p) is generated, and the combination of the feature series X (p), the feature series R _q (p), and the correct label r (j) is output as training data.

In the second embodiment, it is _{assumed that the audio signal corresponding to the feature sequence R q} (p) is an audio signal that can cause an auditory illusion, but as a result of an experiment, it is an audio signal that cannot cause an auditory illusion. It was found that the same effect can be obtained even if there is. Therefore, an audio signal that cannot cause an auditory illusion may be prepared as learning data having high perceptual intensity. In other words, at least the audio signal having the lowest perceptual intensity among the Q-1 audio signals corresponding to the Q-1 _{feature sequence R q (p) may be an audio signal that can cause an auditory illusion.}

For example, in the second embodiment, in order to obtain the continuous listening effect, the feature amount conversion unit 150 deletes a segment having the feature amount series X (p), and the value is equal to or greater than the value of the feature amount before and after the deleted segment. By embedding the features with features in the deleted part, the feature series X (p) is _{converted to the feature series R q} (p). At this time, when converting to learning data having high perceptual intensity, the segment length of the segment to be deleted or embedded may be long enough to prevent the continuous listening effect. Further, the interval between the deletion process and the embedding process may be so short that the continuous listening effect cannot be produced. Even when such data expansion is performed, it is possible to learn an acoustic model having the same accuracy as that of the second embodiment.

Further, for example, in the second embodiment, the feature quantity conversion unit 150 divides the feature quantity sequence X (p) into segments having a predetermined time length in order to convert the feature quantity sequence X (p) into a voice signal that becomes a time-reversed voice. The feature series X (p) is converted into the _{feature series R q} (p) by inverting the feature series divided in each segment in time and concatenating the inverted feature series. At this time, when converting to learning data having high perceptual intensity, the segment length may be long enough to prevent illusion. Even when such data expansion is performed, it is possible to learn an acoustic model having the same accuracy as that of the second embodiment.

Further, in the second embodiment, Q-1 features from the same original feature series X (p) by changing only the time segment length to be omitted or the segment length to be inverted (hereinafter, also simply referred to as segment length). Although the conversion method for generating the quantity series R _q (p) is illustrated, the feature quantity conversion unit 150 uses the original feature quantity series with different perceptual intensities Q as in the signal conversion unit 125 of the first embodiment. -If it can be converted into one feature series, it may be converted in any way. For example, the feature sequence X corresponding to the original audio digital signal x (t) is Q-1 feature sequence V (2) = {X (v _2s ) to X (v _2e )}, V (3). = {X (v _3s ) ～ X (v _3e )},…, V (Q) = {X (v _Qs ) ～ X (v _Qe )}, and one time interval V (q) = { One feature series R _q may be generated from X (v _qs ) to X (v _qe)}. v _qs and v _qe indicate the first and last frame numbers of the qth time interval, respectively. In addition, the feature sequence X (p) corresponding to the original P utterances is divided into Q-1 groups, the same segment length is set for each group, and the voice digital signal R _q (p) is generated. You may.

<Other variants>
The present invention is not limited to the above embodiments and modifications. For example, the various processes described above may not only be executed in chronological order according to the description, but may also be executed in parallel or individually as required by the processing capacity of the device that executes the processes. In addition, changes can be made as appropriate without departing from the spirit of the present invention.

<Programs and recording media>
The various processes described above can be performed by causing the storage unit 2020 of the computer shown in FIG. 5 to read a program for executing each step of the above method and operating the control unit 2010, the input unit 2030, the output unit 2040, and the like. ..

The program that describes this processing content can be recorded on a computer-readable recording medium. The recording medium that can be read by a computer may be, for example, a magnetic recording device, an optical disk, a photomagnetic recording medium, a semiconductor memory, or the like.

The distribution of this program is carried out, for example, by selling, transferring, renting, etc. a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Further, the program may be stored in the storage device of the server computer, and the program may be distributed by transferring the program from the server computer to another computer via the network.

A computer that executes such a program first, for example, first stores a program recorded on a portable recording medium or a program transferred from a server computer in its own storage device. Then, when the process is executed, the computer reads the program stored in its own recording medium and executes the process according to the read program. Further, as another execution form of this program, a computer may read the program directly from a portable recording medium and execute processing according to the program, and further, the program is transferred from the server computer to this computer. You may execute the process according to the received program one by one each time. In addition, the above-mentioned processing is executed by a so-called ASP (Application Service Provider) type service that realizes the processing function only by the execution instruction and the result acquisition without transferring the program from the server computer to this computer. May be. The program in this embodiment includes information used for processing by a computer and equivalent to the program (data that is not a direct command to the computer but has a property that regulates the processing of the computer, etc.).

Further, in this form, the present device is configured by executing a predetermined program on a computer, but at least a part of these processing contents may be realized in terms of hardware.

Claims

It is a learning data generation device that generates learning data used when learning an acoustic model used in a speech recognition device.
A signal conversion unit is included in which Q is set to any of two or more integers and the first learning audio signal is converted into a second learning audio signal which is Q-1 audio signals having different perceptual intensities.
Q-1 Of the two second learning audio signals, at least the second learning audio signal with the lowest perceptual intensity is an audio signal that can cause illusion.
Training data generator.
It is a learning data generation device that generates learning data used when learning an acoustic model used in a speech recognition device.
Q is one of two or more integers, and the first feature series, which is the acoustic feature series obtained from the first learning audio signal, is converted into Q-1 second feature series. Including the conversion part
The Q-1 second learning audio signals corresponding to the Q-1 second feature quantity series are Q-1 audio signals having different perceptual intensities, and at least the most perceptual intensity is the highest. Low audio signals are audio signals that can cause illusions,
Training data generator.
A model learning device including the learning data generation device according to claim 1 or 2.
The first feature quantity series obtained from the first learning voice signal, the second feature quantity series of Q-1 corresponding to the second learning voice signal, and the first learning voice. Including a learning unit that learns an acoustic model by curriculum learning based on perceptual intensity using a correct answer label corresponding to a signal.
Model learning device.
It is a learning data generation method that generates learning data used when learning an acoustic model used in a speech recognition device.
Includes a signal conversion step in which Q is one of two or more integers and the first learning audio signal is converted into a second learning audio signal, which is one Q-1 audio signal with different perceptual intensities.
Q-1 Of the two second learning audio signals, at least the second learning audio signal with the lowest perceptual intensity is an audio signal that can cause illusion.
Training data generation method.
It is a learning data generation method that generates learning data used when learning an acoustic model used in a speech recognition device.
Q is one of two or more integers, and the first feature series, which is the acoustic feature series obtained from the first learning audio signal, is converted into Q-1 second feature series. Including conversion steps
The Q-1 second learning audio signals corresponding to the Q-1 second feature quantity series are Q-1 audio signals having different perceptual intensities, and at least the most perceptual intensity is the highest. Low audio signals are audio signals that can cause illusions,
Training data generation method.
A model learning method including the learning data generation method according to claim 4 or 5.
The first feature quantity series obtained from the first learning voice signal, the second feature quantity series of Q-1 corresponding to the second learning voice signal, and the first learning voice. Includes learning steps to learn an acoustic model by perceptual intensity-based curriculum learning with the correct label corresponding to the signal.
Model learning method.
A program for operating a computer as the learning data generation device of claim 1 or 2, or the model learning device of claim 3.