WO2021245771A1 - Training data generation device, model training device, training data generation method, model training method, and program - Google Patents

Training data generation device, model training device, training data generation method, model training method, and program Download PDF

Info

Publication number
WO2021245771A1
WO2021245771A1 PCT/JP2020/021699 JP2020021699W WO2021245771A1 WO 2021245771 A1 WO2021245771 A1 WO 2021245771A1 JP 2020021699 W JP2020021699 W JP 2020021699W WO 2021245771 A1 WO2021245771 A1 WO 2021245771A1
Authority
WO
WIPO (PCT)
Prior art keywords
learning
feature
signal
series
audio
Prior art date
Application number
PCT/JP2020/021699
Other languages
French (fr)
Japanese (ja)
Inventor
孝典 芦原
雄介 篠原
義和 山口
Original Assignee
日本電信電話株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 日本電信電話株式会社 filed Critical 日本電信電話株式会社
Priority to PCT/JP2020/021699 priority Critical patent/WO2021245771A1/en
Publication of WO2021245771A1 publication Critical patent/WO2021245771A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks

Definitions

  • the present invention relates to a learning data generation device that generates learning data used when learning an acoustic model used in a speech recognition device, a model learning device that uses training data, a training data generation method, a model learning method, and a program.
  • Patent Document 1 is a technique for adapting an acoustic model to a task to be recognized in order to ensure a practical level of speech recognition performance.
  • Patent Document 1 is a technique for adapting an original acoustic model to a task having different acoustic characteristics such as a speaker, a noise type, and a way of speaking.
  • the speech recognition performance has an aspect of increasing or decreasing depending on the amount of learning data of the task to be recognized and the acoustic coverage. Therefore, usually, desired learning data is collected by sufficiently collecting the voices of the tasks to be recognized and transcribing the voices.
  • Data Augmentation is one of the solutions to such problems.
  • Data expansion is to add some variation to the original training data, generate new training data, and inflate the training data. By data expansion, it is possible to reduce learning with the same learning data and obtain further generalization performance.
  • Non-Patent Document 1 by converting the speaking speed to the original data, various speaker data are generated, and the generalization performance for a wider range of speakers is improved.
  • Non-Patent Document 2 in order to improve noise immunity and recognition performance for reverberation sound, noise is superimposed on the original learning data, and the impulse response of a room with strong reverberation is convoluted to create a pseudo reverberation sound. Is generated, and the reverberation sound is superimposed on the original learning data to improve the generalization performance.
  • the recurrent neural network (RNN) model is different from the multi-layer perceptron (MLP) model, and the model itself is devised to explicitly capture long-term information and handles time-series information such as voice recognition.
  • the task has greatly improved accuracy.
  • the speech recognition accuracy is improved by explicitly incorporating a linguistic long-time context into the end-to-end speech recognition model.
  • the learning data itself is devised.
  • learning data is generated by executing a Data Augmentation process using an auditory illusion.
  • the model learning device according to the present invention further improves the performance of the acoustic model by learning the curriculum based on the perceptual intensity in the auditory illusion.
  • the learning data generation device according to the present invention generates learning data suitable for curriculum learning based on perceptual intensity.
  • the present invention is a learning data generation device that generates learning data for learning an acoustic model that simulates the robustness of human speech perception, a model learning device that learns an acoustic model using the learning data, and a learning data generation method. , Model learning methods, and programs.
  • the learning data generation device generates learning data used when learning the acoustic model used in the speech recognition device.
  • the learning data generator sets Q as one of two or more integers, and converts the first learning audio signal into a second learning audio signal, which is Q-1 audio signals with different perceptual intensities.
  • the Q-1 second learning voice signals including the conversion unit at least the second learning voice signal having the lowest perceptual intensity is a voice signal that can cause illusion.
  • the learning data generation device generates learning data used when learning an acoustic model used in a speech recognition device.
  • Q is one of two or more integers
  • the first feature quantity series which is the acoustic feature quantity series obtained from the first learning audio signal
  • the Q-1 second learning audio signal corresponding to the Q-1 second feature sequence including the feature conversion unit that converts to the sequence, is the Q-1 audio signal with different perceptual intensity. At least the lowest perceptual intensity voice signal is a voice signal that can cause illusion.
  • the present invention has the effect of being able to learn an acoustic model that simulates the robustness of human speech perception.
  • the functional block diagram of the model learning apparatus which concerns on 1st Embodiment The figure which shows the example of the processing flow of the model learning apparatus which concerns on 1st Embodiment.
  • the functional block diagram of the model learning apparatus which concerns on 2nd Embodiment The figure which shows the example of the processing flow of the model learning apparatus which concerns on 2nd Embodiment.
  • ⁇ Points of the first embodiment> in order to make the speech recognition device acquire the robustness of speech perception acquired by humans, a data augmentation process using auditory illusion is executed.
  • the illusion is an illusion phenomenon in which the physically presented sound stimulus is not always perceived as it is due to the human auditory characteristics, and can be said to be an auditory version of the illusion.
  • the acoustic model is trained using a voice signal that can produce such a continuous listening effect or a voice signal that becomes a time-reversed voice, it is naturally more than a deleted or masked part or a segment when reversing.
  • the acoustic model will be learned in consideration of a long time interval, and the acoustic model will incorporate information for a long time, and will acquire the robustness of speech perception acquired by humans.
  • the above-mentioned auditory illusion voice waveform as extended data, it is possible to learn a robust acoustic model from the learning data to long-term information.
  • the above-mentioned continuous length effect and time-reversed voice are mentioned among the auditory illusions.
  • the speech recognition device With time-reversed speech, it becomes possible for the speech recognition device to acquire the robustness of humans who can perceive speech even when the time series is locally inverted (destroyed), and as a result, speech recognition that is robust to long-term information.
  • the device is built.
  • Curriculum learning is a method in which the difficulty level of a learning data sample is determined in advance according to some criteria, and the learning data sample is gradually raised from a simple learning data sample to a difficult learning data sample. It is well experimentally shown that curriculum learning accelerates the convergence to the optimal solution and leads to a better local optimal solution (see Reference 4).
  • curriculum learning can be realized by defining the perceptual intensity as the difficulty level of the learning data. .. It should be noted that controlling the perceptual intensity means controlling the voice so that the voice is easy to perceive and the voice is difficult to perceive. The lower the perceptual intensity, the easier it is to perceive speech, and the less difficult the learning data is.
  • FIG. 1 shows a functional block diagram of the model learning device according to the first embodiment
  • FIG. 2 shows a processing flow thereof.
  • the model learning device 100 includes a voice signal acquisition unit 110, a voice digital signal storage unit 120, a signal conversion unit 125, a voice digital signal storage unit 126, a feature amount analysis unit 130, a feature amount storage unit 140, and a learning unit 160.
  • the model learning device is, for example, a special program configured by loading a special program into a publicly known or dedicated computer having a central processing unit (CPU: Central Processing Unit), a main storage device (RAM: Random Access Memory), and the like. It is a device.
  • the model learning device executes each process under the control of the central processing unit, for example.
  • the data input to the model learning device and the data obtained by each process are stored in the main storage device, for example, and the data stored in the main storage device is read out to the central processing unit as needed. It is used for processing.
  • At least a part of each processing unit of the model learning device may be configured by hardware such as an integrated circuit.
  • Each storage unit included in the model learning device is, for example, a main storage device such as RAM (RandomAccessMemory), an auxiliary storage device composed of a hard disk, an optical disk, or a semiconductor memory element such as a flash memory (FlashMemory), or a relational device. It can be configured by middleware such as database and key value store.
  • RAM RandomAccessMemory
  • FlashMemory flash memory
  • middleware such as database and key value store.
  • the model learning device inputs an analog audio signal x (k) for learning and a corresponding correct answer label r (j), learns an acoustic model based on this information, and outputs a trained acoustic model f. do.
  • k is an index indicating the time.
  • the correct label is, for example, a phoneme label
  • j is an index indicating the order of phonemes.
  • Information indicating from where to where the correct answer label (sound element label) corresponds to the analog audio signal shall be included in the training data in advance, and the audio digital signal and feature amount described below are also linked in the same manner. It is assumed that it has been done.
  • q is an index indicating the perceptual intensity, and the larger q is, the higher the perceptual intensity is assumed. It should be noted that the learning data has a lower perceptual intensity as the speech is easier to perceive.
  • the perceptual intensity is set to the Q stage.
  • the signal conversion unit 125 changes only the time segment length that lacks the continuous listening effect or the segment length that inverts the time-reversed audio (hereinafter, also simply referred to as the segment length) to change only the same original audio digital signal x (t). ) Generates Q-1 audio digital signals r q (t).
  • a conversion rule is adopted so that the converted audio digital signal r q (t) becomes an audio signal that can cause an auditory illusion.
  • an audio signal or a time-reversed audio that can obtain the continuous listening effect as described above can be obtained.
  • Adopt a conversion rule that converts to an audio signal that becomes.
  • the signal conversion unit 125 deletes a part (for a certain period of time) of the audio digital signal x (t) in the time axis direction, and the deleted part.
  • the audio digital signal x (t) is converted into the audio digital signal r q (t) by embedding it in the portion where the noise above the sound pressure of each frequency before and after is deleted (see FIG. 3).
  • the length of the part to be deleted and the part to be embedded are the lengths that can cause an illusion, and Q-1 different lengths are applied.
  • the deletion process and the embedding process are performed at intervals that may cause an auditory illusion.
  • the noise to be embedded is, for example, white noise.
  • noise is prepared in advance prior to the process S125. For example, by changing the length of the deleted part and the embedded part from 100 ms to 200 ms and 300 ms, three voice digital signals r q (t) having different perceptual intensities can be obtained from the voice digital signal x (t).
  • the signal conversion unit 125 divides the voice digital signal x (t) into a voice waveform for each fixed short time window width, and the voice waveform is divided into time. After inverting on the axis, the inverted voice waveform is concatenated to convert the voice digital signal x (t) into the voice digital signal r q (t) (see FIG. 4).
  • the length when carving is the length that can cause an illusion, and Q-1 different lengths are applied. For example, by changing the length of the inverted portion from 20 ms to 40 ms and 60 ms, three voice digital signals r q (t) having different perceptual intensities can be obtained from the voice digital signal x (t).
  • the audio digital signal x (t) is converted into either an audio signal that can obtain a continuous listening effect and an audio signal that becomes a time-reversed audio, or both.
  • the feature quantity analysis unit 130 takes out the voice digital signal x (t) from the voice digital signal storage unit 120, takes out the voice digital signal r q (t) from the voice digital signal storage unit 126, and takes out the voice digital signal x (t). ) And r q (t) are divided into frame units, acoustic feature quantity extraction is performed for each frame, and (acoustic) feature quantity series X and R q are acquired.
  • the m-th audio digital signal x (t) in the n-th frame can be expressed as x (D (n-1) + m).
  • the feature amount analysis unit 130 is a voice digital signal x (D (n-1) + 1), x (D (n-1) + 2),..., x (D (n-1) + M) for each frame n. ), The acoustic feature amount is extracted and the feature amount X (n) is acquired.
  • the features to be extracted include, for example, 1 to 12 dimensions of MFCC (Mel-Frequenct Cepstrum Coefficient) based on short-time frame analysis of voice signals, dynamic parameters such as ⁇ MFCC and ⁇ MFCC, which are the dynamic features, and dynamic parameters such as ⁇ MFCC and ⁇ MFCC. Power, ⁇ power, ⁇ power, etc. are used. Further, CMN (cepstrum average normalization) processing may be performed on the MFCC.
  • the feature amount is not limited to MFCC and power, and parameters used for identifying special utterances (for example, autocorrelation peak value and group delay) may be used.
  • the acoustic model f is a model that inputs a feature sequence and outputs a phoneme label.
  • GMM-HMM and DNN-HMM are often used as the acoustic model in speech recognition, and in recent years, the End-to-End speech recognition model has also been used. Since there are no restrictions on the speech recognition model, it may be either a GMM / DNN-HMM or an End-to-End speech recognition model.
  • the correct label r (j) corresponds to the analog audio signal x (k) for learning, and the feature quantity sequence X obtained from the audio signal x (k) and the feature quantity sequence X are converted. It also corresponds to the feature quantity series R q obtained by.
  • a configuration that does not include the feature amount analysis unit 130, the feature amount storage unit 140, and the learning unit 160 of the model learning device 100 is also referred to as a learning data generation device. That is, the learning data generation device includes an audio signal acquisition unit 110, an audio digital signal storage unit 120, a signal conversion unit 125, and an audio digital signal storage unit 126.
  • the learning data generator inputs an analog audio signal x (k) for learning and a correct answer label r (j), and from the audio signal x (k) to an audio digital signal x (t) and an audio digital signal r q ( t) is generated, and the combination of the voice digital signal x (t), the voice digital signal r q (t), and the correct answer label r (j) is output as training data.
  • the audio digital signal r q (t) is an audio signal that can cause an auditory illusion, but as a result of an experiment, the same applies to an audio signal that cannot cause an auditory illusion. It turned out that the effect can be obtained. Therefore, an audio signal that cannot cause an auditory illusion may be prepared as learning data having high perceptual intensity.
  • at least the voice digital signal r 2 (t) having the lowest perceptual intensity of the Q-1 voice digital signals r q (t) may be any voice signal that can cause an auditory illusion.
  • the signal conversion unit 125 deletes a part of the audio digital signal r q (t) in the time axis direction, and the sound of each frequency before and after the deleted part is deleted.
  • the audio digital signal x (t) is converted into the audio digital signal r q (t) by embedding it in the part where the noise above the pressure is removed.
  • the lengths of the deleted portion and the embedded portion may be long enough to prevent the continuous listening effect.
  • the interval between the deletion process and the embedding process may be so short that the continuous listening effect cannot be produced. Even when such data expansion is executed, it is possible to learn an acoustic model having the same accuracy as that of the first embodiment.
  • the signal conversion unit 125 converts the voice digital signal x (t) into a voice waveform for each fixed short time window width. After inverting the voice waveform on the time axis, the voice digital signal x (t) is converted into the voice digital signal r q (t) by concatenating the inverted voice waveforms.
  • the length of the cut voice waveform may be long enough to prevent illusion. Even when such data expansion is executed, it is possible to learn an acoustic model having the same accuracy as that of the first embodiment.
  • Q-1 voices from the same original voice digital signal x (t) by changing only the time segment length to be omitted or the segment length to be inverted hereinafter, also simply referred to as segment length).
  • the conversion method for generating the digital signal r q (t) is illustrated, but what if the signal conversion unit 125 can convert the original voice digital signal into Q-1 voice digital signals having different perceptual intensities? It may be converted as follows.
  • v qs and v qe indicate the sample numbers of the first and last audio digital signals in the qth time interval, respectively.
  • the utterance contained in the original voice digital signal x (t) may be divided into Q-1 groups, and the segment length may be set for each group to generate the voice digital signal r q (t).
  • the Data Augmentation process using the illusion as in the first embodiment is not executed for the voice waveform, but the Data Augmentation process is executed on the feature space to lengthen the data from the training data. It becomes possible to construct a voice recognition device that is robust to time information.
  • the amount of training data is simply Q times, and when the data is stored, the original data is also taken into consideration and the Q times. Capacity is required.
  • the Data Augmentation process on the feature amount space, it is possible to convert the feature amount that becomes the learning data during learning, so the data capacity is only for the original learning data. I'm done.
  • the speech recognition device With the continuous length effect, it becomes possible for the speech recognition device to acquire the robustness of a human being who can perceive speech even when a part is missing, and as a result, a speech recognition device that is robust to long-term information is constructed.
  • similar expressions are possible in the feature space. For example, by deleting a certain segment on the time axis in the feature amount and embedding a value larger than the size of the feature amount before and after the segment in the segment, the expression equivalent to the continuous length effect can be obtained.
  • the speech recognition device With time-reversed speech, it becomes possible for the speech recognition device to acquire the robustness of humans who can perceive speech even when the time series is locally inverted (destroyed), and as a result, speech recognition that is robust to long-term information.
  • the device is built.
  • the feature series is inverted in each segment on the time axis on the feature, and all the data are reconcatenated to expand the data. Used as.
  • FIG. 5 shows a functional block diagram of the model learning device according to the second embodiment
  • FIG. 6 shows a processing flow thereof.
  • the model learning device 100 includes a voice signal acquisition unit 110, a voice digital signal storage unit 120, a feature amount analysis unit 130, a feature amount storage unit 140, a feature amount conversion unit 150, and a learning unit 160.
  • the processing contents of the audio signal acquisition unit 110 and the audio digital signal storage unit 120 are the same as those in the first embodiment.
  • the feature amount analysis unit 130 takes out the voice digital signal x (t) for each utterance p from the voice digital signal storage unit 120, divides the voice digital signal x (t) into frame units, and has an acoustic feature amount for each frame. Extraction is performed, and the (acoustic) feature quantity series X (p) for each utterance p is acquired.
  • D the m-th voice digital signal x (t) in the n-p- th frame of a certain utterance p can be expressed as x (D (n p -1) + m).
  • the subscript subscript p indicates that the value corresponds to the utterance p.
  • the feature amount to be extracted is the same as that of the first embodiment.
  • ⁇ Feature amount storage unit 140> Input: Feature series X (p) Processing: Accumulation of feature quantity series The feature quantity storage unit 140 accumulates the feature quantity series X (p) analyzed by the feature quantity analysis unit 130 (S140).
  • the feature quantity conversion unit 150 executes Data Augmentation processing on the feature quantity series X (p) and converts the feature quantity series X (p) into the feature quantity series R q (p) (S150). ).
  • the feature amount conversion unit 150 converts the feature amount series X (p) into Q-1 feature amount series R q (p) having different perceptual intensities.
  • the feature amount conversion unit 150 generates Q-1 feature amount series R q (p) from the same original feature amount sequence X (p) by changing only the segment length.
  • a feature sequence X (p') when learning using a feature sequence X (p') corresponding to a certain utterance p'(p'is one of 1, 2, ..., P) It means that the Data Augmentation process is executed and the feature series X (p') is converted to the feature series R q (p').
  • P represents the total number of utterances contained in the analog audio signal x (k) for learning.
  • the inflated learning data is used only during learning and does not need to be stored, the amount of learning data to be stored can be reduced. Since the input is a feature quantity series, all the data augmentation processing is performed on the feature quantity space, and it is not necessary to perform the data augmentation processing on the audio digital signal.
  • a conversion rule is adopted so that the audio signal corresponding to the converted feature sequence R q (p) becomes an audio signal that can cause an auditory illusion.
  • processing is performed on the audio waveform, but in the present embodiment, conversion processing is performed on the feature quantity series.
  • Adopt a conversion rule that converts to a feature sequence corresponding to an audio signal such as a signal or time-reversed audio.
  • the feature quantity conversion unit 150 When converting to a feature quantity series corresponding to a voice signal that can obtain a continuous listening effect, the feature quantity conversion unit 150 deletes a segment having the feature quantity sequence X (p), and before and after the deleted segment.
  • the feature quantity series X (p) is converted into the feature quantity series R q (p) by embedding the feature quantity having a value equal to or higher than the feature quantity value of.
  • the segment length is a length that can cause an illusion, and Q-1 different lengths are applied.
  • the deletion process and the embedding process are performed at intervals that may cause an auditory illusion.
  • the feature amount to be embedded is a feature amount corresponding to noise, and the noise is, for example, white noise.
  • a feature amount corresponding to noise is prepared in advance.
  • feature series X (s + 1 p ), X (s + 2 p ), X (s + 3 p ), X (s + 4 p ), X (s + 5 p ), X (s) Of the + 6 p ), X (s + 7 p ), X (s + 8 p ), X (s + 9 p ), X (s + 10 p ),..., three features X (s + 3) Process of deleting p ), X (s + 4 p ), X (s + 5 p ) and embedding three features X (1 n ), X (2 n ), X (3 n) corresponding to noise (See Fig.
  • the values of X (1 n ), X (2 n ), X (3 n ) should be greater than or equal to the values of the front feature X (s + 2 p ) and the back feature X (s + 6 p). Set to. For example, this process is performed every 20 frames. For example, by changing the length of the deleted part and the embedded part from 3 frames to 4,5 frames, 3 feature series R q (p) with different perceptual intensities can be obtained from the feature series X (p). can get.
  • the feature quantity conversion unit 150 divides the feature quantity sequence X (p) into segments having a predetermined time length, and each segment.
  • the feature series X (p) is converted into the feature series R q (p) by reversing the feature series divided in the process in time and concatenating the inverted feature series.
  • the segment length is a length that can cause an illusion, and Q-1 different lengths are applied.
  • the feature amount conversion unit 150 has a feature amount series ..., X (s + 1 p ), X (s + 2 p ), X (s + 3 p ), X (s + 4 p ), X (s +).
  • the feature quantity sequence X (feature quantity sequence X () for either or both of the feature quantity sequence corresponding to the audio signal such that the continuous listening effect can be obtained and the feature quantity sequence corresponding to the audio signal such as the time-reversed audio. Convert p).
  • the processing content of the learning unit 160 is the same as that of the first embodiment.
  • a configuration that does not include the learning unit 160 of the model learning device 100 is also referred to as a learning data generation device. That is, the learning data generation device includes a voice signal acquisition unit 110, a voice digital signal storage unit 120, a feature amount analysis unit 130, a feature amount storage unit 140, and a feature amount conversion unit 150.
  • the learning data generator takes an analog audio signal x (k) for training and a correct label r (j) as inputs, and features sequence X (p) and feature sequence R q (feature sequence R q) from the audio signal x (k). p) is generated, and the combination of the feature series X (p), the feature series R q (p), and the correct label r (j) is output as training data.
  • the audio signal corresponding to the feature sequence R q (p) is an audio signal that can cause an auditory illusion, but as a result of an experiment, it is an audio signal that cannot cause an auditory illusion. It was found that the same effect can be obtained even if there is. Therefore, an audio signal that cannot cause an auditory illusion may be prepared as learning data having high perceptual intensity. In other words, at least the audio signal having the lowest perceptual intensity among the Q-1 audio signals corresponding to the Q-1 feature sequence R q (p) may be an audio signal that can cause an auditory illusion.
  • the feature amount conversion unit 150 deletes a segment having the feature amount series X (p), and the value is equal to or greater than the value of the feature amount before and after the deleted segment.
  • the feature series X (p) is converted to the feature series R q (p).
  • the segment length of the segment to be deleted or embedded may be long enough to prevent the continuous listening effect.
  • the interval between the deletion process and the embedding process may be so short that the continuous listening effect cannot be produced. Even when such data expansion is performed, it is possible to learn an acoustic model having the same accuracy as that of the second embodiment.
  • the feature quantity conversion unit 150 divides the feature quantity sequence X (p) into segments having a predetermined time length in order to convert the feature quantity sequence X (p) into a voice signal that becomes a time-reversed voice.
  • the feature series X (p) is converted into the feature series R q (p) by inverting the feature series divided in each segment in time and concatenating the inverted feature series.
  • the segment length may be long enough to prevent illusion. Even when such data expansion is performed, it is possible to learn an acoustic model having the same accuracy as that of the second embodiment.
  • Q-1 features from the same original feature series X (p) by changing only the time segment length to be omitted or the segment length to be inverted (hereinafter, also simply referred to as segment length).
  • the feature quantity conversion unit 150 uses the original feature quantity series with different perceptual intensities Q as in the signal conversion unit 125 of the first embodiment. -If it can be converted into one feature series, it may be converted in any way.
  • V (Q) ⁇ X (v Qs ) ⁇ X (v Qe ) ⁇
  • V (q) ⁇ X (v Qs ) ⁇ X (v Qe ) ⁇
  • one time interval V (q)
  • One feature series R q may be generated from X (v qs ) to X (v qe) ⁇ .
  • v qs and v qe indicate the first and last frame numbers of the qth time interval, respectively.
  • the feature sequence X (p) corresponding to the original P utterances is divided into Q-1 groups, the same segment length is set for each group, and the voice digital signal R q (p) is generated. You may.
  • the program that describes this processing content can be recorded on a computer-readable recording medium.
  • the recording medium that can be read by a computer may be, for example, a magnetic recording device, an optical disk, a photomagnetic recording medium, a semiconductor memory, or the like.
  • the distribution of this program is carried out, for example, by selling, transferring, renting, etc. a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Further, the program may be stored in the storage device of the server computer, and the program may be distributed by transferring the program from the server computer to another computer via the network.
  • a computer that executes such a program first, for example, first stores a program recorded on a portable recording medium or a program transferred from a server computer in its own storage device. Then, when the process is executed, the computer reads the program stored in its own recording medium and executes the process according to the read program. Further, as another execution form of this program, a computer may read the program directly from a portable recording medium and execute processing according to the program, and further, the program is transferred from the server computer to this computer. You may execute the process according to the received program one by one each time.
  • ASP Application Service Provider
  • the program in this embodiment includes information used for processing by a computer and equivalent to the program (data that is not a direct command to the computer but has a property that regulates the processing of the computer, etc.).
  • the present device is configured by executing a predetermined program on a computer, but at least a part of these processing contents may be realized in terms of hardware.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Probability & Statistics with Applications (AREA)
  • Evolutionary Computation (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

Provided is a training data generation device, etc., which generates training data for training an acoustic model that simulates robustness of human speech perception. The training data generation device includes a signal conversion unit that sets Q as an integer of 2 or higher and converts a first audio signal for training to second audio signals for training which are a (Q – 1) number of audio signals that have different perception strengths. At least the second audio signal for training that has the lowest perception strength, among the (Q – 1) second audio signals for training, is capable of generating auditory illusions.

Description

学習データ生成装置、モデル学習装置、学習データ生成方法、モデル学習方法、およびプログラムTraining data generator, model learning device, training data generation method, model learning method, and program
 本発明は、音声認識装置において用いる音響モデルを学習する際に用いる学習データを生成する学習データ生成装置、学習データを用いるモデル学習装置、学習データ生成方法、モデル学習方法、およびプログラムに関する。 The present invention relates to a learning data generation device that generates learning data used when learning an acoustic model used in a speech recognition device, a model learning device that uses training data, a training data generation method, a model learning method, and a program.
 音響モデルを用いた音声認識装置において、特許文献1は、実用レベルの音声認識性能を担保するために、認識対象とするタスクに対して音響モデルを適応させていく技術である。言い換えると、特許文献1は、話者や、雑音タイプ、喋り方などの音響的特徴が異なるタスクに対して元々の音響モデルを適応させていく技術である。一般的に、音声認識性能は、認識対象とするタスクの学習データ量や、音響的な網羅性に依存して上下する側面を持つ。そこで通常は、認識対象とするタスクの音声を十分に集め、その音声を書き起こしすることで、所望の学習データを収集する。 In a speech recognition device using an acoustic model, Patent Document 1 is a technique for adapting an acoustic model to a task to be recognized in order to ensure a practical level of speech recognition performance. In other words, Patent Document 1 is a technique for adapting an original acoustic model to a task having different acoustic characteristics such as a speaker, a noise type, and a way of speaking. In general, the speech recognition performance has an aspect of increasing or decreasing depending on the amount of learning data of the task to be recognized and the acoustic coverage. Therefore, usually, desired learning data is collected by sufficiently collecting the voices of the tasks to be recognized and transcribing the voices.
 しかしながら、従来技術では、莫大な金銭的・時間的コストを要するという課題がある。 However, the conventional technology has a problem that it requires a huge financial and time cost.
 このような課題に対する解決技術の一つとしてData Augmentation(データ拡張)がある。データ拡張とは、オリジナルの学習データに対して何かしらの変動を加え、新しい学習データを生成し、学習データを水増しすることである。データ拡張により、同じ学習データで学習することを少なくし、一層の汎化性能を獲得することができる。 Data Augmentation is one of the solutions to such problems. Data expansion is to add some variation to the original training data, generate new training data, and inflate the training data. By data expansion, it is possible to reduce learning with the same learning data and obtain further generalization performance.
 例えば、非特許文献1では、話速をオリジナルデータに対して変換することで、様々な話者データを生成し、より広範な話者に対する汎化性能を向上させる。 For example, in Non-Patent Document 1, by converting the speaking speed to the original data, various speaker data are generated, and the generalization performance for a wider range of speakers is improved.
 また、非特許文献2では、雑音耐性や残響音声に対する認識性能を改善するために、オリジナルの学習データに対し雑音を重畳させ、更に残響の強い部屋のインパルス応答を畳み込むことで擬似的な残響音声を生成し、オリジナルの学習データに対し残響音声を重畳させ、汎化性能を向上させる。 Further, in Non-Patent Document 2, in order to improve noise immunity and recognition performance for reverberation sound, noise is superimposed on the original learning data, and the impulse response of a room with strong reverberation is convoluted to create a pseudo reverberation sound. Is generated, and the reverberation sound is superimposed on the original learning data to improve the generalization performance.
特開2007-249051号公報Japanese Unexamined Patent Publication No. 2007-249051
 ここでは、音声認識装置に対し、長時間情報を捉えるためのData Augmentationを考える。まず、音声認識装置と長時間情報について説明する。長時間情報を音声認識装置に組み込むことで、様々な音響事象に頑健になり、音声認識精度が改善されるといった報告が多数存在している。 Here, consider Data Augmentation for capturing information for a long time for a voice recognition device. First, the voice recognition device and long-time information will be described. There are many reports that incorporating long-term information into a speech recognition device makes it more robust to various acoustic events and improves speech recognition accuracy.
 例えば、recurrent neural network(RNN)モデルは、multi-layer perceptron(MLP)モデルなどとは異なり、長時間情報を陽に取り込むためにモデル自体に工夫をし、音声認識のような時系列情報を扱うタスクでは大きく精度改善した。 For example, the recurrent neural network (RNN) model is different from the multi-layer perceptron (MLP) model, and the model itself is devised to explicitly capture long-term information and handles time-series information such as voice recognition. The task has greatly improved accuracy.
 また、参考文献1では、End-to-End音声認識モデルに言語的な長時間コンテキストを陽に組み込むことで、音声認識精度を改善している。 Further, in Reference 1, the speech recognition accuracy is improved by explicitly incorporating a linguistic long-time context into the end-to-end speech recognition model.
(参考文献1)R. Masumura, T. Tanaka, T. Moriya, Y. Shinohara, T. Oba and Y. Aono, "Large Context End-to-end Automatic Speech Recognition via Extension of Hierarchical Recurrent Encoder-decoder Models", ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, United Kingdom, 2019, pp. 5661-5665.
 上述のように、モデルを工夫することで長時間情報を音声認識装置に組み込む手法は様々ある。しかしながら、学習データ自体を工夫することで長時間情報を組み込んだ音声認識装置を獲得するようなアプローチはこれまでに存在しない。
(Reference 1) R. Masumura, T. Tanaka, T. Moriya, Y. Shinohara, T. Oba and Y. Aono, "Large Context End-to-end Automatic Speech Recognition via Extension of Hierarchical Recurrent Encoder-decoder Models" , ICASSP 2019 --2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, United Kingdom, 2019, pp. 5661-5665.
As described above, there are various methods for incorporating long-term information into the speech recognition device by devising a model. However, there has never been an approach to acquire a speech recognition device that incorporates information for a long time by devising the learning data itself.
 本発明では、学習データ自体に工夫を施す。本発明に係る学習データ生成装置で生成した学習データを用いて音響モデルを学習することで、ヒトが先天的また後天的に獲得するような音声知覚に対する頑健性を獲得することができる。本実施形態では、錯聴を利用してData Augmentation処理を実行することで、学習データを生成する。さらに、本発明に係るモデル学習装置は、錯聴における知覚強度に基づいて、カリキュラム学習することで更なる音響モデルの性能改善を図る。本発明に係る学習データ生成装置では、知覚強度に基づくカリキュラム学習に適した学習データを生成する。 In the present invention, the learning data itself is devised. By learning an acoustic model using the learning data generated by the learning data generation device according to the present invention, it is possible to acquire robustness to speech perception that humans acquire innately and acquiredly. In this embodiment, learning data is generated by executing a Data Augmentation process using an auditory illusion. Further, the model learning device according to the present invention further improves the performance of the acoustic model by learning the curriculum based on the perceptual intensity in the auditory illusion. The learning data generation device according to the present invention generates learning data suitable for curriculum learning based on perceptual intensity.
 本発明は、ヒトの音声知覚に対する頑健性を模擬した音響モデルを学習するための学習データを生成する学習データ生成装置、その学習データを用いて音響モデルを学習するモデル学習装置、学習データ生成方法、モデル学習方法、およびプログラムを提供することを目的とする。 The present invention is a learning data generation device that generates learning data for learning an acoustic model that simulates the robustness of human speech perception, a model learning device that learns an acoustic model using the learning data, and a learning data generation method. , Model learning methods, and programs.
 上記の課題を解決するために、本発明の一態様によれば、学習データ生成装置は、音声認識装置において用いる音響モデルを学習する際に用いる学習データを生成する。学習データ生成装置は、Qを2以上の整数の何れかとし、第一の学習用音声信号を、知覚強度の異なるQ-1個の音声信号である第二の学習用音声信号に変換する信号変換部を含み、Q-1個の第二の学習用音声信号のうちの少なくとも最も知覚強度が低い第二の学習用音声信号は、錯聴を生じ得る音声信号である。 In order to solve the above problem, according to one aspect of the present invention, the learning data generation device generates learning data used when learning the acoustic model used in the speech recognition device. The learning data generator sets Q as one of two or more integers, and converts the first learning audio signal into a second learning audio signal, which is Q-1 audio signals with different perceptual intensities. Of the Q-1 second learning voice signals including the conversion unit, at least the second learning voice signal having the lowest perceptual intensity is a voice signal that can cause illusion.
 上記の課題を解決するために、本発明の他の態様によれば、学習データ生成装置は、音声認識装置において用いる音響モデルを学習する際に用いる学習データを生成する。学習データ生成装置は、Qを2以上の整数の何れかとし、第一の学習用音声信号から得られる音響特徴量系列である第一の特徴量系列をQ-1個の第二の特徴量系列に変換する特徴量変換部を含み、Q-1個の第二の特徴量系列に対応するQ-1個の第二の学習用音声信号は、知覚強度の異なるQ-1個の音声信号であり、そのうちの少なくとも最も知覚強度が低い音声信号は錯聴を生じ得る音声信号である。 In order to solve the above problem, according to another aspect of the present invention, the learning data generation device generates learning data used when learning an acoustic model used in a speech recognition device. In the training data generator, Q is one of two or more integers, and the first feature quantity series, which is the acoustic feature quantity series obtained from the first learning audio signal, is the second feature quantity of Q-1. The Q-1 second learning audio signal corresponding to the Q-1 second feature sequence, including the feature conversion unit that converts to the sequence, is the Q-1 audio signal with different perceptual intensity. At least the lowest perceptual intensity voice signal is a voice signal that can cause illusion.
 本発明により、ヒトの音声知覚に対する頑健性を模擬した音響モデルを学習することができるという効果を奏する。 The present invention has the effect of being able to learn an acoustic model that simulates the robustness of human speech perception.
第一実施形態に係るモデル学習装置の機能ブロック図。The functional block diagram of the model learning apparatus which concerns on 1st Embodiment. 第一実施形態に係るモデル学習装置の処理フローの例を示す図。The figure which shows the example of the processing flow of the model learning apparatus which concerns on 1st Embodiment. 連続聴効果が得られるような音声信号に変換する例を示す図。The figure which shows the example which converts into the audio signal which can obtain the continuous listening effect. 時間反転音声となるような音声信号に変換する例を示す図。The figure which shows the example which converts into the voice signal which becomes the time-inverted voice. 第二実施形態に係るモデル学習装置の機能ブロック図。The functional block diagram of the model learning apparatus which concerns on 2nd Embodiment. 第二実施形態に係るモデル学習装置の処理フローの例を示す図。The figure which shows the example of the processing flow of the model learning apparatus which concerns on 2nd Embodiment. 連続聴効果が得られるような音声信号に対応する特徴量系列に変換する例を示す図。The figure which shows the example which converts into the feature quantity series corresponding to the audio signal which can obtain the continuous listening effect. 時間反転音声となるような音声信号に対応する特徴量系列に変換する例を示す図。The figure which shows the example which converts into the feature quantity series corresponding to the voice signal which becomes the time-inverted voice. 本手法を適用するコンピュータの構成例を示す図。The figure which shows the configuration example of the computer to which this method is applied.
 以下、本発明の実施形態について、説明する。なお、以下の説明に用いる図面では、同じ機能を持つ構成部や同じ処理を行うステップには同一の符号を記し、重複説明を省略する。以下の説明において、ベクトルや行列の各要素単位で行われる処理は、特に断りが無い限り、そのベクトルやその行列の全ての要素に対して適用されるものとする。 Hereinafter, embodiments of the present invention will be described. In the drawings used in the following description, the same reference numerals are given to the components having the same function and the steps performing the same processing, and duplicate description is omitted. In the following description, the processing performed for each element of the vector or matrix shall be applied to all the elements of the vector or matrix unless otherwise specified.
<第一実施形態のポイント>
 本実施形態では、ヒトが獲得している音声知覚の頑健性を音声認識装置に獲得させるために、錯聴を利用したData Augmentation処理を実行する。
<Points of the first embodiment>
In the present embodiment, in order to make the speech recognition device acquire the robustness of speech perception acquired by humans, a data augmentation process using auditory illusion is executed.
 錯聴とは、ヒトの聴覚特性により、物理的に提示された音刺激に対して必ずしもその通り知覚されないような錯覚現象であり、錯視の聴覚版といえる。 The illusion is an illusion phenomenon in which the physically presented sound stimulus is not always perceived as it is due to the human auditory characteristics, and can be said to be an auditory version of the illusion.
 例えば、連続聴効果(continuity illusion)では、周波数変化する純音や音声等の一部分を削除し、その削除された部分に、元の音を十分にマスキングするだけのノイズを重畳することで、物理的には削除されているはずの音区間が補完されて知覚される現象である(参考文献2参照)。 For example, in the continuity illusion, a part of a pure tone or voice whose frequency changes is deleted, and noise that sufficiently masks the original sound is superimposed on the deleted part to physically perform it. Is a phenomenon in which the sound section that should have been deleted is complemented and perceived (see Reference 2).
(参考文献2)R. M. Warren: "Perceptual restoration of missing speech sounds", Science, 167, pp. 392-393 (1970).
 また、時間反転音声(Locally Time-reversed Speech)は、ある一定の短い時間セグメントごとに音声波形を切り分け、それぞれのセグメントにおいてその波形を時間軸上で反転させた後、その反転させた各セグメントを再度連結させた音声である(参考文献3参照)。
(Reference 2) R. M. Warren: "Perceptual restoration of missing speech sounds", Science, 167, pp. 392-393 (1970).
In Locally Time-reversed Speech, a voice waveform is divided into certain short time segments, the waveform is inverted on the time axis in each segment, and then each inverted segment is used. The audio is reconnected (see Reference 3).
(参考文献3)K. Saberi and D. R. Perrott, "Cognitive restoration of reversed speech", Nature, 398, 6730, pp. 760-760 (1999).
ヒトがこのような時間反転音声を聴取した場合、その音声知覚の明瞭度はそのセグメント長が比較的短い場合、例えば25ms程度であれば、十分高いまま維持される。しかしながら、セグメント長が長くなればなるほどその明瞭度はシグモイド関数的に低下していき、100ms程度でほぼ音声知覚が困難になることが実験的に示されている。つまりある程度までの局所的な時系列の破壊であれば、ヒトの音声知覚には影響を与えない(頑健である)ことが分かる。
(Reference 3) K. Saberi and D. R. Perrott, "Cognitive restoration of reversed speech", Nature, 398, 6730, pp. 760-760 (1999).
When a human hears such a time-reversed speech, the intelligibility of the speech perception is maintained sufficiently high when the segment length is relatively short, for example, about 25 ms. However, it has been experimentally shown that the longer the segment length, the lower the intelligibility in a sigmoid function, and it becomes almost difficult to perceive speech in about 100 ms. In other words, it can be seen that if the time series is destroyed locally to some extent, it does not affect human speech perception (it is robust).
 このような連続聴効果を生じ得るような音声信号や時間反転音声となるような音声信号を用いて、音響モデルを学習すれば、自ずと、削除またはマスキングした部分や、反転させる際のセグメントよりも長い時間間隔を考慮して音響モデルを学習することとなり、音響モデルは長時間情報を組み込んだものとなり、ヒトが獲得している音声知覚の頑健性を獲得したものとなる。 If the acoustic model is trained using a voice signal that can produce such a continuous listening effect or a voice signal that becomes a time-reversed voice, it is naturally more than a deleted or masked part or a segment when reversing. The acoustic model will be learned in consideration of a long time interval, and the acoustic model will incorporate information for a long time, and will acquire the robustness of speech perception acquired by humans.
 本実施形態では、上述のような錯聴の音声波形を拡張データとして用いることで、学習データから長時間情報に頑健な音響モデルを学習することが可能となる。本実施形態では、錯聴の中でも、上述した連続長効果と時間反転音声を挙げる。 In the present embodiment, by using the above-mentioned auditory illusion voice waveform as extended data, it is possible to learn a robust acoustic model from the learning data to long-term information. In the present embodiment, the above-mentioned continuous length effect and time-reversed voice are mentioned among the auditory illusions.
 連続長効果では、一部分が欠如した状態でも音声知覚出来るヒトの頑健性を音声認識装置に獲得させることが可能になるため、結果として長時間情報に頑健な音声認識装置が構築される。 With the continuous length effect, it becomes possible for the speech recognition device to acquire the robustness of a human being who can perceive speech even when a part is missing, and as a result, a speech recognition device that is robust to long-term information is constructed.
 時間反転音声では、局所的に時系列が反転(破壊)された状態でも音声知覚出来るヒトの頑健性を音声認識装置に獲得させることが可能になるため、結果として長時間情報に頑健な音声認識装置が構築される。 With time-reversed speech, it becomes possible for the speech recognition device to acquire the robustness of humans who can perceive speech even when the time series is locally inverted (destroyed), and as a result, speech recognition that is robust to long-term information. The device is built.
 また、学習時にはカリキュラム学習を介することでさらなる性能改善を実現する。 Also, during learning, further performance improvement will be realized through curriculum learning.
 カリキュラム学習とは、学習データサンプルの難易度を何かしらの基準に従って事前に決定しておき、簡単な学習データサンプルから困難な学習データサンプルに段階的に引き上げていく手法である。カリキュラム学習により最適解への収束が早くなり、より良い局所最適解に向かう、ということが実験的によく示されている(参考文献4参照)。 Curriculum learning is a method in which the difficulty level of a learning data sample is determined in advance according to some criteria, and the learning data sample is gradually raised from a simple learning data sample to a difficult learning data sample. It is well experimentally shown that curriculum learning accelerates the convergence to the optimal solution and leads to a better local optimal solution (see Reference 4).
(参考文献4) Bengio, Y., et al.: "Curriculum learning", in ICML, pp. 41-48 (2009)
 例えば、言語モデルタスクであれば、学習データの語彙数を段階的に増やしていくことで、普通に学習させる場合よりも性能を改善することが報告されている。本実施形態でも、この学習法を採用することでさらなる性能の改善を実現する。
(Reference 4) Bengio, Y., et al .: "Curriculum learning", in ICML, pp. 41-48 (2009)
For example, in the case of a language model task, it has been reported that by gradually increasing the number of vocabularies of the learning data, the performance is improved as compared with the case of normal learning. Also in this embodiment, further improvement in performance is realized by adopting this learning method.
 具体的には、錯聴にはヒトの知覚強度を制御することが可能なパラメータが存在するため、その知覚強度を学習データの難易度と定めることで、カリキュラム学習を実現することが可能となる。なお、知覚強度を制御するとは、音声知覚しやすい音声や音声知覚しにくい音声になるように制御することを意味する。知覚強度が低いほど、音声知覚しやすく、学習データの難易度が低い。 Specifically, since there are parameters in the auditory illusion that can control the human perceptual intensity, curriculum learning can be realized by defining the perceptual intensity as the difficulty level of the learning data. .. It should be noted that controlling the perceptual intensity means controlling the voice so that the voice is easy to perceive and the voice is difficult to perceive. The lower the perceptual intensity, the easier it is to perceive speech, and the less difficult the learning data is.
 例えば、連続長効果では、欠如させる時間セグメント長を短くした学習データサンプルから、時間セグメント長を長くした学習データサンプルに段階的に変化させていくことで、タスクの難易度を上げていくことが可能になる。 For example, in the continuous length effect, it is possible to increase the difficulty of the task by gradually changing from the learning data sample with a short time segment length to the training data sample with a long time segment length. It will be possible.
 また時間反転音声では、それぞれ反転させるセグメント長によって知覚強度が操作可能であり、短いセグメントから長いセグメントになればなるほど、一般的に知覚が困難になることがよく実験的に示されている。そこで、カリキュラム学習においても、反転させるセグメント長を短くした学習データサンプルから、セグメント長を長くした学習データサンプルに段階的に変化させていくことで、タスクの難易度を上げていくことが可能になる。 Also, in time-reversed speech, it is often experimentally shown that the perceptual intensity can be manipulated by the segment length to be inverted, and that the shorter the segment to the longer the segment, the more difficult it is to perceive in general. Therefore, even in curriculum learning, it is possible to increase the difficulty of the task by gradually changing from a learning data sample with a short segment length to be inverted to a learning data sample with a long segment length. Become.
<第一実施形態>
 図1は第一実施形態に係るモデル学習装置の機能ブロック図を、図2はその処理フローを示す。
<First Embodiment>
FIG. 1 shows a functional block diagram of the model learning device according to the first embodiment, and FIG. 2 shows a processing flow thereof.
 モデル学習装置100は、音声信号取得部110と音声ディジタル信号蓄積部120と信号変換部125と音声ディジタル信号蓄積部126と特徴量分析部130と特徴量蓄積部140と学習部160とを含む。 The model learning device 100 includes a voice signal acquisition unit 110, a voice digital signal storage unit 120, a signal conversion unit 125, a voice digital signal storage unit 126, a feature amount analysis unit 130, a feature amount storage unit 140, and a learning unit 160.
 モデル学習装置は、例えば、中央演算処理装置(CPU: Central Processing Unit)、主記憶装置(RAM: Random Access Memory)などを有する公知又は専用のコンピュータに特別なプログラムが読み込まれて構成された特別な装置である。モデル学習装置は、例えば、中央演算処理装置の制御のもとで各処理を実行する。モデル学習装置に入力されたデータや各処理で得られたデータは、例えば、主記憶装置に格納され、主記憶装置に格納されたデータは必要に応じて中央演算処理装置へ読み出されて他の処理に利用される。モデル学習装置の各処理部は、少なくとも一部が集積回路等のハードウェアによって構成されていてもよい。モデル学習装置が備える各記憶部は、例えば、RAM(Random Access Memory)などの主記憶装置、ハードディスクや光ディスクもしくはフラッシュメモリ(Flash Memory)のような半導体メモリ素子により構成される補助記憶装置、またはリレーショナルデータベースやキーバリューストアなどのミドルウェアにより構成することができる。 The model learning device is, for example, a special program configured by loading a special program into a publicly known or dedicated computer having a central processing unit (CPU: Central Processing Unit), a main storage device (RAM: Random Access Memory), and the like. It is a device. The model learning device executes each process under the control of the central processing unit, for example. The data input to the model learning device and the data obtained by each process are stored in the main storage device, for example, and the data stored in the main storage device is read out to the central processing unit as needed. It is used for processing. At least a part of each processing unit of the model learning device may be configured by hardware such as an integrated circuit. Each storage unit included in the model learning device is, for example, a main storage device such as RAM (RandomAccessMemory), an auxiliary storage device composed of a hard disk, an optical disk, or a semiconductor memory element such as a flash memory (FlashMemory), or a relational device. It can be configured by middleware such as database and key value store.
 モデル学習装置は、学習用のアナログの音声信号x(k)と、対応する正解ラベルr(j)とを入力とし、これらの情報に基づき音響モデルを学習し、学習済みの音響モデルfを出力する。なお、kは時刻を示すインデックスである。正解ラベルは例えば音素ラベルであり、jは音素の順番を示すインデックスである。なお、正解ラベル(音素ラベル)がアナログの音声信号のどこからどこまでに対応するかを示す情報は、予め学習データに含まれるものとし、以下で説明する音声ディジタル信号、特徴量についても同様に紐付けられているものとする。 The model learning device inputs an analog audio signal x (k) for learning and a corresponding correct answer label r (j), learns an acoustic model based on this information, and outputs a trained acoustic model f. do. Note that k is an index indicating the time. The correct label is, for example, a phoneme label, and j is an index indicating the order of phonemes. Information indicating from where to where the correct answer label (sound element label) corresponds to the analog audio signal shall be included in the training data in advance, and the audio digital signal and feature amount described below are also linked in the same manner. It is assumed that it has been done.
 以下、各部の処理内容について説明する。 The processing contents of each part will be explained below.
<音声信号取得部110>
入力:音声信号x(k)
出力:音声ディジタル信号x(t)
処理:AD変換
 音声信号取得部110は、アナログの音声信号x(k)を取得し、ディジタルの音声ディジタル信号x(t)に変換する(S110)。なお、tは音声ディジタル信号のサンプル番号を示すインデックスである。
<Audio signal acquisition unit 110>
Input: Audio signal x (k)
Output: Audio digital signal x (t)
Processing: AD conversion The audio signal acquisition unit 110 acquires an analog audio signal x (k) and converts it into a digital audio digital signal x (t) (S110). Note that t is an index indicating the sample number of the audio digital signal.
<音声ディジタル信号蓄積部120>
入力:音声ディジタル信号x(t)
処理:音声ディジタル信号の蓄積
 音声ディジタル信号蓄積部120は、音声ディジタル信号x(t)を蓄積する(S120)。
<Audio digital signal storage unit 120>
Input: Audio digital signal x (t)
Processing: Accumulation of voice digital signal The voice digital signal storage unit 120 stores voice digital signal x (t) (S120).
<信号変換部125>
入力:音声ディジタル信号x(t)
出力:音声ディジタル信号rq(t)
処理:Data Augmentation処理
 信号変換部125は、音声ディジタル信号x(t)に対してData Augmentation処理を実行し、音声ディジタル信号x(t)を音声ディジタル信号rq(t)に変換する(S125)。ただし、qは知覚強度を示すインデックスであり、qが大きいほど知覚強度が高いものとする。なお、学習データは、音声知覚しやすいほど知覚強度が低い。ここでは、知覚強度をQ段階に設定する。カリキュラム学習を行うために、Qは2以上の整数の何れかとし、q=2,3,…,Qとする。信号変換部125は、音声ディジタル信号x(t)を知覚強度の異なるQ-1個の音声ディジタル信号rq(t)に変換する。なお、知覚強度が低いほど音声知覚しやすい音声であり、最も知覚強度が低い学習データはオリジナルの学習データに対応する音声ディジタル信号x(t)である。そこで、x(t)=r1(t)とし、知覚強度の異なるQ個の音声ディジタル信号rq、q=1,2,…,Qとも表記する。例えば、信号変換部125は、連続聴効果の欠如させる時間セグメント長または時間反転音声の反転させるセグメント長(以下、単にセグメント長ともいう)のみを変更して同一のオリジナルの音声ディジタル信号x(t)からQ-1個の音声ディジタル信号rq(t)を生成する。
<Signal converter 125>
Input: Audio digital signal x (t)
Output: Audio digital signal r q (t)
Processing: Data Augmentation processing The signal conversion unit 125 executes Data Augmentation processing on the voice digital signal x (t) and converts the voice digital signal x (t) into the voice digital signal r q (t) (S125). .. However, q is an index indicating the perceptual intensity, and the larger q is, the higher the perceptual intensity is assumed. It should be noted that the learning data has a lower perceptual intensity as the speech is easier to perceive. Here, the perceptual intensity is set to the Q stage. For curriculum learning, Q is one of the integers of 2 or more, and q = 2,3, ..., Q. The signal conversion unit 125 converts the audio digital signal x (t) into Q-1 audio digital signals r q (t) having different perceptual intensities. It should be noted that the lower the perceptual intensity, the easier the speech is perceived, and the learning data having the lowest perceptual intensity is the speech digital signal x (t) corresponding to the original learning data. Therefore, x (t) = r 1 (t), and Q voice digital signals with different perceptual intensities r q , q = 1,2, ..., Q are also expressed. For example, the signal conversion unit 125 changes only the time segment length that lacks the continuous listening effect or the segment length that inverts the time-reversed audio (hereinafter, also simply referred to as the segment length) to change only the same original audio digital signal x (t). ) Generates Q-1 audio digital signals r q (t).
 本実施形態におけるData Augmentation処理は、音声ディジタル信号x(t)をある変換ルールに基づいて音声ディジタル信号rq(t)(q=2,3,…,Q)に変換する。この変換により、擬似的な水増し学習データを生成する。 The Data Augmentation process in the present embodiment converts the audio digital signal x (t) into an audio digital signal r q (t) (q = 2,3, ..., Q) based on a certain conversion rule. By this conversion, pseudo inflated learning data is generated.
 本実施形態では、変換後の音声ディジタル信号rq(t)が錯聴を生じ得る音声信号となるような変換ルールが採用される。 In this embodiment, a conversion rule is adopted so that the converted audio digital signal r q (t) becomes an audio signal that can cause an auditory illusion.
 変換後の音声ディジタル信号rq(t)が錯聴を生じ得る音声信号となるような変換ルールとして、本実施形態では、上述したような連続聴効果が得られるような音声信号や時間反転音声となるような音声信号に変換する変換ルールを採用する。 As a conversion rule such that the converted audio digital signal r q (t) becomes an audio signal that can cause illusion, in the present embodiment, an audio signal or a time-reversed audio that can obtain the continuous listening effect as described above can be obtained. Adopt a conversion rule that converts to an audio signal that becomes.
(i)連続聴効果が得られるような音声信号に変換する場合、信号変換部125は、時間軸方向において音声ディジタル信号x(t)のある一部分(一定時間分)を削除し、削除した部分の前後の各周波数の音圧以上のノイズを削除した部分に埋め込むことで、音声ディジタル信号x(t)を音声ディジタル信号rq(t)に変換する(図3参照)。削除する部分および埋め込む部分の長さは、錯聴を生じ得る長さであり、Q-1個の異なる長さを適用する。また、削除処理および埋め込み処理は、錯聴を生じ得る間隔で行う。また、埋め込むノイズは、例えばホワイトノイズである。なお、処理S125に先立ちノイズを予め用意しておく。例えば、削除する部分および埋め込む部分の長さを100msから200ms,300msに変更することで、音声ディジタル信号x(t)から知覚強度の異なる3個の音声ディジタル信号rq(t)が得られる。 (i) When converting to an audio signal that can obtain a continuous listening effect, the signal conversion unit 125 deletes a part (for a certain period of time) of the audio digital signal x (t) in the time axis direction, and the deleted part. The audio digital signal x (t) is converted into the audio digital signal r q (t) by embedding it in the portion where the noise above the sound pressure of each frequency before and after is deleted (see FIG. 3). The length of the part to be deleted and the part to be embedded are the lengths that can cause an illusion, and Q-1 different lengths are applied. In addition, the deletion process and the embedding process are performed at intervals that may cause an auditory illusion. The noise to be embedded is, for example, white noise. Note that noise is prepared in advance prior to the process S125. For example, by changing the length of the deleted part and the embedded part from 100 ms to 200 ms and 300 ms, three voice digital signals r q (t) having different perceptual intensities can be obtained from the voice digital signal x (t).
(ii)時間反転音声となるような音声信号に変換する場合、信号変換部125は、音声ディジタル信号x(t)をある一定の短い時間窓幅毎に音声波形に切り分け、その音声波形を時間軸上で反転させた後、その反転させた音声波形を連結することで、音声ディジタル信号x(t)を音声ディジタル信号rq(t)に変換する(図4参照)。切り分ける際の長さは、錯聴を生じ得る長さであり、Q-1個の異なる長さを適用する。例えば、反転させる部分の長さを20msから40ms,60msに変更することで、音声ディジタル信号x(t)から知覚強度の異なる3個の音声ディジタル信号rq(t)が得られる。 (ii) When converting to a voice signal that becomes a time-inverted voice, the signal conversion unit 125 divides the voice digital signal x (t) into a voice waveform for each fixed short time window width, and the voice waveform is divided into time. After inverting on the axis, the inverted voice waveform is concatenated to convert the voice digital signal x (t) into the voice digital signal r q (t) (see FIG. 4). The length when carving is the length that can cause an illusion, and Q-1 different lengths are applied. For example, by changing the length of the inverted portion from 20 ms to 40 ms and 60 ms, three voice digital signals r q (t) having different perceptual intensities can be obtained from the voice digital signal x (t).
 本実施形態では、連続聴効果が得られるような音声信号と時間反転音声となるような音声信号のどちらか、もしくはその両方に音声ディジタル信号x(t)を変換する。 In the present embodiment, the audio digital signal x (t) is converted into either an audio signal that can obtain a continuous listening effect and an audio signal that becomes a time-reversed audio, or both.
<音声ディジタル信号蓄積部126>
入力:音声ディジタル信号rq(t)
処理:音声ディジタル信号の蓄積
 音声ディジタル信号蓄積部126は、音声ディジタル信号rq(t)を蓄積する(S126)。
<Audio digital signal storage unit 126>
Input: Audio digital signal r q (t)
Processing: Accumulation of voice digital signal The voice digital signal storage unit 126 stores the voice digital signal r q (t) (S126).
<特徴量分析部130>
入力:音声ディジタル信号x(t)、rq(t)
出力:特徴量系列X、Rq
処理:特徴量分析
 特徴量分析部130は、音声ディジタル信号x(t)、rq(t)に対して特徴量分析を行い、特徴量系列X、Rqを得る。
<Feature quantity analysis unit 130>
Input: Audio digital signal x (t), r q (t)
Output: Feature series X, R q
Processing: Feature analysis The feature analysis unit 130 performs feature analysis on the voice digital signals x (t) and r q (t), and obtains feature series X and R q.
 例えば、特徴量分析部130は、音声ディジタル信号蓄積部120から音声ディジタル信号x(t)を取り出し、音声ディジタル信号蓄積部126から音声ディジタル信号rq(t)を取り出し、音声ディジタル信号x(t)、rq(t)をそれぞれフレーム単位に分割し、フレーム毎に音響特徴量抽出を行い、(音響)特徴量系列X、Rqを取得する。 For example, the feature quantity analysis unit 130 takes out the voice digital signal x (t) from the voice digital signal storage unit 120, takes out the voice digital signal r q (t) from the voice digital signal storage unit 126, and takes out the voice digital signal x (t). ) And r q (t) are divided into frame units, acoustic feature quantity extraction is performed for each frame, and (acoustic) feature quantity series X and R q are acquired.
 例えば、音声ディジタル信号x(t)に含まれるフレーム総数をNとし、n=1,2,…,Nとし、フレーム長をMとし、m=1,2,…,Mとし、シフト幅をDとすると、n番目のフレームのm番目の音声ディジタル信号x(t)は、x(D(n-1)+m)と表すことができる。特徴量分析部130は、フレームn毎に音声ディジタル信号x(D(n-1)+1),x(D(n-1)+2),…,x(D(n-1)+M)に対して、音響特徴量抽出を行い、特徴量X(n)を取得する。特徴量分析部130は、全てのフレーム1,2,…,Nに対して処理を行い、特徴量系列X={X(1),X(2),…,X(N)}を取得する。特徴量分析部130は、同様の処理を音声ディジタル信号rq(t)に対して行い、特徴量系列Rq={Rq(1),Rq(2),…,Rq(N)}を取得する。 For example, the total number of frames included in the audio digital signal x (t) is N, n = 1,2, ..., N, the frame length is M, m = 1,2, ..., M, and the shift width is D. Then, the m-th audio digital signal x (t) in the n-th frame can be expressed as x (D (n-1) + m). The feature amount analysis unit 130 is a voice digital signal x (D (n-1) + 1), x (D (n-1) + 2),…, x (D (n-1) + M) for each frame n. ), The acoustic feature amount is extracted and the feature amount X (n) is acquired. The feature amount analysis unit 130 processes all frames 1,2, ..., N and acquires the feature amount series X = {X (1), X (2), ..., X (N)}. .. The feature analysis unit 130 performs the same processing on the voice digital signal r q (t), and the feature sequence R q = {R q (1), R q (2),…, R q (N). } To get.
 抽出する特徴量としては、例えば、音声信号の短時間フレーム分析に基づくMFCC(Mel-Frequenct Cepstrum Coefficient)の1~12次元と、その動的特徴量であるΔMFCC、ΔΔMFCCなどの動的パラメータや、パワー、Δパワー、ΔΔパワー等を用いる。また、MFCCに対してはCMN(ケプストラム平均正規化)処理を行っても良い。特徴量は、MFCCやパワーに限定したものでは無く、特殊発話の識別に用いられるパラメータ(例えば、自己相関ピーク値や群遅延など)を用いても良い。 The features to be extracted include, for example, 1 to 12 dimensions of MFCC (Mel-Frequenct Cepstrum Coefficient) based on short-time frame analysis of voice signals, dynamic parameters such as ΔMFCC and ΔΔMFCC, which are the dynamic features, and dynamic parameters such as ΔMFCC and ΔΔMFCC. Power, Δpower, ΔΔ power, etc. are used. Further, CMN (cepstrum average normalization) processing may be performed on the MFCC. The feature amount is not limited to MFCC and power, and parameters used for identifying special utterances (for example, autocorrelation peak value and group delay) may be used.
<特徴量蓄積部140>
入力: 特徴量系列X、Rq
処理:特徴量系列の蓄積
 特徴量蓄積部140は、特徴量分析部130で分析した特徴量系列X、Rqを蓄積する(S140)。
<Feature amount storage unit 140>
Input: Feature series X, R q
Processing: Accumulation of feature quantity series The feature quantity storage unit 140 accumulates the feature quantity series X and R q analyzed by the feature quantity analysis unit 130 (S140).
<学習部160>
入力:特徴量系列X、特徴量系列Rq、正解ラベルr(j)
出力:学習済み音響モデルf
処理:モデル学習
 学習部160は、特徴量系列X、特徴量系列Rq、正解ラベルr(j)を用いて、知覚強度に基づくカリキュラム学習により音響モデルfを学習する(S160)。つまり、学習部160は、まず、特徴量系列Xと正解ラベルr(j)とを用いて音響モデルfを学習し、学習の過程が進むごとに段階的にR2、R3、…、RQの順で特徴量系列Rqを学習データに追加しながら音響モデルfを学習していく。音響モデルfは、特徴量系列を入力とし、音素ラベルを出力するモデルである。音声認識における音響モデルとしては、例えばGMM-HMMやDNN-HMMなどがしばしば用いられており、近年ではEnd-to-End音声認識モデルも用いられているが、本実施形態では、特に学習対象の音声認識モデルに制約はないため、GMM/DNN-HMMであってもEnd-to-End音声認識モデルであってもよい。なお、正解ラベルr(j)は、学習用のアナログの音声信号x(k)に対応するものであり、音声信号x(k)から得られる特徴量系列X、および、特徴量系列Xを変換して得られる特徴量系列Rqにも対応する。
<Learning Department 160>
Input: Feature series X, feature series R q , correct label r (j)
Output: Trained acoustic model f
Processing: Model learning The learning unit 160 learns the acoustic model f by curriculum learning based on perceptual intensity using the feature quantity series X, the feature quantity series R q , and the correct answer label r (j) (S160). That is, the learning unit 160 first learns the acoustic model f using the feature sequence X and the correct label r (j), and R 2 , R 3 , ..., R step by step as the learning process progresses. The acoustic model f is learned while adding the feature series R q to the training data in the order of Q. The acoustic model f is a model that inputs a feature sequence and outputs a phoneme label. For example, GMM-HMM and DNN-HMM are often used as the acoustic model in speech recognition, and in recent years, the End-to-End speech recognition model has also been used. Since there are no restrictions on the speech recognition model, it may be either a GMM / DNN-HMM or an End-to-End speech recognition model. The correct label r (j) corresponds to the analog audio signal x (k) for learning, and the feature quantity sequence X obtained from the audio signal x (k) and the feature quantity sequence X are converted. It also corresponds to the feature quantity series R q obtained by.
<効果>
 以上の構成により、ヒトの音声知覚に対する頑健性を模擬した音響モデルを学習することができる。また、金銭的・時間的コストを低減することができる。
<Effect>
With the above configuration, it is possible to learn an acoustic model that simulates the robustness of human speech perception. In addition, financial and time costs can be reduced.
<変形例>
 モデル学習装置100の特徴量分析部130と特徴量蓄積部140と学習部160とを含まない構成を、学習データ生成装置ともいう。つまり、学習データ生成装置は、音声信号取得部110と音声ディジタル信号蓄積部120と信号変換部125と音声ディジタル信号蓄積部126とを含む。学習データ生成装置は、学習用のアナログの音声信号x(k)と正解ラベルr(j)とを入力とし、音声信号x(k)から音声ディジタル信号x(t)と音声ディジタル信号rq(t)とを生成し、音声ディジタル信号x(t)、音声ディジタル信号rq(t)および正解ラベルr(j)の組合せを学習データとして出力する。
<Modification example>
A configuration that does not include the feature amount analysis unit 130, the feature amount storage unit 140, and the learning unit 160 of the model learning device 100 is also referred to as a learning data generation device. That is, the learning data generation device includes an audio signal acquisition unit 110, an audio digital signal storage unit 120, a signal conversion unit 125, and an audio digital signal storage unit 126. The learning data generator inputs an analog audio signal x (k) for learning and a correct answer label r (j), and from the audio signal x (k) to an audio digital signal x (t) and an audio digital signal r q ( t) is generated, and the combination of the voice digital signal x (t), the voice digital signal r q (t), and the correct answer label r (j) is output as training data.
 第一実施形態では、音声ディジタル信号rq(t)は、錯聴を生じ得る音声信号であることを前提としているが、実験の結果、錯聴を生じ得ない音声信号であっても同様の効果を得ることができることが分かった。そこで、知覚強度が高い学習データとして錯聴を生じ得ない音声信号を用意してもよい。言い換えると、Q-1個の音声ディジタル信号rq(t)のうちの少なくとも最も知覚強度が低い音声ディジタル信号r2(t)が錯聴を生じ得る音声信号であればよい。 In the first embodiment, it is assumed that the audio digital signal r q (t) is an audio signal that can cause an auditory illusion, but as a result of an experiment, the same applies to an audio signal that cannot cause an auditory illusion. It turned out that the effect can be obtained. Therefore, an audio signal that cannot cause an auditory illusion may be prepared as learning data having high perceptual intensity. In other words, at least the voice digital signal r 2 (t) having the lowest perceptual intensity of the Q-1 voice digital signals r q (t) may be any voice signal that can cause an auditory illusion.
 例えば、第一実施形態では、連続聴効果を得るために、信号変換部125は、時間軸方向において音声ディジタル信号rq(t)の一部分を削除し、削除した部分の前後の各周波数の音圧以上のノイズを削除した部分に埋め込むことで、音声ディジタル信号x(t)を音声ディジタル信号rq(t)に変換する。このとき、知覚強度が高い学習データに変換する場合、削除する部分および埋め込む部分の長さは、連続聴効果を生じ得ないほど長くともよい。また、削除処理および埋め込み処理の間隔は、連続聴効果を生じ得ないほど短くともよい。このようなデータ拡張を実行した場合であっても、第一実施形態と同様の精度を持つ音響モデルを学習することができる。 For example, in the first embodiment, in order to obtain the continuous listening effect, the signal conversion unit 125 deletes a part of the audio digital signal r q (t) in the time axis direction, and the sound of each frequency before and after the deleted part is deleted. The audio digital signal x (t) is converted into the audio digital signal r q (t) by embedding it in the part where the noise above the pressure is removed. At this time, when converting to learning data having high perceptual intensity, the lengths of the deleted portion and the embedded portion may be long enough to prevent the continuous listening effect. Further, the interval between the deletion process and the embedding process may be so short that the continuous listening effect cannot be produced. Even when such data expansion is executed, it is possible to learn an acoustic model having the same accuracy as that of the first embodiment.
 また、例えば、第一実施形態では、時間反転音声となるような音声信号に変換するために、信号変換部125は、音声ディジタル信号x(t)をある一定の短い時間窓幅毎に音声波形に切り分け、その音声波形を時間軸上で反転させた後、その反転させた音声波形を連結することで、音声ディジタル信号x(t)を音声ディジタル信号rq(t)に変換する。このとき、知覚強度が高い学習データに変換する場合、切り分けた音声波形の長さは、錯聴を生じ得ないほど長くともよい。このようなデータ拡張を実行した場合であっても、第一実施形態と同様の精度を持つ音響モデルを学習することができる。 Further, for example, in the first embodiment, in order to convert the voice signal into a time-inverted voice, the signal conversion unit 125 converts the voice digital signal x (t) into a voice waveform for each fixed short time window width. After inverting the voice waveform on the time axis, the voice digital signal x (t) is converted into the voice digital signal r q (t) by concatenating the inverted voice waveforms. At this time, when converting to learning data having high perceptual intensity, the length of the cut voice waveform may be long enough to prevent illusion. Even when such data expansion is executed, it is possible to learn an acoustic model having the same accuracy as that of the first embodiment.
 さらに、第一実施形態では、欠如させる時間セグメント長または反転させるセグメント長(以下、単にセグメント長ともいう)のみを変更して同一のオリジナルの音声ディジタル信号x(t)からQ-1個の音声ディジタル信号rq(t)を生成する変換方法を例示しているが、信号変換部125が、元の音声ディジタル信号を知覚強度の異なるQ-1個の音声ディジタル信号に変換することができればどのように変換してもよい。例えば、オリジナルの音声ディジタル信号x(t)をQ-1個の時間区間V(2)={x(v2s)~x(v2e)},V(3)={x(v3s)~x(v3e)},…,V(Q)={x(vQs)~x(vQe)}に分割し、1個の時間区間V(q)={x(vqs)~x(vqe)}から1つの音声ディジタル信号rqを生成してもよい。vqsおよびvqeは、それぞれq番目の時間区間の最初および最後の音声ディジタル信号のサンプル番号を示す。オリジナルの音声ディジタル信号x(t)に含まれる発話をQ-1個のグループに分け、グループごとにセグメント長を設定して、音声ディジタル信号rq(t)を生成してもよい。
<第二実施形態のポイント>
Further, in the first embodiment, Q-1 voices from the same original voice digital signal x (t) by changing only the time segment length to be omitted or the segment length to be inverted (hereinafter, also simply referred to as segment length). The conversion method for generating the digital signal r q (t) is illustrated, but what if the signal conversion unit 125 can convert the original voice digital signal into Q-1 voice digital signals having different perceptual intensities? It may be converted as follows. For example, the original audio digital signal x (t) is Q-1 time interval V (2) = {x (v 2s ) ~ x (v 2e )}, V (3) = {x (v 3s ) ~ x (v 3e )},…, V (Q) = {x (v Qs ) ~ x (v Qe )}, and one time interval V (q) = {x (v qs ) ~ x ( One audio digital signal r q may be generated from v qe)}. v qs and v qe indicate the sample numbers of the first and last audio digital signals in the qth time interval, respectively. The utterance contained in the original voice digital signal x (t) may be divided into Q-1 groups, and the segment length may be set for each group to generate the voice digital signal r q (t).
<Points of the second embodiment>
 本実施形態では、第一実施形態のような錯聴を利用したData Augmentation処理を音声波形に対して実行するのではなく、特徴量空間上でData Augmentation処理を実行することで、学習データから長時間情報に頑健な音声認識装置を構築することが可能になる。第一実施形態のように音声波形に対してData Augmentation処理を実行する場合、学習データ量が単純にQ倍になってしまい、そのデータを保管する場合、オリジナルデータ分も考慮してQ倍の容量が必要になる。しかしながら、特徴量空間上でData Augmentation処理を実行することで、学習の最中に、その学習データとなる特徴量を変換処理することが可能になるため、データ容量もオリジナルの学習データ分だけで済む。 In the present embodiment, the Data Augmentation process using the illusion as in the first embodiment is not executed for the voice waveform, but the Data Augmentation process is executed on the feature space to lengthen the data from the training data. It becomes possible to construct a voice recognition device that is robust to time information. When the Data Augmentation process is executed for the voice waveform as in the first embodiment, the amount of training data is simply Q times, and when the data is stored, the original data is also taken into consideration and the Q times. Capacity is required. However, by executing the Data Augmentation process on the feature amount space, it is possible to convert the feature amount that becomes the learning data during learning, so the data capacity is only for the original learning data. I'm done.
 錯聴の中でも特徴量空間上で処理することが可能な例として、本実施形態では上述した連続長効果と時間反転音声を挙げる。 As an example of the illusion that can be processed in the feature space, the above-mentioned continuous length effect and time-reversed voice are given in this embodiment.
 連続長効果では、一部分が欠如した状態でも音声知覚出来るヒトの頑健性を音声認識装置に獲得させることが可能になるため、結果として長時間情報に頑健な音声認識装置が構築される。また、特徴量空間上でも同様に似たような表現が可能である。例えば、特徴量において時間軸上のあるセグメントを削除し、そこにそのセグメントの前後の特徴量の大きさ以上の値を代わりに埋め込むことで連続長効果と同等の表現となる。 With the continuous length effect, it becomes possible for the speech recognition device to acquire the robustness of a human being who can perceive speech even when a part is missing, and as a result, a speech recognition device that is robust to long-term information is constructed. In addition, similar expressions are possible in the feature space. For example, by deleting a certain segment on the time axis in the feature amount and embedding a value larger than the size of the feature amount before and after the segment in the segment, the expression equivalent to the continuous length effect can be obtained.
 時間反転音声では、局所的に時系列が反転(破壊)された状態でも音声知覚出来るヒトの頑健性を音声認識装置に獲得させることが可能になるため、結果として長時間情報に頑健な音声認識装置が構築される。こちらも同様に、特徴量空間上で似たような表現にするために、特徴量上の時間軸上における各セグメント内で特徴量系列を反転させ、それを全て再連結させたデータを拡張データとして用いる。 With time-reversed speech, it becomes possible for the speech recognition device to acquire the robustness of humans who can perceive speech even when the time series is locally inverted (destroyed), and as a result, speech recognition that is robust to long-term information. The device is built. Similarly, in order to make a similar expression in the feature space, the feature series is inverted in each segment on the time axis on the feature, and all the data are reconcatenated to expand the data. Used as.
<第二実施形態>
 第一実施形態と異なる部分を中心に説明する。
 図5は第二実施形態に係るモデル学習装置の機能ブロック図を、図6はその処理フローを示す。
<Second embodiment>
The part different from the first embodiment will be mainly described.
FIG. 5 shows a functional block diagram of the model learning device according to the second embodiment, and FIG. 6 shows a processing flow thereof.
 モデル学習装置100は、音声信号取得部110と音声ディジタル信号蓄積部120と特徴量分析部130と特徴量蓄積部140と特徴量変換部150と学習部160とを含む。 The model learning device 100 includes a voice signal acquisition unit 110, a voice digital signal storage unit 120, a feature amount analysis unit 130, a feature amount storage unit 140, a feature amount conversion unit 150, and a learning unit 160.
 以下、各部の処理内容について説明する。 The processing contents of each part will be explained below.
 音声信号取得部110および音声ディジタル信号蓄積部120の処理内容は第一実施形態と同様である。 The processing contents of the audio signal acquisition unit 110 and the audio digital signal storage unit 120 are the same as those in the first embodiment.
<特徴量分析部130>
入力:音声ディジタル信号x(t)
出力:特徴量系列X(p)
処理:特徴量分析
 特徴量分析部130は、音声ディジタル信号x(t)に対して特徴量分析を行い、特徴量系列X(p)を得る。
<Feature quantity analysis unit 130>
Input: Audio digital signal x (t)
Output: Feature series X (p)
Processing: Feature analysis The feature analysis unit 130 performs feature analysis on the voice digital signal x (t) to obtain a feature series X (p).
 例えば、特徴量分析部130は、音声ディジタル信号蓄積部120から発話p毎の音声ディジタル信号x(t)を取り出し、音声ディジタル信号x(t)をフレーム単位に分割し、フレーム毎に音響特徴量抽出を行い、発話p毎の(音響)特徴量系列X(p)を取得する。 For example, the feature amount analysis unit 130 takes out the voice digital signal x (t) for each utterance p from the voice digital signal storage unit 120, divides the voice digital signal x (t) into frame units, and has an acoustic feature amount for each frame. Extraction is performed, and the (acoustic) feature quantity series X (p) for each utterance p is acquired.
 例えば、ある発話pに含まれるフレーム総数をNpとし、np=1p,2p,…,Npとし、フレーム長をMとし、m=1,2,…,Mとし、シフト幅をDとすると、ある発話pのnp番目のフレームのm番目の音声ディジタル信号x(t)は、x(D(np-1)+m)と表すことができる。ただし、下付き添え字pは、発話pに対応する値であることを示す。特徴量分析部130は、フレームnp毎に音声ディジタル信号x(D(np-1)+1),x(D(np-1)+2),…,x(D(np-1)+M)に対して、音響特徴量抽出を行い、特徴量X(np)を取得する。特徴量分析部130は、発話pに含まれる全てのフレーム1p,2p,…,Npに対して処理を行い、発話p毎の特徴量系列X(p)={X(1p),X(2p),…,X(Np)}を取得する。 For example, the total number of frames included in a certain utterance p is N p , n p = 1 p , 2 p , ..., N p , the frame length is M, m = 1,2, ..., M, and the shift width is set. If D, the m-th voice digital signal x (t) in the n-p- th frame of a certain utterance p can be expressed as x (D (n p -1) + m). However, the subscript subscript p indicates that the value corresponds to the utterance p. The feature analysis unit 130 uses the audio digital signal x (D (n p -1) +1), x (D (n p -1) + 2),…, x (D (n p −— ) for each frame n p. 1) For + M), the acoustic features are extracted and the features X (n p ) are obtained. The feature analysis unit 130 processes all frames 1 p , 2 p , ..., N p included in the utterance p, and the feature sequence X (p) = {X (1 p ) for each utterance p. , X (2 p ),…, X (N p )}.
 抽出する特徴量は、第一実施形態と同様である。 The feature amount to be extracted is the same as that of the first embodiment.
<特徴量蓄積部140>
入力: 特徴量系列X(p)
処理:特徴量系列の蓄積
 特徴量蓄積部140は、特徴量分析部130で分析した特徴量系列X(p)を蓄積する(S140)。
<Feature amount storage unit 140>
Input: Feature series X (p)
Processing: Accumulation of feature quantity series The feature quantity storage unit 140 accumulates the feature quantity series X (p) analyzed by the feature quantity analysis unit 130 (S140).
<特徴量変換部150>
入力:特徴量系列X(p)
出力:特徴量系列Rq(p)
処理:Data Augmentation処理
 特徴量変換部150は、特徴量系列X(p)に対してData Augmentation処理を実行し、特徴量系列X(p)を特徴量系列Rq(p)に変換する(S150)。別の言い方をすると、特徴量変換部150は、特徴量系列X(p)を知覚強度の異なるQ-1個の特徴量系列Rq(p)に変換する。例えば、特徴量変換部150は、セグメント長のみを変更して同一のオリジナルの特徴量系列X(p)からQ-1個の特徴量系列Rq(p)を生成する。
<Feature quantity conversion unit 150>
Input: Feature series X (p)
Output: Feature series R q (p)
Processing: Data Augmentation processing The feature quantity conversion unit 150 executes Data Augmentation processing on the feature quantity series X (p) and converts the feature quantity series X (p) into the feature quantity series R q (p) (S150). ). In other words, the feature amount conversion unit 150 converts the feature amount series X (p) into Q-1 feature amount series R q (p) having different perceptual intensities. For example, the feature amount conversion unit 150 generates Q-1 feature amount series R q (p) from the same original feature amount sequence X (p) by changing only the segment length.
 なお、Data Augmentation処理は、後述する学習部160で学習する際に同時にオンラインで実行される。より詳しく説明すると、後述する学習部160で用いる全ての発話p(ここではp=1,2,…,P)に対応する特徴量系列X(p)に対して予めData Augmentation処理を行うのではなく、ある発話p'(p'は1,2,…,Pの何れか)に対応する特徴量系列X(p')を用いて学習する際にある特徴量系列X(p')に対してData Augmentation処理を実行し、特徴量系列X(p')を特徴量系列Rq(p')に変換することを意味する。ただし、Pは、学習用のアナログの音声信号x(k)に含まれる発話の総数を表す。水増しした学習データは、学習時のみ利用し、保管する必要がないため、保管する学習データ量を減らすことができる。なお、入力は特徴量系列となるため、全て特徴量空間上でData Augmentation処理をすることになり、音声ディジタル信号に対してData Augmentation処理を行う必要がなくなる。 The Data Augmentation process is executed online at the same time as learning by the learning unit 160 described later. More specifically, the data augmentation process may be performed in advance on the feature sequence X (p) corresponding to all the utterances p (here, p = 1,2, ..., P) used in the learning unit 160 described later. For a feature sequence X (p') when learning using a feature sequence X (p') corresponding to a certain utterance p'(p'is one of 1, 2, ..., P) It means that the Data Augmentation process is executed and the feature series X (p') is converted to the feature series R q (p'). However, P represents the total number of utterances contained in the analog audio signal x (k) for learning. Since the inflated learning data is used only during learning and does not need to be stored, the amount of learning data to be stored can be reduced. Since the input is a feature quantity series, all the data augmentation processing is performed on the feature quantity space, and it is not necessary to perform the data augmentation processing on the audio digital signal.
 本実施形態におけるData Augmentation処理は、特徴量系列X(p)をある変換ルールに基づいて特徴量系列Rq(p)(q=2,3,…,Q)に変換する。この変換により、擬似的な水増し学習データを生成する。 The Data Augmentation process in this embodiment converts the feature sequence X (p) into the feature sequence R q (p) (q = 2,3, ..., Q) based on a certain conversion rule. By this conversion, pseudo inflated learning data is generated.
 本実施形態では、変換後の特徴量系列Rq(p)に対応する音声信号が錯聴を生じ得る音声信号となるような変換ルールが採用される。なお、一般的に錯聴を生じ得る音声信号を生成する場合には音声波形に対して処理を行うが、本実施形態では特徴量系列上で変換処理を行う。 In the present embodiment, a conversion rule is adopted so that the audio signal corresponding to the converted feature sequence R q (p) becomes an audio signal that can cause an auditory illusion. Generally, when an audio signal that can cause an auditory illusion is generated, processing is performed on the audio waveform, but in the present embodiment, conversion processing is performed on the feature quantity series.
 変換後の特徴量系列Rq(p)に対応する音声信号が錯聴を生じ得る音声信号となるような変換ルールとして、本実施形態では、上述したような連続聴効果が得られるような音声信号や時間反転音声となるような音声信号に対応する特徴量系列に変換する変換ルールを採用する。 As a conversion rule such that the audio signal corresponding to the converted feature sequence R q (p) becomes an audio signal that can cause illusion, in the present embodiment, the audio that can obtain the continuous listening effect as described above can be obtained. Adopt a conversion rule that converts to a feature sequence corresponding to an audio signal such as a signal or time-reversed audio.
(i)連続聴効果が得られるような音声信号に対応する特徴量系列に変換する場合、特徴量変換部150は、特徴量系列X(p)のあるセグメントを削除し、削除したセグメントの前後の特徴量の値以上の値を持つ特徴量を削除した部分に埋め込むことで、特徴量系列X(p)を特徴量系列Rq(p)に変換する。セグメント長は、錯聴を生じ得る長さであり、Q-1個の異なる長さを適用する。また、削除処理および埋め込み処理は、錯聴を生じ得る間隔で行う。また、埋め込む特徴量はノイズに対応する特徴量であり、ノイズは例えばホワイトノイズである。なお、処理S150に先立ちノイズに対応する特徴量を予め用意しておく。例えば、特徴量系列…,X(s+1p),X(s+2p),X(s+3p),X(s+4p),X(s+5p),X(s+6p),X(s+7p),X(s+8p),X(s+9p),X(s+10p),…のうち、3つの特徴量X(s+3p),X(s+4p),X(s+5p)を削除し、ノイズに対応する3つの特徴量X(1n),X(2n),X(3n)を埋め込む処理を行う(図7参照)。X(1n),X(2n),X(3n)の値は、前の特徴量X(s+2p)と後ろの特徴量X(s+6p)の値以上となるように設定する。例えば、この処理を、20フレーム毎に行う。例えば、削除する部分および埋め込む部分の長さを3フレーム分から4,5フレーム分に変更することで、特徴量系列X(p)から知覚強度の異なる3個の特徴量系列Rq(p)が得られる。 (i) When converting to a feature quantity series corresponding to a voice signal that can obtain a continuous listening effect, the feature quantity conversion unit 150 deletes a segment having the feature quantity sequence X (p), and before and after the deleted segment. The feature quantity series X (p) is converted into the feature quantity series R q (p) by embedding the feature quantity having a value equal to or higher than the feature quantity value of. The segment length is a length that can cause an illusion, and Q-1 different lengths are applied. In addition, the deletion process and the embedding process are performed at intervals that may cause an auditory illusion. Further, the feature amount to be embedded is a feature amount corresponding to noise, and the noise is, for example, white noise. Prior to the processing S150, a feature amount corresponding to noise is prepared in advance. For example, feature series…, X (s + 1 p ), X (s + 2 p ), X (s + 3 p ), X (s + 4 p ), X (s + 5 p ), X (s) Of the + 6 p ), X (s + 7 p ), X (s + 8 p ), X (s + 9 p ), X (s + 10 p ),…, three features X (s + 3) Process of deleting p ), X (s + 4 p ), X (s + 5 p ) and embedding three features X (1 n ), X (2 n ), X (3 n) corresponding to noise (See Fig. 7). The values of X (1 n ), X (2 n ), X (3 n ) should be greater than or equal to the values of the front feature X (s + 2 p ) and the back feature X (s + 6 p). Set to. For example, this process is performed every 20 frames. For example, by changing the length of the deleted part and the embedded part from 3 frames to 4,5 frames, 3 feature series R q (p) with different perceptual intensities can be obtained from the feature series X (p). can get.
(ii)時間反転音声となるような音声信号に対応する特徴量系列に変換する場合、特徴量変換部150は、特徴量系列X(p)を所定の時間長のセグメントに分割し、各セグメント内で分割した特徴量系列を時間的に反転させ、反転させた特徴量系列を連結することで、特徴量系列X(p)を特徴量系列Rq(p)に変換する。セグメント長は、錯聴を生じ得る長さであり、Q-1個の異なる長さを適用する。例えば、特徴量変換部150は、特徴量系列…,X(s+1p),X(s+2p),X(s+3p),X(s+4p),X(s+5p),X(s+6p),X(s+7p),X(s+8p),X(s+9p),X(s+10p),…を5フレーム分の時間長のセグメント…、s(1)={X(s+1p),X(s+2p),X(s+3p),X(s+4p),X(s+5p)}、s(2)={X(s+6p),X(s+7p),X(s+8p),X(s+9p),X(s+10p)}、…に分割する。さらに、特徴量変換部150は、各セグメント内の特徴量系列を時間的に反転させ、…、s'(1)={X(s+5p),X(s+4p),X(s+3p),X(s+2p),X(s+1p)}、s'(2)={X(s+10p),X(s+9p),X(s+8p),X(s+7p),X(s+6p)}、…とし、…、s'(1)、s'(2)、…の順番で連結する(図8参照)。例えば、反転させる部分の長さを5フレーム分から6,7フレーム分に変更することで、特徴量系列X(p)から知覚強度の異なる3個の特徴量系列Rq(p)が得られる。 (ii) When converting to a feature quantity sequence corresponding to a voice signal that becomes a time-reversed voice, the feature quantity conversion unit 150 divides the feature quantity sequence X (p) into segments having a predetermined time length, and each segment. The feature series X (p) is converted into the feature series R q (p) by reversing the feature series divided in the process in time and concatenating the inverted feature series. The segment length is a length that can cause an illusion, and Q-1 different lengths are applied. For example, the feature amount conversion unit 150 has a feature amount series ..., X (s + 1 p ), X (s + 2 p ), X (s + 3 p ), X (s + 4 p ), X (s +). 5 p ), X (s + 6 p ), X (s + 7 p ), X (s + 8 p ), X (s + 9 p ), X (s + 10 p ),… for 5 frames Time length segment…, s (1) = {X (s + 1 p ), X (s + 2 p ), X (s + 3 p ), X (s + 4 p ), X (s + 5 p) )}, S (2) = {X (s + 6 p ), X (s + 7 p ), X (s + 8 p ), X (s + 9 p ), X (s + 10 p )}, Divide into ... Further, the feature amount conversion unit 150 inverts the feature amount series in each segment in time, and ..., s'(1) = {X (s + 5 p ), X (s + 4 p ), X ( s + 3 p ), X (s + 2 p ), X (s + 1 p )}, s'(2) = {X (s + 10 p ), X (s + 9 p ), X (s + 8 p ), X (s + 7 p ), X (s + 6 p )},…, and concatenate in the order of…, s'(1), s'(2),… (see Fig. 8). For example, by changing the length of the inverted portion from 5 frames to 6,7 frames, three feature series R q (p) with different perceptual intensities can be obtained from the feature series X (p).
 本実施形態では、連続聴効果が得られるような音声信号に対応する特徴量系列と時間反転音声となるような音声信号に対応する特徴量系列のどちらか、もしくはその両方に特徴量系列X(p)を変換する。 In the present embodiment, the feature quantity sequence X (feature quantity sequence X () for either or both of the feature quantity sequence corresponding to the audio signal such that the continuous listening effect can be obtained and the feature quantity sequence corresponding to the audio signal such as the time-reversed audio. Convert p).
 学習部160の処理内容は第一実施形態と同様である。 The processing content of the learning unit 160 is the same as that of the first embodiment.
<効果>
 以上の構成により、第一実施形態と同様の効果を得ることができる。さらに、Data Augmentation処理を音声波形に対して行うのではなく、特徴量空間上で行うことで、水増しした学習データに対する処理S110~S140を削減することができる。また、Data Augmentation処理を学習時に同時に行うことで、学習データの記憶容量を削減することができる。
<Effect>
With the above configuration, the same effect as that of the first embodiment can be obtained. Further, by performing the Data Augmentation process on the feature amount space instead of performing the data augmentation process on the voice waveform, it is possible to reduce the processes S110 to S140 for the inflated learning data. Further, by performing the Data Augmentation process at the same time as learning, the storage capacity of the learning data can be reduced.
<変形例>
 モデル学習装置100の学習部160を含まない構成を、学習データ生成装置ともいう。つまり、学習データ生成装置は、音声信号取得部110と音声ディジタル信号蓄積部120と特徴量分析部130と特徴量蓄積部140と特徴量変換部150とを含む。学習データ生成装置は、学習用のアナログの音声信号x(k)と正解ラベルr(j)とを入力とし、音声信号x(k)から特徴量系列X(p)と特徴量系列Rq(p)とを生成し、特徴量系列X(p)、特徴量系列Rq(p)および正解ラベルr(j)の組合せを学習データとして出力する。
<Modification example>
A configuration that does not include the learning unit 160 of the model learning device 100 is also referred to as a learning data generation device. That is, the learning data generation device includes a voice signal acquisition unit 110, a voice digital signal storage unit 120, a feature amount analysis unit 130, a feature amount storage unit 140, and a feature amount conversion unit 150. The learning data generator takes an analog audio signal x (k) for training and a correct label r (j) as inputs, and features sequence X (p) and feature sequence R q (feature sequence R q) from the audio signal x (k). p) is generated, and the combination of the feature series X (p), the feature series R q (p), and the correct label r (j) is output as training data.
 第二実施形態では、特徴量系列Rq(p)に対応する音声信号は、錯聴を生じ得る音声信号であることを前提としているが、実験の結果、錯聴を生じ得ない音声信号であっても同様の効果を得ることができることが分かった。そこで、知覚強度が高い学習データとして錯聴を生じ得ない音声信号を用意してもよい。言い換えると、Q-1個の特徴量系列Rq(p)に対応するQ-1個の音声信号のうちの少なくとも最も知覚強度が低い音声信号が錯聴を生じ得る音声信号であればよい。 In the second embodiment, it is assumed that the audio signal corresponding to the feature sequence R q (p) is an audio signal that can cause an auditory illusion, but as a result of an experiment, it is an audio signal that cannot cause an auditory illusion. It was found that the same effect can be obtained even if there is. Therefore, an audio signal that cannot cause an auditory illusion may be prepared as learning data having high perceptual intensity. In other words, at least the audio signal having the lowest perceptual intensity among the Q-1 audio signals corresponding to the Q-1 feature sequence R q (p) may be an audio signal that can cause an auditory illusion.
 例えば、第二実施形態では、連続聴効果を得るために、特徴量変換部150は、特徴量系列X(p)のあるセグメントを削除し、削除したセグメントの前後の特徴量の値以上の値を持つ特徴量を削除した部分に埋め込むことで、特徴量系列X(p)を特徴量系列Rq(p)に変換する。このとき、知覚強度が高い学習データに変換する場合、削除する、または、埋め込むセグメントのセグメント長は、連続聴効果を生じ得ないほど長くともよい。また、削除処理および埋め込み処理の間隔は、連続聴効果を生じ得ないほど短くともよい。このようなデータ拡張を実行した場合であっても、第二実施形態と同様の精度を持つ音響モデルを学習することができる。 For example, in the second embodiment, in order to obtain the continuous listening effect, the feature amount conversion unit 150 deletes a segment having the feature amount series X (p), and the value is equal to or greater than the value of the feature amount before and after the deleted segment. By embedding the features with features in the deleted part, the feature series X (p) is converted to the feature series R q (p). At this time, when converting to learning data having high perceptual intensity, the segment length of the segment to be deleted or embedded may be long enough to prevent the continuous listening effect. Further, the interval between the deletion process and the embedding process may be so short that the continuous listening effect cannot be produced. Even when such data expansion is performed, it is possible to learn an acoustic model having the same accuracy as that of the second embodiment.
 また、例えば、第二実施形態では、時間反転音声となるような音声信号に変換するために、特徴量変換部150は、特徴量系列X(p)を所定の時間長のセグメントに分割し、各セグメント内で分割した特徴量系列を時間的に反転させ、反転させた特徴量系列を連結することで、特徴量系列X(p)を特徴量系列Rq(p)に変換する。このとき、知覚強度が高い学習データに変換する場合、セグメント長は、錯聴を生じ得ないほど長くともよい。このようなデータ拡張を実行した場合であっても、第二実施形態と同様の精度を持つ音響モデルを学習することができる。 Further, for example, in the second embodiment, the feature quantity conversion unit 150 divides the feature quantity sequence X (p) into segments having a predetermined time length in order to convert the feature quantity sequence X (p) into a voice signal that becomes a time-reversed voice. The feature series X (p) is converted into the feature series R q (p) by inverting the feature series divided in each segment in time and concatenating the inverted feature series. At this time, when converting to learning data having high perceptual intensity, the segment length may be long enough to prevent illusion. Even when such data expansion is performed, it is possible to learn an acoustic model having the same accuracy as that of the second embodiment.
 さらに、第二実施形態では、欠如させる時間セグメント長または反転させるセグメント長(以下、単にセグメント長ともいう)のみを変更して同一のオリジナルの特徴量系列X(p)からQ-1個の特徴量系列Rq(p)を生成する変換方法を例示しているが、第一実施形態の信号変換部125と同様に、特徴量変換部150が、元の特徴量系列を知覚強度の異なるQ-1個の特徴量系列に変換することができればどのように変換してもよい。例えば、オリジナルの音声ディジタル信号x(t)に対応する特徴量系列XをQ-1個の特徴量系列V(2)={X(v2s)~X(v2e)},V(3)={X(v3s)~X(v3e)},…,V(Q)={X(vQs)~X(vQe)}に分割し、1個の時間区間V(q)={X(vqs)~X(vqe)}から1つの特徴量系列Rqを生成してもよい。vqsおよびvqeは、それぞれq番目の時間区間の最初および最後のフレーム番号を示す。また、オリジナルのP個の発話に対応する特徴量系列X(p)をQ-1個のグループに分け、グループごとに同じセグメント長を設定して、音声ディジタル信号Rq(p)を生成してもよい。 Further, in the second embodiment, Q-1 features from the same original feature series X (p) by changing only the time segment length to be omitted or the segment length to be inverted (hereinafter, also simply referred to as segment length). Although the conversion method for generating the quantity series R q (p) is illustrated, the feature quantity conversion unit 150 uses the original feature quantity series with different perceptual intensities Q as in the signal conversion unit 125 of the first embodiment. -If it can be converted into one feature series, it may be converted in any way. For example, the feature sequence X corresponding to the original audio digital signal x (t) is Q-1 feature sequence V (2) = {X (v 2s ) to X (v 2e )}, V (3). = {X (v 3s ) ~ X (v 3e )},…, V (Q) = {X (v Qs ) ~ X (v Qe )}, and one time interval V (q) = { One feature series R q may be generated from X (v qs ) to X (v qe)}. v qs and v qe indicate the first and last frame numbers of the qth time interval, respectively. In addition, the feature sequence X (p) corresponding to the original P utterances is divided into Q-1 groups, the same segment length is set for each group, and the voice digital signal R q (p) is generated. You may.
<その他の変形例>
 本発明は上記の実施形態及び変形例に限定されるものではない。例えば、上述の各種の処理は、記載に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。その他、本発明の趣旨を逸脱しない範囲で適宜変更が可能である。
<Other variants>
The present invention is not limited to the above embodiments and modifications. For example, the various processes described above may not only be executed in chronological order according to the description, but may also be executed in parallel or individually as required by the processing capacity of the device that executes the processes. In addition, changes can be made as appropriate without departing from the spirit of the present invention.
<プログラム及び記録媒体>
 上述の各種の処理は、図5に示すコンピュータの記憶部2020に、上記方法の各ステップを実行させるプログラムを読み込ませ、制御部2010、入力部2030、出力部2040などに動作させることで実施できる。
<Programs and recording media>
The various processes described above can be performed by causing the storage unit 2020 of the computer shown in FIG. 5 to read a program for executing each step of the above method and operating the control unit 2010, the input unit 2030, the output unit 2040, and the like. ..
 この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。 The program that describes this processing content can be recorded on a computer-readable recording medium. The recording medium that can be read by a computer may be, for example, a magnetic recording device, an optical disk, a photomagnetic recording medium, a semiconductor memory, or the like.
 また、このプログラムの流通は、例えば、そのプログラムを記録したDVD、CD-ROM等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 The distribution of this program is carried out, for example, by selling, transferring, renting, etc. a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Further, the program may be stored in the storage device of the server computer, and the program may be distributed by transferring the program from the server computer to another computer via the network.
 このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶装置に格納する。そして、処理の実行時、このコンピュータは、自己の記録媒体に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実行形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるASP(Application Service Provider)型のサービスによって、上述の処理を実行する構成としてもよい。なお、本形態におけるプログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの(コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等)を含むものとする。 A computer that executes such a program first, for example, first stores a program recorded on a portable recording medium or a program transferred from a server computer in its own storage device. Then, when the process is executed, the computer reads the program stored in its own recording medium and executes the process according to the read program. Further, as another execution form of this program, a computer may read the program directly from a portable recording medium and execute processing according to the program, and further, the program is transferred from the server computer to this computer. You may execute the process according to the received program one by one each time. In addition, the above-mentioned processing is executed by a so-called ASP (Application Service Provider) type service that realizes the processing function only by the execution instruction and the result acquisition without transferring the program from the server computer to this computer. May be. The program in this embodiment includes information used for processing by a computer and equivalent to the program (data that is not a direct command to the computer but has a property that regulates the processing of the computer, etc.).
 また、この形態では、コンピュータ上で所定のプログラムを実行させることにより、本装置を構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 Further, in this form, the present device is configured by executing a predetermined program on a computer, but at least a part of these processing contents may be realized in terms of hardware.

Claims (7)

  1.  音声認識装置において用いる音響モデルを学習する際に用いる学習データを生成する学習データ生成装置であって、
     Qを2以上の整数の何れかとし、第一の学習用音声信号を、知覚強度の異なるQ-1個の音声信号である第二の学習用音声信号に変換する信号変換部を含み、
     Q-1個の第二の学習用音声信号のうちの少なくとも最も知覚強度が低い第二の学習用音声信号は、錯聴を生じ得る音声信号である、
     学習データ生成装置。
    It is a learning data generation device that generates learning data used when learning an acoustic model used in a speech recognition device.
    A signal conversion unit is included in which Q is set to any of two or more integers and the first learning audio signal is converted into a second learning audio signal which is Q-1 audio signals having different perceptual intensities.
    Q-1 Of the two second learning audio signals, at least the second learning audio signal with the lowest perceptual intensity is an audio signal that can cause illusion.
    Training data generator.
  2.  音声認識装置において用いる音響モデルを学習する際に用いる学習データを生成する学習データ生成装置であって、
     Qを2以上の整数の何れかとし、第一の学習用音声信号から得られる音響特徴量系列である第一の特徴量系列をQ-1個の第二の特徴量系列に変換する特徴量変換部を含み、
     前記Q-1個の第二の特徴量系列に対応するQ-1個の第二の学習用音声信号は、知覚強度の異なるQ-1個の音声信号であり、そのうちの少なくとも最も知覚強度が低い音声信号は錯聴を生じ得る音声信号である、
     学習データ生成装置。
    It is a learning data generation device that generates learning data used when learning an acoustic model used in a speech recognition device.
    Q is one of two or more integers, and the first feature series, which is the acoustic feature series obtained from the first learning audio signal, is converted into Q-1 second feature series. Including the conversion part
    The Q-1 second learning audio signals corresponding to the Q-1 second feature quantity series are Q-1 audio signals having different perceptual intensities, and at least the most perceptual intensity is the highest. Low audio signals are audio signals that can cause illusions,
    Training data generator.
  3.  請求項1または請求項2の学習データ生成装置を含むモデル学習装置であって、
     前記第一の学習用音声信号から得られる第一の特徴量系列と、前記第二の学習用音声信号に対応するQ-1個の第二の特徴量系列と、前記第一の学習用音声信号に対応する正解ラベルとを用いて、知覚強度に基づくカリキュラム学習により、音響モデルを学習する学習部を含む、
     モデル学習装置。
    A model learning device including the learning data generation device according to claim 1 or 2.
    The first feature quantity series obtained from the first learning voice signal, the second feature quantity series of Q-1 corresponding to the second learning voice signal, and the first learning voice. Including a learning unit that learns an acoustic model by curriculum learning based on perceptual intensity using a correct answer label corresponding to a signal.
    Model learning device.
  4.  音声認識装置において用いる音響モデルを学習する際に用いる学習データを生成する学習データ生成方法であって、
     Qを2以上の整数の何れかとし、第一の学習用音声信号を知覚強度の異なるQ-1個の音声信号である第二の学習用音声信号に変換する信号変換ステップを含み、
     Q-1個の第二の学習用音声信号のうちの少なくとも最も知覚強度が低い第二の学習用音声信号は、錯聴を生じ得る音声信号である、
     学習データ生成方法。
    It is a learning data generation method that generates learning data used when learning an acoustic model used in a speech recognition device.
    Includes a signal conversion step in which Q is one of two or more integers and the first learning audio signal is converted into a second learning audio signal, which is one Q-1 audio signal with different perceptual intensities.
    Q-1 Of the two second learning audio signals, at least the second learning audio signal with the lowest perceptual intensity is an audio signal that can cause illusion.
    Training data generation method.
  5.  音声認識装置において用いる音響モデルを学習する際に用いる学習データを生成する学習データ生成方法であって、
     Qを2以上の整数の何れかとし、第一の学習用音声信号から得られる音響特徴量系列である第一の特徴量系列をQ-1個の第二の特徴量系列に変換する特徴量変換ステップを含み、
     前記Q-1個の第二の特徴量系列に対応するQ-1個の第二の学習用音声信号は、知覚強度の異なるQ-1個の音声信号であり、そのうちの少なくとも最も知覚強度が低い音声信号は錯聴を生じ得る音声信号である、
     学習データ生成方法。
    It is a learning data generation method that generates learning data used when learning an acoustic model used in a speech recognition device.
    Q is one of two or more integers, and the first feature series, which is the acoustic feature series obtained from the first learning audio signal, is converted into Q-1 second feature series. Including conversion steps
    The Q-1 second learning audio signals corresponding to the Q-1 second feature quantity series are Q-1 audio signals having different perceptual intensities, and at least the most perceptual intensity is the highest. Low audio signals are audio signals that can cause illusions,
    Training data generation method.
  6.  請求項4または請求項5の学習データ生成方法を含むモデル学習方法であって、
     前記第一の学習用音声信号から得られる第一の特徴量系列と、前記第二の学習用音声信号に対応するQ-1個の第二の特徴量系列と、前記第一の学習用音声信号に対応する正解ラベルとを用いて、知覚強度に基づくカリキュラム学習により、音響モデルを学習する学習ステップを含む、
     モデル学習方法。
    A model learning method including the learning data generation method according to claim 4 or 5.
    The first feature quantity series obtained from the first learning voice signal, the second feature quantity series of Q-1 corresponding to the second learning voice signal, and the first learning voice. Includes learning steps to learn an acoustic model by perceptual intensity-based curriculum learning with the correct label corresponding to the signal.
    Model learning method.
  7.  請求項1もしくは請求項2の学習データ生成装置、または、請求項3のモデル学習装置としてコンピュータを機能させるためのプログラム。 A program for operating a computer as the learning data generation device of claim 1 or 2, or the model learning device of claim 3.
PCT/JP2020/021699 2020-06-02 2020-06-02 Training data generation device, model training device, training data generation method, model training method, and program WO2021245771A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/JP2020/021699 WO2021245771A1 (en) 2020-06-02 2020-06-02 Training data generation device, model training device, training data generation method, model training method, and program

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2020/021699 WO2021245771A1 (en) 2020-06-02 2020-06-02 Training data generation device, model training device, training data generation method, model training method, and program

Publications (1)

Publication Number Publication Date
WO2021245771A1 true WO2021245771A1 (en) 2021-12-09

Family

ID=78830220

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2020/021699 WO2021245771A1 (en) 2020-06-02 2020-06-02 Training data generation device, model training device, training data generation method, model training method, and program

Country Status (1)

Country Link
WO (1) WO2021245771A1 (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190130896A1 (en) * 2017-10-26 2019-05-02 Salesforce.Com, Inc. Regularization Techniques for End-To-End Speech Recognition
US20190354808A1 (en) * 2018-05-18 2019-11-21 Google Llc Augmentation of Audiographic Images for Improved Machine Learning

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190130896A1 (en) * 2017-10-26 2019-05-02 Salesforce.Com, Inc. Regularization Techniques for End-To-End Speech Recognition
US20190354808A1 (en) * 2018-05-18 2019-11-21 Google Llc Augmentation of Audiographic Images for Improved Machine Learning

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
TAKANORI ASHIHARA, TOMOHIRO TANAKA, TAKASHI MORIYA, RYO MASUMURA, YUSUKE SHINOHARA, MAKIO KASHIWANO: "1-P-1 Data expansion for speech recognition using auditory illusion: Examination based on time-reversed speech", PROCEEDINGS OF THE ACOUSTICAL SOCIETY OF JAPAN; MARCH 1-3, 2020, 2 March 2020 (2020-03-02) - 3 March 2020 (2020-03-03), JP, pages 793 - 794, XP009532980 *
TAKANORI ASHIHARA, TOMOHIRO TANAKA, TAKASHI MORIYA, RYO MASUMURA, YUSUKE SHINOHARA, MAKIO KASHIWANO: "Data augmentation for ASR system by using locally time-reversed speech: Temporal inversion of feature sequence", IEICE TECHNICAL REPORT, SP, vol. 119, no. 441 (SP2019-59), 24 February 2020 (2020-02-24), pages 53 - 58, XP009532837 *

Similar Documents

Publication Publication Date Title
CN111954903B (en) Multi-speaker neuro-text-to-speech synthesis
Barker et al. The third ‘CHiME’speech separation and recognition challenge: Analysis and outcomes
JP7243760B2 (en) Audio feature compensator, method and program
KR20230043250A (en) Synthesis of speech from text in a voice of a target speaker using neural networks
Yuliani et al. Speech enhancement using deep learning methods: A review
CA2610269A1 (en) Method of adapting a neural network of an automatic speech recognition device
CN113436609B (en) Voice conversion model, training method thereof, voice conversion method and system
KR102272554B1 (en) Method and system of text to multiple speech
WO2023030235A1 (en) Target audio output method and system, readable storage medium, and electronic apparatus
Ismail et al. Mfcc-vq approach for qalqalahtajweed rule checking
JP2019020598A (en) Learning method of neural network
CN111667834A (en) Hearing-aid device and hearing-aid method
CN115836300A (en) Self-training WaveNet for text-to-speech
Kadyan et al. In domain training data augmentation on noise robust Punjabi Children speech recognition
CN111724809A (en) Vocoder implementation method and device based on variational self-encoder
US20240265908A1 (en) Methods for real-time accent conversion and systems thereof
JPH08335091A (en) Voice recognition device, voice synthesizer, and voice recognizing/synthesizing device
JP2000347681A (en) Regeneration method for voice control system utilizing voice synthesis of text base
Mandel et al. Audio super-resolution using concatenative resynthesis
CN117612545A (en) Voice conversion method, device, equipment and computer readable medium
KR20030031936A (en) Mutiple Speech Synthesizer using Pitch Alteration Method
Xu et al. Speaker Recognition Based on Long Short-Term Memory Networks
WO2021245771A1 (en) Training data generation device, model training device, training data generation method, model training method, and program
Ouzounov A robust feature for speech detection
WO2021234905A1 (en) Learning data generation device, model learning device, learning data generation method, and program

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20938934

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20938934

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: JP