WO2021234905A1 - Learning data generation device, model learning device, learning data generation method, and program - Google Patents

Learning data generation device, model learning device, learning data generation method, and program Download PDF

Info

Publication number
WO2021234905A1
WO2021234905A1 PCT/JP2020/020106 JP2020020106W WO2021234905A1 WO 2021234905 A1 WO2021234905 A1 WO 2021234905A1 JP 2020020106 W JP2020020106 W JP 2020020106W WO 2021234905 A1 WO2021234905 A1 WO 2021234905A1
Authority
WO
WIPO (PCT)
Prior art keywords
learning
audio signal
voice
data generation
signal
Prior art date
Application number
PCT/JP2020/020106
Other languages
French (fr)
Japanese (ja)
Inventor
孝典 芦原
雄介 篠原
義和 山口
Original Assignee
日本電信電話株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 日本電信電話株式会社 filed Critical 日本電信電話株式会社
Priority to PCT/JP2020/020106 priority Critical patent/WO2021234905A1/en
Publication of WO2021234905A1 publication Critical patent/WO2021234905A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks

Definitions

  • the present invention relates to a learning data generation device that generates learning data used when learning an acoustic model used in a speech recognition device, a model learning device that uses learning data, a learning data generation method, and a program.
  • Patent Document 1 is a technique for adapting an acoustic model to a task to be recognized in order to ensure a practical level of speech recognition performance.
  • Patent Document 1 is a technique for adapting an original acoustic model to a task having different acoustic characteristics such as a speaker, a noise type, and a way of speaking.
  • the speech recognition performance has an aspect of increasing or decreasing depending on the amount of learning data of the task to be recognized and the acoustic coverage. Therefore, usually, desired learning data is collected by sufficiently collecting the voices of the tasks to be recognized and transcribing the voices.
  • Data Augmentation is one of the solutions to such problems.
  • Data expansion is to add some variation to the original training data, generate new training data, and inflate the training data. By data expansion, it is possible to reduce learning with the same learning data and obtain further generalization performance.
  • Non-Patent Document 1 by converting the speaking speed to the original data, various speaker data are generated, and the generalization performance for a wider range of speakers is improved.
  • Non-Patent Document 2 in order to improve noise immunity and recognition performance for reverberation sound, noise is superimposed on the original learning data, and the impulse response of a room with strong reverberation is convoluted to create a pseudo reverberation sound. Is generated, and the reverberation sound is superimposed on the original learning data to improve the generalization performance.
  • the recurrent neural network (RNN) model is different from the multi-layer perceptron (MLP) model, and the model itself is devised to explicitly capture long-term information and handles time-series information such as voice recognition.
  • the task has greatly improved accuracy.
  • the speech recognition accuracy is improved by explicitly incorporating a linguistic long-time context into the end-to-end speech recognition model.
  • the learning data itself is devised.
  • learning data is generated by executing a Data Augmentation process using an auditory illusion.
  • the present invention is a learning data generation device that generates learning data for learning an acoustic model that simulates the robustness of human speech perception, a model learning device that learns an acoustic model using the learning data, and a learning data generation method. , And the purpose of providing the program.
  • the learning data generation device generates learning data used when learning the acoustic model used in the speech recognition device.
  • the learning data generation device includes a signal conversion unit that converts a first learning voice signal into a second learning voice signal that is a voice signal that can cause an auditory illusion.
  • the learning data generation device generates learning data used when learning an acoustic model used in a speech recognition device.
  • the learning data generator deletes a part of the first learning audio signal in the time axis direction, and embeds the noise above the sound pressure of each frequency before and after the deleted part in the deleted part. It includes a signal conversion unit that converts a learning audio signal into a second learning audio signal.
  • the learning data generation device generates learning data used when learning the acoustic model used in the speech recognition device.
  • the learning data generator divides the first learning voice signal into voice waveforms for each fixed short time window width, inverts the voice waveform on the time axis, and then concatenates the inverted voice waveforms.
  • the present invention has the effect of being able to learn an acoustic model that simulates the robustness of human speech perception.
  • the functional block diagram of the model learning apparatus which concerns on 1st Embodiment The figure which shows the example of the processing flow of the model learning apparatus which concerns on 1st Embodiment.
  • ⁇ Points of the first embodiment> in order to make the speech recognition device acquire the robustness of speech perception acquired by humans, a data augmentation process using auditory illusion is executed.
  • the illusion is an illusion phenomenon in which the physically presented sound stimulus is not always perceived as it is due to the human auditory characteristics, and can be said to be an auditory version of the illusion.
  • an acoustic model is learned using a voice signal that can produce such a continuous listening effect or a voice signal that becomes a time-reversed voice, it is naturally more than a deleted or masked part or a segment when inverting.
  • the acoustic model will be learned in consideration of a long time interval, and the acoustic model will incorporate information for a long time, and will acquire the robustness of speech perception acquired by humans.
  • the above-mentioned auditory illusion voice waveform as extended data, it is possible to learn a robust acoustic model from the learning data to long-term information.
  • the above-mentioned continuous length effect and time-reversed voice are mentioned among the auditory illusions.
  • the speech recognition device With time-reversed speech, it becomes possible for the speech recognition device to acquire the robustness of humans who can perceive speech even when the time series is locally inverted (destroyed), and as a result, speech recognition that is robust to long-term information.
  • the device is built.
  • FIG. 1 shows a functional block diagram of the model learning device according to the first embodiment
  • FIG. 2 shows a processing flow thereof.
  • the model learning device 100 includes a voice signal acquisition unit 110, a voice digital signal storage unit 120, a signal conversion unit 125, a voice digital signal storage unit 126, a feature amount analysis unit 130, a feature amount storage unit 140, and a learning unit 160.
  • the model learning device is, for example, a special program configured by loading a special program into a publicly known or dedicated computer having a central processing unit (CPU: Central Processing Unit), a main storage device (RAM: Random Access Memory), and the like. It is a device.
  • the model learning device executes each process under the control of the central processing unit, for example.
  • the data input to the model learning device and the data obtained by each process are stored in the main storage device, for example, and the data stored in the main storage device is read out to the central processing unit as needed. It is used for processing.
  • At least a part of each processing unit of the model learning device may be configured by hardware such as an integrated circuit.
  • Each storage unit included in the model learning device is, for example, a main storage device such as RAM (RandomAccessMemory), an auxiliary storage device composed of a hard disk, an optical disk, or a semiconductor memory element such as a flash memory (FlashMemory), or a relational device. It can be configured by middleware such as database and key value store.
  • RAM RandomAccessMemory
  • FlashMemory flash memory
  • middleware such as database and key value store.
  • the model learning device inputs an analog audio signal x (k) for learning and a corresponding correct answer label r (j), learns an acoustic model based on this information, and outputs a trained acoustic model f. do.
  • k is an index indicating the time.
  • the correct label is, for example, a phoneme label
  • j is an index indicating the order of phonemes.
  • Information indicating from where to where the correct answer label (sound element label) corresponds to the analog audio signal shall be included in the training data in advance, and the audio digital signal and feature amount described below are also linked in the same manner. It is assumed that it has been done.
  • the Data Augmentation process in this embodiment converts the audio digital signal x (t) into the audio digital signal r (t) based on a certain conversion rule. By this conversion, pseudo inflated learning data is generated.
  • a conversion rule is adopted so that the converted voice digital signal r (t) becomes a voice signal that can cause an auditory illusion.
  • the audio signal and the time-reversed audio that can obtain the continuous listening effect as described above are used.
  • Adopt a conversion rule that converts to such an audio signal.
  • the signal conversion unit 125 deletes a part (for a certain period of time) of the audio digital signal x (t) in the time axis direction, and the deleted part.
  • the audio digital signal x (t) is converted into the audio digital signal r (t) by embedding it in the portion where the noise above the sound pressure of each frequency before and after is deleted (see FIG. 3).
  • the length of the part to be deleted and the part to be embedded is a length that can cause an illusion.
  • the deletion process and the embedding process are performed at intervals that may cause an auditory illusion.
  • the noise to be embedded is, for example, white noise. Note that noise is prepared in advance prior to the process S125.
  • the signal conversion unit 125 divides the voice digital signal x (t) into a voice waveform for each fixed short time window width, and the voice waveform is divided into time. After being inverted on the axis, the inverted audio waveform is concatenated to convert the audio digital signal x (t) into the audio digital signal r (t) (see FIG. 4).
  • the length of carving is the length that can cause an illusion.
  • the feature quantity analysis unit 130 takes out the voice digital signal x (t) from the voice digital signal storage unit 120, takes out the voice digital signal r (t) from the voice digital signal storage unit 126, and takes out the voice digital signal x (t).
  • R (t) are divided into frame units, acoustic feature quantity extraction is performed for each frame, and (acoustic) feature quantity series X and R are acquired.
  • the m-th audio digital signal x (t) in the n-th frame can be expressed as x (D (n-1) + m).
  • the feature amount analysis unit 130 has audio digital signals x (D (n-1) + 1), x (D (n-1) + 2), ..., x (D (n-1) + M) for each frame n. ), The acoustic feature amount is extracted and the feature amount X (n) is acquired.
  • the features to be extracted include, for example, 1 to 12 dimensions of MFCC (Mel-Frequenct Cepstrum Coefficient) based on short-time frame analysis of voice signals, dynamic parameters such as ⁇ MFCC and ⁇ MFCC, which are the dynamic features, and dynamic parameters such as ⁇ MFCC and ⁇ MFCC. Power, ⁇ power, ⁇ power, etc. are used. Further, CMN (cepstrum average normalization) processing may be performed on the MFCC.
  • the feature amount is not limited to MFCC and power, and parameters used for identifying special utterances (for example, autocorrelation peak value and group delay) may be used.
  • the acoustic model f is a model that inputs a feature sequence and outputs a phoneme label.
  • GMM-HMM and DNN-HMM are often used as the acoustic model in speech recognition, and in recent years, the End-to-End speech recognition model is also used.
  • the correct label r (j) corresponds to the analog audio signal x (k) for learning, and the feature quantity sequence X obtained from the audio signal x (k) and the feature quantity sequence X are converted. It also corresponds to the feature quantity series R obtained by.
  • a configuration that does not include the feature amount analysis unit 130, the feature amount storage unit 140, and the learning unit 160 of the model learning device 100 is also referred to as a learning data generation device. That is, the learning data generation device includes an audio signal acquisition unit 110, an audio digital signal storage unit 120, a signal conversion unit 125, and an audio digital signal storage unit 126.
  • the learning data generator inputs an analog audio signal x (k) for learning and a correct answer label r (j), and from the audio signal x (k) to an audio digital signal x (t) and an audio digital signal r (t). ) And is generated, and the combination of the voice digital signal x (t), the voice digital signal r (t) and the correct answer label r (j) is output as training data.
  • the audio digital signal r (t) is an audio signal that can cause an auditory illusion, but as a result of an experiment, the same effect can be obtained even if the audio signal cannot cause an auditory illusion. It turned out that you can get.
  • the signal conversion unit 125 deletes a part of the audio digital signal r (t) in the time axis direction, and the sound pressure of each frequency before and after the deleted part is deleted.
  • the audio digital signal x (t) is converted into the audio digital signal r (t).
  • the lengths of the portion to be deleted and the portion to be embedded may be so long that the continuous listening effect cannot be produced.
  • the interval between the deletion process and the embedding process may be so short that the continuous listening effect cannot be produced. Even when such data expansion is executed, it is possible to learn an acoustic model having the same accuracy as that of the first embodiment.
  • the signal conversion unit 125 converts the voice digital signal x (t) into a voice waveform for each fixed short time window width. After inverting the voice waveform on the time axis, the voice digital signal x (t) is converted into the voice digital signal r (t) by concatenating the inverted voice waveforms. At this time, the length of the cut voice waveform may be long enough to prevent an auditory illusion. Even when such data expansion is executed, it is possible to learn an acoustic model having the same accuracy as that of the first embodiment.
  • the program that describes this processing content can be recorded on a computer-readable recording medium.
  • the recording medium that can be read by a computer may be, for example, a magnetic recording device, an optical disk, a photomagnetic recording medium, a semiconductor memory, or the like.
  • the distribution of this program is carried out, for example, by selling, transferring, renting, etc. a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Further, the program may be stored in the storage device of the server computer, and the program may be distributed by transferring the program from the server computer to another computer via the network.
  • a computer that executes such a program first, for example, first stores a program recorded on a portable recording medium or a program transferred from a server computer in its own storage device. Then, when the process is executed, the computer reads the program stored in its own recording medium and executes the process according to the read program. Further, as another execution form of this program, a computer may read the program directly from a portable recording medium and execute processing according to the program, and further, the program is transferred from the server computer to this computer. You may execute the process according to the received program one by one each time.
  • ASP Application Service Provider
  • the program in this embodiment includes information used for processing by a computer and equivalent to the program (data that is not a direct command to the computer but has a property that regulates the processing of the computer, etc.).
  • the present device is configured by executing a predetermined program on a computer, but at least a part of these processing contents may be realized in terms of hardware.

Abstract

Provided is a learning data generation device, etc., that generates learning data for learning an acoustic model that simulates robustness for perception of a human voice. The learning data generation device generates learning data for use when learning an acoustic model used in a voice recognition device. The learning data generation device includes a signal conversion unit that converts a first learning voice signal to a second learning voice signal, which is a voice signal that can produce an acoustic illusion.

Description

学習データ生成装置、モデル学習装置、学習データ生成方法、およびプログラムTraining data generator, model learning device, training data generation method, and program
 本発明は、音声認識装置において用いる音響モデルを学習する際に用いる学習データを生成する学習データ生成装置、学習データを用いるモデル学習装置、学習データ生成方法、およびプログラムに関する。 The present invention relates to a learning data generation device that generates learning data used when learning an acoustic model used in a speech recognition device, a model learning device that uses learning data, a learning data generation method, and a program.
 音響モデルを用いた音声認識装置において、特許文献1は、実用レベルの音声認識性能を担保するために、認識対象とするタスクに対して音響モデルを適応させていく技術である。言い換えると、特許文献1は、話者や、雑音タイプ、喋り方などの音響的特徴が異なるタスクに対して元々の音響モデルを適応させていく技術である。一般的に、音声認識性能は、認識対象とするタスクの学習データ量や、音響的な網羅性に依存して上下する側面を持つ。そこで通常は、認識対象とするタスクの音声を十分に集め、その音声を書き起こしすることで、所望の学習データを収集する。 In a speech recognition device using an acoustic model, Patent Document 1 is a technique for adapting an acoustic model to a task to be recognized in order to ensure a practical level of speech recognition performance. In other words, Patent Document 1 is a technique for adapting an original acoustic model to a task having different acoustic characteristics such as a speaker, a noise type, and a way of speaking. In general, the speech recognition performance has an aspect of increasing or decreasing depending on the amount of learning data of the task to be recognized and the acoustic coverage. Therefore, usually, desired learning data is collected by sufficiently collecting the voices of the tasks to be recognized and transcribing the voices.
 しかしながら、従来技術では、莫大な金銭的・時間的コストを要するという課題がある。 However, the conventional technology has a problem that it requires a huge financial and time cost.
 このような課題に対する解決技術の一つとしてData Augmentation(データ拡張)がある。データ拡張とは、オリジナルの学習データに対して何かしらの変動を加え、新しい学習データを生成し、学習データを水増しすることである。データ拡張により、同じ学習データで学習することを少なくし、一層の汎化性能を獲得することができる。 Data Augmentation is one of the solutions to such problems. Data expansion is to add some variation to the original training data, generate new training data, and inflate the training data. By data expansion, it is possible to reduce learning with the same learning data and obtain further generalization performance.
 例えば、非特許文献1では、話速をオリジナルデータに対して変換することで、様々な話者データを生成し、より広範な話者に対する汎化性能を向上させる。 For example, in Non-Patent Document 1, by converting the speaking speed to the original data, various speaker data are generated, and the generalization performance for a wider range of speakers is improved.
 また、非特許文献2では、雑音耐性や残響音声に対する認識性能を改善するために、オリジナルの学習データに対し雑音を重畳させ、更に残響の強い部屋のインパルス応答を畳み込むことで擬似的な残響音声を生成し、オリジナルの学習データに対し残響音声を重畳させ、汎化性能を向上させる。 Further, in Non-Patent Document 2, in order to improve noise immunity and recognition performance for reverberation sound, noise is superimposed on the original learning data, and the impulse response of a room with strong reverberation is convoluted to create a pseudo reverberation sound. Is generated, and the reverberation sound is superimposed on the original learning data to improve the generalization performance.
特開2007-249051号公報Japanese Unexamined Patent Publication No. 2007-249051
 ここでは、音声認識装置に対し、長時間情報を捉えるためのData Augmentationを考える。まず、音声認識装置と長時間情報について説明する。長時間情報を音声認識装置に組み込むことで、様々な音響事象に頑健になり、音声認識精度が改善されるといった報告が多数存在している。 Here, consider Data Augmentation for capturing information for a long time for a voice recognition device. First, the voice recognition device and long-time information will be described. There are many reports that incorporating long-term information into a speech recognition device makes it more robust to various acoustic events and improves speech recognition accuracy.
 例えば、recurrent neural network(RNN)モデルは、multi-layer perceptron(MLP)モデルなどとは異なり、長時間情報を陽に取り込むためにモデル自体に工夫をし、音声認識のような時系列情報を扱うタスクでは大きく精度改善した。 For example, the recurrent neural network (RNN) model is different from the multi-layer perceptron (MLP) model, and the model itself is devised to explicitly capture long-term information and handles time-series information such as voice recognition. The task has greatly improved accuracy.
 また、参考文献1では、End-to-End音声認識モデルに言語的な長時間コンテキストを陽に組み込むことで、音声認識精度を改善している。 Further, in Reference 1, the speech recognition accuracy is improved by explicitly incorporating a linguistic long-time context into the end-to-end speech recognition model.
(参考文献1)R. Masumura, T. Tanaka, T. Moriya, Y. Shinohara, T. Oba and Y. Aono, "Large Context End-to-end Automatic Speech Recognition via Extension of Hierarchical Recurrent Encoder-decoder Models", ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, United Kingdom, 2019, pp. 5661-5665.
 上述のように、モデルを工夫することで長時間情報を音声認識装置に組み込む手法は様々ある。しかしながら、学習データ自体を工夫することで長時間情報を組み込んだ音声認識装置を獲得するようなアプローチはこれまでに存在しない。
(Reference 1) R. Masumura, T. Tanaka, T. Moriya, Y. Shinohara, T. Oba and Y. Aono, "Large Context End-to-end Automatic Speech Recognition via Extension of Hierarchical Recurrent Encoder-decoder Models" , ICASSP 2019 --2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, United Kingdom, 2019, pp. 5661-5665.
As described above, there are various methods for incorporating long-term information into the speech recognition device by devising a model. However, there has never been an approach to acquire a speech recognition device that incorporates information for a long time by devising the learning data itself.
 本発明では、学習データ自体に工夫を施す。本発明に係る学習データ生成装置で生成した学習データを用いて音響モデルを学習することで、ヒトが先天的また後天的に獲得するような音声知覚に対する頑健性を獲得することができる。本実施形態では、錯聴を利用してData Augmentation処理を実行することで、学習データを生成する。 In the present invention, the learning data itself is devised. By learning an acoustic model using the learning data generated by the learning data generation device according to the present invention, it is possible to acquire robustness to speech perception that humans acquire innately and acquiredly. In this embodiment, learning data is generated by executing a Data Augmentation process using an auditory illusion.
 本発明は、ヒトの音声知覚に対する頑健性を模擬した音響モデルを学習するための学習データを生成する学習データ生成装置、その学習データを用いて音響モデルを学習するモデル学習装置、学習データ生成方法、およびプログラムを提供することを目的とする。 The present invention is a learning data generation device that generates learning data for learning an acoustic model that simulates the robustness of human speech perception, a model learning device that learns an acoustic model using the learning data, and a learning data generation method. , And the purpose of providing the program.
 上記の課題を解決するために、本発明の一態様によれば、学習データ生成装置は、音声認識装置において用いる音響モデルを学習する際に用いる学習データを生成する。学習データ生成装置は、第一の学習用音声信号を、錯聴を生じ得る音声信号である第二の学習用音声信号に変換する信号変換部を含む。 In order to solve the above problem, according to one aspect of the present invention, the learning data generation device generates learning data used when learning the acoustic model used in the speech recognition device. The learning data generation device includes a signal conversion unit that converts a first learning voice signal into a second learning voice signal that is a voice signal that can cause an auditory illusion.
 上記の課題を解決するために、本発明の他の態様によれば、学習データ生成装置は、音声認識装置において用いる音響モデルを学習する際に用いる学習データを生成する。学習データ生成装置は、時間軸方向において第一の学習用音声信号のある一部分を削除し、削除した部分の前後の各周波数の音圧以上のノイズを削除した部分に埋め込むことで、第一の学習用音声信号を第二の学習用音声信号に変換する信号変換部を含む。 In order to solve the above problem, according to another aspect of the present invention, the learning data generation device generates learning data used when learning an acoustic model used in a speech recognition device. The learning data generator deletes a part of the first learning audio signal in the time axis direction, and embeds the noise above the sound pressure of each frequency before and after the deleted part in the deleted part. It includes a signal conversion unit that converts a learning audio signal into a second learning audio signal.
 上記の課題を解決するために、本発明の一態様によれば、学習データ生成装置は、音声認識装置において用いる音響モデルを学習する際に用いる学習データを生成する。学習データ生成装置は、第一の学習用音声信号をある一定の短い時間窓幅毎に音声波形に切り分け、その音声波形を時間軸上で反転させた後、その反転させた音声波形を連結することで、第一の学習用音声信号を第二の学習用音声信号に変換する信号変換部を含む。 In order to solve the above problem, according to one aspect of the present invention, the learning data generation device generates learning data used when learning the acoustic model used in the speech recognition device. The learning data generator divides the first learning voice signal into voice waveforms for each fixed short time window width, inverts the voice waveform on the time axis, and then concatenates the inverted voice waveforms. This includes a signal conversion unit that converts the first learning audio signal into the second learning audio signal.
 本発明により、ヒトの音声知覚に対する頑健性を模擬した音響モデルを学習することができるという効果を奏する。 The present invention has the effect of being able to learn an acoustic model that simulates the robustness of human speech perception.
第一実施形態に係るモデル学習装置の機能ブロック図。The functional block diagram of the model learning apparatus which concerns on 1st Embodiment. 第一実施形態に係るモデル学習装置の処理フローの例を示す図。The figure which shows the example of the processing flow of the model learning apparatus which concerns on 1st Embodiment. 連続聴効果が得られるような音声信号に変換する例を示す図。The figure which shows the example which converts into the audio signal which can obtain the continuous listening effect. 時間反転音声となるような音声信号に変換する例を示す図。The figure which shows the example which converts into the voice signal which becomes the time-inverted voice. 本手法を適用するコンピュータの構成例を示す図。The figure which shows the configuration example of the computer to which this method is applied.
 以下、本発明の実施形態について、説明する。なお、以下の説明に用いる図面では、同じ機能を持つ構成部や同じ処理を行うステップには同一の符号を記し、重複説明を省略する。以下の説明において、ベクトルや行列の各要素単位で行われる処理は、特に断りが無い限り、そのベクトルやその行列の全ての要素に対して適用されるものとする。 Hereinafter, embodiments of the present invention will be described. In the drawings used in the following description, the same reference numerals are given to the components having the same function and the steps performing the same processing, and duplicate description is omitted. In the following description, the processing performed for each element of the vector or matrix shall be applied to all the elements of the vector or matrix unless otherwise specified.
<第一実施形態のポイント>
 本実施形態では、ヒトが獲得している音声知覚の頑健性を音声認識装置に獲得させるために、錯聴を利用したData Augmentation処理を実行する。
<Points of the first embodiment>
In the present embodiment, in order to make the speech recognition device acquire the robustness of speech perception acquired by humans, a data augmentation process using auditory illusion is executed.
 錯聴とは、ヒトの聴覚特性により、物理的に提示された音刺激に対して必ずしもその通り知覚されないような錯覚現象であり、錯視の聴覚版といえる。 The illusion is an illusion phenomenon in which the physically presented sound stimulus is not always perceived as it is due to the human auditory characteristics, and can be said to be an auditory version of the illusion.
 例えば、連続聴効果(continuity illusion)では、周波数変化する純音や音声等の一部分を削除し、その削除された部分に、元の音を十分にマスキングするだけのノイズを重畳することで、物理的には削除されているはずの音区間が補完されて知覚される現象である(参考文献2参照)。 For example, in the continuity illusion, a part of a pure tone or voice whose frequency changes is deleted, and noise that sufficiently masks the original sound is superimposed on the deleted part to physically perform it. Is a phenomenon in which the sound section that should have been deleted is complemented and perceived (see Reference 2).
(参考文献2)R. M. Warren: "Perceptual restoration of missing speech sounds", Science, 167, pp. 392-393 (1970).
 また、時間反転音声(Locally Time-reversed Speech)は、ある一定の短い時間セグメントごとに音声波形を切り分け、それぞれのセグメントにおいてその波形を時間軸上で反転させた後、その反転させた各セグメントを再度連結させた音声である(参考文献3参照)。
(Reference 2) R. M. Warren: "Perceptual restoration of missing speech sounds", Science, 167, pp. 392-393 (1970).
In Locally Time-reversed Speech, a voice waveform is divided into certain short time segments, the waveform is inverted on the time axis in each segment, and then each inverted segment is used. The audio is reconnected (see Reference 3).
(参考文献3)K. Saberi and D. R. Perrott, "Cognitive restoration of reversed speech", Nature, 398, 6730, pp. 760-760 (1999).
ヒトがこのような時間反転音声を聴取した場合、その音声知覚の明瞭度はそのセグメント長が比較的短い場合、例えば25ms程度であれば、十分高いまま維持される。しかしながら、セグメント長が長くなればなるほどその明瞭度はシグモイド関数的に低下していき、100ms程度でほぼ音声知覚が困難になることが実験的に示されている。つまりある程度までの局所的な時系列の破壊であれば、ヒトの音声知覚には影響を与えない(頑健である)ことが分かる。
(Reference 3) K. Saberi and D. R. Perrott, "Cognitive restoration of reversed speech", Nature, 398, 6730, pp. 760-760 (1999).
When a human hears such a time-reversed speech, the intelligibility of the speech perception is maintained sufficiently high when the segment length is relatively short, for example, about 25 ms. However, it has been experimentally shown that the longer the segment length, the lower the intelligibility in a sigmoid function, and it becomes almost difficult to perceive speech in about 100 ms. In other words, it can be seen that if the time series is destroyed locally to some extent, it does not affect human speech perception (it is robust).
 このような連続聴効果を生じ得るような音声信号や時間反転音声となるような音声信号を用いて、音響モデルを学習すれば、自ずと、削除またはマスキングした部分や、反転させる際のセグメントよりも長い時間間隔を考慮して音響モデルを学習することとなり、音響モデルは長時間情報を組み込んだものとなり、ヒトが獲得している音声知覚の頑健性を獲得したものとなる。 If an acoustic model is learned using a voice signal that can produce such a continuous listening effect or a voice signal that becomes a time-reversed voice, it is naturally more than a deleted or masked part or a segment when inverting. The acoustic model will be learned in consideration of a long time interval, and the acoustic model will incorporate information for a long time, and will acquire the robustness of speech perception acquired by humans.
 本実施形態では、上述のような錯聴の音声波形を拡張データとして用いることで、学習データから長時間情報に頑健な音響モデルを学習することが可能となる。本実施形態では、錯聴の中でも、上述した連続長効果と時間反転音声を挙げる。 In the present embodiment, by using the above-mentioned auditory illusion voice waveform as extended data, it is possible to learn a robust acoustic model from the learning data to long-term information. In the present embodiment, the above-mentioned continuous length effect and time-reversed voice are mentioned among the auditory illusions.
 連続長効果では、一部分が欠如した状態でも音声知覚出来るヒトの頑健性を音声認識装置に獲得させることが可能になるため、結果として長時間情報に頑健な音声認識装置が構築される。 With the continuous length effect, it becomes possible for the speech recognition device to acquire the robustness of a human being who can perceive speech even when a part is missing, and as a result, a speech recognition device that is robust to long-term information is constructed.
 時間反転音声では、局所的に時系列が反転(破壊)された状態でも音声知覚出来るヒトの頑健性を音声認識装置に獲得させることが可能になるため、結果として長時間情報に頑健な音声認識装置が構築される。 With time-reversed speech, it becomes possible for the speech recognition device to acquire the robustness of humans who can perceive speech even when the time series is locally inverted (destroyed), and as a result, speech recognition that is robust to long-term information. The device is built.
<第一実施形態>
 図1は第一実施形態に係るモデル学習装置の機能ブロック図を、図2はその処理フローを示す。
<First Embodiment>
FIG. 1 shows a functional block diagram of the model learning device according to the first embodiment, and FIG. 2 shows a processing flow thereof.
 モデル学習装置100は、音声信号取得部110と音声ディジタル信号蓄積部120と信号変換部125と音声ディジタル信号蓄積部126と特徴量分析部130と特徴量蓄積部140と学習部160とを含む。 The model learning device 100 includes a voice signal acquisition unit 110, a voice digital signal storage unit 120, a signal conversion unit 125, a voice digital signal storage unit 126, a feature amount analysis unit 130, a feature amount storage unit 140, and a learning unit 160.
 モデル学習装置は、例えば、中央演算処理装置(CPU: Central Processing Unit)、主記憶装置(RAM: Random Access Memory)などを有する公知又は専用のコンピュータに特別なプログラムが読み込まれて構成された特別な装置である。モデル学習装置は、例えば、中央演算処理装置の制御のもとで各処理を実行する。モデル学習装置に入力されたデータや各処理で得られたデータは、例えば、主記憶装置に格納され、主記憶装置に格納されたデータは必要に応じて中央演算処理装置へ読み出されて他の処理に利用される。モデル学習装置の各処理部は、少なくとも一部が集積回路等のハードウェアによって構成されていてもよい。モデル学習装置が備える各記憶部は、例えば、RAM(Random Access Memory)などの主記憶装置、ハードディスクや光ディスクもしくはフラッシュメモリ(Flash Memory)のような半導体メモリ素子により構成される補助記憶装置、またはリレーショナルデータベースやキーバリューストアなどのミドルウェアにより構成することができる。 The model learning device is, for example, a special program configured by loading a special program into a publicly known or dedicated computer having a central processing unit (CPU: Central Processing Unit), a main storage device (RAM: Random Access Memory), and the like. It is a device. The model learning device executes each process under the control of the central processing unit, for example. The data input to the model learning device and the data obtained by each process are stored in the main storage device, for example, and the data stored in the main storage device is read out to the central processing unit as needed. It is used for processing. At least a part of each processing unit of the model learning device may be configured by hardware such as an integrated circuit. Each storage unit included in the model learning device is, for example, a main storage device such as RAM (RandomAccessMemory), an auxiliary storage device composed of a hard disk, an optical disk, or a semiconductor memory element such as a flash memory (FlashMemory), or a relational device. It can be configured by middleware such as database and key value store.
 モデル学習装置は、学習用のアナログの音声信号x(k)と、対応する正解ラベルr(j)とを入力とし、これらの情報に基づき音響モデルを学習し、学習済みの音響モデルfを出力する。なお、kは時刻を示すインデックスである。正解ラベルは例えば音素ラベルであり、jは音素の順番を示すインデックスである。なお、正解ラベル(音素ラベル)がアナログの音声信号のどこからどこまでに対応するかを示す情報は、予め学習データに含まれるものとし、以下で説明する音声ディジタル信号、特徴量についても同様に紐付けられているものとする。 The model learning device inputs an analog audio signal x (k) for learning and a corresponding correct answer label r (j), learns an acoustic model based on this information, and outputs a trained acoustic model f. do. Note that k is an index indicating the time. The correct label is, for example, a phoneme label, and j is an index indicating the order of phonemes. Information indicating from where to where the correct answer label (sound element label) corresponds to the analog audio signal shall be included in the training data in advance, and the audio digital signal and feature amount described below are also linked in the same manner. It is assumed that it has been done.
 以下、各部の処理内容について説明する。 The processing contents of each part will be explained below.
<音声信号取得部110>
入力:音声信号x(k)
出力:音声ディジタル信号x(t)
処理:AD変換
 音声信号取得部110は、アナログの音声信号x(k)を取得し、ディジタルの音声ディジタル信号x(t)に変換する(S110)。なお、tは音声ディジタル信号のサンプル番号を示すインデックスである。
<Audio signal acquisition unit 110>
Input: Audio signal x (k)
Output: Audio digital signal x (t)
Processing: AD conversion The audio signal acquisition unit 110 acquires an analog audio signal x (k) and converts it into a digital audio digital signal x (t) (S110). Note that t is an index indicating the sample number of the audio digital signal.
<音声ディジタル信号蓄積部120>
入力:音声ディジタル信号x(t)
処理:音声ディジタル信号の蓄積
 音声ディジタル信号蓄積部120は、音声ディジタル信号x(t)を蓄積する(S120)。
<Audio digital signal storage unit 120>
Input: Audio digital signal x (t)
Processing: Accumulation of voice digital signal The voice digital signal storage unit 120 stores voice digital signal x (t) (S120).
<信号変換部125>
入力:音声ディジタル信号x(t)
出力:音声ディジタル信号r(t)
処理:Data Augmentation処理
 信号変換部125は、音声ディジタル信号x(t)に対してData Augmentation処理を実行し、音声ディジタル信号x(t)を音声ディジタル信号r(t)に変換する(S125)。
<Signal converter 125>
Input: Audio digital signal x (t)
Output: Audio digital signal r (t)
Processing: Data Augmentation processing The signal conversion unit 125 executes Data Augmentation processing on the voice digital signal x (t) and converts the voice digital signal x (t) into the voice digital signal r (t) (S125).
 本実施形態におけるData Augmentation処理は、音声ディジタル信号x(t)をある変換ルールに基づいて音声ディジタル信号r(t)に変換する。この変換により、擬似的な水増し学習データを生成する。 The Data Augmentation process in this embodiment converts the audio digital signal x (t) into the audio digital signal r (t) based on a certain conversion rule. By this conversion, pseudo inflated learning data is generated.
 本実施形態では、変換後の音声ディジタル信号r(t)が錯聴を生じ得る音声信号となるような変換ルールが採用される。 In this embodiment, a conversion rule is adopted so that the converted voice digital signal r (t) becomes a voice signal that can cause an auditory illusion.
 変換後の音声ディジタル信号r(t)が錯聴を生じ得る音声信号となるような変換ルールとして、本実施形態では、上述したような連続聴効果が得られるような音声信号や時間反転音声となるような音声信号に変換する変換ルールを採用する。 As a conversion rule such that the converted audio digital signal r (t) becomes an audio signal that can cause illusion, in the present embodiment, the audio signal and the time-reversed audio that can obtain the continuous listening effect as described above are used. Adopt a conversion rule that converts to such an audio signal.
(i)連続聴効果が得られるような音声信号に変換する場合、信号変換部125は、時間軸方向において音声ディジタル信号x(t)のある一部分(一定時間分)を削除し、削除した部分の前後の各周波数の音圧以上のノイズを削除した部分に埋め込むことで、音声ディジタル信号x(t)を音声ディジタル信号r(t)に変換する(図3参照)。削除する部分および埋め込む部分の長さは、錯聴を生じ得る長さである。また、削除処理および埋め込み処理は、錯聴を生じ得る間隔で行う。また、埋め込むノイズは、例えばホワイトノイズである。なお、処理S125に先立ちノイズを予め用意しておく。 (i) When converting to an audio signal that can obtain a continuous listening effect, the signal conversion unit 125 deletes a part (for a certain period of time) of the audio digital signal x (t) in the time axis direction, and the deleted part. The audio digital signal x (t) is converted into the audio digital signal r (t) by embedding it in the portion where the noise above the sound pressure of each frequency before and after is deleted (see FIG. 3). The length of the part to be deleted and the part to be embedded is a length that can cause an illusion. In addition, the deletion process and the embedding process are performed at intervals that may cause an auditory illusion. The noise to be embedded is, for example, white noise. Note that noise is prepared in advance prior to the process S125.
(ii)時間反転音声となるような音声信号に変換する場合、信号変換部125は、音声ディジタル信号x(t)をある一定の短い時間窓幅毎に音声波形に切り分け、その音声波形を時間軸上で反転させた後、その反転させた音声波形を連結することで、音声ディジタル信号x(t)を音声ディジタル信号r(t)に変換する(図4参照)。切り分ける際の長さは、錯聴を生じ得る長さである。 (ii) When converting to a voice signal that becomes a time-inverted voice, the signal conversion unit 125 divides the voice digital signal x (t) into a voice waveform for each fixed short time window width, and the voice waveform is divided into time. After being inverted on the axis, the inverted audio waveform is concatenated to convert the audio digital signal x (t) into the audio digital signal r (t) (see FIG. 4). The length of carving is the length that can cause an illusion.
<音声ディジタル信号蓄積部126>
入力:音声ディジタル信号r(t)
処理:音声ディジタル信号の蓄積
 音声ディジタル信号蓄積部126は、音声ディジタル信号r(t)を蓄積する(S126)。
<Audio digital signal storage unit 126>
Input: Audio digital signal r (t)
Processing: Accumulation of voice digital signal The voice digital signal storage unit 126 stores the voice digital signal r (t) (S126).
<特徴量分析部130>
入力:音声ディジタル信号x(t)、r(t)
出力:特徴量系列X、R
処理:特徴量分析
 特徴量分析部130は、音声ディジタル信号x(t)、r(t)に対して特徴量分析を行い、特徴量系列X、Rを得る。
<Feature quantity analysis unit 130>
Input: Audio digital signal x (t), r (t)
Output: Feature series X, R
Processing: Feature analysis The feature analysis unit 130 performs feature analysis on the voice digital signals x (t) and r (t) to obtain feature series X and R.
 例えば、特徴量分析部130は、音声ディジタル信号蓄積部120から音声ディジタル信号x(t)を取り出し、音声ディジタル信号蓄積部126から音声ディジタル信号r(t)を取り出し、音声ディジタル信号x(t)、r(t)をそれぞれフレーム単位に分割し、フレーム毎に音響特徴量抽出を行い、(音響)特徴量系列X、Rを取得する。 For example, the feature quantity analysis unit 130 takes out the voice digital signal x (t) from the voice digital signal storage unit 120, takes out the voice digital signal r (t) from the voice digital signal storage unit 126, and takes out the voice digital signal x (t). , R (t) are divided into frame units, acoustic feature quantity extraction is performed for each frame, and (acoustic) feature quantity series X and R are acquired.
 例えば、音声ディジタル信号x(t)に含まれるフレーム総数をNとし、n=1,2,…,Nとし、フレーム長をMとし、m=1,2,…,Mとし、シフト幅をDとすると、n番目のフレームのm番目の音声ディジタル信号x(t)は、x(D(n-1)+m)と表すことができる。特徴量分析部130は、フレームn毎に音声ディジタル信号x(D(n-1)+1),x(D(n-1)+2),…,x(D(n-1)+M)に対して、音響特徴量抽出を行い、特徴量X(n)を取得する。特徴量分析部130は、全てのフレーム1,2,…,Nに対して処理を行い、特徴量系列X={X(1),X(2),…,X(N)}を取得する。特徴量分析部130は、同様の処理を音声ディジタル信号r(t)に対して行い、特徴量系列R={R(1),R(2),…,R(N)}を取得する。 For example, the total number of frames included in the audio digital signal x (t) is N, n = 1,2, ..., N, the frame length is M, m = 1,2, ..., M, and the shift width is D. Then, the m-th audio digital signal x (t) in the n-th frame can be expressed as x (D (n-1) + m). The feature amount analysis unit 130 has audio digital signals x (D (n-1) + 1), x (D (n-1) + 2), ..., x (D (n-1) + M) for each frame n. ), The acoustic feature amount is extracted and the feature amount X (n) is acquired. The feature amount analysis unit 130 processes all frames 1,2, ..., N and acquires the feature amount series X = {X (1), X (2), ..., X (N)}. .. The feature amount analysis unit 130 performs the same processing on the voice digital signal r (t), and acquires the feature amount series R = {R (1), R (2), ..., R (N)}.
 抽出する特徴量としては、例えば、音声信号の短時間フレーム分析に基づくMFCC(Mel-Frequenct Cepstrum Coefficient)の1~12次元と、その動的特徴量であるΔMFCC、ΔΔMFCCなどの動的パラメータや、パワー、Δパワー、ΔΔパワー等を用いる。また、MFCCに対してはCMN(ケプストラム平均正規化)処理を行っても良い。特徴量は、MFCCやパワーに限定したものでは無く、特殊発話の識別に用いられるパラメータ(例えば、自己相関ピーク値や群遅延など)を用いても良い。 The features to be extracted include, for example, 1 to 12 dimensions of MFCC (Mel-Frequenct Cepstrum Coefficient) based on short-time frame analysis of voice signals, dynamic parameters such as ΔMFCC and ΔΔMFCC, which are the dynamic features, and dynamic parameters such as ΔMFCC and ΔΔMFCC. Power, Δpower, ΔΔ power, etc. are used. Further, CMN (cepstrum average normalization) processing may be performed on the MFCC. The feature amount is not limited to MFCC and power, and parameters used for identifying special utterances (for example, autocorrelation peak value and group delay) may be used.
<特徴量蓄積部140>
入力: 特徴量系列X、R
処理:特徴量系列の蓄積
 特徴量蓄積部140は、特徴量分析部130で分析した特徴量系列X、Rを蓄積する(S140)。
<Feature amount storage unit 140>
Input: Feature series X, R
Processing: Accumulation of feature quantity series The feature quantity storage unit 140 accumulates the feature quantity series X and R analyzed by the feature quantity analysis unit 130 (S140).
<学習部160>
入力:特徴量系列X、特徴量系列R、正解ラベルr(j)
出力:学習済み音響モデルf
処理:モデル学習
 学習部160は、特徴量系列X、特徴量系列R、正解ラベルr(j)を用いて、音響モデルfを学習する(S160)。音響モデルfは、特徴量系列を入力とし、音素ラベルを出力するモデルである。音声認識における音響モデルとしては、例えばGMM-HMMやDNN-HMMなどがしばしば用いられており、近年ではEnd-to-End音声認識モデルも用いられているが、本実施形態では、特に学習対象の音声認識モデルに制約はないため、GMM/DNN-HMMであってもEnd-to-End音声認識モデルであってもよい。なお、正解ラベルr(j)は、学習用のアナログの音声信号x(k)に対応するものであり、音声信号x(k)から得られる特徴量系列X、および、特徴量系列Xを変換して得られる特徴量系列Rにも対応する。
<Learning Department 160>
Input: Feature series X, feature series R, correct label r (j)
Output: Trained acoustic model f
Processing: Model learning The learning unit 160 learns the acoustic model f using the feature quantity sequence X, the feature quantity sequence R, and the correct label r (j) (S160). The acoustic model f is a model that inputs a feature sequence and outputs a phoneme label. For example, GMM-HMM and DNN-HMM are often used as the acoustic model in speech recognition, and in recent years, the End-to-End speech recognition model is also used. Since there are no restrictions on the speech recognition model, it may be either a GMM / DNN-HMM or an End-to-End speech recognition model. The correct label r (j) corresponds to the analog audio signal x (k) for learning, and the feature quantity sequence X obtained from the audio signal x (k) and the feature quantity sequence X are converted. It also corresponds to the feature quantity series R obtained by.
<効果>
 以上の構成により、ヒトの音声知覚に対する頑健性を模擬した音響モデルを学習することができる。また、金銭的・時間的コストを低減することができる。
<Effect>
With the above configuration, it is possible to learn an acoustic model that simulates the robustness of human speech perception. In addition, financial and time costs can be reduced.
<変形例>
 モデル学習装置100の特徴量分析部130と特徴量蓄積部140と学習部160とを含まない構成を、学習データ生成装置ともいう。つまり、学習データ生成装置は、音声信号取得部110と音声ディジタル信号蓄積部120と信号変換部125と音声ディジタル信号蓄積部126とを含む。学習データ生成装置は、学習用のアナログの音声信号x(k)と正解ラベルr(j)とを入力とし、音声信号x(k)から音声ディジタル信号x(t)と音声ディジタル信号r(t)とを生成し、音声ディジタル信号x(t)、音声ディジタル信号r(t)および正解ラベルr(j)の組合せを学習データとして出力する。
<Modification example>
A configuration that does not include the feature amount analysis unit 130, the feature amount storage unit 140, and the learning unit 160 of the model learning device 100 is also referred to as a learning data generation device. That is, the learning data generation device includes an audio signal acquisition unit 110, an audio digital signal storage unit 120, a signal conversion unit 125, and an audio digital signal storage unit 126. The learning data generator inputs an analog audio signal x (k) for learning and a correct answer label r (j), and from the audio signal x (k) to an audio digital signal x (t) and an audio digital signal r (t). ) And is generated, and the combination of the voice digital signal x (t), the voice digital signal r (t) and the correct answer label r (j) is output as training data.
 第一実施形態では、音声ディジタル信号r(t)は、錯聴を生じ得る音声信号であることを前提としているが、実験の結果、錯聴を生じ得ない音声信号であっても同様の効果を得ることができることが分かった。 In the first embodiment, it is assumed that the audio digital signal r (t) is an audio signal that can cause an auditory illusion, but as a result of an experiment, the same effect can be obtained even if the audio signal cannot cause an auditory illusion. It turned out that you can get.
 例えば、第一実施形態では、連続聴効果を得るために、信号変換部125は、時間軸方向において音声ディジタル信号r(t)の一部分を削除し、削除した部分の前後の各周波数の音圧以上のノイズを削除した部分に埋め込むことで、音声ディジタル信号x(t)を音声ディジタル信号r(t)に変換する。このとき、削除する部分および埋め込む部分の長さは、連続聴効果を生じ得ないほど長くともよい。また、削除処理および埋め込み処理の間隔は、連続聴効果を生じ得ないほど短くともよい。このようなデータ拡張を実行した場合であっても、第一実施形態と同様の精度を持つ音響モデルを学習することができる。 For example, in the first embodiment, in order to obtain the continuous listening effect, the signal conversion unit 125 deletes a part of the audio digital signal r (t) in the time axis direction, and the sound pressure of each frequency before and after the deleted part is deleted. By embedding the above noise in the removed portion, the audio digital signal x (t) is converted into the audio digital signal r (t). At this time, the lengths of the portion to be deleted and the portion to be embedded may be so long that the continuous listening effect cannot be produced. Further, the interval between the deletion process and the embedding process may be so short that the continuous listening effect cannot be produced. Even when such data expansion is executed, it is possible to learn an acoustic model having the same accuracy as that of the first embodiment.
 また、例えば、第一実施形態では、時間反転音声となるような音声信号に変換するために、信号変換部125は、音声ディジタル信号x(t)をある一定の短い時間窓幅毎に音声波形に切り分け、その音声波形を時間軸上で反転させた後、その反転させた音声波形を連結することで、音声ディジタル信号x(t)を音声ディジタル信号r(t)に変換する。このとき、切り分けた音声波形の長さは、錯聴を生じ得ないほど長くともよい。このようなデータ拡張を実行した場合であっても、第一実施形態と同様の精度を持つ音響モデルを学習することができる。 Further, for example, in the first embodiment, in order to convert the voice signal into a time-inverted voice, the signal conversion unit 125 converts the voice digital signal x (t) into a voice waveform for each fixed short time window width. After inverting the voice waveform on the time axis, the voice digital signal x (t) is converted into the voice digital signal r (t) by concatenating the inverted voice waveforms. At this time, the length of the cut voice waveform may be long enough to prevent an auditory illusion. Even when such data expansion is executed, it is possible to learn an acoustic model having the same accuracy as that of the first embodiment.
<その他の変形例>
 本発明は上記の実施形態及び変形例に限定されるものではない。例えば、上述の各種の処理は、記載に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。その他、本発明の趣旨を逸脱しない範囲で適宜変更が可能である。
<Other variants>
The present invention is not limited to the above embodiments and modifications. For example, the various processes described above may not only be executed in chronological order according to the description, but may also be executed in parallel or individually as required by the processing capacity of the device that executes the processes. In addition, changes can be made as appropriate without departing from the spirit of the present invention.
<プログラム及び記録媒体>
 上述の各種の処理は、図5に示すコンピュータの記憶部2020に、上記方法の各ステップを実行させるプログラムを読み込ませ、制御部2010、入力部2030、出力部2040などに動作させることで実施できる。
<Programs and recording media>
The various processes described above can be performed by causing the storage unit 2020 of the computer shown in FIG. 5 to read a program for executing each step of the above method and operating the control unit 2010, the input unit 2030, the output unit 2040, and the like. ..
 この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。 The program that describes this processing content can be recorded on a computer-readable recording medium. The recording medium that can be read by a computer may be, for example, a magnetic recording device, an optical disk, a photomagnetic recording medium, a semiconductor memory, or the like.
 また、このプログラムの流通は、例えば、そのプログラムを記録したDVD、CD-ROM等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 The distribution of this program is carried out, for example, by selling, transferring, renting, etc. a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Further, the program may be stored in the storage device of the server computer, and the program may be distributed by transferring the program from the server computer to another computer via the network.
 このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶装置に格納する。そして、処理の実行時、このコンピュータは、自己の記録媒体に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実行形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるASP(Application Service Provider)型のサービスによって、上述の処理を実行する構成としてもよい。なお、本形態におけるプログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの(コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等)を含むものとする。 A computer that executes such a program first, for example, first stores a program recorded on a portable recording medium or a program transferred from a server computer in its own storage device. Then, when the process is executed, the computer reads the program stored in its own recording medium and executes the process according to the read program. Further, as another execution form of this program, a computer may read the program directly from a portable recording medium and execute processing according to the program, and further, the program is transferred from the server computer to this computer. You may execute the process according to the received program one by one each time. In addition, the above-mentioned processing is executed by a so-called ASP (Application Service Provider) type service that realizes the processing function only by the execution instruction and the result acquisition without transferring the program from the server computer to this computer. May be. The program in this embodiment includes information used for processing by a computer and equivalent to the program (data that is not a direct command to the computer but has a property that regulates the processing of the computer, etc.).
 また、この形態では、コンピュータ上で所定のプログラムを実行させることにより、本装置を構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 Further, in this form, the present device is configured by executing a predetermined program on a computer, but at least a part of these processing contents may be realized in terms of hardware.

Claims (8)

  1.  音声認識装置において用いる音響モデルを学習する際に用いる学習データを生成する学習データ生成装置であって、
     第一の学習用音声信号を、錯聴を生じ得る音声信号である第二の学習用音声信号に変換する信号変換部を含む、
     学習データ生成装置。
    It is a learning data generation device that generates learning data used when learning an acoustic model used in a speech recognition device.
    A signal conversion unit that converts a first learning audio signal into a second learning audio signal, which is an audio signal that can cause an auditory illusion.
    Training data generator.
  2.  音声認識装置において用いる音響モデルを学習する際に用いる学習データを生成する学習データ生成装置であって、
     時間軸方向において第一の学習用音声信号のある一部分を削除し、削除した部分の前後の各周波数の音圧以上のノイズを削除した部分に埋め込むことで、前記第一の学習用音声信号を第二の学習用音声信号に変換する信号変換部を含む、
     学習データ生成装置。
    It is a learning data generation device that generates learning data used when learning an acoustic model used in a speech recognition device.
    By deleting a part of the first learning audio signal in the time axis direction and embedding the noise above the sound pressure of each frequency before and after the deleted part in the deleted part, the first learning audio signal is obtained. Includes a signal converter that converts to a second learning audio signal,
    Training data generator.
  3.  音声認識装置において用いる音響モデルを学習する際に用いる学習データを生成する学習データ生成装置であって、
     第一の学習用音声信号をある一定の短い時間窓幅毎に音声波形に切り分け、その音声波形を時間軸上で反転させた後、その反転させた音声波形を連結することで、前記第一の学習用音声信号を第二の学習用音声信号に変換する信号変換部を含む、
     学習データ生成装置。
    It is a learning data generation device that generates learning data used when learning an acoustic model used in a speech recognition device.
    The first learning voice signal is divided into voice waveforms for a certain short time window width, the voice waveform is inverted on the time axis, and then the inverted voice waveforms are connected to each other. Includes a signal converter that converts the learning voice signal of
    Training data generator.
  4.  請求項1から請求項3の何れかの学習データ生成装置を含むモデル学習装置であって、
     前記第一の学習用音声信号から得られる特徴量系列と、前記第二の学習用音声信号から得られる特徴量系列と、前記第一の学習用音声信号に対応する正解ラベルとを用いて、音響モデルを学習する学習部を含む、
     モデル学習装置。
    A model learning device including the learning data generation device according to any one of claims 1 to 3.
    Using the feature quantity sequence obtained from the first learning audio signal, the feature quantity sequence obtained from the second learning audio signal, and the correct answer label corresponding to the first learning audio signal, Including a learning department that learns acoustic models,
    Model learning device.
  5.  音声認識装置において用いる音響モデルを学習する際に用いる学習データを生成する学習データ生成方法であって、
     第一の学習用音声信号を、錯聴を生じ得る音声信号である第二の学習用音声信号に変換する信号変換ステップを含む、
     学習データ生成方法。
    It is a learning data generation method that generates learning data used when learning an acoustic model used in a speech recognition device.
    A signal conversion step of converting a first learning audio signal into a second learning audio signal, which is an audio signal that can cause an auditory illusion.
    Training data generation method.
  6.  音声認識装置において用いる音響モデルを学習する際に用いる学習データを生成する学習データ生成方法であって、
     時間軸方向において第一の学習用音声信号のある一部分を削除し、削除した部分の前後の各周波数の音圧以上のノイズを削除した部分に埋め込むことで、前記第一の学習用音声信号を第二の学習用音声信号に変換する信号変換ステップを含む、
     学習データ生成方法。
    It is a learning data generation method that generates learning data used when learning an acoustic model used in a speech recognition device.
    By deleting a part of the first learning audio signal in the time axis direction and embedding the noise above the sound pressure of each frequency before and after the deleted part in the deleted part, the first learning audio signal is obtained. Includes a signal conversion step to convert to a second learning audio signal,
    Training data generation method.
  7.  音声認識装置において用いる音響モデルを学習する際に用いる学習データを生成する学習データ生成方法であって、
     第一の学習用音声信号をある一定の短い時間窓幅毎に音声波形に切り分け、その音声波形を時間軸上で反転させた後、その反転させた音声波形を連結することで、前記第一の学習用音声信号を第二の学習用音声信号に変換する信号変換ステップを含む、
     学習データ生成方法。
    It is a learning data generation method that generates learning data used when learning an acoustic model used in a speech recognition device.
    The first learning voice signal is divided into voice waveforms for a certain short time window width, the voice waveform is inverted on the time axis, and then the inverted voice waveforms are connected to each other. Includes a signal conversion step to convert the learning voice signal of
    Training data generation method.
  8.  請求項1から請求項3の何れかの学習データ生成装置、または、請求項4のモデル学習装置としてコンピュータを機能させるためのプログラム。 A program for operating a computer as the learning data generation device according to any one of claims 1 to 3 or the model learning device according to claim 4.
PCT/JP2020/020106 2020-05-21 2020-05-21 Learning data generation device, model learning device, learning data generation method, and program WO2021234905A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/JP2020/020106 WO2021234905A1 (en) 2020-05-21 2020-05-21 Learning data generation device, model learning device, learning data generation method, and program

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2020/020106 WO2021234905A1 (en) 2020-05-21 2020-05-21 Learning data generation device, model learning device, learning data generation method, and program

Publications (1)

Publication Number Publication Date
WO2021234905A1 true WO2021234905A1 (en) 2021-11-25

Family

ID=78708583

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2020/020106 WO2021234905A1 (en) 2020-05-21 2020-05-21 Learning data generation device, model learning device, learning data generation method, and program

Country Status (1)

Country Link
WO (1) WO2021234905A1 (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2016161823A (en) * 2015-03-03 2016-09-05 株式会社日立製作所 Acoustic model learning support device and acoustic model learning support method

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2016161823A (en) * 2015-03-03 2016-09-05 株式会社日立製作所 Acoustic model learning support device and acoustic model learning support method

Similar Documents

Publication Publication Date Title
KR102514990B1 (en) Synthesis of speech from text with the speech of the target speaker using neural networks
JP7243760B2 (en) Audio feature compensator, method and program
Wang et al. A multiobjective learning and ensembling approach to high-performance speech enhancement with compact neural network architectures
KR20060044629A (en) Isolating speech signals utilizing neural networks
CN113436609B (en) Voice conversion model, training method thereof, voice conversion method and system
Yuliani et al. Speech enhancement using deep learning methods: A review
WO2023030235A1 (en) Target audio output method and system, readable storage medium, and electronic apparatus
Dendani et al. Speech enhancement based on deep AutoEncoder for remote Arabic speech recognition
CN112863489A (en) Speech recognition method, apparatus, device and medium
CN112185342A (en) Voice conversion and model training method, device and system and storage medium
Mandel et al. Audio super-resolution using concatenative resynthesis
CN116368563A (en) Real-time packet loss concealment using deep-drawn networks
WO2021234905A1 (en) Learning data generation device, model learning device, learning data generation method, and program
Xu et al. Speaker Recognition Based on Long Short-Term Memory Networks
WO2021234904A1 (en) Training data generation device, model training device, training data generation method, and program
WO2021245771A1 (en) Training data generation device, model training device, training data generation method, model training method, and program
JP2016186516A (en) Pseudo-sound signal generation device, acoustic model application device, pseudo-sound signal generation method, and program
Zhao et al. Time Domain Speech Enhancement using self-attention-based subspace projection
JP7028311B2 (en) Learning audio data generator, its method, and program
Tzudir et al. Low-resource dialect identification in Ao using noise robust mean Hilbert envelope coefficients
WO2022082607A1 (en) Vocal track removal by convolutional neural network embedded voice finger printing on standard arm embedded platform
US11335321B2 (en) Building a text-to-speech system from a small amount of speech data
Song et al. Speaker-adaptive neural vocoders for parametric speech synthesis systems
Dahy et al. Dilated Multi-Activation Autoencoder to Improve the Performance of Sound Separation Mechanisms
Bakoria et al. Face Recognition and Speaker Recognition on the basis of their Facial Features and Skin Tones

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20936691

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20936691

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: JP