JP2012220607A

JP2012220607A - Sound recognition method and apparatus

Info

Publication number: JP2012220607A
Application number: JP2011084323A
Authority: JP
Inventors: Hitoshi Nakayama; 仁史中山
Original assignee: Institute of National Colleges of Technologies Japan
Current assignee: Institute of National Colleges of Technologies Japan
Priority date: 2011-04-06
Filing date: 2011-04-06
Publication date: 2012-11-12

Abstract

PROBLEM TO BE SOLVED: To provide a sound recognition method and an apparatus that allow a sound interval of object sound under a noise environment to be detected.SOLUTION: There is provided a sound recognition method, which allows a sound interval of object sound having periodical stationarity under a noise environment to be detected and comprises: a first step for collecting an analog acoustic signal by using sound inputting means, and converting the analog acoustic signal to a digital waveform signal constituted by frames; a second step for analyzing the digital waveform signal in frame units to calculate an auto-correlation function and a secondary auto-correlation function; and a third step for determining a range in which a sum of difference absolute values of the secondary auto-correlation functions calculated for each frame exceeds a pre-set threshold value, to be the sound interval, and an apparatus of the method.

Description

本発明は、雑音環境下における周期定常性を持つ対象音の音区間を検出可能とする音認識方法及び装置に関する。 The present invention relates to a sound recognition method and apparatus that can detect a sound section of a target sound having periodic steadiness in a noisy environment.

近年、カーナビゲーションの地名入力、計算機また携帯電話におけるアプリケーションインタフェースとして音声認識システムは身近な技術であるといえる。音声認識システムは計算量また実時間処理の必要性から、ソフトウェアによる計算機上でのシステム構築や応用を目指した検討が活発に行われている。この中で、処理能力の向上に伴い、計算機上で実現されてきた音声認識システムは組込みシステムや近年広まりつつあるスマートフォンなどでも利用可能な技術へと広がりを見せている。この情勢の中で、基礎研究における先導的音声認識システムの開発は今後もソフトウェア実装で行われると推測される。 In recent years, it can be said that a speech recognition system is a familiar technology as an application interface in a place name input of a car navigation, a computer or a mobile phone. Speech recognition systems have been actively studied for the construction and application of systems on computers using software due to the need for computational complexity and real-time processing. In this situation, with the improvement of processing capability, speech recognition systems realized on computers are spreading to technologies that can be used in embedded systems and smartphones that have been spreading in recent years. In this situation, it is speculated that the development of leading speech recognition systems in basic research will continue to be implemented in software.

一方、組込みシステムによる音声認識システムは計算機を必要とせず、統計的手法を用いた音声認識システムの実現が可能となった。今後はさらに技術革新が進み、大語彙連続単語認識も組込みシステム上で実現されることは明白である。そのため、音声認識システムは従来のソフトウェアからミドルウェアでの実装へと進展したといえる。このことから、ハードウェア音声認識システムへと技術の広がりが期待できる。 On the other hand, a speech recognition system using an embedded system does not require a computer, and a speech recognition system using a statistical method can be realized. It is clear that further technological innovation will progress in the future, and large vocabulary continuous word recognition will also be realized on embedded systems. For this reason, it can be said that the speech recognition system has progressed from the conventional software to the middleware implementation. From this, the spread of technology to hardware speech recognition systems can be expected.

ハードウェア音声認識システムは、DSP(Digital Signal Processor)もしくはFPGA(Field-Programmable Gate Array)で構築し、小語彙特定話者認識システムの実現が図られている。これは、マイクロコンピュータの高速化・高精度化とハードウェア実装を行うためのコンパイラ及びシミュレータが整備されてきたことによるものといえる。これに伴い、隠れマルコフモデルを用いた統計的手法による音声認識システムの実装も報告されている。連続分布を用いた隠れマルコフモデルによる音声認識システムの検討では、ソフトウェアとハードウェアで実装したシステムを比較したところ、同精度の認識性能を維持した上で40倍以上の高速化が確認されており(非特許文献1参照)、ハードウェア音声認識システムがソフトウェア実装によるシステムよりも速度面で優れていることが確認されている。 A hardware speech recognition system is constructed by a DSP (Digital Signal Processor) or an FPGA (Field-Programmable Gate Array) to realize a small vocabulary specific speaker recognition system. This can be attributed to the fact that compilers and simulators for improving the speed and accuracy of microcomputers and mounting hardware have been prepared. Along with this, the implementation of speech recognition systems using statistical methods using hidden Markov models has also been reported. In a speech recognition system based on a hidden Markov model using continuous distribution, a system implemented with software and hardware was compared, and a speedup of 40 times or more was confirmed while maintaining the same precision recognition performance. (See Non-Patent Document 1), it has been confirmed that a hardware speech recognition system is superior in speed to a system based on software.

発明者は、特許文献1において、高速で音声認識を行うことができる音声認識装置であって、音声波形信号をフレーム単位で解析して音声の特徴量を表す特徴ベクトルを抽出する特徴ベクトル抽出部と、特徴ベクトルを時系列的に複数フレーム分記憶する特徴ベクトル記憶部と、音声認識候補となる複数の音声を記憶する認識候補音声記憶部と、特徴ベクトル記憶部に記憶された複数フレーム分における特徴ベクトルに基づき音声認識候補となる各音声の尤度を算出する第1解析部と、複数フレーム分における特徴ベクトルからフレーム単位あたりの平均特徴ベクトルを算出し当該平均特徴ベクトルから音声認識候補となる音声の尤度を算出する第2解析部と、第1解析部において算出した音声認識候補となる各音声の尤度及び第2解析部において算出した音声認識候補となる各音声の尤度に基づき一つの音声を決定する音声決定部とを備える音声認識装置を提案した。 The inventor is a speech recognition device capable of performing speech recognition at high speed in Patent Document 1, and extracts a feature vector representing a feature amount of speech by analyzing a speech waveform signal in units of frames. A feature vector storage unit that stores a plurality of frames of feature vectors in time series, a recognition candidate speech storage unit that stores a plurality of speech that is a speech recognition candidate, and a plurality of frames stored in the feature vector storage unit A first analysis unit that calculates the likelihood of each speech that is a speech recognition candidate based on the feature vector, and an average feature vector per frame unit is calculated from the feature vectors for a plurality of frames, and the speech recognition candidate is obtained from the average feature vector In the second analysis unit for calculating the likelihood of speech and the likelihood of each speech that is a speech recognition candidate calculated in the first analysis unit and in the second analysis unit Proposed a voice recognition device and a voice determination unit that determines the one voice based on the voice of the likelihood that the speech recognition candidates out.

特許文献2には、入力された音を分析して周期性を検出するとともに、入力された音のパワー情報に基づいて音声区間を検出し、これらの2つの検出結果に基づいて、予め定めた音声区間と非音声区間とを判定する規則にしたがって音声区間を検出する音声区間検出装置が提案されている。 Patent Document 2 analyzes the input sound to detect periodicity, detects a voice section based on the power information of the input sound, and determines in advance based on these two detection results. There has been proposed a speech segment detection device that detects a speech segment according to a rule for determining a speech segment and a non-speech segment.

特開2010-224020号公報JP 2010-224020 JP 特開平8-305388号公報JP-A-8-305388

S. J. Melnikoff, S. F. Quigley and M. J. Russell, “Speech Recognition on an FPGA Using and Continuous Hidden Markov Models”, Proceedings of 12th International Conference on Field Programmable Logic and Applications, pp.201-211, 2002.S. J. Melnikoff, S. F. Quigley and M. J. Russell, “Speech Recognition on an FPGA Using and Continuous Hidden Markov Models”, Proceedings of 12th International Conference on Field Programmable Logic and Applications, pp.201-211, 2002.

上述のとおり、ハードウェア音声認識システムにより音声認識の高速化が可能となったが、実利用を踏まえ、雑音環境下における音区間検出が必要とされている。
そこで、本発明は、雑音環境下における対象音の音区間を検出可能とする音認識方法及び装置を提供することを目的とし、特に、雑音環境下における音声認識を可能とする音認識方法及び装置を提供することを目的とする。 As described above, the speed of speech recognition can be increased by the hardware speech recognition system, but it is necessary to detect a sound section in a noisy environment based on actual use.
Therefore, the present invention has an object to provide a sound recognition method and apparatus capable of detecting a sound section of a target sound under a noisy environment, and in particular, a sound recognition method and apparatus capable of voice recognition under a noisy environment. The purpose is to provide.

第1の発明は、雑音環境下における周期定常性を持つ対象音の音区間を検出可能とする音認識方法であって、音入力手段によりアナログ音響信号を採取し、フレームによって構成されるデジタル波形信号に変換する第1ステップと、デジタル波形信号をフレーム単位で解析して自己相関関数及び2次自己相関関数を算出する第2ステップと、各フレームについて算出した2次自己相関関数の差分絶対値の和が予め設定した閾値を超える範囲を音区間と判定する第3ステップと、を有することを特徴とする音認識方法である。
第2の発明は、第1の発明において、前記対象音が音声であることを特徴とする。 A first invention is a sound recognition method capable of detecting a sound section of a target sound having periodic steadiness in a noisy environment, wherein an analog sound signal is sampled by sound input means, and a digital waveform configured by a frame 1st step to convert to signal, 2nd step to analyze digital waveform signal by frame unit and calculate autocorrelation function and secondary autocorrelation function, and absolute difference value of secondary autocorrelation function calculated for each frame And a third step of determining a range in which the sum exceeds a preset threshold value as a sound section.
According to a second aspect, in the first aspect, the target sound is a voice.

第3の発明は、アナログ音響信号を採取する音入力手段と、音区間を検出する信号処理部と、音の種別を判定する音認識部とを備え、雑音環境下における周期定常性を持つ対象音の音区間を検出可能とする音認識装置であって、信号処理部が、フレームによって構成されるデジタル波形信号に変換する手段と、デジタル波形信号をフレーム単位で解析して自己相関関数及び2次自己相関関数を算出する手段と、各フレームについて算出した2次自己相関関数の差分絶対値の和が予め設定した閾値を超える範囲を音区間と判定する手段と、を有することを特徴とする音認識装置である。
第4の発明は、第3の発明において、前記信号処理部が判定した音区間について、複数チャンネルにおける周波数帯域パワーの割合に基づき母音グループを認識し、認識した母音グループについて、スペクトル距離に基づき母音を判定する音声認識機能を有することを特徴とする。 A third invention comprises a sound input means for collecting an analog sound signal, a signal processing unit for detecting a sound section, and a sound recognition unit for determining the type of sound, and having a periodic steadiness in a noisy environment A sound recognition device capable of detecting a sound section of a sound, wherein the signal processing unit converts the digital waveform signal into a digital waveform signal composed of frames, analyzes the digital waveform signal in units of frames, and 2 Means for calculating a second autocorrelation function, and means for determining, as a sound section, a range in which a sum of absolute differences of secondary autocorrelation functions calculated for each frame exceeds a preset threshold value. It is a sound recognition device.
According to a fourth aspect of the present invention, in the third aspect, the vowel group is recognized based on the ratio of frequency band power in a plurality of channels for the sound section determined by the signal processing unit, and the vowel group is recognized based on the spectral distance for the recognized vowel group. It has a voice recognition function for judging

本発明によれば、雑音環境下において、周期定常性を持つ対象音の音区間を検出することが可能となる。
また、雑音環境下における音声認識を高精度に行うことが可能となる。 According to the present invention, it is possible to detect a sound section of a target sound having periodic steadiness in a noisy environment.
In addition, speech recognition in a noisy environment can be performed with high accuracy.

ハードウェア音声認識システムの概念図Conceptual diagram of hardware speech recognition system 母音/a/の音声信号Voice signal of vowel / a / 音声の自己相関関数Autocorrelation function of speech 音声の2次自己相関関数Second-order autocorrelation function of speech 雑音信号Noise signal 雑音の自己相関関数Autocorrelation function of noise 雑音の2次自己相関関数Second-order autocorrelation function of noise 高SNR環境下での母音/a/の雑音化音声Noise of vowel / a / in high SNR environment 高SNR環境下での母音/a/の音声区間検出Voice segment detection of vowel / a / under high SNR environment 低SNR環境下での母音/a/の雑音化音声Noise generation of vowel / a / in low SNR environment 低SNR環境下での母音/a/の音声区間検出Voice interval detection of vowel / a / in low SNR environment 発声文“ねじ曲げたのだ”の雑音化音声The voiced noise of the uttered sentence “Turning the screw” 発声文“ねじ曲げたのだ”の音声区間検出Speech segment detection of uttered sentence "Screw bend"

本発明の音認識装置は、アナログ音響信号を採取する音入力手段と、音区間を検出する信号処理部と、音の種別を判定する音認識部を備えている。この音認識装置は、1フレームもしくは数フレーム程度の信号を用いて音声区間検出や音声認識を行うことにより瞬時処理・瞬時認識が実現可能である。
信号処理部は、フレームによって構成されるデジタル波形信号に変換する手段と、デジタル波形信号をフレーム単位で解析して自己相関関数及び2次自己相関関数を算出する手段と、各フレームについて算出した2次自己相関関数の差分絶対値の和が予め設定した閾値を超える範囲を音区間と判定する手段を有しており、波形信号をフレーム化した瞬間にハードウェア演算で信号処理・音声認識を行うことが可能である。 The sound recognition apparatus of the present invention includes sound input means for collecting an analog sound signal, a signal processing unit for detecting a sound section, and a sound recognition unit for determining the type of sound. This sound recognition device can realize instantaneous processing and instantaneous recognition by performing speech section detection and speech recognition using a signal of about one frame or several frames.
The signal processing unit is a means for converting to a digital waveform signal composed of frames, a means for analyzing the digital waveform signal in units of frames to calculate an autocorrelation function and a secondary autocorrelation function, and 2 calculated for each frame. It has means to determine the range in which the sum of absolute differences of the next autocorrelation function exceeds a preset threshold as a sound section, and performs signal processing and speech recognition by hardware calculation at the moment when the waveform signal is framed It is possible.

音入力手段としては、代表的にはマイクロフォンが挙げられるが、例えば、骨伝導音をはじめとする体内伝導音などの固体伝搬信号を抽出する加速度ピックアップなどを用いてもよい。音入力手段により採取されたアナログ音響信号は、信号処理部でAD変換されPCM(パルス符号変調)形式の波形信号に変換される。
信号処理部では、あらゆる音源から発せられる周期定常性のある音の音区間を検出することが可能である。代表的には、人が発声した音声が挙げられるが、これに限定されず、例えば、エンジン、モーター、ベル等の機器類からの周期定常性を持つ音を検出することも可能である。 A typical example of the sound input means is a microphone, but for example, an acceleration pickup that extracts a solid propagation signal such as a body conduction sound including a bone conduction sound may be used. The analog acoustic signal collected by the sound input means is AD converted by the signal processing unit and converted into a waveform signal in the PCM (pulse code modulation) format.
In the signal processing unit, it is possible to detect a sound section of a periodic stationary sound emitted from any sound source. A typical example is a voice uttered by a person, but the present invention is not limited to this. For example, it is also possible to detect a sound having periodic steadiness from devices such as an engine, a motor, and a bell.

本発明は、パソコンを用いてソフトウェア的に実現することもできるが、処理速度の観点からは信号処理部と音認識部を統合したハードウェア音声認識システムにより実現することが好ましい。さらに、ハードウェア音認識システムは、FPGAを用いて構成することが好ましい。FPGAは、AND、ORまたNANDなどの論理素子とフリップフロップなどから構成される集積回路であり、ユーザ独自の論理回路を構成することが可能である。実装手法としては、コンパイラを用いたソフトウェア開発とGUIベースのシミュレータを用いた開発があるが、実施例ではあらかじめシステムのふるまいを確認する必要性があったのでシミュレータ上で開発を行った。 The present invention can be realized by software using a personal computer, but is preferably realized by a hardware speech recognition system in which a signal processing unit and a sound recognition unit are integrated from the viewpoint of processing speed. Furthermore, the hardware sound recognition system is preferably configured using an FPGA. The FPGA is an integrated circuit composed of logic elements such as AND, OR, and NAND, and flip-flops, and can constitute a user-specific logic circuit. As implementation methods, there are software development using a compiler and development using a GUI-based simulator. In the embodiment, however, it was necessary to confirm the behavior of the system in advance, so development was performed on the simulator.

本発明は、定常的特性を有する母音や異常音検知などの複雑な認識アルゴリズムを必要としない用途やリアルタイム性が求められる用途に特に適している。また、信号処理部や音認識部を必要に応じてモジュール化できるため、雑音抑圧、音区間検出また特徴抽出などのあらゆる音認識システムのフロントエンド部として用いることができる。 The present invention is particularly suitable for applications that do not require complex recognition algorithms such as vowels having abnormal characteristics and abnormal sound detection, and applications that require real-time performance. Further, since the signal processing unit and the sound recognition unit can be modularized as necessary, it can be used as a front-end unit for all sound recognition systems such as noise suppression, sound section detection, and feature extraction.

以下では、本発明の詳細を実施例により説明するが、本発明は何ら実施例により限定されるものではない。 Hereinafter, details of the present invention will be described with reference to examples, but the present invention is not limited to the examples.

[1]システムの概要
実施例は、音声を処理対象とした、ハードウェア音声認識システムに関する。本実施例のシステムは、周囲雑音の変化や話者交代など遂次状況が変化する環境下で用いられることを前提としているため、得られた信号に対する前処理を行う機能を有している。すなわち、本実施例のシステムは、フロントエンド処理から認識処理までを統合したシステムであり、話者交代に伴う音響モデルやテンプレートの変更などが可能な音声認識システムを提供することができ、また対話システムとしての運用も実現可能である。 [1] System Overview The embodiment relates to a hardware speech recognition system for processing speech. The system of the present embodiment is premised on being used in an environment in which successive conditions change, such as changes in ambient noise and speaker changes, and thus has a function of preprocessing the obtained signal. In other words, the system of this embodiment is an integrated system from front-end processing to recognition processing, and can provide a speech recognition system capable of changing an acoustic model and a template associated with speaker change, and interactive. Operation as a system is also possible.

図1に本実施例で提案するハードウェア音声認識システムの概念図を示す。本実施例の音声認識システムは、信号処理部2及び音声認識部3を有する音声認識装置1と、マイクロフォンからなる音声入力手段4と、出力装置5から構成される。信号処理部2では音声入力手段 4で採取した波形信号を用いて、話者認識、音声区間検出また雑音抑圧など音声認識における前処理を行う。音声認識部3では、信号処理部2から得られた結果を用いて効率の良い音声認識を行う部分であって、フレーム単位での処理を実現し、瞬時処理(リアルタイム処理)が可能である。判定された音声情報は出力装置5に出力される。 FIG. 1 shows a conceptual diagram of a hardware speech recognition system proposed in this embodiment. The voice recognition system according to the present embodiment includes a voice recognition device 1 having a signal processing unit 2 and a voice recognition unit 3, voice input means 4 including a microphone, and an output device 5. The signal processing unit 2 uses the waveform signal collected by the voice input unit 4 to perform preprocessing in voice recognition such as speaker recognition, voice segment detection, and noise suppression. The speech recognition unit 3 is a part that performs efficient speech recognition using the result obtained from the signal processing unit 2, and realizes processing in units of frames and can perform instantaneous processing (real-time processing). The determined audio information is output to the output device 5.

本実施例では、信号処理部2及び音声認識部3の開発を、すべてLabVIEW(登録商標)を用いて行った。LabVIEW(登録商標)はNational Instruments社が提供する主として計測制御に用いられるグラフィカルプログラミング環境であり、アイコンとワイヤによるフローチャートのような直感的プログラミングを行うことができる。また、多くのハードウェアデバイスとの統合が可能で、高度な解析やデータの可視化が行えるライブラリが内蔵されており、評価、試験、テストなどのシステム開発で必要な検討を行うことができる。本実施例における各検討はシミュレータ上で実装したシステムによる検討結果である。 In this example, the signal processing unit 2 and the speech recognition unit 3 were all developed using LabVIEW (registered trademark). LabVIEW (registered trademark) is a graphical programming environment mainly used for measurement control provided by National Instruments, and can perform intuitive programming such as flowcharts with icons and wires. It can also be integrated with many hardware devices and has a built-in library that allows advanced analysis and data visualization, allowing you to conduct necessary studies in system development such as evaluation, testing, and testing. Each study in this example is a study result by a system implemented on a simulator.

以下では、本実施例の音声認識装置1が備える、雑音環境下における頑健な音声区間検出アルゴリズムを、母音認識の実験例に基づき説明する。音声認識装置1の信号処理部2では2次自己相関関数の差分絶対値の和を用いた雑音環境下における音声区間検出を行い、音声認識部3では周波数帯域パワーの出力とフォルマント周波数の距離比較を用いた2段方式の音声認識の有効性確認を行った。 Hereinafter, a robust speech segment detection algorithm in a noisy environment provided in the speech recognition apparatus 1 of the present embodiment will be described based on an experiment example of vowel recognition. The signal processing unit 2 of the speech recognition device 1 performs speech section detection in a noisy environment using the sum of absolute differences of the second-order autocorrelation function, and the speech recognition unit 3 compares the distance between the output of the frequency band power and the formant frequency. We confirmed the effectiveness of the two-stage speech recognition using.

[2]信号処理部2
信号処理部2では、発話区間検出、雑音抑圧など頑健な音声認識を実現するための前処理を行う。以下では、信号間の差分を用いたプリエンファシスと、雑音環境下での頑健な音声区間検出について説明する。 [2] Signal processor 2
The signal processing unit 2 performs preprocessing for realizing robust speech recognition such as speech segment detection and noise suppression. In the following, pre-emphasis using a difference between signals and robust speech section detection under a noise environment will be described.

[2-1]プリエンファシス
マイクロフォンなどから採取された音声信号は窓関数によりフレーム化される。ここで、フォルマント周波数は高域になるほど信号レベルが減衰するためピークの検出が難しくなるという問題があり、この問題を解決するためにプリエンファシスが行われる。ここで、明瞭化などにおいてはプリエンファシスが有効ではない場合もある。しかし、本実施例の音声認識部3ではスペクトルのピーク検出に用いるため、前処理として信号処理部2でプリエンファシスを行うようにしている。本実施例ではシステムを簡単化するために、時間波形の差分を用いたプリエンファシスを行っている。 [2-1] Pre-emphasis Voice signals collected from microphones are framed by window functions. Here, there is a problem that it becomes difficult to detect a peak because the signal level attenuates as the formant frequency becomes higher, and pre-emphasis is performed to solve this problem. Here, pre-emphasis may not be effective for clarification or the like. However, since the speech recognition unit 3 of this embodiment is used for spectrum peak detection, the signal processing unit 2 performs pre-emphasis as preprocessing. In the present embodiment, in order to simplify the system, pre-emphasis using a time waveform difference is performed.

一般的に、音声処理のプリエンファシスは係数0.97程度を用いて施されており、このパラメータはシステムに対する音声のサンプリング周波数は可変であるにもかかわらず用いられる場合が多い。ここで、サンプリング周波数が変化すれば、離散信号の微分間隔が変化し、異なる周波数特性が得られるところ、厳密にスケーリング係数を求める必要性が無いのであれば、単純な差分を用いれば十分であると考えられる。そこで、本実施例ではシステムの高速化の観点も踏まえ、差分計算を採用した。 In general, pre-emphasis of audio processing is performed using a coefficient of about 0.97, and this parameter is often used even though the audio sampling frequency for the system is variable. Here, if the sampling frequency changes, the differential interval of the discrete signal changes and a different frequency characteristic can be obtained. If there is no need to obtain a scaling coefficient strictly, it is sufficient to use a simple difference. it is conceivable that. Therefore, in this embodiment, the difference calculation is adopted from the viewpoint of speeding up the system.

[2-2]2次自己相関関数を用いた音声区間検出
周囲の雑音環境は常に変化するため、静寂環境下また雑音環境下など環境に依存しないシステムが求められる。そこで、本実施例では2次自己相関関数を用いて音声区間検出を行っている。以下に自己相関関数を求めるための式1を示す。 [2-2] Voice interval detection using second-order autocorrelation function Since the ambient noise environment changes constantly, a system that does not depend on the environment, such as a quiet environment or a noise environment, is required. Therefore, in the present embodiment, speech segment detection is performed using a secondary autocorrelation function. Equation 1 for obtaining the autocorrelation function is shown below.

波形信号x(i)に対して、式1を用いることで自己相関関数R(j)を求めることができる。
図2に男性20歳1名が母音/a/を発声したときの音声信号、図3に音声の自己相関関数、図4に音声の2次自己相関関数を示す。また、図5に雑音信号、図6に雑音の自己相関関数、図7に雑音の2次自己相関関数を示す。
本実施例では音声信号から求めた自己相関関数に対して再度自己相関関数を計算し、2次自己相関関数R’(k)を求める。このようにすることで、自己相関関数では表現しきれない定常性を強調した信号を得ることができる。そして、2次相関関数の差分絶対値の和を発声推定値A(l)として用いることにした。以下に発声推定値を求めるための式2を示す。 The autocorrelation function R (j) can be obtained by using Equation 1 for the waveform signal x (i).
Fig. 2 shows the speech signal when a male 20-year-old utters a vowel / a /, Fig. 3 shows the autocorrelation function of speech, and Fig. 4 shows the secondary autocorrelation function of speech. FIG. 5 shows a noise signal, FIG. 6 shows a noise autocorrelation function, and FIG. 7 shows a noise secondary autocorrelation function.
In this embodiment, the autocorrelation function is calculated again with respect to the autocorrelation function obtained from the speech signal, and the secondary autocorrelation function R ′ (k) is obtained. In this way, it is possible to obtain a signal that emphasizes continuity that cannot be expressed by an autocorrelation function. Then, the sum of absolute differences of the secondary correlation function is used as the utterance estimation value A (l). Equation 2 for obtaining the estimated speech value is shown below.

静寂環境下であれば無発声時の自己相関関数の振幅値は得られなく、雑音環境下であれば非定常雑音などにより自己相関関数の振幅値は低下する。一方、母音は定常性が高く、自己相関関数の振幅値もそれに伴い高くなる。子音は母音と比較すると自己相関関数の値は低くなるものの、白色雑音のような無相関信号と比較すると高くなると推測できる。よって、静寂環境下及び雑音環境下など問わずに頑健な音声区間検出が簡易的に実現できる。 In a quiet environment, the amplitude value of the autocorrelation function during no speech cannot be obtained, and in a noisy environment, the amplitude value of the autocorrelation function decreases due to non-stationary noise or the like. On the other hand, vowels are highly stationary, and the amplitude value of the autocorrelation function increases accordingly. Although the consonant has a lower autocorrelation function value than the vowel, it can be estimated that the consonant is higher than the uncorrelated signal such as white noise. Therefore, robust voice segment detection can be easily realized regardless of whether the environment is quiet or noisy.

まず、雑音環境下で男性20歳が母音/a/を発声したときの有効性確認を行う。
信号検出のために用いたマイクロフォンはSONY社のECM-31HVC(600Ω)である。白色雑音は、フリーソフト(WaveGene)で生成したものを一般的なマイクで再生して発声時におけるバックグラウンドノイズとした。
図8は高SNR環境下で発声した雑音化音声であり、図9は図8に対して音声区間検出を行った結果である。図9から、0.9秒から1.8秒付近において音声区間を検出できることが確認できる。
図10は図8よりも難易度の高い条件、すなわち低SNR環境下で図8と同じ男性が発声した雑音化音声であり、図11は図10に対して音声区間検出を行った結果である。図11から、0.5秒から1.6秒付近の値が相対的に高くなっていることが確認できる。
音声区間検出では雑音区間と発声区間の相対比較を行うところ、図9,11では発声推定値が雑音区間よりも高い値を示すことから音声区間検出が可能であると判断できる。他の日本語母音(/i/、/u/、/e/及び/o/)についても有効性確認を行ったところ、同様の結果を得ることができた。 First, the effectiveness is confirmed when a 20-year-old man utters a vowel / a / in a noisy environment.
The microphone used for signal detection is ECM-31HVC (600Ω) manufactured by SONY. The white noise generated by free software (WaveGene) was reproduced by a general microphone and used as background noise during utterance.
FIG. 8 shows the noisy speech uttered in a high SNR environment, and FIG. 9 shows the result of speech segment detection performed on FIG. From FIG. 9, it can be confirmed that the voice section can be detected in the vicinity of 0.9 to 1.8 seconds.
10 is a noisy voice uttered by the same male as in FIG. 8 under a condition that is more difficult than FIG. 8, that is, in a low SNR environment, and FIG. 11 is the result of performing voice segment detection on FIG. . From FIG. 11, it can be confirmed that the values in the vicinity of 0.5 to 1.6 seconds are relatively high.
In the speech section detection, a relative comparison between the noise section and the utterance section is performed. In FIGS. 9 and 11, it can be determined that the speech section can be detected because the utterance estimation value is higher than the noise section. The effectiveness of other Japanese vowels (/ i /, / u /, / e / and / o /) was confirmed, and similar results were obtained.

次に、単音発声した母音ではなく、文節発声を行ったときの有効性確認を行う。
図12は20歳男声が雑音環境下において文節“ねじ曲げたのだ”を発声したときの雑音化音声であり、図13は図12に対して音声区間検出を行った結果である。図13から、母音区間では自己相関関数の値が高く、子音区間では母音と比較すると自己相関関数の値が低いものの発声推定値を得られることを確認できる。これは、雑音と比較して子音区間の周期性が相対的に高く、子音区間でも音声区間検出が可能となることを示している。白色雑音のような不規則性雑音であれば頑健に検出可能であるが、定常雑音では発声推定値が高くなるため性能低下が見込まれる。また一方では、音声に限らず異常音また環境音の検出など様々な所望信号の検出が期待できることが確認された。 Next, the validity is confirmed when a phrase utterance is made, not a vowel produced by a single sound.
FIG. 12 shows a noisy voice when a 20-year-old male voice utters the phrase “twisted” in a noisy environment, and FIG. 13 shows the result of voice segment detection performed on FIG. From FIG. 13, it can be confirmed that an estimated speech value can be obtained although the value of the autocorrelation function is high in the vowel section and the value of the autocorrelation function is low compared to the vowel in the consonant section. This shows that the periodicity of the consonant section is relatively high compared to noise, and the speech section can be detected even in the consonant section. Irregular noise such as white noise can be detected robustly, but with steady noise, the estimated speech value becomes high, and performance degradation is expected. On the other hand, it was confirmed that detection of various desired signals such as detection of abnormal sounds and environmental sounds as well as voices can be expected.

[3]音声認識部3
本実施例ではハードウェア音声認識システムの利点を生かす手法として、入力フレーム毎に遂次音声認識を行う手法を採用している。この手法を用いることで瞬時にフレーム単位の音声認識が可能で、音質変換のためのフィルタ決定や異常音をフレーム毎に推定できる。本実施例では2段方式の母音認識を用いた認識システムを構築した。 [3] Voice recognition unit 3
In the present embodiment, as a technique for taking advantage of the hardware voice recognition system, a technique for performing successive voice recognition for each input frame is employed. By using this method, voice recognition can be performed in units of frames, and filter determination for sound quality conversion and abnormal sounds can be estimated for each frame. In this embodiment, a recognition system using two-stage vowel recognition was constructed.

音声認識部3の第1パスでは母音グループの認識を行い、認識対象を絞り込む認識を行った。ここでは、フォルマント周波数の存在する帯域を3チャンネルで分割し、そこから得られる帯域パワー毎にパラメータを設定した。公知の式を用いて各チャンネルにおける最大値PLocalMax(i)と最小値PLocalMin(i)で正規化し、周波数帯域パワーにおけるチャンネルiの各出力P(i)を割合Ratio(i)で表現することにより、存在するフォルマント周波数の帯域に違いがあることを確認することができた。 In the first pass of the speech recognition unit 3, vowel groups were recognized, and recognition was performed to narrow down the recognition target. Here, the band where the formant frequency exists was divided into three channels, and parameters were set for each band power obtained from it. By normalizing with the maximum value PLocalMax (i) and the minimum value PLocalMin (i) in each channel using a known formula, and expressing each output P (i) of channel i in frequency band power as a ratio Ratio (i) It was confirmed that there was a difference in the band of formant frequencies that existed.

音声認識部3の第2パスではグループ内の母音識別を行った。あらかじめ調査したフォルマント周波数のテンプレートを作成し、テンプレートと音声信号から得られたフォルマント周波数の距離比較を行った。ここで、第1及び第2フォルマントの距離にスケールの差があるため、単純な距離比較による判別を行うことが難しいという問題がある。そこで、公知の式を用いてスケーリングの問題を解決し、最終候補を決定することにした。 In the second pass of the speech recognition unit 3, vowels in the group were identified. Templates of formant frequencies investigated in advance were created, and the distance between the formant frequencies obtained from the template and the audio signal was compared. Here, since there is a difference in scale between the distances of the first and second formants, there is a problem that it is difficult to perform discrimination by simple distance comparison. Therefore, we decided to solve the scaling problem using a known formula and determine the final candidate.

第1パスでは周波数帯域パワーを用いた母音/a/、/u/、/o/群と母音/i/、/e/群の識別を行い、第2パスではスペクトル距離尺度を用いた各グループ内の識別を行った。認識用のテンプレートはあらかじめ20歳男性3名から推定し、この話者3名に1名加えた男性20歳4名で認識実験を行った。各話者は実験室内の静寂環境下で各日本語母音を40セットずつ発声した。このとき、環境音認識も考慮したため、サンプリング周波数を44.1kHzとした。
表1は全話者における認識結果のエラーマトリクス及び母音認識率である。各認識結果から確認できるように、孤立発声した母音に対して約75%程度の認識率が得られることが確認できた。 The first pass identifies vowels / a /, / u /, / o / groups and vowels / i /, / e / groups using frequency band power, and each group uses spectral distance measures in the second pass. Identification within. The template for recognition was preliminarily estimated from three 20-year-old men, and a recognition experiment was conducted with four male 20-year-olds, one added to the three speakers. Each speaker uttered 40 sets of each Japanese vowel in a quiet environment in the laboratory. At this time, the sampling frequency was set to 44.1 kHz in consideration of environmental sound recognition.
Table 1 shows the error matrix and vowel recognition rate of recognition results for all speakers. As can be confirmed from each recognition result, it was confirmed that a recognition rate of about 75% was obtained for isolated vowels.

1 音声認識装置
2 信号処理部
3 音声認識部
4 音声入力手段(マイクロフォン)
5 出力装置 1 Voice recognition device
2 Signal processor
3 Voice recognition unit
4 Voice input means (microphone)
5 Output device

Claims

A sound recognition method capable of detecting a sound section of a target sound having periodic steadiness in a noisy environment,
A first step of collecting an analog sound signal by sound input means and converting it into a digital waveform signal constituted by a frame; a first step of calculating an autocorrelation function and a secondary autocorrelation function by analyzing the digital waveform signal in units of frames; A sound recognition method comprising: two steps; and a third step of determining, as a sound section, a range in which a sum of absolute differences of secondary autocorrelation functions calculated for each frame exceeds a preset threshold value.

2. The sound segment recognition method according to claim 1, wherein the target sound is a voice.

Equipped with a sound input means that collects analog sound signals, a signal processing unit that detects sound intervals, and a sound recognition unit that determines the type of sound, and detects the sound interval of the target sound that has periodic steadiness in a noisy environment A sound recognition device that enables
The signal processing unit converts to a digital waveform signal composed of frames, means for analyzing the digital waveform signal in units of frames to calculate an autocorrelation function and a secondary autocorrelation function, and 2 calculated for each frame And a means for determining, as a sound section, a range in which a sum of absolute differences of the next autocorrelation function exceeds a preset threshold value.

It has a voice recognition function for recognizing a vowel group based on a frequency band power ratio in a plurality of channels for a sound section determined by the signal processing unit and determining a vowel based on a spectral distance for the recognized vowel group. 5. The sound recognition apparatus according to claim 4.