JP2007292940A

JP2007292940A - Voice recognition device and voice recognition method

Info

Publication number: JP2007292940A
Application number: JP2006119436A
Authority: JP
Inventors: Kazuhide Okada; 一秀岡田; Atsuo Takasaya; 充生高佐屋
Original assignee: Toyota Motor Corp
Current assignee: Toyota Motor Corp
Priority date: 2006-04-24
Filing date: 2006-04-24
Publication date: 2007-11-08

Abstract

<P>PROBLEM TO BE SOLVED: To provide a voice recognition device capable of accurately and rapidly recognizing whether sound indicated by a sound signal is voiceless sound or voiced sound. <P>SOLUTION: The voice recognition device 10 of the invention is provided with: a generating section 13 in which the sound signal along a time axis is segmented into a plurality of frames by shifting each segmentation start time; a calculation section 15 for calculating the number of zero crossing corresponding to each of the plurality of segmented frames; and a recognition section 17 in which it is recognized, based on the calculated number of zero crossing, whether a phoneme corresponding to each of the plurality of frames is the voiceless sound or the voiced sound, based on the result that the number of zero crossing for each of the plurality of frames is compared with the number of zero crossing in an adjoining frame to each of the plurality of frames. <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

本発明は、音声識別装置及び音声識別方法に関する。 The present invention relates to a voice identification device and a voice identification method.

入力した音声信号によって示される音が無声音か有声音かを識別する音声識別技術がある。音声識別技術は、音声認識又は音声合成等に用いられる。このような音声識別技術として、下記特許文献１には、所定区間の音声信号のゼロクロス数と予め設定したゼロクロス数の閾値とに基づいて、所定区間の音声信号が無声音を示すか、又は有声音を示すか識別する技術が記載されている。
特開２００３−２５６０００号公報 There is a voice identification technique for identifying whether a sound indicated by an input voice signal is an unvoiced sound or a voiced sound. Speech identification technology is used for speech recognition or speech synthesis. As such a voice identification technique, the following Patent Document 1 discloses that a voice signal in a predetermined section indicates an unvoiced sound or a voiced sound based on a zero-cross number of a voice signal in a predetermined section and a preset zero-cross number threshold. Techniques for indicating or identifying are described.
JP 2003-256000 A

しかしながら、無声音を示す所定区間の音声信号のゼロクロス数は、有声音を示す所定区間の音声信号のゼロクロス数より大きい場合もあれば、小さい場合もある。よって、所定区間の音声信号のゼロクロス数を用いる場合、無声音と有声音を識別するための適当な閾値を設定するのは困難である。 However, the number of zero crosses of the audio signal in the predetermined section indicating unvoiced sound may be larger or smaller than the number of zero crosses of the audio signal in the predetermined section indicating voiced sound. Therefore, when using the zero-cross number of the audio signal in a predetermined section, it is difficult to set an appropriate threshold value for identifying unvoiced sound and voiced sound.

また、音声識別方法として変形相関法が知られている。変形相関法は、入力された音声信号の波形に対して、自己相関関数を求めた後、線形予測分析を行い、次数分の線形予測係数を求める。その後、変形相関法では、予測残差を求めて予測残差のピークを検出し、ピーク間の時間距離を測定し、測定結果に基づいて音声の識別を行う。 Also, a modified correlation method is known as a speech identification method. In the modified correlation method, an autocorrelation function is obtained for a waveform of an input speech signal, and then linear prediction analysis is performed to obtain linear prediction coefficients corresponding to orders. Thereafter, in the modified correlation method, the prediction residual is obtained, the peak of the prediction residual is detected, the time distance between the peaks is measured, and the voice is identified based on the measurement result.

変形相関法は、このように複雑な操作を要するため、音声の識別速度が遅くなる。よって、変形相関法では、リアルタイムな対話システムに用いられる音声認識及び音声合成には対応できない。 Since the modified correlation method requires such a complicated operation, the voice identification speed is slow. Therefore, the modified correlation method cannot cope with speech recognition and speech synthesis used in a real-time dialogue system.

そこで本発明では、音声信号が示す音素が無声音か有声音かを正確かつ迅速に識別可能な音声識別装置及び音声識別方法を提供することを目的とする。 Therefore, an object of the present invention is to provide a speech identification device and a speech identification method that can accurately and quickly identify whether a phoneme indicated by a speech signal is an unvoiced sound or a voiced sound.

本発明者らは、研究の結果、無声音の音声信号は、有声音の音声信号と比較して周期性が乏しく、ゼロクロス数の変動が大きいことを見出した。本発明者らは、このゼロクロス数の変動に着目して研究を進めることで本願発明を想起したものである。 As a result of research, the present inventors have found that an unvoiced sound signal has less periodicity than a voiced sound signal and has a large variation in the number of zero crossings. The present inventors recalled the present invention by advancing research focusing on the variation in the number of zero crossings.

本発明の音声識別装置は、時間軸に沿った音声信号を、それぞれの切り出し開始時刻をずらして複数のフレームに切り出すフレーム生成手段と、当該切り出された複数のフレームそれぞれに対応するゼロクロス数を算出するゼロクロス算出手段と、当該算出されたゼロクロス数に基づいて、複数のフレームそれぞれにおけるゼロクロス数を、複数のフレームそれぞれに近接するフレームにおけるゼロクロス数と比較するゼロクロス比較手段と、当該比較結果に基づいて、複数のフレームそれぞれに対応する音素が無声音か有声音かを識別する音素識別手段と、を備える。 The voice identification device according to the present invention calculates a frame generation unit that cuts a voice signal along a time axis into a plurality of frames by shifting each cut-out start time, and calculates the number of zero crosses corresponding to each of the cut out frames. Based on the calculated zero-cross number, a zero-cross comparison unit that compares the zero-cross number in each of the plurality of frames with a zero-cross number in a frame adjacent to each of the plurality of frames, based on the calculated zero-cross number, Phoneme identifying means for identifying whether a phoneme corresponding to each of a plurality of frames is an unvoiced sound or a voiced sound.

本発明の音声識別方法は、フレーム生成手段が、時間軸に沿った音声信号をそれぞれの切り出し開始時刻をずらして複数のフレームに切り出す第１ステップと、ゼロクロス算出手段が、第１ステップにおいて切り出された複数のフレームそれぞれに対応するゼロクロス数を算出する第２ステップと、第２ステップにおいて算出されたゼロクロス数に基づいて、ゼロクロス比較手段が、複数のフレームそれぞれにおけるゼロクロス数を複数のフレームそれぞれに近接するフレームにおけるゼロクロス数と比較する第３ステップと、第３ステップにおける比較結果に基づいて、音素識別手段が、複数のフレームそれぞれに対応する音素が無声音か有声音かを識別する第４ステップと、を備える。 According to the speech identification method of the present invention, the frame generation unit extracts the audio signal along the time axis into a plurality of frames by shifting the extraction start time, and the zero-cross calculation unit is extracted in the first step. The second step of calculating the number of zero crosses corresponding to each of the plurality of frames, and the zero cross comparison means, based on the number of zero crosses calculated in the second step, closes the number of zero crosses in each of the plurality of frames to each of the plurality of frames. A third step of comparing with the number of zero crosses in the frame to be performed, a fourth step of identifying whether the phoneme corresponding to each of the plurality of frames is an unvoiced sound or a voiced sound based on the comparison result in the third step; Is provided.

本発明によれば、音声信号をフレームに切り出して、近接するフレーム同士のゼロクロス数を比較した結果に基づいて、複数のフレームそれぞれに対応する音素が無声音か有声音かを識別している。従って、各フレームのゼロクロス数と閾値とを比較する場合に比較して正確に識別することが出来ると共に、変形相関法に比較して迅速に識別することができる。 According to the present invention, audio signals are cut out into frames, and whether the phonemes corresponding to the plurality of frames are unvoiced sounds or voiced sounds is identified based on the result of comparing the number of zero crosses between adjacent frames. Therefore, the number of zero crosses of each frame can be accurately identified as compared with the case where the threshold is compared, and the number can be quickly identified as compared with the modified correlation method.

また本発明の音声識別装置では、音素識別手段は、複数のフレームそれぞれにおけるゼロクロス数と、複数のフレームそれぞれに近接するフレームにおけるゼロクロス数との差分値が予め定められた閾値を超えた場合に、後方のフレームに対応する音素が無声音であると識別することも好ましい。 Further, in the speech identification device of the present invention, the phoneme identification means, when the difference value between the number of zero crosses in each of a plurality of frames and the number of zero crosses in a frame adjacent to each of the plurality of frames exceeds a predetermined threshold, It is also preferable to identify that the phoneme corresponding to the rear frame is an unvoiced sound.

また本発明の音声識別方法では、第４ステップにおいて、音素識別手段は、複数のフレームそれぞれにおけるゼロクロス数と、複数のフレームそれぞれに近接するフレームにおけるゼロクロス数との差分値が予め定められた閾値を超えた場合に、後方のフレームに対応する音素が無声音であると識別することも好ましい。 In the speech identification method of the present invention, in the fourth step, the phoneme identifying means sets a threshold value in which a difference value between the number of zero crosses in each of a plurality of frames and the number of zero crosses in a frame adjacent to each of the plurality of frames is set in advance. When exceeding, it is also preferable to identify that the phoneme corresponding to the rear frame is an unvoiced sound.

また本発明の音声識別装置では、音素識別手段は、複数のフレームそれぞれにおけるゼロクロス数と、複数のフレームそれぞれに近接するフレームにおけるゼロクロス数との差分値が予め定められた閾値より小さい場合に、後方のフレームに対応する音素が有声音であると識別することも好ましい。 Further, in the speech identification device of the present invention, the phoneme identification unit is configured to perform a backward operation when a difference value between the number of zero crosses in each of a plurality of frames and the number of zero crosses in a frame adjacent to each of the plurality of frames is smaller than a predetermined threshold. It is also preferable to identify that the phoneme corresponding to this frame is a voiced sound.

また本発明の音声識別方法では、第４ステップにおいて、音素識別手段は、複数のフレームそれぞれにおけるゼロクロス数と、複数のフレームそれぞれに近接するフレームにおけるゼロクロス数との差分値が予め定められた閾値より小さい場合に、後方のフレームに対応する音素が有声音であると識別することも好ましい。 In the speech identification method of the present invention, in the fourth step, the phoneme identifying means determines that a difference value between the number of zero crosses in each of a plurality of frames and the number of zero crosses in a frame adjacent to each of the plurality of frames is based on a predetermined threshold. It is also preferable to identify that the phoneme corresponding to the rear frame is a voiced sound when it is small.

この好ましい態様によれば、近接するフレーム同士のゼロクロス数をより正確に比較することができる。 According to this preferable aspect, the number of zero crosses between adjacent frames can be compared more accurately.

本発明によれば、近接するフレームのゼロクロス数の変動に基づいて無声音か有声音かを識別するので、音声信号が示す音が無声音か有声音かを正確かつ迅速に識別可能である。 According to the present invention, since it is identified whether it is unvoiced sound or voiced sound based on the variation in the number of zero crosses of adjacent frames, it is possible to accurately and quickly identify whether the sound indicated by the sound signal is unvoiced sound or voiced sound.

本発明の知見は、例示のみのために示された添付図面を参照して以下の詳細な記述を考慮することによって容易に理解することができる。引き続いて、添付図面を参照しながら本発明の実施の形態を説明する。可能な場合には、同一の部分には同一の符号を付して、重複する説明を省略する。 The knowledge of the present invention can be easily understood by considering the following detailed description with reference to the accompanying drawings shown for illustration only. Subsequently, embodiments of the present invention will be described with reference to the accompanying drawings. Where possible, the same parts are denoted by the same reference numerals, and redundant description is omitted.

本発明の実施形態に係る音声識別システムについて図１を参照しながら説明する。図１は、本実施形態に係る音声識別システムの構成図である。本実施形態に係る音声識別システム１は、マイク３と、音声識別装置１０とを備えて、人が話す言葉を構成するそれぞれの音が有声音か無声音かを識別するシステムである。 A voice identification system according to an embodiment of the present invention will be described with reference to FIG. FIG. 1 is a configuration diagram of a voice identification system according to the present embodiment. The voice identification system 1 according to the present embodiment is a system that includes a microphone 3 and a voice identification device 10 and identifies whether each sound constituting a word spoken by a person is voiced sound or unvoiced sound.

マイク３は、コンデンサマイクであり、人が話す言葉の音の振動をサンプリングし、音声信号に変換して音声識別装置１０へ出力する。音声識別装置１０は、マイク３から入力した音声信号に基づいて、人が話す言葉を構成するそれぞれの音が有声音か無声音かを識別する装置である。 The microphone 3 is a condenser microphone, samples vibrations of sounds spoken by a person, converts them into voice signals, and outputs them to the voice identification device 10. The voice identification device 10 is a device that identifies whether each sound constituting a word spoken by a person is voiced sound or unvoiced sound based on a voice signal input from the microphone 3.

本発明の実施形態に係る音声識別装置１０についてより詳細に説明する。音声識別装置１０は、物理的な構成要素として、ＣＰＵ、メモリー、電源、及び入出力インターフェイス部を備えるパーソナルコンピュータである。また、図２に示すように、音声識別装置１０は、機能的な構成要素として、図２に示すように、検出部１１と、生成部（フレーム生成手段）１３と、算出部（ゼロクロス算出手段）１５と、識別部（ゼロクロス比較手段、音素識別手段）１７とを備える。図２は、本実施形態に係る音声識別装置の機能ブロック図である。引き続いて、機能的な各構成要素についてそれぞれ説明する。 The voice identification device 10 according to the embodiment of the present invention will be described in more detail. The voice identification device 10 is a personal computer including a CPU, a memory, a power source, and an input / output interface unit as physical components. As shown in FIG. 2, the voice identification device 10 includes, as functional components, a detection unit 11, a generation unit (frame generation unit) 13, and a calculation unit (zero cross calculation unit) as illustrated in FIG. 2. ) 15 and an identification unit (zero cross comparison means, phoneme identification means) 17. FIG. 2 is a functional block diagram of the voice identification device according to the present embodiment. Subsequently, each functional component will be described.

検出部１１は、図３に示すように、入力した音声信号Ｓ１から発話区間を検出する部分である。図３は、本実施形態に係る音声識別装置が入力した音声信号の波形を示すグラフである。検出部１１は、音声信号Ｓ１を入力し、音声信号の振幅に対して、発話区間の語始及び語尾共に２段階のラッチをかけて発話区間を検出する。 As shown in FIG. 3, the detection unit 11 is a part that detects an utterance section from the input voice signal S <b> 1. FIG. 3 is a graph showing a waveform of a voice signal input by the voice identification device according to the present embodiment. The detection unit 11 receives the audio signal S1 and detects the utterance interval by performing two-stage latching on both the beginning and the end of the utterance interval with respect to the amplitude of the audio signal.

すなわち、検出部１１は、図３の波形において、振幅が立ち上がった点Ａから一定時間内に振幅励起が起こった（点Ｂ）場合に、振幅が立ち上がった点Ａを発話区間の語始とする。また、検出部１１は、振幅が立ち下がった点Ｄから３００ｍｓ以上振幅の励起がない場合に、振幅が立ち下がった点Ｄを発話区間の語尾とする。検出部１１は、図３に示す波形においては、振幅が立ち下がった点Ｃから３００ｍｓ以内に振幅の励起があるので、振幅が立ち下がった点Ｃを発話区間の語尾と認識しない。 That is, in the waveform of FIG. 3, when the amplitude excitation occurs within a certain time from the point A where the amplitude rises (point B), the detection unit 11 sets the point A where the amplitude rises as the beginning of the utterance section. . In addition, the detection unit 11 sets the point D at which the amplitude has fallen as the ending of the utterance section when there is no excitation for 300 ms or more from the point D at which the amplitude has fallen. In the waveform shown in FIG. 3, the detection unit 11 does not recognize the point C at which the amplitude has fallen as the end of the utterance section because the amplitude is excited within 300 ms from the point C at which the amplitude has fallen.

検出部１１は、発話区間を検出後、図４に示すように、検出した発話区間の音声信号Ｓ２に対してプリエンファシスを行い、音声信号Ｓ３を生成する。図４（ａ）は、プリエンファシス前の音声信号Ｓ２の波形を示すグラフである。図４（ｂ）は、プリエンファシス後の音声信号Ｓ３の波形を示すグラフである。プリエンファシスとは、音声信号の高域を強調することである。プリエンファシス後の音声信号Ｓ３の振幅は、プリエンファシス前の音声信号Ｓ２の振幅より大きい。 After detecting the utterance section, the detection unit 11 performs pre-emphasis on the detected speech signal S2 in the utterance section, as shown in FIG. 4, and generates the speech signal S3. FIG. 4A is a graph showing the waveform of the audio signal S2 before pre-emphasis. FIG. 4B is a graph showing the waveform of the audio signal S3 after pre-emphasis. Pre-emphasis is to emphasize the high frequency range of the audio signal. The amplitude of the audio signal S3 after pre-emphasis is larger than the amplitude of the audio signal S2 before pre-emphasis.

具体的には、検出部１１は、音声信号Ｓ２のプリエンファシスを下式（１）によって行う。

上式（１）において、ＨＳＲ［］はプリエンファシス前の音声信号を示し、ＨＰｒＳＲ［］はプリエンファシス後の音声信号を示す。 Specifically, the detection unit 11 performs pre-emphasis of the audio signal S2 by the following equation (1).

In the above equation (1), HSR [] indicates an audio signal before pre-emphasis, and HPrSR [] indicates an audio signal after pre-emphasis.

図５（ａ）は、プリエンファシス前の音声信号Ｓ２のペクトルを示すグラフである。図５（ｂ）は、プリエンファシス後の音声信号Ｓ３のペクトルを示すグラフである。図５に示すように、プリエンファシスによって、より高域の信号が増幅されている。プリエンファシス前の音声信号における特定の周波数のパワー「ｌａ」，「ｌｂ」が強調されてより大きなパワー「ｌａ’」「ｌｂ’」となっている。検出部１１は、プリエンファシスを行った音声信号Ｓ３を生成部１３へ出力する。 FIG. 5A is a graph showing the spectrum of the audio signal S2 before pre-emphasis. FIG. 5B is a graph showing the spectrum of the audio signal S3 after pre-emphasis. As shown in FIG. 5, a higher frequency signal is amplified by pre-emphasis. The powers “la” and “lb” at specific frequencies in the audio signal before pre-emphasis are emphasized to become larger powers “la ′” and “lb ′”. The detection unit 11 outputs the audio signal S3 subjected to pre-emphasis to the generation unit 13.

生成部１３は、音声信号Ｓ３を用いて複数のフレームを生成する部分である。生成部１３は、図６に示すように、音声信号Ｓ３を５１２サンプリングごとに３０％ずつシフトさせながら関数Ｈ（ハミング窓）を乗算して複数のフレームを生成する。図６は、シフト後の音声信号の波形と関数Ｈとを示すグラフである。 The generation unit 13 is a part that generates a plurality of frames using the audio signal S3. As illustrated in FIG. 6, the generation unit 13 generates a plurality of frames by multiplying the audio signal S3 by 30% for each 512 sampling while multiplying the function H (Humming window). FIG. 6 is a graph showing the waveform and function H of the audio signal after the shift.

具体的には、生成部１３は、下式（２）によってフレームを生成する。

上式（２）において、Ｐｒｅｃｉｓｅはサンプリング数（５１２）を示し、ｐは次数を示す。生成部１３は、音声信号Ｓ３の時系列順に、フレームを生成すると共に、時系列順にフレームにフレームナンバーを割り当てる。すなわち、フレームナンバーＮ（Ｎは正の整数）のフレームは、フレームナンバーＮ−１のフレームにおける後方部分の７０％ほどを含み、音声信号Ｓ３においてフレームナンバーＮ−１のフレームに後続する部分に対応する信号を３０％ほど含む。生成部１３は、生成した複数のフレームを算出部１５へ出力する。 Specifically, the generation unit 13 generates a frame by the following expression (2).

In the above formula (2), Precise indicates the sampling number (512), and p indicates the order. The generation unit 13 generates frames in the time sequence of the audio signal S3 and assigns frame numbers to the frames in the time sequence. That is, the frame with the frame number N (N is a positive integer) includes about 70% of the rear portion of the frame with the frame number N-1, and corresponds to the portion following the frame with the frame number N-1 in the audio signal S3. About 30% of the signals to be transmitted. The generation unit 13 outputs the generated plurality of frames to the calculation unit 15.

算出部１５は、各フレームに対応するゼロクロス数を算出する部分である。算出部１５は、各フレームのノイズを除去すると共に各フレームに対応するゼロクロス数を算出する。ゼロクロス数とは、ノイズを除去した信号の波形とグラフの横軸との交点の数である。算出部１５は、まず、フレームごとに図７（ａ）に示す自己相関関数を算出する。図７（ａ）は、フレームの自己相関関数を示す。具体的には、算出部１５は、ＸｔＸｔ−τ＋１平均が０の定常時系列データ｛ｘ（ｔ）｜ｔ＝０，…，Ｎ−１｝が与えられた場合に、下式（３）を用いて自己相関関数を算出する。

The calculation unit 15 is a part that calculates the number of zero crosses corresponding to each frame. The calculation unit 15 removes noise of each frame and calculates the number of zero crosses corresponding to each frame. The number of zero crossings is the number of intersections between the waveform of the signal from which noise is removed and the horizontal axis of the graph. First, the calculation unit 15 calculates an autocorrelation function shown in FIG. 7A for each frame. FIG. 7A shows a frame autocorrelation function. Specifically, the calculation unit 15 obtains the following equation (3) when given the steady time series data {x (t) | t = 0,..., N−1} where XtXt−τ + 1 average is 0. To calculate the autocorrelation function.

更に、算出部１５は、算出した自己相関関数を３連連続で移動平均させて、図７（ｂ）に示す平均自己相関関数を算出する。図７（ｂ）は、図７（ａ）の自己相関関数の移動平均をとった平均自己相関関数を示す。算出部１５は、平均自己相関関数を利用して、各フレームのゼロクロス数を算出する。算出部１５は、算出したゼロクロス数と、該当するフレームのフレームナンバーとを識別部１７へ出力する。 Further, the calculation unit 15 calculates the average autocorrelation function shown in FIG. 7B by moving and averaging the calculated autocorrelation functions in a triple sequence. FIG. 7B shows an average autocorrelation function obtained by taking a moving average of the autocorrelation function of FIG. The calculation unit 15 calculates the number of zero crosses of each frame using an average autocorrelation function. The calculation unit 15 outputs the calculated number of zero crosses and the frame number of the corresponding frame to the identification unit 17.

識別部１７は、フレームナンバーＮ−１のフレームに対応するゼロクロス数とフレームナンバーＮのフレームに対応するゼロクロス数との比較に基づいてフレームナンバーＮのフレームによって示される音が無声音か有声音かを識別する部分である。 Based on the comparison between the number of zero crosses corresponding to the frame of frame number N-1 and the number of zero crosses corresponding to the frame of frame number N, the identification unit 17 determines whether the sound indicated by the frame of frame number N is an unvoiced sound or a voiced sound. It is a part to identify.

より具体的には、識別部１７は、フレームナンバーＮ−１のフレームに対応するゼロクロス数とフレームナンバーＮのフレームに対応するゼロクロス数との差が所定の値より大きい場合に、フレームナンバーＮのフレームが示す音が無声音であると識別する。また、識別部１７は、フレームナンバーＮ−１のフレームに対応するゼロクロス数とフレームナンバーＮのフレームに対応するゼロクロス数との差が所定の値より小さい場合に、フレームナンバーＮのフレームが示す音が有声音であると識別する。 More specifically, when the difference between the number of zero crosses corresponding to the frame of frame number N-1 and the number of zero crosses corresponding to the frame of frame number N is greater than a predetermined value, the identification unit 17 The sound indicated by the frame is identified as an unvoiced sound. In addition, the identification unit 17 determines the sound indicated by the frame of the frame number N when the difference between the number of zero crosses corresponding to the frame of the frame number N-1 and the number of zero crosses corresponding to the frame of the frame number N is smaller than a predetermined value. Is identified as a voiced sound.

より具体的に図８を参照して説明する。図８は、各フレームのゼロクロス数を示す図である。フレームのゼロクロス数が、図８の左から、「１３」「２０」「２４」「３０」「２５」「３０」「３９」「８」と比較的変動している領域は、ゼロクロス数変化の安定しないフレーム域Ａである。 More specific description will be given with reference to FIG. FIG. 8 is a diagram showing the number of zero crossings in each frame. The area where the number of zero crosses of the frame is relatively fluctuating from “13” “20” “24” “30” “25” “30” “39” “8” from the left in FIG. The frame area A is not stable.

フレーム域Ａに含まれる各フレームのゼロクロス数は、直前のフレームのゼロクロス数との差が比較的大きい。よって、識別部１７は、フレーム域Ａに含まれる各フレームが示す音を無声音と識別する。すなわち、識別部１７は、ゼロクロス数の変動が大きいフレームが示す音を無声音と識別する。なお、図８に示されるフレーム域Ａの波形は、無声音である子音「ｓ」を示す。 The difference between the number of zero crosses in each frame included in the frame area A and the number of zero crosses in the immediately preceding frame is relatively large. Therefore, the identification unit 17 identifies the sound indicated by each frame included in the frame area A as an unvoiced sound. That is, the identification unit 17 identifies a sound indicated by a frame with a large variation in the number of zero crosses as an unvoiced sound. Note that the waveform in the frame area A shown in FIG. 8 indicates a consonant “s” which is an unvoiced sound.

フレームのゼロクロス数が、上記フレーム域Ａに続いて図８の左から、「７」「８」「８」「８」と比較的安定している領域は、ゼロクロス数変化の安定したフレーム域Ｂである。 From the left in FIG. 8, the area where the number of zero crosses of the frame is relatively stable from the left in FIG. 8 is “7”, “8”, “8”, “8”. It is.

フレーム域Ｂに含まれる各フレームのゼロクロス数は、直前のフレームのゼロクロス数との差が比較的小さい。よって、識別部１７は、フレーム域Ｂに含まれる各フレームが示す音を有声音と識別する。すなわち、識別部１７は、ゼロクロス数の変動が小さいフレームが示す音を有声音と識別する。なお、図８示されるフレーム域Ｂの波形は、有声音である母音「ａ」を示す。 The number of zero crosses in each frame included in the frame area B is relatively small from the number of zero crosses in the immediately preceding frame. Therefore, the identification unit 17 identifies the sound indicated by each frame included in the frame region B as a voiced sound. That is, the identification unit 17 identifies a sound indicated by a frame with a small variation in the number of zero crosses as a voiced sound. The waveform in the frame region B shown in FIG. 8 indicates a vowel “a” that is a voiced sound.

また、図９は、フレームのゼロクロス数の変動を示すグラフである。横軸がフレームナンバーを示し、縦軸がフレームのゼロクロス数を示す。曲線Ｘ１及び曲線Ｘ２それぞれが、音声信号のゼロクロス数の変動を示す。 FIG. 9 is a graph showing fluctuations in the number of zero crossings in a frame. The horizontal axis indicates the frame number, and the vertical axis indicates the number of zero crossings of the frame. Each of the curve X1 and the curve X2 indicates a variation in the number of zero crossings of the audio signal.

曲線Ｘ１の領域Ｙ１は、ゼロクロス数の変動が大きいので、識別部１７は、領域Ｙ１に含まれるフレームが示す各音を無声音であると識別する。また、曲線Ｘ２の領域Ｙ２は、ゼロクロス数の変動が小さいので、識別部１７は、領域Ｙ２に含まれるフレームが示す各音を有声音であると識別する。このようにして、識別部１７は、フレーム毎のゼロクロス数の変動を検出して、フレームが示す音が無声音か又は有声音かを識別する。 Since the region Y1 of the curve X1 has a large variation in the number of zero crossings, the identification unit 17 identifies each sound indicated by the frame included in the region Y1 as an unvoiced sound. In addition, since the variation in the number of zero crosses is small in the area Y2 of the curve X2, the identification unit 17 identifies each sound indicated by the frame included in the area Y2 as a voiced sound. In this way, the identification unit 17 detects a variation in the number of zero crosses for each frame, and identifies whether the sound indicated by the frame is an unvoiced sound or a voiced sound.

無声音は、有声音と比較して周期性が乏しいのでフレーム毎のゼロクロス数の変動が大きい。また、有声音は、無声音と比較して周期性を有するのでフレーム毎のゼロクロス数の変化が小さい。よって、上述したように、フレーム毎のゼロクロス数の変動を検出することにより、フレームが示す音が無声音か有声音か識別することができる。 The unvoiced sound has less periodicity than the voiced sound, and therefore the fluctuation of the number of zero crosses per frame is large. In addition, the voiced sound has a periodicity as compared with the unvoiced sound, so that the change in the number of zero crosses for each frame is small. Therefore, as described above, by detecting a variation in the number of zero crosses for each frame, it is possible to identify whether the sound indicated by the frame is an unvoiced sound or a voiced sound.

引き続いて、図１０を参照して、音声信号に基づいて音を識別する際の上記音声識別装置１０の動作を説明すると共に、本実施形態に係る音声識別方法を説明する。図１０は、本実施形態に係る音声識別装置の動作を示すフローチャートである。 Subsequently, with reference to FIG. 10, the operation of the voice identification device 10 when identifying a sound based on a voice signal will be described, and the voice identification method according to the present embodiment will be described. FIG. 10 is a flowchart showing the operation of the voice identification device according to the present embodiment.

識別処理を開始すると、音声識別システム１が録音モードに移行して「．ｗａｖファイル」が投入される（Ｓ２１）。録音モードに移行すると、入力された音声信号Ｓ１の発話区間が、検出部１１によって検出される（Ｓ２２）。発話区間の音声信号Ｓ２が検出されると、音声信号Ｓ２の波形に対するプリエンファシスが、検出部１１によってなされる（Ｓ２３）。 When the identification process is started, the voice identification system 1 shifts to the recording mode and “.wav file” is input (S21). When the recording mode is entered, the speech section of the input voice signal S1 is detected by the detection unit 11 (S22). When the speech signal S2 in the speech section is detected, pre-emphasis on the waveform of the speech signal S2 is performed by the detection unit 11 (S23).

音声信号Ｓ２がプリエンファシスされると、フレームが、プリエンファシスされた音声信号Ｓ３に基づいて生成部１３によって生成される（Ｓ２４）。フレームが生成されると、自己相関関数が、フレーム毎に算出部１５によって算出される（Ｓ２５）。自己相関関数が算出されると、自己相関関数の移動平均が、算出部１５によって算出される（Ｓ２６）。 When the audio signal S2 is pre-emphasized, a frame is generated by the generation unit 13 based on the pre-emphasized audio signal S3 (S24). When a frame is generated, an autocorrelation function is calculated for each frame by the calculation unit 15 (S25). When the autocorrelation function is calculated, the moving average of the autocorrelation function is calculated by the calculation unit 15 (S26).

移動平均が算出されると、フレームナンバー０のフレームのゼロクロス数が、算出部１５によって算出される（Ｓ２７）。フレームナンバーｉのフレームのゼロクロス数が、算出部１５によって算出される（Ｓ２８）。 When the moving average is calculated, the zero cross number of the frame with frame number 0 is calculated by the calculation unit 15 (S27). The number of zero crosses of the frame with the frame number i is calculated by the calculation unit 15 (S28).

フレームナンバーｉ−１のフレームのゼロクロス数とフレームナンバーｉのフレームのゼロクロス数との差が所定値より大きいと識別部１７によって判断される（Ｓ２９でＹＥＳ）と、フレームナンバーｉのフレームが示す音（音素）が無声音であると識別部１７によって識別される（Ｓ３０）。 When the discriminator 17 determines that the difference between the number of zero crosses of the frame with the frame number i-1 and the number of zero crosses of the frame with the frame number i is greater than a predetermined value (YES in S29), the sound indicated by the frame with the frame number i The identification unit 17 identifies that the (phoneme) is an unvoiced sound (S30).

フレームナンバーｉ−１のフレームのゼロクロス数とフレームナンバーｉのフレームのゼロクロス数との差が所定値より小さいと識別部１７によって判断される（Ｓ２９でＮＯ）と、フレームナンバーｉのフレームが示す音（音素）が有声音であると識別部１７によって識別される（Ｓ３１）。 When the discriminating unit 17 determines that the difference between the number of zero crosses of the frame of frame number i-1 and the number of zero crosses of the frame of frame number i is smaller than a predetermined value (NO in S29), the sound indicated by the frame of frame number i The identification unit 17 identifies that the (phoneme) is a voiced sound (S31).

音が無声音か有声音か識別されると、フレームナンバーｉをプラスする（Ｓ３２）。識別対象のフレームが発話区間の最終フレームではない場合（Ｓ３３でＮＯ）は、ステップＳ２８へ戻り、識別対象のフレームが発話区間の最終フレームとなるまで繰り返す。 When the sound is identified as unvoiced or voiced, the frame number i is added (S32). If the identification target frame is not the last frame of the utterance section (NO in S33), the process returns to step S28 and is repeated until the identification target frame becomes the final frame of the utterance section.

識別対象のフレームが発話区間の最終フレームである場合（Ｓ３３でＹＥＳ）は、識別処理を終了する。このようにして、発話区間の音声信号が示す音がそれぞれ有声音か無声音か識別される。 If the identification target frame is the last frame of the utterance section (YES in S33), the identification process is terminated. In this way, it is discriminated whether the sound indicated by the speech signal in the utterance section is voiced sound or unvoiced sound.

本実施形態によれば、フレームナンバーＮ−１のフレームに対応するゼロクロス数とフレームナンバーＮのフレームに対応するゼロクロス数との比較に基づいてフレームナンバーＮのフレームによって示される音が無声音か有声音かを識別する。よってゼロクロス数の変動に基づいてフレームによって示される音を識別することができる。すなわち、音声信号が示す音が無声音か有声音かをより簡易かつより正確に識別することができる。 According to the present embodiment, the sound indicated by the frame of frame number N based on a comparison between the number of zero crosses corresponding to the frame of frame number N-1 and the number of zero crosses corresponding to the frame of frame number N is unvoiced sound or voiced sound. To identify. Therefore, the sound indicated by the frame can be identified based on the variation of the number of zero crossings. That is, it is possible to more easily and accurately identify whether the sound indicated by the audio signal is an unvoiced sound or a voiced sound.

また本実施形態によれば、識別部１７が、フレームナンバーＮ−１のフレームに対応するゼロクロス数とフレームナンバーＮのフレームに対応するゼロクロス数との差が所定の値より大きい場合に、フレームナンバーＮのフレームの示す音が無声音であると識別する。この場合、ゼロクロス数の変動が比較的大きい領域に含まれるフレームによって示される音を無声音であると識別できるので、音を的確に識別することができる。 Further, according to the present embodiment, when the identification unit 17 determines that the difference between the number of zero crosses corresponding to the frame with the frame number N-1 and the number of zero crosses corresponding to the frame with the frame number N is greater than a predetermined value, The sound indicated by the N frames is identified as an unvoiced sound. In this case, since the sound indicated by the frame included in the region where the variation in the number of zero crosses is relatively large can be identified as an unvoiced sound, the sound can be accurately identified.

また本実施形態によれば、識別部１７が、フレームナンバーＮ−１のフレームに対応するゼロクロス数とフレームナンバーＮのフレームに対応するゼロクロス数との差が所定の値より小さい場合に、フレームナンバーＮのフレームの示す音が有声音であると識別する。この場合、ゼロクロス数の変動が比較的小さい領域に含まれるフレームによって示される音を有声音であると識別できるので、音を的確に識別することができる。 Further, according to the present embodiment, when the identification unit 17 determines that the difference between the number of zero crosses corresponding to the frame of frame number N-1 and the number of zero crosses corresponding to the frame of frame number N is smaller than a predetermined value, The sound indicated by the N frames is identified as a voiced sound. In this case, since the sound indicated by the frame included in the region where the variation in the number of zero crosses is relatively small can be identified as voiced sound, the sound can be accurately identified.

上記音声識別装置１０を用いて日本語の子音について識別試験を行った。図１１は、本実施形態に係る音声識別装置を用いた識別試験結果を示す表である。図１１において、「○」が音声識別装置１０による識別が正確であったことを示し、「×」が音声識別装置１０による識別が不正確であったことを示す。図１１の識別試験結果によれば、音声識別装置１０を用いた音声識別の的中率は９割程度である。 A discrimination test was conducted on Japanese consonants using the voice discrimination device 10. FIG. 11 is a table showing the discrimination test results using the voice discrimination apparatus according to the present embodiment. In FIG. 11, “◯” indicates that the identification by the voice identification device 10 is accurate, and “X” indicates that the identification by the voice identification device 10 is incorrect. According to the discrimination test result of FIG. 11, the target rate of voice discrimination using the voice discrimination device 10 is about 90%.

従来の音声識別方法として知られる変形相関法の的中率は、８割程度である。変形相関法は、入力された音声信号の波形に対して、自己相関関数を求めた後、線形予測分析を行い、次数分の線形予測係数を求める。次数分の線形予測係数とは、現サンプル値について、過去の複数のサンプリング値との相関で表せる程度を示す基準である。その後、変形相関法は、予測残差を求めて予測残差のピークを検出し、ピーク間の時間距離を測定し、その測定結果に基づいて識別を行う。変形相関法は、このように複雑な操作を要するため、識別に時間がかかる。 The hit rate of the modified correlation method known as a conventional speech identification method is about 80%. In the modified correlation method, an autocorrelation function is obtained for a waveform of an input speech signal, and then linear prediction analysis is performed to obtain linear prediction coefficients corresponding to orders. The linear prediction coefficient for the order is a standard indicating the degree to which the current sample value can be expressed by correlation with a plurality of past sampling values. Thereafter, the modified correlation method obtains a prediction residual, detects a peak of the prediction residual, measures a time distance between the peaks, and performs identification based on the measurement result. Since the modified correlation method requires such a complicated operation, identification takes time.

図１１の識別試験結果によれば、本実施形態の音声識別装置１０は、従来の変形相関法を用いた場合と同程度以上の正確さで音声信号が示す音が無声音か有声音かを識別することができる。また、本実施形態の音声識別装置１０は、変形相関法を用いた場合よりも簡易かつ迅速な方法で音声信号が示す音が無声音か有声音かを識別することができる。 According to the discrimination test result of FIG. 11, the voice identification device 10 of this embodiment discriminates whether the sound indicated by the voice signal is an unvoiced sound or a voiced sound with an accuracy equal to or higher than the case where the conventional modified correlation method is used. can do. In addition, the speech identification device 10 of the present embodiment can identify whether the sound indicated by the speech signal is an unvoiced sound or a voiced sound by a simpler and quicker method than when the modified correlation method is used.

上述した本実施形態では、互いに隣接するフレームのゼロクロス数を比較対象としたが、本発明の実施の形態はこれに限られず、互いに近接する（近くにある）フレームのゼロクロス数を比較対象としていればよい。本実施形態では３０％ずつシフトさせながらフレームを生成したが、例えばシフト量を少なくしてより細かくフレーミングをすることも好ましい。このようにフレーム間隔を短くした場合には、データの相互関連性を阻害しない範囲で２つ以上前のフレームを比較対照としてもよい。このように互いに近接するフレーム、すなわちデータの相互関連性を阻害しない程度に近くに存在するフレームのゼロクロス数を比較対照とすることも好ましい。 In the above-described embodiment, the number of zero crosses of adjacent frames is a comparison target. However, the embodiment of the present invention is not limited to this, and the number of zero crosses of adjacent (near) frames can be compared. That's fine. In this embodiment, the frame is generated while shifting by 30%. However, for example, it is also preferable to perform the framing more finely by reducing the shift amount. When the frame interval is shortened in this way, two or more previous frames may be used as a comparison reference within a range that does not inhibit the correlation of data. In this way, it is also preferable to use the number of zero crossings of frames close to each other, that is, frames that are close enough not to inhibit the correlation of data as a comparison.

本実施形態に係る音声識別システムの構成図である。It is a lineblock diagram of the voice identification system concerning this embodiment. 本実施形態に係る音声識別装置の機能ブロック図である。It is a functional block diagram of the voice identification device according to the present embodiment. 本実施形態に係る音声識別装置が入力した音声信号の波形を示すグラフである。It is a graph which shows the waveform of the audio | voice signal which the audio | voice identification apparatus which concerns on this embodiment input. プリエンファシス前及び後の音声信号の波形を示すグラフである。It is a graph which shows the waveform of the audio | voice signal before and after pre-emphasis. プリエンファシス前及び後の音声信号のスペクトルを示すグラフである。It is a graph which shows the spectrum of the audio | voice signal before and after pre-emphasis. シフト後の音声信号の波形とハミング窓とを示すグラフである。It is a graph which shows the waveform of the audio | voice signal after a shift, and a Hamming window. フレームの自己相関関数を示すグラフである。It is a graph which shows the autocorrelation function of a flame | frame. 各フレームのゼロクロス数を示す図である。It is a figure which shows the number of zero crosses of each frame. フレームのゼロクロス数の変動を示すグラフである。It is a graph which shows the fluctuation | variation of the zero crossing number of a flame | frame. 本実施形態に係る音声識別装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the audio | voice identification apparatus which concerns on this embodiment. 本実施形態に係る音声識別装置を用いた識別試験結果を示す表である。It is a table | surface which shows the identification test result using the audio | voice identification apparatus which concerns on this embodiment.

Explanation of symbols

１…音声識別システム、３…マイク１０…音声識別装置、１１…検出部、１３…生成部、１５…算出部、１７…識別部。 DESCRIPTION OF SYMBOLS 1 ... Voice identification system, 3 ... Microphone 10 ... Voice identification device, 11 ... Detection part, 13 ... Generation part, 15 ... Calculation part, 17 ... Identification part.

Claims

A frame generation means for cutting out the audio signal along the time axis into a plurality of frames by shifting each cut-out start time;
Zero-cross calculating means for calculating the number of zero-crosses corresponding to each of the plurality of cut out frames,
Zero cross comparison means for comparing the number of zero crosses in each of the plurality of frames with the number of zero crosses in a frame adjacent to each of the plurality of frames based on the calculated number of zero crosses;
Phoneme identification means for identifying whether the phoneme corresponding to each of the plurality of frames is an unvoiced sound or a voiced sound based on the comparison result;
A voice identification device comprising:

The phoneme identification unit corresponds to a rear frame when a difference value between the number of zero crosses in each of the plurality of frames and the number of zero crosses in a frame adjacent to each of the plurality of frames exceeds a predetermined threshold. The speech identification device according to claim 1, wherein the phoneme is identified as an unvoiced sound.

The phoneme identifying means is configured to detect a phoneme corresponding to a rear frame when a difference value between a number of zero crosses in each of the plurality of frames and a number of zero crosses in a frame adjacent to each of the plurality of frames is smaller than a predetermined threshold. The voice identification device according to claim 1, wherein is identified as a voiced sound.

A first step in which the frame generation means cuts the audio signal along the time axis into a plurality of frames by shifting each cut-out start time;
A second step in which the zero cross calculating means calculates the number of zero crosses corresponding to each of the plurality of frames cut out in the first step;
Based on the number of zero crosses calculated in the second step, a third step in which zero cross comparison means compares the number of zero crosses in each of the plurality of frames with the number of zero crosses in a frame adjacent to each of the plurality of frames;
A fourth step in which the phoneme identifying means identifies whether the phoneme corresponding to each of the plurality of frames is an unvoiced sound or a voiced sound based on the comparison result in the third step;
A voice identification method comprising:

In the fourth step, the phoneme identifying means, when a difference value between the number of zero crosses in each of the plurality of frames and the number of zero crosses in a frame adjacent to each of the plurality of frames exceeds a predetermined threshold value, The speech identification method according to claim 4, wherein the phoneme corresponding to the rear frame is identified as an unvoiced sound.

In the fourth step, the phoneme identifying means determines whether the difference between the number of zero crosses in each of the plurality of frames and the number of zero crosses in a frame adjacent to each of the plurality of frames is smaller than a predetermined threshold. The speech identification method according to claim 4, wherein the phoneme corresponding to the frame is identified as a voiced sound.