JP4349415B2

JP4349415B2 - Sound signal processing apparatus and program

Info

Publication number: JP4349415B2
Application number: JP2006347789A
Authority: JP
Inventors: 靖雄吉岡
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2006-12-25
Filing date: 2006-12-25
Publication date: 2009-10-21
Anticipated expiration: 2026-12-25
Also published as: JP2008158316A

Description

本発明は、音声や楽音といった各種の音を示す信号（以下「音信号」という）を処理する技術に関し、特に、音信号のうち実際に所期の音が発音されている区間（以下「発音区間」という）を特定する技術に関する。 The present invention relates to a technique for processing a signal (hereinafter referred to as “sound signal”) indicating various sounds such as voice and musical sound, and in particular, a section (hereinafter referred to as “sound generation”) of a sound signal in which a desired sound is actually generated. (Referred to as “section”).

音声認識や音声認証（話者認証）などの音声解析においては音信号を発音区間と非発音区間（環境に応じた雑音のみが存在する区間）とに区分する技術が利用される。例えば、音信号のＳ/Ｎ比が所定の閾値を上回る区間が発音区間として特定される。また、特許文献１には、音信号を区分した各区間のＳ/Ｎ比と過去に非発音区間と判定された区間のＳ/Ｎ比とを比較することで各区間が発音区間および非発音区間の何れに該当するかを判別する技術が開示されている。
特開２００１−２６５３６７号公報 In voice analysis such as voice recognition and voice authentication (speaker authentication), a technique is used in which a sound signal is divided into a sounding period and a non-sounding period (a period in which only noise corresponding to the environment exists). For example, a section where the S / N ratio of the sound signal exceeds a predetermined threshold is specified as the sound generation section. Further, in Patent Document 1, each section is divided into a sound generation section and a non-sound generation by comparing the S / N ratio of each section into which a sound signal is divided and the S / N ratio of a section determined as a non-sound generation section in the past. A technique for determining which of the sections corresponds is disclosed.
JP 2001-265367 A

しかし、特許文献１の技術においては、音信号の各区間のＳ/Ｎ比と過去の非発音区間におけるＳ/Ｎ比との比較のみによって発音区間と非発音区間との区別が確定されるから、例えば発声者の咳の音やリップノイズや口中音など瞬間的な雑音が発生した区間（本来ならば非発音区間と判定されるべき区間）が発音区間として誤検出される可能性がある。以上の事情を背景として、本発明は、発音区間の特定の精度を向上するという課題の解決を目的としている。 However, in the technique of Patent Document 1, the distinction between the sounding period and the non-sounding period is determined only by comparing the S / N ratio of each section of the sound signal with the S / N ratio in the past non-sounding period. For example, a section where a momentary noise such as a coughing sound of a speaker, lip noise, or a mouth sound (a section that should be determined as a non-sounding section) may be erroneously detected as a sounding section. Against the background of the above circumstances, the present invention aims to solve the problem of improving the specific accuracy of the pronunciation interval.

以上の課題を解決するために、本発明のひとつの形態に係る音信号処理装置は、音信号の各フレームについてフレーム情報を生成するフレーム情報生成手段と、前記フレーム情報生成手段が生成したフレーム情報を記憶する記憶手段と、前記音信号の第１発音区間（例えば図２の発音区間Ｐ1）を特定する第１区間特定手段と、前記第１区間特定手段が特定した第１発音区間内の各フレームについて前記記憶手段が記憶するフレーム情報に基づいて、前記第１発音区間を短縮した第２発音区間（例えば図２の発音区間Ｐ2）を特定する第２区間特定手段とを具備する。
本発明のひとつの態様において、フレーム情報生成手段は、相異なる種類の第１フレーム情報と第２フレーム情報とを音信号の各フレームについて生成し、第１区間特定手段は、前記各フレームの前記第１フレーム情報に基づいて前記音信号の第１発音区間を特定し、第２区間特定手段は、第１発音区間内の各フレームの前記第２フレーム情報に基づいて第２発音区間を特定する。
In order to solve the above problems, a sound signal processing device according to one aspect of the present invention includes frame information generating means for generating frame information for each frame of a sound signal, and frame information generated by the frame information generating means. Storing means, first section specifying means for specifying the first sounding section (for example, the sounding section P1 in FIG. 2) of the sound signal, and each of the first sounding sections specified by the first section specifying means. And second section specifying means for specifying a second sound generation section (for example, the sound generation section P2 in FIG. 2) obtained by shortening the first sound generation section based on the frame information stored by the storage means for the frame.
In one aspect of the present invention, the frame information generating unit generates different types of first frame information and second frame information for each frame of the sound signal, and the first section specifying unit includes the first frame information of the frame. The first sounding section of the sound signal is specified based on the first frame information, and the second section specifying means specifies the second sounding section based on the second frame information of each frame in the first sounding section. .

以上の構成によれば、各フレームのフレーム情報に基づいて第１発音区間を短縮することで第２発音区間が特定される。したがって、ひとつの段階の処理で発音区間が確定される構成（例えば第１発音区間のみが特定される構成）と比較して、発音区間の特定の精度を向上することが可能である。なお、フレーム情報の具体的な内容やフレーム情報に基づいて第２発音区間を特定する具体的な方法は本発明において任意であるが、例えば以下の各態様が採用される。 According to the above configuration, the second sound generation interval is specified by shortening the first sound generation interval based on the frame information of each frame. Therefore, it is possible to improve the accuracy of specifying the sounding section as compared with a configuration in which the sounding interval is determined in one stage of processing (for example, a configuration in which only the first sounding interval is specified). In addition, although the specific method of specifying the second sound generation section based on the specific contents of the frame information and the frame information is arbitrary in the present invention, for example, the following aspects are adopted.

第１の態様において、フレーム情報は、各フレームにおける音信号の信号レベルに応じた信号指標値（例えば実施形態における信号レベルHIST_LEVELやＳ/Ｎ比Ｒ）を含む。第２区間特定手段は、第１発音区間の始点から連続する１以上のフレームおよび第１発音区間の終点から手前側に連続する１以上のフレームの少なくとも一方であって、フレーム情報に含まれる信号指標値が第１発音区間内の信号指標値の最大値に応じた閾値（例えば図６の閾値ＴＨ1）を下回るフレームを、第１発音区間内の複数のフレームから除外することで第２発音区間を特定する。 In the first aspect, the frame information includes a signal index value corresponding to the signal level of the sound signal in each frame (for example, the signal level HIST_LEVEL and the S / N ratio R in the embodiment). The second section specifying means is at least one of one or more frames continuous from the start point of the first sounding section and one or more frames continuous from the end point of the first sounding section to the near side, and is a signal included in the frame information By excluding a frame whose index value is lower than a threshold value corresponding to the maximum value of the signal index value in the first sounding period (for example, the threshold value TH1 in FIG. 6) from the plurality of frames in the first sounding period, the second sounding period Is identified.

また、第１の態様において、第２区間特定手段は、第１発音区間の始点から連続する複数のフレームにわたる信号指標値の加算値が、第１発音区間内の信号指標値の最大値に応じた閾値（例えば図６の閾値ＴＨ2）を下回る場合に、当該複数のフレームのうち始点側の２以上のフレームを除外することで第２発音区間を特定する。同様に、第２区間特定手段は、第１発音区間の終点から手前側に連続する複数のフレームにわたる信号指標値の加算値が、第１発音区間内の信号指標値の最大値に応じた閾値を下回る場合に、当該複数のフレームのうち終点側の２以上のフレームを除外することで第２発音区間を特定する。
Further, in the first aspect, the second section specifying means is configured such that the added value of the signal index value over a plurality of consecutive frames from the start point of the first sounding section corresponds to the maximum value of the signal index value in the first sounding section. When the threshold value is below the threshold value (for example, the threshold value TH2 in FIG. 6), the second sound generation interval is specified by excluding two or more frames on the start point side from the plurality of frames. Similarly, the second section specifying means is configured such that the added value of the signal index values over a plurality of consecutive frames from the end point of the first sounding section to the near side is a threshold corresponding to the maximum value of the signal index values in the first sounding section. The second sound generation section is specified by excluding two or more frames on the end point side among the plurality of frames.

以上のように第１発音区間内の信号指標値の最大値に応じて第２発音区間を特定する構成によれば、実際の発声の区間の前後に発生する雑音（例えば発声者の咳払いやリップノイズなど）を有効に排除することが可能である。なお、第１の態様の具体例は第１実施形態として後述される。 As described above, according to the configuration in which the second sound generation interval is specified in accordance with the maximum value of the signal index value in the first sound generation interval, noise (for example, coughing and lip of the speaker) generated before and after the actual sound generation interval. Noise, etc.) can be effectively eliminated. A specific example of the first aspect will be described later as the first embodiment.

第２の態様において、フレーム情報は、各フレームの音信号のピッチを検出した結果を示すピッチデータを含む。第２区間特定手段は、第１発音区間の始点から連続する１以上のフレームおよび第１発音区間の終点から手前側に連続する１以上のフレームの少なくとも一方であって、フレーム情報に含まれるピッチデータが非検出を示すフレームを、第１発音区間から除外することで第２発音区間を特定する。以上の態様によれば、風切音のようにピッチが明確に特定されない雑音を有効に排除することが可能である。なお、第２の態様の具体例は第２実施形態として後述される。 In the second aspect, the frame information includes pitch data indicating a result of detecting the pitch of the sound signal of each frame. The second section specifying means is at least one of one or more frames continuous from the start point of the first sounding section and one or more frames continuous from the end point of the first sounding section to the near side, and the pitch included in the frame information The second sounding interval is specified by excluding the frame whose data is not detected from the first sounding interval. According to the above aspect, it is possible to effectively eliminate noise whose pitch is not clearly specified, such as wind noise. A specific example of the second aspect will be described later as a second embodiment.

第３の態様において、フレーム情報は、各フレームにおける音信号のゼロクロス数を含む。第２区間特定手段は、フレーム情報に含まれるゼロクロス数が閾値を上回るフレームが第１発音区間の終点から手前側に複数のフレームにわたって連続する場合に、複数のフレームのうち始点側の所定個のフレーム以外のフレームを除外することで第２発音区間を特定する。以上の態様によれば、第１発音区間の終点から手前側の複数のフレームであってゼロクロス数が閾値を上回るフレーム（無声子音）が所定個を残して除外されるから、発声の末尾（無声子音）を所定の時間長に調整することが可能である。 In the third aspect, the frame information includes the number of zero crossings of the sound signal in each frame. The second section specifying means, when a frame in which the number of zero crosses included in the frame information exceeds a threshold value continues over a plurality of frames from the end point of the first sound generation section to the near side, a predetermined number of the start point side of the plurality of frames By excluding frames other than the frame, the second sound generation interval is specified. According to the above aspect, frames (unvoiced consonants) that are a plurality of frames on the near side from the end point of the first sounding section and whose zero-cross number exceeds the threshold value are excluded except for a predetermined number. It is possible to adjust the consonant) to a predetermined time length.

本発明の好適な態様に係る音信号処理装置は、開始指示を取得する取得手段（例えば図３の切換部５８３）と、音信号のうち取得手段による開始指示の取得前のフレームの雑音レベルを算定する雑音レベル算定手段と、音信号のうち取得手段による開始指示の取得後の各フレームの信号レベルと雑音レベル算定手段が算定した雑音レベルとに基づいてＳ/Ｎ比を算定するＳ/Ｎ比算定手段とを具備し、第１区間特定手段は、Ｓ/Ｎ比算定手段が各フレームについて算定したＳ/Ｎ比に基づいて第１発音区間を特定する。以上の態様によれば、開始指示の取得前の各フレームを雑音として開始指示の取得後の各フレームのＳ/Ｎ比が算定されるから、第１発音区間を高精度に特定することが可能である。 The sound signal processing device according to a preferred aspect of the present invention includes an acquisition unit (for example, the switching unit 583 in FIG. 3) for acquiring a start instruction, and a noise level of a frame before acquisition of the start instruction by the acquisition unit among the sound signals. The S / N ratio is calculated based on the noise level calculation means to be calculated, and the signal level of each frame of the sound signal after acquisition of the start instruction by the acquisition means and the noise level calculated by the noise level calculation means. The first section specifying means specifies the first sound generation section based on the S / N ratio calculated for each frame by the S / N ratio calculating means. According to the above aspect, since the S / N ratio of each frame after obtaining the start instruction is calculated using each frame before obtaining the start instruction as noise, it is possible to specify the first sound generation section with high accuracy. It is.

本発明の好適な態様に係る音信号処理装置は、音信号処理装置とは別体の音解析装置が音信号の解析に使用する特徴量を音信号の各フレームについて順次に算定する特徴量算定手段と、第１区間特定手段が特定した第１発音区間に対応する各フレームの特徴量を、特徴量算定手段による算定のたびに順次に音解析装置に出力する出力制御手段を具備し、前記第２区間特定手段は第２発音区間を音解析装置に通知する。以上の態様においては、特徴量算定手段が算定した特徴量は順次に音解析装置に出力されるから、第１発音区間に属する総てのフレームの特徴量を音信号処理装置に保持しておく必要はない。したがって、音信号処理装置の回路の規模や処理の負荷が軽減されるという効果がある。以上の効果は、各フレームのフレーム情報のデータ量が各フレームの特徴量のデータ量と比較して少ない場合に特に顕著となる。また、第２区間特定手段が特定した第２発音区間が音解析装置に通知されるから、音解析装置においては、出力制御装置から取得した特徴量のうち第２発音区間に属するフレームの特徴量を選択的に音信号の解析に利用することが可能となる。したがって、音解析装置による音信号の解析の精度が向上するという利点もある。
The sound signal processing device according to a preferred aspect of the present invention is a feature amount calculation for sequentially calculating, for each frame of a sound signal, a feature amount used by the sound analysis device separate from the sound signal processing device for analysis of the sound signal. And an output control means for sequentially outputting the feature quantity of each frame corresponding to the first sounding section specified by the first section specifying means to the sound analysis apparatus every time the feature quantity calculation means calculates , The second section specifying means notifies the sound analysis device of the second sound generation section. In the above aspect, since the feature values calculated by the feature value calculation means are sequentially output to the sound analysis device, the feature values of all the frames belonging to the first sounding section are held in the sound signal processing device. There is no need. Therefore, there is an effect that the circuit size and processing load of the sound signal processing device are reduced. The above effects are particularly remarkable when the data amount of the frame information of each frame is smaller than the data amount of the feature amount of each frame. Further, since the second sound generation section specified by the second section specifying means is notified to the sound analysis device, the sound analysis device uses the feature value of the frame belonging to the second sound generation section among the feature values acquired from the output control device. Can be selectively used for analysis of sound signals. Therefore, there is an advantage that the accuracy of the analysis of the sound signal by the sound analyzer is improved.

本発明は、以上の各態様に係る音信号処理装置の動作方法（音信号処理方法）としても特定される。本発明のひとつの態様に係る音信号処理方法は、音信号の各フレームについてフレーム情報を生成し、音信号の第１発音区間（例えば図２の発音区間Ｐ1）を特定し、第１区間特定手段が特定した第１発音区間内の各フレームについて生成したフレーム情報に基づいて、第１発音区間を短縮した第２発音区間（例えば図２の発音区間Ｐ2）を特定する。以上の方法によれば、本発明に係る音信号処理装置と同様の作用および効果が奏される。
The present invention is also specified as an operation method (sound signal processing method) of the sound signal processing device according to each of the above aspects. The sound signal processing method according to one aspect of the present invention generates frame information for each frame of a sound signal, specifies a first sounding section (for example, the sounding section P1 in FIG. 2) of the sound signal, and specifies the first section. Based on the frame information generated for each frame in the first sounding period specified by the means, the second sounding period (for example, the sounding period P2 in FIG. 2) obtained by shortening the first sounding period is specified. According to the above method, the same operation and effect as the sound signal processing apparatus according to the present invention are exhibited.

以上の各態様に係る音信号処理装置は、各処理に専用されるＤＳＰ（Digital Signal Processor）などのハードウェア（電子回路）によって実現されるほか、ＣＰＵ（Central Processing Unit）などの汎用の演算処理装置とプログラムとの協働によっても実現される。本発明に係るプログラムは、音信号の各フレームについてフレーム情報を生成するフレーム情報生成処理と、音信号の第１発音区間（例えば図２の発音区間Ｐ1）を特定する第１区間特定処理と、第１区間特定処理で特定した第１発音区間内の各フレームについてフレーム情報生成処理で生成したフレーム情報に基づいて、第１発音区間を短縮した第２発音区間（例えば図２の発音区間Ｐ2）を特定する第２区間特定処理とをコンピュータに実行させる内容である。以上のプログラムによっても、本発明に係る音信号処理装置と同様の作用および効果が奏される。なお、本発明のプログラムは、ＣＤ−ＲＯＭなど可搬型の記録媒体に格納された形態で利用者に提供されてコンピュータにインストールされるほか、ネットワークを介した配信の形態でサーバ装置から提供されてコンピュータにインストールされる。

The sound signal processing apparatus according to each aspect described above is realized by hardware (electronic circuit) such as a DSP (Digital Signal Processor) dedicated to each process, and general-purpose arithmetic processing such as a CPU (Central Processing Unit). This is also realized by cooperation between the apparatus and the program. The program according to the present invention includes a frame information generation process for generating frame information for each frame of a sound signal, a first section specifying process for specifying a first sound generation section (for example, the sound generation section P1 in FIG. 2) of the sound signal, Based on the frame information generated by the frame information generation process for each frame in the first sound generation section specified by the first section specifying process, the second sound generation section (for example, the sound generation section P2 in FIG. 2) obtained by shortening the first sound generation section. The second section specifying process for specifying is executed by the computer. Even with the above program, the same operation and effect as the sound signal processing apparatus according to the present invention are exhibited. The program of the present invention is provided to a user in a form stored in a portable recording medium such as a CD-ROM and installed in a computer, or provided from a server device in a form of distribution via a network. Installed on the computer.

＜Ａ：第１実施形態＞
＜Ａ−１：構成＞
図１は、本発明のひとつの形態に係る音信号処理システムの構成を示すブロック図である。同図に示すように、音信号処理システムは、収音装置（マイクロホン）１０と音信号処理装置２０と入力装置７０と音解析装置８０とを具備する。本形態においては収音装置１０と入力装置７０と音解析装置８０とが音信号処理装置２０と別体に設置された構成を例示するが、以上の要素の一部または全部が単一の装置を構成してもよい。 <A: First Embodiment>
<A-1: Configuration>
FIG. 1 is a block diagram showing a configuration of a sound signal processing system according to one embodiment of the present invention. As shown in the figure, the sound signal processing system includes a sound collection device (microphone) 10, a sound signal processing device 20, an input device 70, and a sound analysis device 80. In the present embodiment, a configuration in which the sound collection device 10, the input device 70, and the sound analysis device 80 are installed separately from the sound signal processing device 20 is illustrated, but a part or all of the above elements are a single device. May be configured.

収音装置１０は、周囲の音響（音声および雑音）の波形を示す音信号Ｓを生成する。図２には、音信号Ｓの波形が例示されている。音信号処理装置２０は、収音装置１０が生成した音信号Ｓのうち発声者が実際に発声した発音区間を特定する。入力装置７０は、利用者による操作に応じた信号を出力する機器（例えばキーボードやマウス）である。利用者は、入力装置７０を適宜に操作することで、音信号処理装置２０が発音区間の特定を開始する契機となる指示（以下「開始指示」という）ＴＲを入力する。音解析装置８０は、音信号Ｓの解析に使用される。本形態の音解析装置８０は、音信号Ｓから抽出された特徴量と予め登録された特徴量とを対比することで発声者の正当性を認証する音声認証装置である。 The sound collection device 10 generates a sound signal S indicating a waveform of surrounding sound (speech and noise). FIG. 2 illustrates the waveform of the sound signal S. The sound signal processing device 20 specifies a sounding section in which the speaker has actually uttered from the sound signal S generated by the sound collection device 10. The input device 70 is a device (for example, a keyboard or a mouse) that outputs a signal corresponding to a user's operation. The user appropriately operates the input device 70 to input an instruction TR (hereinafter referred to as “start instruction”) that triggers the sound signal processing apparatus 20 to start specifying the sounding section. The sound analysis device 80 is used for analyzing the sound signal S. The sound analysis device 80 according to the present embodiment is a voice authentication device that authenticates the legitimacy of a speaker by comparing a feature amount extracted from the sound signal S with a pre-registered feature amount.

音信号処理装置２０は、第１区間特定部３０と第２区間特定部４０とフレーム分析部５０と出力制御部６２と記憶部６４とを含む。第１区間特定部３０と第２区間特定部４０とフレーム分析部５０と出力制御部６２とは、例えばＣＰＵなどの演算処理装置がプログラムを実行することで実現されてもよいし、ＤＳＰなどのハードウェア回路によって実現されてもよい。 The sound signal processing device 20 includes a first section specifying unit 30, a second section specifying unit 40, a frame analysis unit 50, an output control unit 62, and a storage unit 64. The first section specifying unit 30, the second section specifying unit 40, the frame analyzing unit 50, and the output control unit 62 may be realized by an arithmetic processing unit such as a CPU executing a program, or a DSP or the like. It may be realized by a hardware circuit.

第１区間特定部３０は、図２に図示された発音区間Ｐ1を音信号Ｓに基づいて特定する手段である。一方、第２区間特定部４０は図２の発音区間Ｐ2を特定する手段である。第１区間特定部３０が発音区間Ｐ1を特定する方法と第２区間特定部４０が発音区間Ｐ2を特定する方法とは相違する。本形態の第２区間特定部４０は、第１区間特定部３０による発音区間Ｐ1の特定よりも高精度な方法で発音区間Ｐ2を特定する。したがって、図２に示すように発音区間Ｐ2は発音区間Ｐ1よりも短い。 The first section specifying unit 30 is a means for specifying the sound generation section P1 shown in FIG. On the other hand, the second section specifying unit 40 is means for specifying the sound generation section P2 in FIG. The method in which the first section specifying unit 30 specifies the sounding section P1 is different from the method in which the second section specifying unit 40 specifies the sounding section P2. The second section specifying unit 40 of the present embodiment specifies the sound generation section P2 by a method with higher accuracy than the specification of the sound generation section P1 by the first section specifying unit 30. Therefore, as shown in FIG. 2, the sound generation period P2 is shorter than the sound generation period P1.

図１のフレーム分析部５０は、分割部５２と特徴量算定部５４とフレーム情報生成部５６とを含む。分割部５２は、図２に示すように、収音装置１０から供給される音信号Ｓを所定の時間長（例えば数十ミリ秒）のフレームに区分して順次に出力する。各フレームは時間軸上で相互に重なり合うように設定される。 The frame analysis unit 50 in FIG. 1 includes a division unit 52, a feature amount calculation unit 54, and a frame information generation unit 56. As shown in FIG. 2, the dividing unit 52 divides the sound signal S supplied from the sound collecting device 10 into frames having a predetermined time length (for example, several tens of milliseconds) and sequentially outputs them. Each frame is set to overlap each other on the time axis.

特徴量算定部５４は、音信号Ｓの各フレームＦについて特徴量Ｃを算定する。特徴量Ｃは、音解析装置８０が音信号Ｓの解析に使用するパラメータである。本形態の特徴量算定部５４は、ＦＦＴ（Fast Fourier Transform）処理を含む周波数分析によってメルケプストラム係数（MFCC：Mel Frequency Cepstrum Coefficient）を特徴量Ｃとして算定する。特徴量Ｃは、各フレームＦの音信号Ｓの供給に同期して実時間的に算定される（すなわち音信号Ｓの各フレームが供給されるたびに順次に算定される）。 The feature amount calculation unit 54 calculates a feature amount C for each frame F of the sound signal S. The feature amount C is a parameter used by the sound analysis device 80 to analyze the sound signal S. The feature amount calculation unit 54 of the present embodiment calculates a mel cepstrum coefficient (MFCC) as a feature amount C by frequency analysis including FFT (Fast Fourier Transform) processing. The feature amount C is calculated in real time in synchronization with the supply of the sound signal S of each frame F (that is, sequentially calculated each time each frame of the sound signal S is supplied).

フレーム情報生成部５６は、分割部５２が出力する音信号Ｓの各フレームＦについてフレーム情報Ｆ_HISTを生成する。また、本形態のフレーム情報生成部５６は、各フレームＦについてＳ/Ｎ比Ｒを算定する演算部５８を含む。Ｓ/Ｎ比Ｒは、第１区間特定部３０が発音区間Ｐ1を特定するために使用する情報である。一方、フレーム情報Ｆ_HISTは、第２区間特定部４０が発音区間Ｐ1を発音区間Ｐ2に短縮するために使用する情報である。フレーム情報Ｆ_HISTおよびＳ/Ｎ比Ｒは、各フレームＦの音信号Ｓの供給に同期して実時間的に算定される。 The frame information generation unit 56 generates frame information F_HIST for each frame F of the sound signal S output from the dividing unit 52. Further, the frame information generation unit 56 of the present embodiment includes a calculation unit 58 that calculates the S / N ratio R for each frame F. The S / N ratio R is information used by the first section specifying unit 30 to specify the sound generation section P1. On the other hand, the frame information F_HIST is information used by the second section specifying unit 40 to shorten the sound generation section P1 to the sound generation section P2. The frame information F_HIST and the S / N ratio R are calculated in real time in synchronization with the supply of the sound signal S of each frame F.

図３は、演算部５８の具体的な構成を示すブロック図である。同図に示すように、演算部５８は、レベル算定部５８１と切換部５８３と雑音レベル算定部５８５と記憶部５８７とＳ/Ｎ比算定部５８９とを含む。レベル算定部５８１は、分割部５２から供給される音信号Ｓの各フレームＦについて順次にレベル（強度）を算定する手段である。本形態のレベル算定部５８１は、ひとつのフレームＦの音信号Ｓをｎ個（ｎは２以上の自然数）の周波数帯域に区分したときの各成分のレベルである帯域別レベルFRAME_LEVEL[1]〜FRAME_LEVEL[n]を算定する。したがって、レベル算定部５８１は、例えば各々の通過帯域が相違する複数のバンドパスフィルタ（フィルタバンク）によって実現される。ただし、ＦＦＴ処理などの周波数分析によってレベル算定部５８１が帯域別レベルFRAME_LEVEL[1]〜FRAME_LEVEL[n]を算定する構成も採用される。 FIG. 3 is a block diagram illustrating a specific configuration of the calculation unit 58. As shown in the figure, the calculation unit 58 includes a level calculation unit 581, a switching unit 583, a noise level calculation unit 585, a storage unit 587, and an S / N ratio calculation unit 589. The level calculation unit 581 is means for calculating the level (intensity) sequentially for each frame F of the sound signal S supplied from the division unit 52. The level calculation unit 581 of this embodiment is a level for each component FRAME_LEVEL [1] to the level of each component when the sound signal S of one frame F is divided into n (n is a natural number of 2 or more) frequency bands. Calculate FRAME_LEVEL [n]. Therefore, the level calculation unit 581 is realized by a plurality of bandpass filters (filter banks) having different passbands, for example. However, a configuration in which the level calculation unit 581 calculates the band-specific levels FRAME_LEVEL [1] to FRAME_LEVEL [n] by frequency analysis such as FFT processing is also adopted.

図１のフレーム情報生成部５６は、音信号Ｓの各フレームＦについて信号レベルHIST_LEVELを算定する。ひとつのフレームＦのフレーム情報F_HISTは、当該フレームＦについて算定された信号レベルHIST_LEVELを含む。信号レベルHIST_LEVELは、以下の式（1）で表現されるように、帯域別レベルFRAME_LEVEL[1]〜FRAME_LEVEL[n]の合計値である。ひとつのフレームＦのフレーム情報Ｆ_HISTは、ひとつのフレームＦの特徴量Ｃ（例えばＭＦＣＣ）と比較してデータ量が少ない。

The frame information generation unit 56 in FIG. 1 calculates a signal level HIST_LEVEL for each frame F of the sound signal S. The frame information F_HIST of one frame F includes the signal level HIST_LEVEL calculated for the frame F. The signal level HIST_LEVEL is a total value of the band-specific levels FRAME_LEVEL [1] to FRAME_LEVEL [n] as expressed by the following equation (1). The frame information F_HIST of one frame F has a smaller data amount than the feature amount C (for example, MFCC) of one frame F.

図３の切換部５８３は、レベル算定部５８１が算定した帯域別レベルFRAME_LEVEL[1]〜FRAME_LEVEL[n]の供給先を、入力装置７０から入力される開始指示ＴＲに応じて選択的に切り換える手段である。さらに詳述すると、切換部５８３は、帯域別レベルFRAME_LEVEL[1]〜FRAME_LEVEL[n]を、開始指示ＴＲの取得前には雑音レベル算定部５８５に出力し、開始指示ＴＲの取得後にはＳ/Ｎ比算定部５８９に出力する。 The switching unit 583 in FIG. 3 selectively switches the supply destination of the band-specific levels FRAME_LEVEL [1] to FRAME_LEVEL [n] calculated by the level calculating unit 581 in accordance with the start instruction TR input from the input device 70. It is. More specifically, the switching unit 583 outputs the band-specific levels FRAME_LEVEL [1] to FRAME_LEVEL [n] to the noise level calculation unit 585 before obtaining the start instruction TR, and after obtaining the start instruction TR, the S / S It outputs to N ratio calculation part 589.

雑音レベル算定部５８５は、図２に示すように、切換部５８３が開始指示ＴＲを取得する直前の期間Ｐ0の雑音レベルNOISE_LEVEL[1]〜NOISE_LEVEL[n]を算定する手段である。期間Ｐ0は開始指示ＴＲの時点を終点とする期間であって複数（図２の例示では６個）のフレームＦで構成される。第ｉ番目の周波数帯域に対応した雑音レベルNOISE_LEVEL[i]は、期間Ｐ0内の所定個のフレームＦにわたる帯域別レベルFRAME_LEVEL[i]の平均値である。雑音レベル算定部５８５が算定した雑音レベルNOISE_LEVEL[1]〜NOISE_LEVEL[n]は記憶部５８７に順次に格納される。 As shown in FIG. 2, the noise level calculation unit 585 is means for calculating noise levels NOISE_LEVEL [1] to NOISE_LEVEL [n] in the period P0 immediately before the switching unit 583 acquires the start instruction TR. The period P0 is a period whose end point is the time point of the start instruction TR, and is composed of a plurality of frames (6 in the example of FIG. 2). The noise level NOISE_LEVEL [i] corresponding to the i-th frequency band is an average value of the band-specific levels FRAME_LEVEL [i] over a predetermined number of frames F within the period P0. The noise levels NOISE_LEVEL [1] to NOISE_LEVEL [n] calculated by the noise level calculation unit 585 are sequentially stored in the storage unit 587.

図３のＳ/Ｎ比算定部５８９は、音信号Ｓの各フレームＦについてＳ/Ｎ比Ｒを算定して第１区間特定部３０に出力する。Ｓ/Ｎ比Ｒは、開始指示ＴＲ後の各フレームＦの強度と期間Ｐ0内の雑音の強度との相対比に相当する数値である。本形態のＳ/Ｎ比算定部５８９は、開始指示ＴＲ後に切換部５８３から供給される各フレームＦの帯域別レベルFRAME_LEVEL[1]〜FRAME_LEVEL[n]と記憶部５８７に格納された雑音レベルNOISE_LEVEL[1]〜NOISE_LEVEL[n]とから以下の式(2)に基づいてＳ/Ｎ比Ｒを算定する。

The S / N ratio calculation unit 589 in FIG. 3 calculates the S / N ratio R for each frame F of the sound signal S and outputs it to the first section specifying unit 30. The S / N ratio R is a numerical value corresponding to the relative ratio between the intensity of each frame F after the start instruction TR and the intensity of noise within the period P0. The S / N ratio calculation unit 589 of this embodiment includes the level FRAME_LEVEL [1] to FRAME_LEVEL [n] for each frame F supplied from the switching unit 583 after the start instruction TR and the noise level NOISE_LEVEL stored in the storage unit 587. The S / N ratio R is calculated from [1] to NOISE_LEVEL [n] based on the following equation (2).

以上の式(2)で算定されるＳ/Ｎ比Ｒは、収音装置１０の周囲に存在する雑音のレベルに対する現時点の音声のレベルの大小を示す指標である。すなわち、利用者が発声していない場合にＳ/Ｎ比Ｒは「１」に近い数値となり、利用者による発声の音量が増加するほどにＳ/Ｎ比Ｒは「１」から増大する。そこで、第１区間特定部３０は、各フレームＦのＳ/Ｎ比Ｒに基づいて図２の発音区間Ｐ1を特定する。すなわち、概略的にはＳ/Ｎ比Ｒが所定値を上回るフレームＦの集合が発音区間Ｐ1として特定される。本形態においては、開始指示ＴＲの直前（すなわち発声者による発声の直前）における所定個のフレームＦの雑音レベルに基づいてＳ/Ｎ比Ｒが算定されるから、発音区間Ｐ1の特定にあたって周囲の雑音の影響を低減することが可能である。 The S / N ratio R calculated by the above equation (2) is an index indicating the magnitude of the current voice level relative to the level of noise existing around the sound pickup device 10. That is, when the user is not speaking, the S / N ratio R is a value close to “1”, and the S / N ratio R increases from “1” as the volume of the utterance by the user increases. Therefore, the first section specifying unit 30 specifies the sound generation section P1 of FIG. 2 based on the S / N ratio R of each frame F. That is, generally, a set of frames F in which the S / N ratio R exceeds a predetermined value is specified as the sound generation section P1. In this embodiment, since the S / N ratio R is calculated based on the noise level of a predetermined number of frames F immediately before the start instruction TR (that is, immediately before utterance by the speaker), It is possible to reduce the influence of noise.

図１に示すように第１区間特定部３０は始点特定部３２と終点特定部３４とを含む。始点特定部３２は、発音区間Ｐ1の始点Ｐ1_START（図２）を特定するとともに当該始点Ｐ1_STARTを識別するための始点データＤ1_STARTを生成する。終点特定部３４は、発音区間Ｐ1の終点Ｐ1_STOP（図２）を特定するとともに当該終点Ｐ1_STOPを識別するための終点データＤ1_STOPを生成する。始点データＤ1_STARTは、発音区間Ｐ1の先頭のフレームＦに付与された番号であり、終点データＤ1_STOPは、発音区間Ｐ1の最後のフレームＦに付与された番号である。図２に示すように、発音区間Ｐ1はＭ1個（Ｍ1は自然数）のフレームＦを含む。なお、第１区間特定部３０の動作の具体例は後述する。 As shown in FIG. 1, the first section specifying unit 30 includes a start point specifying unit 32 and an end point specifying unit 34. The start point specifying unit 32 specifies the start point P1_START (FIG. 2) of the sound generation section P1, and generates start point data D1_START for identifying the start point P1_START. The end point specifying unit 34 specifies the end point P1_STOP (FIG. 2) of the sound generation section P1, and generates end point data D1_STOP for identifying the end point P1_STOP. The start point data D1_START is a number assigned to the first frame F of the sound generation interval P1, and the end point data D1_STOP is a number assigned to the last frame F of the sound generation interval P1. As shown in FIG. 2, the sound generation section P1 includes M1 frames (M1 is a natural number). A specific example of the operation of the first section specifying unit 30 will be described later.

記憶部６４は、フレーム情報生成部５６が生成したフレーム情報Ｆ_HISTを記憶する手段である。半導体記憶装置や磁気記憶装置や光ディスク記憶装置など様々な記憶装置が記憶部６４として好適に採用される。なお、記憶部６４と記憶部５８７とは、ひとつの記憶装置に画定された別個の記憶領域であってもよいし、各々が別個の記憶装置であってもよい。 The storage unit 64 is means for storing the frame information F_HIST generated by the frame information generation unit 56. Various storage devices such as a semiconductor storage device, a magnetic storage device, and an optical disk storage device are preferably employed as the storage unit 64. Note that the storage unit 64 and the storage unit 587 may be separate storage areas defined in one storage device, or may be separate storage devices.

本形態の記憶部６４は、フレーム情報生成部５６が順次に算定する多数のフレーム情報Ｆ_HISTのうち発音区間Ｐ1に属するＭ1個のフレームＦのフレーム情報Ｆ_HISTのみを選択的に記憶する。すなわち、記憶部６４は、始点特定部３２が始点Ｐ1_STARTを特定した時点で、当該始点Ｐ1_STARTに対応するフレームＦからフレーム情報Ｆ_HISTの記憶を開始し、終点特定部３４が終点Ｐ1_STOPを特定した時点で、当該終点Ｐ1_STOPに対応するフレームＦをもってフレーム情報Ｆ_HISTの記憶を終了する。 The storage unit 64 of this embodiment selectively stores only the frame information F_HIST of the M1 frames F belonging to the sound generation section P1 among the many pieces of frame information F_HIST calculated sequentially by the frame information generation unit 56. That is, the storage unit 64 starts storing the frame information F_HIST from the frame F corresponding to the start point P1_START when the start point specifying unit 32 specifies the start point P1_START, and when the end point specifying unit 34 specifies the end point P1_STOP. Then, the storage of the frame information F_HIST is terminated with the frame F corresponding to the end point P1_STOP.

第２区間特定部４０は、記憶部６４に格納されたＭ1個のフレーム情報Ｆ_HIST（信号レベルHIST_LEVEL）に基づいて図２の発音区間Ｐ2を特定する。図１に示すように第２区間特定部４０は始点特定部４２と終点特定部４４とを含む。図２に示すように、始点特定部４２は、発音区間Ｐ1の始点Ｐ1_STARTからフレーム情報Ｆ_HISTに応じた時間長（フレーム数）だけ経過した時点を発音区間Ｐ2の始点Ｐ2_STARTとして特定し、当該始点Ｐ2_STARTを識別するための始点データＤ2_STARTを生成する。終点特定部４４は、発音区間Ｐ1の終点Ｐ1_STOPからフレーム情報Ｆ_HISTに応じた時間長（フレーム数）だけ手前の時点を発音区間Ｐ2の終点Ｐ2_STOPとして特定し、当該終点Ｐ2_STOPを識別するための終点データＤ2_STOPを生成する。始点データＤ2_STARTは発音区間Ｐ2の先頭のフレームＦの番号であり、終点データＤ2_STOPは発音区間Ｐ2の最後のフレームＦの番号である。始点データＤ2_STARTと終点データＤ2_STOPとは音解析装置８０に出力される。図２に示すように、発音区間Ｐ2はＭ2個（Ｍ2は自然数）のフレームＦを含む（Ｍ2＜Ｍ1）。なお、第２区間特定部４０の動作の具体例は後述する。 The second section specifying unit 40 specifies the sound generation section P2 of FIG. 2 based on the M1 frame information F_HIST (signal level HIST_LEVEL) stored in the storage unit 64. As shown in FIG. 1, the second section specifying unit 40 includes a start point specifying unit 42 and an end point specifying unit 44. As shown in FIG. 2, the start point specifying unit 42 specifies a time point when the time length (number of frames) corresponding to the frame information F_HIST has elapsed from the start point P1_START of the sound generation interval P1 as the start point P2_START of the sound generation interval P2, and the start point P2_START The start point data D2_START for identifying is generated. The end point specifying unit 44 specifies a time point before the end point P1_STOP of the sound generation period P1 by the time length (number of frames) according to the frame information F_HIST as the end point P2_STOP of the sound generation period P2, and end point data for identifying the end point P2_STOP D2_STOP is generated. The start point data D2_START is the number of the first frame F of the sounding section P2, and the end point data D2_STOP is the number of the last frame F of the sounding section P2. The start point data D2_START and the end point data D2_STOP are output to the sound analyzer 80. As shown in FIG. 2, the sound generation period P2 includes M2 frames (M2 is a natural number) F (M2 <M1). A specific example of the operation of the second section specifying unit 40 will be described later.

図１の出力制御部６２は、特徴量算定部５４が各フレームＦについて順次に算定する特徴量Ｃを選択的に音解析装置８０に出力する手段である。本形態の出力制御部６２は、発音区間Ｐ1に属する各フレームＦの特徴量Ｃを音解析装置８０に出力する一方、発音区間Ｐ1以外の各フレームＦの特徴量Ｃを破棄する（音解析装置８０に出力しない）。すなわち、出力制御部６２は、始点特定部３２が始点Ｐ1_STARTを特定した時点で、当該始点Ｐ1_STARTに対応したフレームＦから特徴量Ｃの出力を開始し、以後の各フレームＦについては特徴量算定部５４による算定に同期して実時間的に特徴量Ｃを出力する（すなわち各フレームＦの特徴量Ｃが特徴量算定部５４から供給されるたびに音解析装置８０に出力する）。そして、出力制御部６２は、終点特定部３４が終点Ｐ1_STOPを特定した時点で、当該終点Ｐ1_STOPに対応するフレームＦをもって特徴量Ｃの出力を終了する。 The output control unit 62 in FIG. 1 is means for selectively outputting to the sound analysis device 80 the feature amount C that the feature amount calculation unit 54 sequentially calculates for each frame F. The output control unit 62 of the present embodiment outputs the feature value C of each frame F belonging to the sound generation section P1 to the sound analysis device 80, while discarding the feature value C of each frame F other than the sound generation section P1 (sound analysis device). No output to 80). That is, when the start point specifying unit 32 specifies the start point P1_START, the output control unit 62 starts outputting the feature amount C from the frame F corresponding to the start point P1_START, and for each subsequent frame F, the feature amount calculating unit. The feature amount C is output in real time in synchronization with the calculation by 54 (that is, the feature amount C of each frame F is output to the sound analysis device 80 each time the feature amount C is supplied from the feature amount calculation unit 54). Then, when the end point specifying unit 34 specifies the end point P1_STOP, the output control unit 62 ends the output of the feature amount C with the frame F corresponding to the end point P1_STOP.

図１に示すように、音解析装置８０は記憶部８２と制御部８４とを具備する。記憶部８２は、特定の発声者の音声から抽出された特徴量（以下「登録特徴量」という）の集合を予め記憶する。さらに、記憶部８２は、出力制御部６２から出力された特徴量Ｃを記憶する。すなわち、発音区間Ｐ1に属するＭ1個のフレームＦの各々の特徴量Ｃが記憶部８２に格納される。 As shown in FIG. 1, the sound analysis device 80 includes a storage unit 82 and a control unit 84. The storage unit 82 stores in advance a set of feature amounts (hereinafter referred to as “registered feature amounts”) extracted from the voice of a specific speaker. Further, the storage unit 82 stores the feature amount C output from the output control unit 62. That is, the feature amount C of each of the M1 frames F belonging to the sound generation section P1 is stored in the storage unit 82.

第２区間特定部４０が生成した始点データＤ2_STARTおよび終点データＤ2_STOPは制御部８４に供給される。制御部８４は、記憶部８２に格納されたＭ1個の特徴量Ｃのうち始点データＤ2_STARTと終点データＤ2_STOPとで画定される発音区間Ｐ2内のＭ2個の特徴量Ｃを使用して音信号Ｓを解析する。例えば、制御部８４は、ＤＰマッチングなど各種のパターンマッチング技術を利用して発音区間Ｐ2内の各特徴量Ｃと各登録特徴量との距離を算定し、この算定した距離に基づいて今回の発声者の正当性（発声者が予め登録された正規の利用者であるか否か）を判定する。 The start point data D2_START and the end point data D2_STOP generated by the second section specifying unit 40 are supplied to the control unit 84. The control unit 84 uses the M2 feature values C in the sound generation section P2 defined by the start point data D2_START and the end point data D2_STOP among the M1 feature values C stored in the storage unit 82 to generate the sound signal S. Is analyzed. For example, the control unit 84 uses various pattern matching techniques such as DP matching to calculate the distance between each feature quantity C in the pronunciation section P2 and each registered feature quantity, and based on this calculated distance, the current utterance The legitimacy of the user (whether or not the speaker is an authorized user registered in advance) is determined.

以上に説明したように、本形態においては、発音区間Ｐ1の特定に並行して各フレームＦの特徴量Ｃが実時間的に音解析装置８０に出力されるから、発音区間Ｐ1内の総てのフレームＦの特徴量Ｃを発音区間Ｐ1の確定（終点Ｐ1_STOPの確定）まで音信号処理装置２０が保持しておく必要はない。したがって、音信号処理装置２０の規模を縮小することが可能である。また、音解析装置８０においては発音区間Ｐ1をさらに絞り込んだ発音区間Ｐ2内の各特徴量Ｃが音信号Ｓの解析に使用されるから、発音区間Ｐ1内の総ての特徴量Ｃを対象として音信号Ｓの解析が実行される構成と比較して、制御部８４による処理の負荷が軽減されるとともに解析の精度（例えば発声者の正当性を認証する精度）が向上するという利点もある。 As described above, in the present embodiment, since the feature value C of each frame F is output to the sound analysis device 80 in real time in parallel with the specification of the sound generation section P1, all of the sounds in the sound generation section P1 are output. It is not necessary for the sound signal processing device 20 to hold the feature amount C of the frame F until the sounding section P1 is confirmed (end point P1_STOP is confirmed). Therefore, the scale of the sound signal processing device 20 can be reduced. Further, in the sound analysis apparatus 80, since each feature quantity C in the sound generation section P2 further narrowing down the sound generation section P1 is used for the analysis of the sound signal S, all the feature quantities C in the sound generation section P1 are targeted. Compared to the configuration in which the analysis of the sound signal S is performed, there are advantages that the processing load by the control unit 84 is reduced and the accuracy of the analysis (for example, the accuracy of authenticating the validity of the speaker) is improved.

＜Ａ−２：動作＞
次に、発音区間Ｐ1および発音区間Ｐ2を特定する処理を中心として音信号処理装置２０の具体的な動作を説明する。 <A-2: Operation>
Next, a specific operation of the sound signal processing device 20 will be described with a focus on the process of specifying the sound generation section P1 and the sound generation section P2.

音信号処理装置２０が起動すると、図３のレベル算定部５８１は、音信号Ｓの各フレームＦについて帯域別レベルFRAME_LEVEL[1]〜FRAME_LEVEL[n]を継続的に算定する。利用者が自身の発声に先立って入力装置７０から開始指示ＴＲを入力すると、雑音レベル算定部５８５は、開始指示ＴＲの直前の所定個のフレームＦの帯域別レベルFRAME_LEVEL[1]〜FRAME_LEVEL[n]から雑音レベルNOISE_LEVEL[1]〜NOISE_LEVEL[n]を算定して記憶部５８７に格納する。一方、Ｓ/Ｎ比算定部５８９は、開始指示ＴＲ後の各フレームＦの帯域別レベルFRAME_LEVEL[1]〜FRAME_LEVEL[n]と記憶部５８７の雑音レベルNOISE_LEVEL[1]〜NOISE_LEVEL[n]とに応じたＳ/Ｎ比Ｒを算定する。 When the sound signal processing device 20 is activated, the level calculation unit 581 in FIG. 3 continuously calculates the band-specific levels FRAME_LEVEL [1] to FRAME_LEVEL [n] for each frame F of the sound signal S. When the user inputs the start instruction TR from the input device 70 prior to his / her utterance, the noise level calculation unit 585 causes the band-specific levels FRAME_LEVEL [1] to FRAME_LEVEL [n] of a predetermined number of frames F immediately before the start instruction TR. ], Noise levels NOISE_LEVEL [1] to NOISE_LEVEL [n] are calculated and stored in the storage unit 587. On the other hand, the S / N ratio calculation unit 589 sets the level FRAME_LEVEL [1] to FRAME_LEVEL [n] of each frame F after the start instruction TR and the noise level NOISE_LEVEL [1] to NOISE_LEVEL [n] of the storage unit 587. The corresponding S / N ratio R is calculated.

（ａ）第１区間特定部３０の動作
第１区間特定部３０は、開始指示ＴＲを契機として、発音区間Ｐ1を特定するための処理を開始する。すなわち、始点特定部３２が始点Ｐ1_STARTを特定する処理（図４）と、終点特定部３４が終点Ｐ1_STOPを特定する処理（図５）とが実行される。各処理について詳述すると以下の通りである。 (A) Operation of the first section specifying unit 30
The first section specifying unit 30 starts a process for specifying the sound generation section P1 in response to the start instruction TR. That is, a process in which the start point specifying unit 32 specifies the start point P1_START (FIG. 4) and a process in which the end point specifying unit 34 specifies the end point P1_STOP (FIG. 5) are executed. The details of each process are as follows.

図４に示すように、始点特定部３２は、始点データＤ1_STARTをクリアするとともに変数CNT_START1と変数CNT_START2とをゼロに初期化する（ステップＳA1）。次いで、始点特定部３２は、Ｓ/Ｎ比算定部５８９からひとつのフレームＦのＳ/Ｎ比Ｒを取得し（ステップＳA2）、変数CNT_START2に「１」を加算する（ステップＳA3）。 As shown in FIG. 4, the start point specifying unit 32 clears the start point data D1_START and initializes the variables CNT_START1 and CNT_START2 to zero (step SA1). Next, the start point specifying unit 32 acquires the S / N ratio R of one frame F from the S / N ratio calculating unit 589 (step SA2), and adds “1” to the variable CNT_START2 (step SA3).

次に、始点特定部３２は、ステップＳA2で取得したＳ/Ｎ比Ｒが所定の閾値SNR_TH1を上回るか否かを判定する（ステップＳA4）。Ｓ/Ｎ比Ｒが閾値SNR_TH1を上回るフレームＦは発音区間Ｐ1内のフレームＦである可能性が高いが、周囲の雑音や電気的なノイズに起因してＳ/Ｎ比Ｒが突発的に閾値SNR_TH1を上回る場合もある。そこで、本形態においては以下に説明するように、Ｓ/Ｎ比Ｒが最初に閾値SNR_TH1を上回ったフレームＦを始点とした所定個のフレームＦ（以下「候補フレーム群」という）のうちＳ/Ｎ比Ｒが閾値SNR_TH1を超えるフレームＦがＮ1個を上回る場合に、最初のフレームＦを発音区間Ｐ1の始点Ｐ1_STARTとして特定する。 Next, the start point specifying unit 32 determines whether or not the S / N ratio R acquired in Step SA2 exceeds a predetermined threshold value SNR_TH1 (Step SA4). A frame F in which the S / N ratio R exceeds the threshold value SNR_TH1 is likely to be the frame F in the sound generation section P1, but the S / N ratio R suddenly becomes a threshold value due to ambient noise or electrical noise. May exceed SNR_TH1. Therefore, in the present embodiment, as will be described below, among the predetermined number of frames F (hereinafter referred to as “candidate frame group”) starting from the frame F in which the S / N ratio R first exceeds the threshold value SNR_TH1, S / When the number of frames F in which the N ratio R exceeds the threshold value SNR_TH1 exceeds N1, the first frame F is specified as the start point P1_START of the sound generation period P1.

ステップＳA4の結果が肯定である場合、始点特定部３２は、変数CNT_START1がゼロであるか否かを判定する（ステップＳA5）。変数CNT_START1がゼロであるということは今回のフレームＦが候補フレーム群の最初のフレームＦであることを意味している。したがって、ステップＳA5の結果が肯定である場合、始点特定部３２は、始点データＤ1_STARTを今回のフレームＦの番号に仮設定する（ステップＳA6）とともに変数CNT_START2をゼロに初期化する（ステップＳA7）。すなわち、今回のフレームＦが発音区間Ｐ1の始点Ｐ1_STARTとして仮定される。一方、ステップＳA5の結果が否定である場合、始点特定部３２は、ステップＳA6およびステップＳA7を経ることなく処理をステップＳA8に移行する。 If the result of step SA4 is affirmative, the start point specifying unit 32 determines whether or not the variable CNT_START1 is zero (step SA5). The variable CNT_START1 being zero means that the current frame F is the first frame F of the candidate frame group. Therefore, if the result of step SA5 is affirmative, the start point specifying unit 32 temporarily sets the start point data D1_START to the number of the current frame F (step SA6) and initializes the variable CNT_START2 to zero (step SA7). That is, the current frame F is assumed as the start point P1_START of the sound generation section P1. On the other hand, if the result of step SA5 is negative, the start point specifying unit 32 proceeds to step SA8 without passing through step SA6 and step SA7.

始点特定部３２は、変数CNT_START1に「１」を加算した（ステップＳA8）うえで、加算後の変数CNT_START1が所定値Ｎ1を上回るか否かを判定する（ステップＳA9）。ステップＳA9の結果が肯定である場合、始点特定部３２は、直前のステップＳA6で仮設定したフレームＦの番号を正式な始点データＤ1_STARTとして確定する（ステップＳA10）。すなわち、発音区間Ｐ1の始点Ｐ1_STARTが特定される。ステップＳA10において、始点特定部３２は、始点データＤ1_STARTを第２区間特定部４０に出力するとともに、始点Ｐ1_STARTの確定を出力制御部６２および記憶部６４に通知する。第１区間特定部３０からの通知を契機として、出力制御部６２による特徴量Ｃの出力と記憶部６４によるフレーム情報Ｆ_HISTの記憶とが開始される。 The start point specifying unit 32 adds “1” to the variable CNT_START1 (step SA8), and determines whether or not the added variable CNT_START1 exceeds a predetermined value N1 (step SA9). If the result of step SA9 is affirmative, the start point specifying unit 32 determines the number of the frame F temporarily set in the immediately preceding step SA6 as the official start point data D1_START (step SA10). That is, the starting point P1_START of the sound generation section P1 is specified. In step SA10, the start point specifying unit 32 outputs the start point data D1_START to the second section specifying unit 40, and notifies the output control unit 62 and the storage unit 64 of the confirmation of the start point P1_START. With the notification from the first section specifying unit 30, the output of the feature value C by the output control unit 62 and the storage of the frame information F_HIST by the storage unit 64 are started.

ステップＳA9の結果が否定である場合（すなわち候補フレーム群のうちＳ/Ｎ比Ｒが閾値SNR_TH1を上回るフレームＦが未だＮ1個以下である場合）、始点特定部３２は、次のフレームＦについてＳ/Ｎ比Ｒを取得した（ステップＳA2）うえでステップＳA3以後の処理を実行する。以上のようにひとつのフレームＦのＳ/Ｎ比Ｒが閾値SNR_TH1を上回るだけでは始点Ｐ1_STARTが確定されないから、例えば周囲の雑音や電気的なノイズに起因したＳ/Ｎ比Ｒの上昇を発音区間Ｐ1の始点Ｐ1_STARTと誤認する可能性は低減される。 When the result of step SA9 is negative (that is, when the number of frames F whose S / N ratio R exceeds the threshold value SNR_TH1 is still N1 or less in the candidate frame group), the start point specifying unit 32 performs S for the next frame F. After the / N ratio R is acquired (step SA2), the processing after step SA3 is executed. As described above, since the start point P1_START is not determined only by the S / N ratio R of one frame F exceeding the threshold value SNR_TH1, for example, an increase in the S / N ratio R caused by ambient noise or electrical noise is indicated as a sound generation interval. The possibility of misidentifying the start point P1_START of P1 is reduced.

一方、ステップＳA4の結果が否定である場合（すなわちＳ/Ｎ比Ｒが閾値SNR_TH1以下である場合）、始点特定部３２は、変数CNT_START2が所定値Ｎ2を上回るか否かを判定する（ステップＳA11）。変数CNT_START2が所定値Ｎ2を上回るということは、候補フレーム群のＮ2個のフレームＦのうちＳ/Ｎ比Ｒが閾値SNR_TH1を上回るフレームＦがＮ1個以下であったことを意味している。そこで、ステップＳA11の結果が肯定である場合、始点特定部３２は、変数CNT_START1をゼロに初期化した（ステップＳA12）うえで処理をステップＳA2に移行する。ステップＳA12の直後にＳ/Ｎ比Ｒが閾値SNR_TH1を上回ると（ステップＳA4：YES）、ステップＳA5の結果が肯定となってステップＳA6およびステップＳA7が実行される。すなわち、新たにＳ/Ｎ比Ｒが閾値SNR_TH1を超えたフレームＦが始点となるように候補データ群が更新される。一方、ステップＳA11の結果が否定である場合、始点特定部３２は、ステップＳA12を経ることなく処理をステップＳA2に移行する。 On the other hand, when the result of step SA4 is negative (that is, when the S / N ratio R is equal to or less than the threshold value SNR_TH1), the start point specifying unit 32 determines whether or not the variable CNT_START2 exceeds a predetermined value N2 (step SA11). ). The fact that the variable CNT_START2 exceeds the predetermined value N2 means that the number of frames F in which the S / N ratio R exceeds the threshold value SNR_TH1 among the N2 frames F of the candidate frame group is N1 or less. Therefore, when the result of step SA11 is affirmative, the start point specifying unit 32 initializes the variable CNT_START1 to zero (step SA12), and proceeds to step SA2. If the S / N ratio R exceeds the threshold value SNR_TH1 immediately after step SA12 (step SA4: YES), the result of step SA5 becomes affirmative and steps SA6 and SA7 are executed. That is, the candidate data group is updated so that the frame F in which the S / N ratio R newly exceeds the threshold value SNR_TH1 starts. On the other hand, if the result of step SA11 is negative, the start point identification unit 32 proceeds to step SA2 without passing through step SA12.

図４の処理で始点Ｐ1_STARTが特定されると、今度は発音区間Ｐ1の終点Ｐ1_STOPを特定する処理（図５）が終点特定部３４によって実行される。終点特定部３４は、Ｓ/Ｎ比Ｒが閾値SNR_TH2を下回るフレームＦがＮ3個を超えた場合に、Ｓ/Ｎ比Ｒが最初に閾値SNR_TH2を下回ったフレームＦを終点Ｐ1_STOPとして特定する。 When the start point P1_START is specified in the process of FIG. 4, the end point specifying unit 34 executes a process of specifying the end point P1_STOP of the sound generation section P1 (FIG. 5). When the number of frames F whose S / N ratio R is lower than the threshold SNR_TH2 exceeds N3, the end point specifying unit 34 specifies the frame F whose S / N ratio R first falls below the threshold SNR_TH2 as the end point P1_STOP.

図５に示すように、終点特定部３４は、終点データＤ1_STOPをクリアするとともに変数CNT_STOPをゼロに初期化した（ステップＳB1）うえで、Ｓ/Ｎ比算定部５８９からＳ/Ｎ比Ｒを取得する（ステップＳB2）。次いで、終点特定部３４は、ステップＳB2で取得したＳ/Ｎ比Ｒが所定の閾値SNR_TH2を下回るか否かを判定する（ステップＳB3）。 As shown in FIG. 5, the end point specifying unit 34 clears the end point data D1_STOP and initializes the variable CNT_STOP to zero (step SB1), and acquires the S / N ratio R from the S / N ratio calculation unit 589. (Step SB2). Next, the end point specifying unit 34 determines whether or not the S / N ratio R acquired in Step SB2 is lower than a predetermined threshold value SNR_TH2 (Step SB3).

ステップＳB3の結果が肯定である場合、終点特定部３４は、変数CNT_STOPがゼロであるか否かを判定する（ステップＳB4）。ステップＳB4の結果が肯定である場合、終点特定部３４は、終点データＤ1_STOPを今回のフレームＦの番号に仮設定する（ステップＳB5）。一方、ステップＳB4の結果が否定である場合、終点特定部３４は、ステップＳB5を経ることなく処理をステップＳB6に移行する。 If the result of step SB3 is affirmative, the end point specifying unit 34 determines whether or not the variable CNT_STOP is zero (step SB4). If the result of step SB4 is affirmative, the end point specifying unit 34 temporarily sets the end point data D1_STOP to the number of the current frame F (step SB5). On the other hand, if the result of step SB4 is negative, the end point specifying unit 34 proceeds to step SB6 without passing through step SB5.

次いで、終点特定部３４は、変数CNT_STOPに「１」を加算した（ステップＳB6）うえで、加算後の変数CNT_STOPが所定値Ｎ3を上回るか否かを判定する（ステップＳB7）。ステップＳB7の結果が肯定である場合、終点特定部３４は、直前のステップＳB5で仮設定したフレームＦの番号を正式な終点データＤ1_STOPとして確定する（ステップＳB8）。すなわち、発音区間Ｐ1の終点Ｐ1_STOPが特定される。ステップＳB8において、終点特定部３４は、終点データＤ1_STOPを第２区間特定部４０に出力するとともに、終点Ｐ1_STOPの確定を出力制御部６２および記憶部６４に通知する。第１区間特定部３０からの通知を契機として、出力制御部６２による特徴量Ｃの出力と記憶部６４によるフレーム情報Ｆ_HISTの記憶とが終了する。したがって、図５の処理が完了した段階では、発音区間Ｐ1に属するＭ1個のフレームＦの各々について、記憶部６４にフレーム情報Ｆ_HIST（信号レベルHIST_LEVEL）が格納されるとともに音解析装置８０の記憶部６４に特徴量Ｃが格納されることになる。 Next, the end point specifying unit 34 adds “1” to the variable CNT_STOP (step SB6), and determines whether or not the variable CNT_STOP after the addition exceeds a predetermined value N3 (step SB7). If the result of step SB7 is affirmative, the end point specifying unit 34 determines the number of the frame F temporarily set in the immediately preceding step SB5 as formal end point data D1_STOP (step SB8). That is, the end point P1_STOP of the sound generation section P1 is specified. In step SB8, the end point specifying unit 34 outputs the end point data D1_STOP to the second section specifying unit 40, and notifies the output control unit 62 and the storage unit 64 of the confirmation of the end point P1_STOP. With the notification from the first section specifying unit 30, the output of the feature value C by the output control unit 62 and the storage of the frame information F_HIST by the storage unit 64 are completed. Therefore, at the stage where the processing of FIG. 5 is completed, the frame information F_HIST (signal level HIST_LEVEL) is stored in the storage unit 64 and the storage unit of the sound analysis device 80 for each of the M1 frames F belonging to the sound generation period P1. 64, the feature amount C is stored.

ステップＳB7の結果が否定である場合（すなわちＳ/Ｎ比Ｒが閾値SNR_TH2を下回るフレームＦがＮ3個以下である場合）、終点特定部３４は、次のフレームＦについてＳ/Ｎ比Ｒを取得した（ステップＳB2）うえでステップＳB3以後の処理を実行する。以上のようにひとつのフレームＦのＳ/Ｎ比Ｒが閾値SNR_TH2を下回るだけでは終点Ｐ1_STOPは確定されないから、突発的にＳ/Ｎ比Ｒが低下した時点を終点Ｐ1_STOPと誤認する可能性が低減される。 When the result of step SB7 is negative (that is, when the number of frames F in which the S / N ratio R falls below the threshold value SNR_TH2 is N3 or less), the end point specifying unit 34 acquires the S / N ratio R for the next frame F (Step SB2), the processing after Step SB3 is executed. As described above, if the S / N ratio R of one frame F is less than the threshold value SNR_TH2, the end point P1_STOP is not determined. Therefore, the possibility that the point when the S / N ratio R suddenly decreases is mistaken as the end point P1_STOP is reduced. Is done.

一方、ステップＳB3の結果が否定である場合、終点特定部３４は、始点Ｐ1_STARTの特定に使用した閾値SNR_TH1を今回のＳ/Ｎ比Ｒが上回るか否かを判定する（ステップＳB9）。ステップＳB9の結果が否定である場合、終点特定部３４は、ステップＳB2に処理を移行して新たなＳ/Ｎ比Ｒを取得する。 On the other hand, when the result of step SB3 is negative, the end point specifying unit 34 determines whether or not the current S / N ratio R exceeds the threshold value SNR_TH1 used for specifying the start point P1_START (step SB9). When the result of step SB9 is negative, the end point specifying unit 34 proceeds to step SB2 and acquires a new S / N ratio R.

ところで、利用者の発声時のＳ/Ｎ比Ｒは基本的には閾値SNR_TH1を上回る。したがって、図５の処理を開始してからＳ/Ｎ比Ｒが閾値SNR_TH1を上回った場合（ステップＳB9：YES）には、利用者が発声中である可能性が高い。そこで、ステップＳB9の結果が肯定である場合、終点特定部３４は、変数CNT_STOPをゼロに初期化した（ステップＳB10）うえでステップＳB2以後の処理を実行する。ステップＳB10の実行後にＳ/Ｎ比Ｒが閾値SNR_TH2を下回ると（ステップＳB3：YES）、ステップＳB4の結果が肯定となってステップＳB5が実行される。すなわち、Ｓ/Ｎ比Ｒが閾値SNR_TH2を下回ることで終点データＤ1_STOPが仮設定された場合であっても、Ｓ/Ｎ比Ｒが閾値SNR_TH2を下回るフレームＦの個数が所定値Ｎ3以下の段階でひとつのフレームＦのＳ/Ｎ比Ｒが閾値SNR_TH1を上回った場合（すなわち利用者が発声中である可能性が高い場合）には、終点データＤ1_STOPの仮設定が解除される。 By the way, the S / N ratio R when the user speaks basically exceeds the threshold value SNR_TH1. Therefore, when the S / N ratio R exceeds the threshold value SNR_TH1 after the processing of FIG. 5 is started (step SB9: YES), there is a high possibility that the user is speaking. Therefore, if the result of step SB9 is affirmative, the end point specifying unit 34 initializes the variable CNT_STOP to zero (step SB10), and then executes the processing after step SB2. If the S / N ratio R falls below the threshold value SNR_TH2 after execution of step SB10 (step SB3: YES), the result of step SB4 becomes affirmative and step SB5 is executed. That is, even when the end point data D1_STOP is provisionally set because the S / N ratio R is lower than the threshold value SNR_TH2, the number of frames F whose S / N ratio R is lower than the threshold value SNR_TH2 is at a stage where the predetermined value N3 or less. When the S / N ratio R of one frame F exceeds the threshold value SNR_TH1 (that is, when there is a high possibility that the user is speaking), the temporary setting of the end point data D1_STOP is cancelled.

（ｂ）第２区間特定部４０の動作
発声者が実際に発声した区間を確実に検出する（すなわち検出の漏れを確実に防止する）ためには、例えば図４における閾値SNR_TH1を比較的に小さい数値に設定するとともに図５の閾値SNR_TH2を比較的に大きい数値に設定せざるを得ない。したがって、例えば実際の発声に先立って発声者の咳の音やリップノイズや口中音などの雑音が発生すると、当該雑音の発生した時点が発音区間Ｐ1の始点Ｐ1_STARTと認定される場合がある。そこで、第２区間特定部４０は、第１区間特定部３０による発音区間Ｐ1の特定後に、雑音に該当する可能性が高いフレームＦを、発音区間Ｐ1の先頭および最後尾のフレームＦから順次に除外する（すなわち発音区間Ｐ1を短縮する）ことで発音区間Ｐ2を特定する。 (B) Operation of second section specifying unit 40
In order to reliably detect the section where the speaker actually uttered (that is, to reliably prevent detection omission), for example, the threshold value SNR_TH1 in FIG. 4 is set to a relatively small value and the threshold value SNR_TH2 in FIG. 5 is set. It must be set to a relatively large value. Therefore, for example, when noise such as coughing of the speaker, lip noise, or mouth sound is generated prior to the actual utterance, the time when the noise is generated may be recognized as the start point P1_START of the pronunciation period P1. Therefore, the second section specifying unit 40 sequentially selects the frames F that are highly likely to be noise after the first section specifying unit 30 specifies the sound generation section P1, starting from the first and last frames F of the sound generation section P1. By excluding (that is, shortening the sound generation section P1), the sound generation section P2 is specified.

図６は、第２区間特定部４０の始点特定部４２が実行する処理の内容を示すフローチャートである。第２区間特定部４０の始点特定部４２は、記憶部６４に格納されたＭ1個のフレーム情報Ｆ_HISTのなかから信号レベルHIST_LEVELの最大値MAX_LEVELを特定する（ステップＳC1）。次いで、始点特定部４２は、変数CNT_FRAMEをゼロに初期化するとともに最大値MAX_LEVELに応じた閾値ＴＨ1を設定する（ステップＳC2）。本形態における閾値ＴＨ1は、ステップＳC1で特定した最大値MAX_LEVELと係数αとの乗算値である。係数αは、予め設定された「１」未満の数値である。 FIG. 6 is a flowchart showing the contents of processing executed by the start point specifying unit 42 of the second section specifying unit 40. The start point specifying unit 42 of the second section specifying unit 40 specifies the maximum value MAX_LEVEL of the signal level HIST_LEVEL from among the M1 pieces of frame information F_HIST stored in the storage unit 64 (step SC1). Next, the start point specifying unit 42 initializes the variable CNT_FRAME to zero and sets a threshold value TH1 corresponding to the maximum value MAX_LEVEL (step SC2). The threshold value TH1 in this embodiment is a multiplication value of the maximum value MAX_LEVEL specified in step SC1 and the coefficient α. The coefficient α is a numerical value less than “1” set in advance.

次いで、始点特定部４２は、発音区間Ｐ1のＭ1個のフレームＦのなかからひとつのフレームＦを選択する（ステップＳC3）。本形態の始点特定部４２は、発音区間Ｐ1内の各フレームＦを先頭から最後尾に向けてステップＳC3ごとに順番に選択する。すなわち、図６の処理を開始してから最初のステップＳC3においては発音区間Ｐ1の先頭のフレームＦが選択され、次回以降のステップＳC3においては前回のステップＳC3で選択されたフレームＦの直後のフレームＦが選択される。 Next, the start point specifying unit 42 selects one frame F from among the M1 frames F in the sound generation section P1 (step SC3). The start point specifying unit 42 of the present embodiment sequentially selects each frame F in the sound generation section P1 from the head toward the tail at every step SC3. That is, in the first step SC3 after starting the processing of FIG. 6, the first frame F of the sound generation section P1 is selected, and in the next step SC3, the frame immediately after the frame F selected in the previous step SC3. F is selected.

次に、始点特定部４２は、ステップＳC3で選択したフレームＦに対応するフレーム情報Ｆ_HISTの信号レベルHIST_LEVELが閾値ＴＨ1を下回るか否かを判定する（ステップＳC4）。最大値間MAX_LEVELと比較すると雑音のレベルは小さいから、信号レベルHIST_LEVELが閾値ＴＨ1を下回るフレームＦは、本来の発声の直前に発生した雑音である可能性が高い。そこで、ステップＳC4の結果が肯定である場合、始点特定部４２は、ステップＳC3で選択したフレームＦを発音区間Ｐ1から除外する（ステップＳC5）。さらに詳述すると、始点特定部４２は、ステップＳC3で選択したフレームＦの直後のフレームＦを暫定的な始点ｐ_STARTとして選定する。次いで、始点特定部４２は、変数CNT_FRAMEをゼロに初期化した（ステップＳC6）うえでステップＳC3に処理を移行する。ステップＳC3においては、現時点で選択しているフレームＦの直後のフレームＦを新たに選択する。 Next, the start point specifying unit 42 determines whether or not the signal level HIST_LEVEL of the frame information F_HIST corresponding to the frame F selected in Step SC3 is lower than the threshold value TH1 (Step SC4). Since the noise level is smaller than the maximum value MAX_LEVEL, the frame F in which the signal level HIST_LEVEL is lower than the threshold value TH1 is likely to be noise generated immediately before the original utterance. Therefore, if the result of step SC4 is affirmative, the start point specifying unit 42 excludes the frame F selected in step SC3 from the sound generation interval P1 (step SC5). More specifically, the start point specifying unit 42 selects the frame F immediately after the frame F selected in step SC3 as a temporary start point p_START. Next, the start point specifying unit 42 initializes the variable CNT_FRAME to zero (step SC6), and proceeds to step SC3. In step SC3, a frame F immediately after the currently selected frame F is newly selected.

ステップＳC4の結果が否定である場合（すなわち信号レベルHIST_LEVELが閾値ＴＨ1以上である場合）、始点特定部４２は、変数CNT_FRAMEに「１」を加算した（ステップＳC7）うえで、加算後の変数CNT_FRAMEが所定値Ｎ4を上回るか否かを判定する（ステップＳC8）。ステップＳC8の結果が否定である場合、始点特定部４２はステップＳC3に処理を移行して新たなフレームＦを選択する。一方、ステップＳC8の結果が肯定である場合、始点特定部４２はステップＳC9に処理を移行する。すなわち、Ｎ4個を上回る個数のフレームＦにわたって連続してステップＳC4の判定（HIST_LEVEL＜ＴＨ1）が否定された場合に処理がステップＳC9に移行する。 When the result of step SC4 is negative (that is, when the signal level HIST_LEVEL is greater than or equal to the threshold value TH1), the start point specifying unit 42 adds “1” to the variable CNT_FRAME (step SC7) and then adds the variable CNT_FRAME after the addition. It is determined whether or not exceeds a predetermined value N4 (step SC8). If the result of step SC8 is negative, the start point identifying unit 42 proceeds to step SC3 and selects a new frame F. On the other hand, when the result of step SC8 is affirmative, the start point specifying unit 42 shifts the process to step SC9. That is, if the determination in step SC4 (HIST_LEVEL <TH1) is denied continuously over the number of frames F exceeding N4, the process proceeds to step SC9.

ステップＳC9において、始点特定部４２は、ステップＳC1で特定した最大値MAX_LEVELに応じて閾値ＴＨ2を設定する。本形態の閾値ＴＨ2は、最大値MAX_LEVELと予め定められた係数βとの乗算値である。 In step SC9, the start point specifying unit 42 sets the threshold value TH2 according to the maximum value MAX_LEVEL specified in step SC1. The threshold value TH2 in this embodiment is a multiplication value of the maximum value MAX_LEVEL and a predetermined coefficient β.

次に、始点特定部４２は、発音区間Ｐ1のうち現段階の暫定的な始点ｐ_START以降の複数のフレームＦ（すなわちステップＳC5を経た場合には先頭側の幾つかのフレームＦの除外後の発音区間Ｐ1）のなかから相連続する所定個のフレームＦを選択する（ステップＳC10）。図７は、ステップＳC10で選択されるフレームＦの集合Ｇ（Ｇ1，Ｇ2，Ｇ3，……）を示す概念図である。同図に示すように、図６の処理を開始してから最初のステップＳC10においては、先頭から所定個のフレームＦの集合Ｇ1がが選択される。 Next, the start point specifying unit 42 generates a plurality of frames F after the provisional start point p_START at the current stage in the sound generation section P1 (that is, after the removal of some frames F on the head side after step SC5) A predetermined number of consecutive frames F are selected from the section P1) (step SC10). FIG. 7 is a conceptual diagram showing a set G (G1, G2, G3,...) Of frames F selected in step SC10. As shown in the figure, in the first step SC10 after the processing of FIG. 6 is started, a set G1 of a predetermined number of frames F from the head is selected.

次いで、始点特定部４２は、ステップＳC10で選択した所定個のフレームＦの信号レベルHIST_LEVELについて加算値SUM_LEVELを算定する（ステップＳC11）。さらに、始点特定部４２は、ステップＳC11で算定した加算値SUM_LEVELがステップＳC9で算定した閾値ＴＨ2を下回るか否かを判定する（ステップＳC12）。 Next, the start point specifying unit 42 calculates an addition value SUM_LEVEL for the signal level HIST_LEVEL of the predetermined number of frames F selected in step SC10 (step SC11). Further, the start point specifying unit 42 determines whether or not the added value SUM_LEVEL calculated in step SC11 is lower than the threshold value TH2 calculated in step SC9 (step SC12).

図４を参照して説明したように、本形態においては候補フレーム群のうちＳ/Ｎ比Ｒが閾値SNR_TH1を超えるフレームＦがＮ1個を上回る場合に最初のフレームＦが発音区間Ｐ1の始点Ｐ1_STARTとして特定される。したがって、候補フレーム群のなかの複数のフレームＦにわたって雑音が発生した場合には当該候補フレーム群の先頭が始点Ｐ1_STARTと認定され得る。一方、最大値MAX_LEVELと比較すると雑音のレベルは充分に小さいから、所定個のフレームＦにわたる信号レベルHIST_LEVELの加算値SUM_LEVELが閾値ＴＨ2を下回るフレームＦは、本来の発音の直前に発生した雑音である可能性が高い。 As described with reference to FIG. 4, in this embodiment, when the number of frames F in which the S / N ratio R exceeds the threshold value SNR_TH1 exceeds N1 in the candidate frame group, the first frame F is the start point P1_START of the sound generation interval P1. Identified as Therefore, when noise occurs over a plurality of frames F in the candidate frame group, the head of the candidate frame group can be recognized as the start point P1_START. On the other hand, since the noise level is sufficiently small compared to the maximum value MAX_LEVEL, the frame F in which the added value SUM_LEVEL of the signal level HIST_LEVEL over the predetermined number of frames F is lower than the threshold value TH2 is noise generated immediately before the original sound generation. Probability is high.

そこで、ステップＳC12の結果が肯定である場合、始点特定部４２は、図７に示すように、ステップＳC10で選択した集合Ｇのうち先頭側の半数のフレームＦを除外する（ステップＳC13）。すなわち、集合Ｇを分割した後半の部分のなかの先頭のフレームＦが暫定的な始点ｐ_STARTとして選定される。次いで、始点特定部４２は、ステップＳC10に処理を移行し、図７に示すように、現段階における先頭から所定個のフレームＦの集合Ｇ2を選択してステップＳC11以後の処理を実行する。 Therefore, if the result of step SC12 is affirmative, the start point specifying unit 42 excludes the first half of the frames F from the set G selected in step SC10 as shown in FIG. 7 (step SC13). In other words, the first frame F in the latter half of the set G is selected as the temporary start point p_START. Next, the start point specifying unit 42 shifts the process to step SC10, and as shown in FIG. 7, selects a set G2 of a predetermined number of frames F from the beginning at the current stage, and executes the processes after step SC11.

一方、ステップＳC12の結果が否定である場合、始点特定部４２は、現段階で設定されている始点ｐ_STARTを始点Ｐ2_STARTとして確定し、当該始点Ｐ2_START（フレーム番号）を指定する始点データＤ2_STARTを音解析装置８０に出力する（ステップＳC14）。例えば、図７に示すように集合Ｇ3が選択された段階でステップＳC12の結果が否定となった場合、集合Ｇ3の先頭（集合Ｇ2のうち後半の部分における先頭）が始点Ｐ2_STARTとして特定される。 On the other hand, if the result of step SC12 is negative, the start point specifying unit 42 determines the start point p_START set at the current stage as the start point P2_START, and analyzes the start point data D2_START specifying the start point P2_START (frame number). The data is output to the device 80 (step SC14). For example, as shown in FIG. 7, when the result of step SC12 is negative when the set G3 is selected, the head of the set G3 (the head in the latter half of the set G2) is specified as the start point P2_START.

第２区間特定部４０の終点特定部４４は、図６と同様の処理によって発音区間Ｐ1の各フレームＦを最後尾から順次に除外することで終点Ｐ2_STOPを特定する。すなわち、終点特定部４４は、発音区間Ｐ1の各フレームＦを最後尾から先頭に向けてステップＳC3ごとに順番に選択し、信号レベルHIST_LEVELが閾値ＴＨ1を下回る場合には当該フレームＦを除外する（ステップＳC5）。また、終点特定部４４は、最後尾から手前側に連続する所定個のフレームＦの集合Ｇを選択する（ステップＳC10）とともに信号レベルHIST_LEVELの加算値SUM_LEVELを算定する（ステップＳC11）。そして、終点特定部４４は、加算値SUM_LEVELが閾値ＴＨ2を下回る場合には集合Ｇの後半のフレームＦを除外し（ステップＳC13）、加算値SUM_LEVELが閾値ＴＨ2以上である場合には、当該時点における最後尾のフレームＦを発音区間Ｐ2の終点Ｐ2_STOPとして指定する終点データＤ2_STOPを音解析装置８０に出力する（ステップＳC14）。 The end point specifying unit 44 of the second section specifying unit 40 specifies the end point P2_STOP by sequentially excluding each frame F of the sound generation section P1 from the tail by the same process as in FIG. That is, the end point specifying unit 44 sequentially selects each frame F of the sound generation section P1 from the tail toward the top for each step SC3, and excludes the frame F when the signal level HIST_LEVEL is lower than the threshold value TH1 ( Step SC5). In addition, the end point specifying unit 44 selects a set G of a predetermined number of frames F continuous from the end to the near side (step SC10) and calculates an addition value SUM_LEVEL of the signal level HIST_LEVEL (step SC11). The end point specifying unit 44 excludes the frame F in the latter half of the set G when the addition value SUM_LEVEL is lower than the threshold value TH2 (step SC13), and when the addition value SUM_LEVEL is equal to or higher than the threshold value TH2, End point data D2_STOP that designates the last frame F as the end point P2_STOP of the sound generation section P2 is output to the sound analysis device 80 (step SC14).

以上に説明したように、第２区間特定部４０が発音区間Ｐ2を特定する段階では発音区間Ｐ1における信号レベルHIST_LEVELの最大値MAX_LEVELが確定している。したがって、以上に例示したように最大値MAX_LEVELを利用することで、第２区間特定部４０は、最大値MAX_LEVELが未確定の段階で発音区間Ｐ1を特定せざるを得ない第１区間特定部３０と比較して高精度に発音区間Ｐ2を特定することが可能である。すなわち、発声者の咳払いやリップノイズなどの雑音に起因して発音区間Ｐ1に含められたフレームＦが第２区間特定部４０によって除外される。したがって、音解析装置８０においては、雑音の影響を排除した発音区間Ｐ2の各フレームＦを利用して高精度に音信号Ｓが解析される。 As described above, the maximum value MAX_LEVEL of the signal level HIST_LEVEL in the sounding section P1 is determined at the stage where the second section specifying unit 40 specifies the sounding section P2. Therefore, by using the maximum value MAX_LEVEL as exemplified above, the second section specifying unit 40 has to specify the pronunciation section P1 when the maximum value MAX_LEVEL is not yet determined. Therefore, it is possible to specify the pronunciation period P2 with higher accuracy. That is, the second section specifying unit 40 excludes the frame F included in the sound generation section P1 due to noise such as coughing of the speaker or lip noise. Therefore, in the sound analysis device 80, the sound signal S is analyzed with high accuracy by using each frame F of the sound generation section P2 from which the influence of noise is eliminated.

なお、以上の形態においては信号レベルHIST_LEVELがフレーム情報Ｆ_HISTとして使用される構成を例示したが、フレーム情報Ｆ_HISTの内容は適宜に変更される。例えば、以上の動作における信号レベルHIST_LEVELを、Ｓ/Ｎ比算定部５８９が各フレームＦについて算定したＳ/Ｎ比Ｒに置換してもよい。すなわち、第２区間特定部４０が発音区間Ｐ2の特定に使用するフレーム情報Ｆ_HISTは、音信号Ｓの信号のレベルに応じた数値（信号指標値）であれば足り、その具体的な内容の如何は不問である。 In the above embodiment, the configuration in which the signal level HIST_LEVEL is used as the frame information F_HIST is illustrated, but the content of the frame information F_HIST is appropriately changed. For example, the signal level HIST_LEVEL in the above operation may be replaced with the S / N ratio R calculated by the S / N ratio calculation unit 589 for each frame F. That is, the frame information F_HIST used by the second section specifying unit 40 to specify the sound generation section P2 may be a numerical value (signal index value) corresponding to the level of the signal of the sound signal S. Is unquestionable.

＜Ｂ：第２実施形態＞
次に、本発明の第２実施形態を説明する。なお、本形態において作用や機能が第１実施形態と共通する要素については、以上と同じ符号を付して各々の詳細な説明を適宜に省略する。 <B: Second Embodiment>
Next, a second embodiment of the present invention will be described. In addition, about the element which an effect | action and function are common in 1st Embodiment in this form, the same code | symbol as the above is attached | subjected and each detailed description is abbreviate | omitted suitably.

屋外で発生した風や発声者の鼻息が収音装置１０に吹付けられたとき（すなわち風切音が収音されたとき）の音信号Ｓは長時間にわたって高いレベルを維持する。したがって、第１区間特定部３０は、実際には発声者が発声していない区間であるにも拘わらず、風切音が発生した区間を発音区間Ｐ1と認定する場合がある。そこで、本形態の第２区間特定部４０は、発音区間Ｐ1のうち風切音の可能性が高いフレームを除外することで発音区間Ｐ2を特定する。 The sound signal S when the wind generated outside or the nasal breath of the speaker is blown to the sound collection device 10 (that is, when the wind noise is collected) maintains a high level for a long time. Therefore, the first section specifying unit 30 may recognize the section where the wind noise is generated as the sound generation section P1 even though the section is not actually speaking by the speaker. Therefore, the second section specifying unit 40 of the present embodiment specifies the sound generation section P2 by excluding the frames that have a high possibility of wind noise from the sound generation section P1.

本形態のフレーム情報生成部５６は、音信号Ｓの各フレームＦについてピッチを検出し、この検出の結果を示すピッチデータHIST_PITCHを生成する。記憶部６４に格納されるフレーム情報Ｆ_HISTには、第１実施形態と同様の信号レベルHIST_LEVELとともにピッチデータHIST_PITCHが含められる。ピッチデータHIST_PITCHは、音信号ＳのフレームＦについて明確なピッチが検出された場合には当該ピッチを示し、音信号Ｓについて明確なピッチが検出されなかった場合にはピッチの非検出を示す（例えばゼロに設定される）。人間の音声は、レベルが高ければ基本的にピッチの検出が可能であるから、当該ピッチを含むピッチデータHIST_PITCHが生成される。これに対し、規則的な倍音の構造を持たない風切音は明確なピッチが検出されないから、風切音が収音された場合にはピッチの非検出を示すピッチデータHIST_PITCHが生成される。 The frame information generation unit 56 of this embodiment detects the pitch for each frame F of the sound signal S, and generates pitch data HIST_PITCH indicating the result of this detection. The frame information F_HIST stored in the storage unit 64 includes pitch data HIST_PITCH together with the signal level HIST_LEVEL similar to that in the first embodiment. The pitch data HIST_PITCH indicates the pitch when a clear pitch is detected for the frame F of the sound signal S, and indicates non-detection of the pitch when a clear pitch is not detected for the sound signal S (for example, Set to zero). Since the human voice can basically detect the pitch if the level is high, pitch data HIST_PITCH including the pitch is generated. On the other hand, since a clear pitch is not detected for wind noise that does not have a regular overtone structure, pitch data HIST_PITCH indicating non-detection of pitch is generated when wind noise is collected.

次に、図８は、第２区間特定部４０のうち始点特定部４２の動作を示すフローチャートである。始点特定部４２は、変数CNT_FRAMEをゼロに初期化した（ステップＳD1）うえで発音区間Ｐ1のなかからひとつのフレームＦを選択する（ステップＳD2）。各フレームＦは、発音区間Ｐ1の先頭から最後尾に向けてステップＳD2ごとに順番に選択される。次いで、始点特定部４２は、ステップＳD2で選択したフレームＦのフレーム情報Ｆ_HISTに含まれる信号レベルHIST_LEVELが所定の閾値L_THを上回るか否かを判定する（ステップＳD3）。 Next, FIG. 8 is a flowchart showing the operation of the start point specifying unit 42 in the second section specifying unit 40. The start point identifying unit 42 initializes the variable CNT_FRAME to zero (step SD1), and then selects one frame F from the sound generation section P1 (step SD2). Each frame F is selected in order for each step SD2 from the head to the tail of the sound generation section P1. Next, the start point specifying unit 42 determines whether or not the signal level HIST_LEVEL included in the frame information F_HIST of the frame F selected in Step SD2 exceeds a predetermined threshold L_TH (Step SD3).

ステップＳD3の結果が肯定である場合、始点特定部４２は、ステップＳD2で選択したフレームＦのフレーム情報Ｆ_HISTに含まれるピッチデータHIST_PITCHがピッチの非検出を示すか否かを判定する（ステップＳD4）。ステップＳD4の結果が肯定である場合、始点特定部４２は、変数CNT_FRAMEに「１」を加算した（ステップＳD5）うえで、加算後の変数CNT_FRAMEが所定値Ｎ5を上回るか否かを判定する（ステップＳD6）。風切音のみが収音された場合の音信号Ｓは複数のフレームＦにわたって連続して高いレベルを維持するとともにピッチが非検出となる。そこで、ステップＳD6の結果が肯定である場合（すなわちＮ5個を上回るフレームＦにわたってステップＳD3およびステップＳD4の判定が連続して肯定された場合）、始点特定部４２は、現段階で選択しているフレームＦまでの所定個（(Ｎ5＋１)個）のフレームＦを除外して（ステップＳD7）、ステップＳD1に処理を移行する。すなわち、始点特定部４２は、直前のステップＳD2で選択したフレームＦの直後のフレームＦを暫定的な始点ｐ_STARTとして選定する。一方、ステップＳD6の結果が否定である場合（ステップＳD3およびステップＳD4の条件を充足するフレームＦの連続数がＮ5個以下である場合）、始点特定部４２は、ステップＳD2に処理を移行して新たなフレームＦを選択したうえでステップＳD3以後の処理を実行する。 If the result of step SD3 is affirmative, the start point specifying unit 42 determines whether or not the pitch data HIST_PITCH included in the frame information F_HIST of the frame F selected in step SD2 indicates pitch non-detection (step SD4). . If the result of step SD4 is affirmative, the start point identification unit 42 adds “1” to the variable CNT_FRAME (step SD5), and then determines whether or not the variable CNT_FRAME after the addition exceeds a predetermined value N5 ( Step SD6). The sound signal S when only the wind noise is collected maintains a high level continuously over a plurality of frames F, and the pitch is not detected. Therefore, when the result of step SD6 is affirmative (that is, when the determinations of step SD3 and step SD4 are continuously affirmed over the frame F exceeding N5), the start point specifying unit 42 selects at this stage. A predetermined number ((N5 + 1)) of frames F up to frame F are excluded (step SD7), and the process proceeds to step SD1. That is, the start point specifying unit 42 selects the frame F immediately after the frame F selected in the immediately preceding step SD2 as the temporary start point p_START. On the other hand, when the result of step SD6 is negative (when the number of consecutive frames F satisfying the conditions of step SD3 and step SD4 is N5 or less), the start point specifying unit 42 proceeds to step SD2. After selecting a new frame F, the processing after step SD3 is executed.

一方、ステップＳD3およびステップＳD4の何れかの結果が否定である場合（すなわちフレームＦの音声が風切音のみである可能性が低い場合）、現段階における先頭のフレームＦが始点Ｐ2_STARTとして選定される。すなわち、始点特定部４２は、暫定的な始点ｐ_STARTを始点Ｐ2_STARTとして確定し、当該始点Ｐ2_STARTを指定する始点データＤ2_STARTを音解析装置８０に出力する（ステップＳD8）。 On the other hand, if the result of either step SD3 or step SD4 is negative (ie, it is unlikely that the sound of frame F is only a wind noise), the first frame F at the current stage is selected as the start point P2_START. The That is, the start point specifying unit 42 determines the provisional start point p_START as the start point P2_START, and outputs start point data D2_START specifying the start point P2_START to the sound analysis device 80 (step SD8).

第２区間特定部４０の終点特定部４４は、図８と同様の処理によって発音区間Ｐ1の各フレームＦを最後尾から順次に除外することで終点Ｐ2_STOPを特定する。すなわち、終点特定部４４は、発音区間Ｐ1の各フレームＦを最後尾から先頭に向けてステップＳD2ごとに順番に選択する一方、ステップＳD7においては、ステップＳD3およびステップＳD4の判定が連続して肯定された所定個のフレームＦを除外する。そして、ステップＳD8においては当該時点における最後尾のフレームＦを終点Ｐ2_STOPとして指定する終点データＤ2_STOPが生成される。以上の形態によれば、風切音の影響で発音区間Ｐ1と認定されたフレームＦが除外される。したがって、音解析装置８０による音信号Ｓの解析の精度を向上することができる。 The end point specifying unit 44 of the second section specifying unit 40 specifies the end point P2_STOP by sequentially excluding each frame F of the sound generation section P1 from the tail by the same process as in FIG. That is, the end point specifying unit 44 sequentially selects each frame F of the sound generation section P1 from the tail toward the head for each step SD2, while in step SD7, the determinations in steps SD3 and SD4 are continuously affirmed. The predetermined number of frames F is excluded. In step SD8, end point data D2_STOP for designating the last frame F at the time as the end point P2_STOP is generated. According to the above embodiment, the frame F recognized as the sound generation section P1 due to the influence of wind noise is excluded. Therefore, the accuracy of the analysis of the sound signal S by the sound analysis device 80 can be improved.

＜Ｃ：第３実施形態＞
次に、本発明の第３実施形態について説明する。なお、本形態において作用や機能が第１実施形態と共通する要素については、以上と同じ符号を付して各々の詳細な説明を適宜に省略する。 <C: Third Embodiment>
Next, a third embodiment of the present invention will be described. In addition, about the element which an effect | action and function are common in 1st Embodiment in this form, the same code | symbol as the above is attached | subjected and each detailed description is abbreviate | omitted suitably.

音解析装置８０は、正規の利用者が特定の言葉（パスワード）を発声したときに抽出された登録特徴量と音信号Ｓから抽出された特徴量Ｃとを対比することで発声者を認証する。認証の精度を維持するためには、認証時と登録時とでパスワードの末尾の音韻の時間長が同等であることが望ましいが、実際には、パスワードの末尾に相当する無声子音の時間長は認証のたびに変動する。そこで、本形態においては、認証時におけるパスワードの末尾の無声子音が所定の時間長に統一されるように、発音区間Ｐ1の終点Ｐ1_STOPから手前側に連続する複数のフレームＦが除外される。 The sound analysis device 80 authenticates the speaker by comparing the registered feature value extracted when the authorized user utters a specific word (password) with the feature value C extracted from the sound signal S. . In order to maintain the accuracy of authentication, it is desirable that the time length of the phoneme at the end of the password is the same at the time of authentication and at the time of registration, but in practice the time length of the unvoiced consonant corresponding to the end of the password is Fluctuates with each certification. Therefore, in this embodiment, a plurality of frames F continuous from the end point P1_STOP of the sound generation interval P1 to the near side are excluded so that the unvoiced consonant at the end of the password at the time of authentication is unified to a predetermined time length.

本形態のフレーム情報生成部５６は、各フレームＦの音信号Ｓのゼロクロス数HIST_ZXCNTをフレーム情報Ｆ_HISTとして生成する。ゼロクロス数HIST_ZXCNTは、ひとつのフレームＦ内の音信号Ｓのレベルが基準値（ゼロ）を跨いで変動した回数である。収音装置１０の収音した音声が無声子音である場合には、各フレームＦのゼロクロス数HIST_ZXCNTが大きい数値となる。 The frame information generation unit 56 of this embodiment generates the zero cross number HIST_ZXCNT of the sound signal S of each frame F as the frame information F_HIST. The zero cross number HIST_ZXCNT is the number of times that the level of the sound signal S in one frame F fluctuates across the reference value (zero). When the sound collected by the sound collection device 10 is an unvoiced consonant, the zero cross number HIST_ZXCNT of each frame F is a large numerical value.

図９は、第２区間特定部４０における終点特定部４４の動作を示すフローチャートであり、図１０は、終点特定部４４の処理を説明するための概念図である。終点特定部４４は、変数CNT_FRAMEをゼロに初期化した（ステップＳE1）うえで発音区間Ｐ1のひとつのフレームＦを選択する（ステップＳE2）。各フレームＦは、発音区間Ｐ1の最後尾から先頭に向けてステップＳE2ごとに順番に選択される。次いで、終点特定部４４は、ステップＳE2で選択したフレームＦのフレーム情報Ｆ_HISTに含まれるゼロクロス数HIST_ZXCNTが所定の閾値Z_THを上回るか否かを判定する（ステップＳE3）。閾値Z_THは、フレームＦの音信号Ｓが無声子音である場合にステップＳE3の判定が肯定されるように実験的または統計的に設定される。 FIG. 9 is a flowchart showing the operation of the end point specifying unit 44 in the second section specifying unit 40, and FIG. 10 is a conceptual diagram for explaining the processing of the end point specifying unit 44. The end point specifying unit 44 initializes the variable CNT_FRAME to zero (step SE1), and then selects one frame F in the sounding section P1 (step SE2). Each frame F is selected in order for each step SE2 from the tail to the head of the sound generation section P1. Next, the end point specifying unit 44 determines whether or not the zero cross number HIST_ZXCNT included in the frame information F_HIST of the frame F selected in Step SE2 exceeds a predetermined threshold value Z_TH (Step SE3). The threshold value Z_TH is set experimentally or statistically so that the determination in step SE3 is affirmed when the sound signal S of the frame F is an unvoiced consonant.

ステップＳE3の結果が肯定である場合、終点特定部４４は、ステップＳE2にて選択したフレームＦを発音区間Ｐ1から除外する（ステップＳE4）。すなわち、終点特定部４４は、ステップＳE2にて選択したフレームＦの直前のフレームＦを暫定的な終点ｐ_STOPとして選定する。さらに、終点特定部４４は、ステップＳE1に処理を移行して変数CNT_FRAMEをゼロに初期化したうえでステップＳE2以後の処理を実行する。 If the result of step SE3 is affirmative, the end point specifying unit 44 excludes the frame F selected in step SE2 from the sounding section P1 (step SE4). That is, the end point specifying unit 44 selects the frame F immediately before the frame F selected in step SE2 as the provisional end point p_STOP. Further, the end point specifying unit 44 shifts the process to step SE1, initializes the variable CNT_FRAME to zero, and executes the processes after step SE2.

一方、ステップＳE3の結果が否定である場合、終点特定部４４は、変数CNT_FRAMEに「１」を加算し（ステップＳE5）、加算後の変数CNT_FRAMEが所定値Ｎ6を上回るか否かを判定する（ステップＳE6）。ステップＳE6の結果が否定である場合、終点特定部４４は、ステップＳE2に処理を移行する。 On the other hand, if the result of step SE3 is negative, the end point specifying unit 44 adds “1” to the variable CNT_FRAME (step SE5), and determines whether or not the variable CNT_FRAME after the addition exceeds the predetermined value N6 ( Step SE6). If the result of step SE6 is negative, the end point specifying unit 44 proceeds to step SE2.

ゼロクロス数HIST_ZXCNTが閾値Z_THを上回る場合に変数CNT_FRAMEはゼロに初期化される（ステップＳE1）から、ステップＳE6の判定は、Ｎ6個を超えるフレームＦにわたって連続してゼロクロス数HIST_ZXCNTが閾値Z_TH以下となる場合に肯定される。ステップＳE6の結果が肯定である場合、終点特定部４４は、現段階の最後尾のフレームＦ（暫定的な終点ｐ_STOP）から所定の時間長Ｔだけ経過した時点を発音区間Ｐ2の終点Ｐ2_STOPとして確定したうえで終点データＤ2_STOPを出力する（ステップＳE7）。例えば、ステップＳE4の反復によって図１０のように発音区間Ｐ1の終点から複数（１２個）のフレームＦが除去されると、除去後の最後尾のフレームＦから時間長Ｔだけ経過した時点が終点Ｐ2_STOPとして確定する。 Since the variable CNT_FRAME is initialized to zero when the zero cross number HIST_ZXCNT exceeds the threshold Z_TH (step SE1), the determination at step SE6 is that the zero cross number HIST_ZXCNT is continuously below the threshold Z_TH over the frame F exceeding N6. If you are affirmed. If the result of step SE6 is affirmative, the end point specifying unit 44 determines the time point after a predetermined time length T from the last frame F (provisional end point p_STOP) at the current stage as the end point P2_STOP of the sound generation interval P2. After that, the end point data D2_STOP is output (step SE7). For example, when a plurality of (12) frames F are removed from the end point of the sound generation section P1 by repeating step SE4 as shown in FIG. 10, the end point is the time point that has passed the time length T from the last frame F after the removal. Confirm as P2_STOP.

以上に説明したように、本形態においては、発声者の実際の発声に拘わらず、認証時におけるパスワードの末尾の音声（無声子音）が所定の時間長Ｔに調整されるから、発音区間Ｐ1の総てのフレームＦの特徴量Ｃが使用される場合と比較して、音解析装置８０による認証の精度を向上することが可能である。 As described above, in the present embodiment, since the voice at the end of the password (unvoiced consonant) at the time of authentication is adjusted to a predetermined time length T regardless of the actual utterance of the speaker, Compared with the case where the feature amount C of all the frames F is used, it is possible to improve the accuracy of authentication by the sound analysis device 80.

＜Ｄ：変形例＞
以上の形態には様々な変形を加えることができる。具体的な変形の態様を例示すれば以下の通りである。なお、以下の各態様を適宜に組み合わせてもよい。 <D: Modification>
Various modifications can be made to the above embodiment. An example of a specific modification is as follows. In addition, you may combine each following aspect suitably.

（１）第１区間特定部３０による発音区間Ｐ1の特定には公知の各種の技術を採用することが可能である。例えば、音信号Ｓのうち音量（エネルギ）が所定の閾値を上回る複数のフレームＦの集合を発音区間Ｐ1として特定する構成も採用される。また、発音の開始と終了とが利用者によって入力装置７０から指示される構成においては、開始の指示から終了の指示までの区間を発音区間Ｐ1として特定してもよい。 (1) Various known techniques can be employed to specify the sound generation section P1 by the first section specifying unit 30. For example, a configuration in which a set of a plurality of frames F whose sound volume (energy) exceeds a predetermined threshold in the sound signal S is specified as the sound generation section P1. In the configuration in which the start and end of sound generation are instructed by the user from the input device 70, the section from the start instruction to the end instruction may be specified as the sound generation section P1.

同様に、第２区間特定部４０が発音区間Ｐ2を特定する方法も適宜に変更される。例えば、第２区間特定部４０が始点特定部４２および終点特定部４４の何れかひとつのみを含む構成も採用される。第２区間特定部４０が始点特定部４２のみを含む構成においては、発音区間Ｐ1の始点Ｐ1_STARTを後退させた始点Ｐ2_STARTから終点Ｐ1_STOPまでの区間が発音区間Ｐ2として特定される。同様に、第２区間特定部４０が終点特定部４４のみを含む構成においては、発音区間Ｐ1の始点Ｐ1_STARTから終点Ｐ2_STOPまでの区間が発音区間Ｐ2として特定される。 Similarly, the method by which the second section specifying unit 40 specifies the sound generation section P2 is changed as appropriate. For example, a configuration in which the second section specifying unit 40 includes only one of the start point specifying unit 42 and the end point specifying unit 44 is also employed. In the configuration in which the second section specifying unit 40 includes only the start point specifying section 42, the section from the start point P2_START to the end point P1_STOP obtained by retreating the start point P1_START of the sound generation section P1 is specified as the sound generation section P2. Similarly, in the configuration in which the second section specifying unit 40 includes only the end point specifying unit 44, the section from the start point P1_START to the end point P2_STOP of the sound generation section P1 is specified as the sound generation section P2.

第２区間特定部４０（始点特定部４２または終点特定部４４）が、図６におけるステップＳC8までの処理とステップＳC9以後の処理との何れか一方のみを実行する構成も採用される。さらに、各形態における第２区間特定部４０の動作を適宜に組み合わせてもよい。例えば、信号レベルHIST_LEVEL（第１実施形態）とゼロクロス数HIST_ZXCNT（第３実施形態）との双方に基づいて第２区間特定部４０が始点Ｐ2_STARTまたは終点Ｐ2_STOPを特定する構成が採用される。 A configuration in which the second section specifying unit 40 (the start point specifying unit 42 or the end point specifying unit 44) executes only one of the processing up to step SC8 and the processing after step SC9 in FIG. 6 is also adopted. Furthermore, you may combine suitably the operation | movement of the 2nd area specific | specification part 40 in each form. For example, a configuration in which the second section specifying unit 40 specifies the start point P2_START or the end point P2_STOP based on both the signal level HIST_LEVEL (first embodiment) and the zero cross number HIST_ZXCNT (third embodiment) is employed.

また、第２実施形態においては信号レベルHIST_LEVELが閾値L_THを上回るという条件（ステップＳD3）とピッチデータHIST_PITCHが非検出を示すという条件（ステップＳD4）との双方を充足した場合にフレームＦが除外される構成を例示したが、ステップＳD4の条件のみが判定される構成としてもよい。以上の例示から理解されるように、第２区間特定部４０は、各フレームＦについて生成されたフレーム情報Ｆ_HISTに基づいて発音区間Ｐ1よりも短い発音区間Ｐ2を特定する手段であればよい。 In the second embodiment, the frame F is excluded when both the condition that the signal level HIST_LEVEL exceeds the threshold L_TH (step SD3) and the condition that the pitch data HIST_PITCH indicates non-detection (step SD4) are satisfied. However, only the condition of step SD4 may be determined. As can be understood from the above examples, the second section specifying unit 40 may be any means for specifying a pronunciation section P2 shorter than the pronunciation section P1 based on the frame information F_HIST generated for each frame F.

（２）以上の各形態においては、始点Ｐ1_STARTや終点Ｐ1_STOPの確定を契機として記憶部６４がフレーム情報Ｆ_HISTの記憶を開始または終了する構成を例示したが、フレーム情報生成部５６が、始点Ｐ1_STARTの確定を契機としてフレーム情報Ｆ_HISTの生成を開始するとともに終点Ｐ1_STOPの確定を契機としてフレーム情報Ｆ_HISTの生成を終了する構成においても同様の効果が奏される。 (2) In the above embodiments, the storage unit 64 starts or ends the storage of the frame information F_HIST when the start point P1_START and the end point P1_STOP are determined. However, the frame information generation unit 56 sets the start point P1_START at the start point P1_START. The same effect can be obtained in the configuration in which generation of the frame information F_HIST is started with the confirmation and the generation of the frame information F_HIST is ended with the confirmation of the end point P1_STOP.

もっとも、記憶部６４が記憶する対象は発音区間Ｐ1内のフレーム情報Ｆ_HISTに限定されない。すなわち、音信号Ｓの総てのフレームＦについて生成されるフレーム情報Ｆ_HISTが記憶部６４に格納される構成としてもよい。ただし、以上の各形態のように発音区間Ｐ1内のフレーム情報Ｆ_HISTのみが記憶部６４に格納される構成によれば、記憶部６４に必要となる容量が低減されるという利点がある。 However, the object stored in the storage unit 64 is not limited to the frame information F_HIST in the sound generation section P1. That is, the frame information F_HIST generated for all the frames F of the sound signal S may be stored in the storage unit 64. However, according to the configuration in which only the frame information F_HIST in the sound generation section P1 is stored in the storage unit 64 as in the above embodiments, there is an advantage that the capacity required for the storage unit 64 is reduced.

（３）始点（Ｐ1_START，Ｐ2_START）や終点（Ｐ1_STOP，Ｐ2_STOP）を指定するための情報はフレームＦの番号に限定されない。例えば、始点データ（Ｄ1_START，Ｄ2_START）や終点データ（Ｄ1_STOP，Ｄ2_STOP）は、所定の時点（例えば開始指示ＴＲの発生時）を基準とした時刻で始点や終点を指定するデータであってもよい。 (3) Information for designating the start point (P1_START, P2_START) and the end point (P1_STOP, P2_STOP) is not limited to the frame F number. For example, the start point data (D1_START, D2_START) and the end point data (D1_STOP, D2_STOP) may be data that designates the start point and the end point at a time based on a predetermined time point (for example, when the start instruction TR is generated).

（４）開始指示ＴＲの発生の契機は入力装置７０に対する操作に限定されない。例えば、音信号処理システムから利用者に対して発音の開始を促す通知（画像や音声による報知）が実行される場合には、当該通知を契機として開始指示ＴＲを生成する構成も採用される。 (4) The trigger for generating the start instruction TR is not limited to the operation on the input device 70. For example, when a notification (notification by an image or sound) that prompts the user to start pronunciation is executed from the sound signal processing system, a configuration in which a start instruction TR is generated using the notification as a trigger is also employed.

（５）音解析装置８０による音解析の内容は任意である。例えば、複数の利用者について抽出された登録特徴量と発声者の特徴量Ｃとを対比することで発声者を特定する話者認識や、発声者が発話した音韻（文字データ）を音信号Ｓから特定する音声認識を音解析装置８０が実行してもよい。以上の各形態のように発音区間Ｐ2を特定（音信号Ｓから雑音のみの区間を除外）する技術は、何れの音解析に際しても精度の向上のために好適に採用される。また、特徴量Ｃの内容は音解析装置８０による処理の内容に応じて適宜に選定されるのであって、以上の各形態におけるメルケプストラム係数は特徴量Ｃの例示に過ぎない。例えば、各フレームＦに区分された音信号Ｓが特徴量Ｃとして音解析装置８０に出力される構成としてもよい。 (5) The content of the sound analysis by the sound analysis device 80 is arbitrary. For example, speaker recognition that identifies a speaker by comparing registered feature values extracted for a plurality of users and a speaker's feature value C, or phonemes (character data) uttered by a speaker is represented by a sound signal S. The sound analysis device 80 may execute voice recognition specified from the following. The technique of specifying the sound generation section P2 (excluding the section of only noise from the sound signal S) as in each of the above forms is suitably employed for improving accuracy in any sound analysis. Further, the content of the feature amount C is appropriately selected according to the content of the processing by the sound analysis device 80, and the mel cepstrum coefficient in each of the above embodiments is merely an example of the feature amount C. For example, the sound signal S divided into the frames F may be output to the sound analysis device 80 as the feature amount C.

本発明の第１実施形態に係る音信号処理システムの構成を示すブロック図である。1 is a block diagram showing a configuration of a sound signal processing system according to a first embodiment of the present invention. 音信号Ｓと発音区間（Ｐ1，Ｐ2）との関係を示す概念図である。It is a conceptual diagram which shows the relationship between the sound signal S and sound generation area (P1, P2). 演算部の具体的な構成を示すブロック図である。It is a block diagram which shows the specific structure of a calculating part. 発音区間Ｐ1の始点を特定する処理を示すフローチャートである。It is a flowchart which shows the process which specifies the starting point of the sound production area P1. 発音区間Ｐ1の終点を特定する処理を示すフローチャートである。It is a flowchart which shows the process which specifies the end point of the sound production area P1. 発音区間Ｐ2を特定する処理を示すフローチャートである。It is a flowchart which shows the process which pinpoints pronunciation area P2. 発音区間Ｐ2を特定する処理を説明するための概念図である。It is a conceptual diagram for demonstrating the process which specifies the sound production area P2. 第２実施形態において発音区間Ｐ2を特定する処理を示すフローチャートである。It is a flowchart which shows the process which specifies the sound generation area P2 in 2nd Embodiment. 第３実施形態において発音区間Ｐ2を特定する処理を示すフローチャートである。It is a flowchart which shows the process which specifies the sound generation area P2 in 3rd Embodiment. 第３実施形態において発音区間Ｐ2を特定する処理を説明するための概念図である。It is a conceptual diagram for demonstrating the process which specifies the sound generation area P2 in 3rd Embodiment.

Explanation of symbols

１０……収音装置、２０……音信号処理装置、３０……第１区間特定部、４０……第２区間特定部、３２，４２……始点特定部、３４，４４……終点特定部、５０……フレーム分析部、５２……分割部、５４……特徴量算定部、５６……フレーム情報生成部、５８……演算部、５８１……レベル算定部、５８３……切換部、５８５……雑音レベル算定部、５８７，６４……記憶部、５８９……Ｓ/Ｎ比算定部、６２……出力制御部、７０……入力装置、８０……音解析装置、８２……記憶部、８４……制御部、Ｓ……音信号、Ｆ……フレーム、ＴＲ……開始指示、Ｆ_HIST……フレーム情報、Ｒ……Ｓ/Ｎ比、Ｃ……特徴量、Ｐ1，Ｐ2……発音区間。 DESCRIPTION OF SYMBOLS 10 ... Sound collecting device, 20 ... Sound signal processing device, 30 ... 1st area specific | specification part, 40 ... 2nd area specific part, 32, 42 ... Start point specific part, 34, 44 ... End point specific part , 50... Frame analysis unit, 52... Division unit, 54... Feature amount calculation unit, 56... Frame information generation unit, 58 ... calculation unit, 581 ... level calculation unit, 583 ... switching unit, 585 ...... Noise level calculation unit, 587, 64 ... Storage unit, 589 ... S / N ratio calculation unit, 62 ... Output control unit, 70 ... Input device, 80 ... Sound analysis device, 82 ... Storage unit , 84: Control unit, S: Sound signal, F: Frame, TR: Start instruction, F_HIST: Frame information, R: S / N ratio, C: Feature quantity, P1, P2: Sound generation section.

Claims

Frame information generating means for generating different types of first frame information and second frame information for each frame of the sound signal;
First section specifying means for specifying a first sound generation section of the sound signal based on the first frame information of each frame;
Second section specifying means for specifying a second sounding section obtained by shortening the first sounding section based on the second frame information of each frame in the first sounding section specified by the first section specifying means; Sound signal processing device.

The second frame information includes the number of zero crossings of the sound signal in each frame,
The second section specifying means, when a frame in which the number of zero crosses included in the second frame information exceeds a threshold value continues over a plurality of frames from the end point of the first sound generation section to the near side, starts from the plurality of frames. The sound signal processing device according to claim 1, wherein the second sound generation section is specified by excluding frames other than the predetermined number of frames on the side.

A feature amount calculating means for sequentially calculating, for each frame of the sound signal, a feature amount used by the sound analysis device separate from the sound signal processing device for analysis of the sound signal;
Output control means for sequentially outputting the feature quantities of each frame in the first sound generation section specified by the first section specifying means to the sound analysis apparatus every time the feature quantity calculation means calculates,
The sound signal processing apparatus according to claim 1, wherein the second section specifying unit notifies the sound analysis apparatus of the second sound generation section .

Frame information generation processing for generating different types of first frame information and second frame information for each frame of the sound signal;
A first section specifying process for specifying a first sound generation section of the sound signal based on the first frame information of each frame;
A second section specifying process for specifying a second sounding section obtained by shortening the first sounding section based on the second frame information of each frame in the first sounding section specified by the first section specifying process; A program to be executed.