JP4521673B2

JP4521673B2 - Utterance section detection device, computer program, and computer

Info

Publication number: JP4521673B2
Application number: JP2004101094A
Authority: JP
Inventors: フランクガーピンスーン; 哲中村; 豊葦苅; 玄伊藤
Original assignee: ATR Advanced Telecommunications Research Institute International
Current assignee: ATR Advanced Telecommunications Research Institute International
Priority date: 2003-06-19
Filing date: 2004-03-30
Publication date: 2010-08-11
Anticipated expiration: 2024-03-30
Also published as: JP2005031632A

Abstract

<P>PROBLEM TO BE SOLVED: To provide an utterance section detecting device capable of properly detecting an utterance section without reference to environmental noise. <P>SOLUTION: The utterance section detecting device includes a speech input part 104 which generates speech data in frames, a frame buffer 110 which stores the energy value of the frame-constituted speech in an FIFO basis, an initial environmental noise calculation part 112 which processes energy values of frames in the frame buffer 110 in a specified statistical method to calculate an initial value of an estimated value of environmental noise, a dynamic threshold calculation part 116 which calculates thresholds of energy values for detecting an utterance section by frames according to the energy values stored in the frame buffer 110 to vary following up the environmental noise included in speech data, and a state decision part 118 which decides the states of the frames according to the threshold. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

この発明は音声認識処理などの前処理として発話区間を検出するための装置に関し、特に、実時間での音声認識処理において、環境雑音による発話区間の誤検出を避けることができる発話区間検出装置、ならびにフレームごとの特徴量として正規化した音声エネルギを算出するための音声エネルギ正規化装置に関する。 The present invention relates to an apparatus for detecting an utterance section as preprocessing such as speech recognition processing, and in particular, an utterance section detection apparatus capable of avoiding erroneous detection of an utterance section due to environmental noise in real-time speech recognition processing, The present invention also relates to a speech energy normalization apparatus for calculating speech energy normalized as a feature amount for each frame.

音声認識などの処理においては、音声認識に先立って入力信号中の発話区間とそれ以外の区間（無音区間と呼ぶ。）との区別をすることが必要である。さもなければ、発話のない区間を音声認識することにより無意味な結果がもたらされるためである。 In processing such as speech recognition, prior to speech recognition, it is necessary to distinguish between an utterance interval in an input signal and other intervals (referred to as silent intervals). Otherwise, speech recognition of a section without speech gives a meaningless result.

従来、このような発話区間（又は無音区間）の検出は、入力される音声信号のパワー（エネルギ）を算出し、その値が予め定められたしきい値以上になれば発話区間、しきい値未満であれば無音区間とする、という手法により行なわれている。このとき、そうした条件の成立が持続した時間をも考慮にいれて発話区間又は無音区間の検出がされるのが通常である。 Conventionally, detection of such an utterance section (or silent section) is performed by calculating the power (energy) of an input voice signal, and if the value exceeds a predetermined threshold value, the utterance section and threshold value are calculated. If it is less than, it is performed by the method of setting it as a silence area. At this time, it is usual to detect a speech section or a silent section in consideration of the time during which such a condition is established.

そのような技術が特許文献１に開示されている。特許文献１は、音声付の映像情報から要約を自動的に作成するために、要約の対象となる個所を抽出するための技術を開示している。音声付の映像では、その内容（ジャンル）により、環境雑音の大きさが異なることが知られている。例えばニュース番組では環境雑音が小さく、スポーツ中継等の番組では環境雑音が大きい、などである。そのため、同じしきい値を用いて発話区間を検出しようとすると、映像情報のジャンルによりその結果が異なってしまうという問題がある。そのために特許文献１に開示の技術では、映像情報に、そのジャンルを示す付帯情報をもたせておき、付帯情報に従って各ジャンルに予め割当てられたしきい値を選択している。 Such a technique is disclosed in Patent Document 1. Patent Document 1 discloses a technique for extracting a portion to be summarized in order to automatically create a summary from video information with audio. It is known that the magnitude of environmental noise varies depending on the content (genre) of video with audio. For example, news programs have low environmental noise, and programs such as sports broadcasts have high environmental noise. For this reason, there is a problem that when the utterance section is detected using the same threshold value, the result varies depending on the genre of the video information. For this purpose, in the technique disclosed in Patent Document 1, the video information has accompanying information indicating the genre, and a threshold value assigned in advance to each genre is selected according to the accompanying information.

特開２００３−１０１９３９（段落２０９、２１０、図１及び図７）JP 2003-101939 (paragraphs 209 and 210, FIGS. 1 and 7)

しかし、上記した特許文献１に記載の技術では、一つの映像情報には一種類のしきい値しか使用できない。そのため、番組の中で環境雑音が変化した場合には、発話区間の検出に問題が生じるという問題がある。 However, with the technique described in Patent Document 1 described above, only one type of threshold value can be used for one piece of video information. Therefore, there is a problem that when the environmental noise changes in the program, there is a problem in detecting the utterance section.

特に、実時間の音声認識を行なう場合には、上記したような付帯情報が利用可能となるとは考えられない。また、電話による自動応答などに音声認識を用いる場合、音声信号の背景に存在する環境雑音がどのようなものになるかは予想できない。たとえば突発的な環境雑音が生じた場合、発話区間の検出を誤る可能性が高い。 In particular, when performing real-time speech recognition, it is not considered that the supplementary information as described above can be used. In addition, when voice recognition is used for automatic answering by telephone, it is impossible to predict what environmental noise will exist in the background of the voice signal. For example, when sudden environmental noise occurs, there is a high possibility of erroneous detection of the speech section.

また、音声認識においては発話中の音声エネルギの最大値で各フレームの音声エネルギを正規化した特徴量を用いると有効であることが知られている。しかしそのためには、発話の終了まで待って発話中での最大パワーを算出した後、算出された最大パワーを用いて当該発話中の各フレームの音声エネルギを正規化する必要がある。しかし、発話の終了まで待っていると実時間の音声認識を行なうことができないという問題がある。 Further, it is known that in voice recognition, it is effective to use a feature amount obtained by normalizing the voice energy of each frame with the maximum value of voice energy during speech. However, for that purpose, it is necessary to wait until the end of the utterance and calculate the maximum power during the utterance, and then normalize the voice energy of each frame during the utterance using the calculated maximum power. However, there is a problem that real-time speech recognition cannot be performed if waiting for the end of the utterance.

従って、本発明の目的は、環境雑音にかかわらず発話区間の検出を適切に行なうことができる発話区間検出装置を提供することである。 Accordingly, an object of the present invention is to provide an utterance section detection apparatus capable of appropriately detecting an utterance section regardless of environmental noise.

本発明の他の目的は、環境雑音が変化しても発話区間の検出を適切に行なうことができる発話区間検出装置を提供することである。 Another object of the present invention is to provide an utterance interval detection device that can appropriately detect an utterance interval even if environmental noise changes.

本発明のさらに他の目的は、環境雑音が変化しても発話区間の検出を実時間で適切に行なうことができる発話区間検出装置を提供することである。 Still another object of the present invention is to provide an utterance section detection device capable of appropriately detecting an utterance section in real time even if environmental noise changes.

本発明のさらに他の目的は、突発的な環境雑音の変化があっても発話区間の検出を実時間で適切に行なうことができる発話区間検出装置を提供することである。 Still another object of the present invention is to provide an utterance section detection device capable of appropriately detecting an utterance section in real time even if there is a sudden change in environmental noise.

本発明の他の目的は、実時間で各フレームの音声エネルギを正規化することができる音声エネルギ正規化装置を提供することである。 Another object of the present invention is to provide a speech energy normalization apparatus capable of normalizing speech energy of each frame in real time.

本発明の第１の局面に係る発話区間検出装置は、音声データを逐次フレーム化するためのフレーム化手段と、フレーム化手段によりフレーム化された音声のエネルギ値をフレームごとに算出し、ＦＩＦＯ（Ｆｉｒｓｔ−ＩｎＦｉｒｓｔ−Ｏｕｔ）形式で第１の個数のフレームのエネルギ値を記憶するフレームエネルギ算出及び記憶手段と、フレームエネルギ算出及び記憶手段に、第２の個数のフレームのエネルギ値が格納されたことに応答して、第２の個数のフレームのエネルギ値を所定の統計的手法に従って処理することにより、音声データに含まれる環境雑音の推定値の初期値を算出するための初期値算出手段と、推定値の初期値と、フレームエネルギ算出及び記憶手段に逐次記憶される音声のエネルギ値とに基づいて、音声データに含まれる環境雑音の変化に追従して変化する様に、発話区間を検出するためのエネルギ値のしきい値をフレームごとに逐次算出するための手段と、しきい値に基づいて、第２の個数のフレーム以降のフレームの中で、音声データの発話区間の開始位置又は終了位置に対応するフレームを推定するための発話区間推定手段とを含む。 An utterance section detecting device according to a first aspect of the present invention includes a framing means for sequentially framing speech data, a speech energy value framed by the framing means for each frame, and a FIFO ( Frame energy calculation and storage means for storing energy values of a first number of frames in a First-In First-Out) format, and energy values of a second number of frames are stored in the frame energy calculation and storage means. In response, an initial value calculating means for calculating an initial value of the estimated value of the environmental noise included in the audio data by processing the energy values of the second number of frames according to a predetermined statistical method; Based on the initial value of the estimated value and the energy value of the sound sequentially stored in the frame energy calculation and storage means, Means for sequentially calculating the threshold value of the energy value for detecting the utterance section for each frame so as to change following the change of the ambient noise, and based on the threshold value, the second Utterance interval estimation means for estimating a frame corresponding to the start position or end position of the utterance interval of the voice data among the frames after the number of frames.

環境雑音の推定値の初期値が、第２の個数のフレームのエネルギ値を統計的に処理することにより算出される。以後は、この推定値の初期値と、フレームエネルギ算出及び記憶手段に逐次記憶される音声のエネルギ値とに基づいて、音声データに含まれる環境雑音の変化に追従して変化する様に、発話区間を検出するためのエネルギ値のしきい値をフレームごとに逐次算出する。そのしきい値を用いて音声データの発話区間の開始位置又は終了位置に対応するフレームを推定する。しきい値が、環境雑音の変化に追従して変化するので、正確に発話区間の開始位置又は終了位置を推定できる。 An initial value of the environmental noise estimate is calculated by statistically processing the energy values of the second number of frames. Thereafter, on the basis of the initial value of the estimated value and the energy value of the sound sequentially stored in the frame energy calculation and storage means, the utterance is changed so as to follow the change of the environmental noise included in the sound data. The threshold value of the energy value for detecting the section is sequentially calculated for each frame. A frame corresponding to the start position or end position of the speech section of the voice data is estimated using the threshold value. Since the threshold value changes following the environmental noise change, the start position or the end position of the utterance section can be estimated accurately.

好ましくは、初期値算出手段は、第２の個数のフレームを、各フレームのエネルギ値の大きさによって、第１のエネルギ値を中心とする第１のクラスタと、第１のエネルギよりも大きな第２のエネルギ値を中心とする第２のクラスタとにクラスタ化するための手段と、第１のエネルギ値を環境雑音の推定値の初期値として出力するための手段とを含む。 Preferably, the initial value calculation means sets the second number of frames to a first cluster centered on the first energy value and a larger number than the first energy according to the energy value of each frame. Means for clustering into a second cluster centered on an energy value of 2 and means for outputting the first energy value as an initial value of an estimate of environmental noise.

音声信号には、環境雑音と発話音声とが含まれる。各フレームをクラスタ化すると、環境雑音のみのフレームと、環境雑音と発話音声とを含むフレームとの二つのグループに分類されると思われる。フレームをエネルギの大きさに従って二つのクラスタにクラスタ化すると、エネルギの小さな第１のフレームからなるクラスタにおいて、環境雑音のみからなるフレームの占める割合が高くなる。そこで、この第１のクラスタのフレームのエネルギ値の平均を環境雑音の推定値の初期値とすれば、環境雑音の初期値を信頼性高く推定することができる。 The audio signal includes environmental noise and speech. When each frame is clustered, it is considered that the frames are classified into two groups: a frame containing only environmental noise and a frame containing environmental noise and speech. When a frame is clustered into two clusters according to the magnitude of energy, the proportion of frames consisting only of environmental noise increases in the cluster consisting of the first frame with low energy. Therefore, if the average of the energy values of the frames of the first cluster is used as the initial value of the estimated value of the environmental noise, the initial value of the environmental noise can be estimated with high reliability.

より好ましくは、クラスタ化するための手段は、第２の個数のフレームを第１及び第２のクラスタにクラスタ化するための境界値を決定するための手段と、境界値よりも小さなエネルギ値を持つフレームを第１のクラスタに、それ以外のフレームを第２のクラスタに、それぞれ分類するための手段とを含む。 More preferably, the means for clustering comprises means for determining a boundary value for clustering the second number of frames into the first and second clusters, and an energy value less than the boundary value. And means for classifying frames having a first cluster and other frames into a second cluster.

境界値を決定するための手段は、第２の個数のフレームのうち、エネルギ値をキーとしてソートしたときに予め定める第１のソート順位及び第２のソート順位となる二つのフレームを選択するための手段と、選択された二つのフレームのエネルギ値の平均値を算出するための第１の平均値算出手段と、第１の平均値算出手段により算出された平均値より小さいエネルギ値を持つか否かを基準として、第２の個数のフレームを第１及び第２のグループに分類するための手段と、第１及び第２のグループに属するフレームのエネルギ値の平均値をそれぞれ算出するための第２の平均値算出手段と、第２の平均値算出手段により算出された二つの平均値の平均値をさらに算出し、境界値として出力するための第３の平均値算出手段とを含んでもよい。 The means for determining the boundary value is for selecting two frames having a first sort order and a second sort order that are predetermined when sorting using the energy value as a key out of the second number of frames. The first average value calculating means for calculating the average value of the energy values of the two selected frames, and whether the energy value is smaller than the average value calculated by the first average value calculating means. On the basis of whether or not, a means for classifying the second number of frames into the first and second groups, and an average value of energy values of the frames belonging to the first and second groups, respectively. A second average value calculating unit; and a third average value calculating unit for further calculating an average value of the two average values calculated by the second average value calculating unit and outputting the average value as a boundary value. Good

好ましくは、しきい値をフレームごとに逐次算出するための手段は、フレームエネルギ算出及び記憶手段に格納されているフレームのエネルギ値と、環境雑音の推定値の初期値とに基づいて、フレームエネルギ算出及び記憶手段に格納されているフレームの環境雑音のエネルギ値をフレームごとに推定するための手段と、フレームエネルギ算出及び記憶手段に格納されているフレームのエネルギ値のうち、定常的な背景雑音及び発話音声の合計のエネルギ値の最大値をフレームごとに逐次推定するための手段と、推定された環境雑音のエネルギ値と、推定された背景雑音及び発話音声の合計のエネルギ値とに基づいて、発話区間を検出するためのエネルギのしきい値をフレームごとに算出するための手段とを含む。 Preferably, the means for sequentially calculating the threshold value for each frame is based on the frame energy calculation and the energy value of the frame stored in the storage means and the initial value of the estimated value of the environmental noise. Means for estimating the environmental noise energy value of the frame stored in the calculation and storage means for each frame, and stationary background noise among the frame energy values stored in the frame energy calculation and storage means And means for sequentially estimating the maximum value of the total energy value of the uttered speech for each frame, the estimated energy value of the environmental noise, and the estimated total energy value of the background noise and the uttered speech And means for calculating a threshold value of energy for detecting an utterance section for each frame.

より好ましくは、発話区間推定手段は、しきい値に基づいて、第２の個数のフレーム以降のフレームの状態を判定するための手段を含み、状態は、非発話状態を含み、環境雑音のエネルギ値をフレームごとに逐次推定するための手段は、１フレーム前の時点において推定された環境雑音のエネルギ値を記憶するための手段と、環境雑音の推定値の初期値が算出された時点で記憶するための手段に環境雑音の推定値の初期値を記憶させるための手段と、記憶するための手段に記憶された値、フレームエネルギ算出及び記憶手段に含まれるフレームのエネルギ値、及びフレームの状態を判定する手段による判定結果に基づいて、以下の式
b(t)＝b(t−1)×α＋E(t)×(1−α) （状態が非発話状態の場合）
b(t)＝b(t−1) （状態が非発話状態以外の場合）
ただしαは所定の忘却係数、Ｅ（ｔ）は時刻ｔにおけるフレームのエネルギ値、に従って時刻ｔにおける背景雑音ｂ（ｔ）を算出するための手段とを含み、記憶するための手段は、算出された背景雑音ｂ（ｔ）を記憶する。 More preferably, the utterance period estimation means includes means for determining a state of frames after the second number of frames based on the threshold, the state includes a non-speech state, and the energy of environmental noise The means for sequentially estimating the value for each frame stores the means for storing the energy value of the environmental noise estimated at the time point one frame before and the time when the initial value of the estimated value of the environmental noise is calculated. Means for storing the initial value of the estimated value of the environmental noise in the means for performing, a value stored in the means for storing, a frame energy calculation and a frame energy value included in the storage means, and a state of the frame Based on the determination result by the means for determining
b (t) = b (t−1) × α + E (t) × (1−α) (When the state is a non-speech state)
b (t) = b (t−1) (When the state is other than non-speech state)
Where α is a predetermined forgetting factor, E (t) is the energy value of the frame at time t, and means for calculating the background noise b (t) at time t, and the means for storing is calculated The background noise b (t) is stored.

合計のエネルギ値の最大値をフレームごとに推定するための手段は、フレームエネルギ算出及び記憶手段に格納されているフレームを、エネルギ値をキーとしてソートするための手段と、ソートするための手段によりソートされた結果所定の順位となるフレームのエネルギ値を合計のエネルギ値の最大値Ｅmax(t)として選択するための手段を含んでもよい。 The means for estimating the maximum value of the total energy value for each frame includes means for sorting the frames stored in the frame energy calculation and storage means using the energy value as a key and means for sorting. Means may be included for selecting the energy values of the frames that are in the predetermined order as a result of the sorting as the maximum value Emax (t) of the total energy values.

好ましくは、しきい値をフレームごとに逐次算出するための手段は、時刻ｔにおける発話開始位置検出のためのしきい値Ｅｔｈ₁（ｔ）を、
Eth₁(t)＝ b(t)＋max(β，Ｅmax(t)−b(t))×第１の定数
に従って算出するための手段を含む。 Preferably, the means for sequentially calculating the threshold value for each frame includes the threshold value Eth ₁ (t) for detecting the utterance start position at time t,
Including means for calculating according to Eth ₁ (t) = b (t) + max (β, Emax (t) −b (t)) × first constant.

さらに好ましくは、しきい値をフレームごとに逐次算出するための手段は、さらに、時刻ｔにおける発話終了位置検出のためのしきい値Ｅｔｈ₂（ｔ）を、
Eth₂(t)＝b(t)＋max(β，Ｅmax(t)−b(t))×第２の定数
ただし第２の定数＜第１の定数、に従って算出するための手段を含む。 More preferably, the means for sequentially calculating the threshold value for each frame further includes a threshold value Eth ₂ (t) for detecting the utterance end position at time t,
It includes means for calculating according to Eth ₂ (t) = b (t) + max (β, Emax (t) −b (t)) × second constant where the second constant <the first constant.

発話区間検出装置はさらに、発話の先頭からの各フレームの音声データの最大エネルギ値又は所定のデフォルト基準値のいずれか大きい方を用いて各フレームの音声データを正規化し、各フレームの音声特徴パラメータとして出力するための音声エネルギ正規化手段を含んでもよい。 The utterance section detecting device further normalizes the audio data of each frame using the larger of the maximum energy value of the audio data of each frame from the head of the utterance or a predetermined default reference value, and the audio feature parameter of each frame Voice energy normalization means may be included.

発話の先頭からの各フレームの音声データの最大エネルギ値又は所定のデフォルト基準値のいずれか大きい方を用いて正規化するので、発話の終了まで待たずに、擬似的にではあるが実時間で正規化することが可能になる。したがって、音声特徴パラメータの一つとして音声エネルギを実時間で得ることができる。 Since normalization is performed using the maximum energy value of voice data of each frame from the beginning of the utterance or a predetermined default reference value, whichever is greater, without waiting for the end of the utterance, it is simulated in real time. It becomes possible to normalize. Therefore, speech energy can be obtained in real time as one of the speech feature parameters.

好ましくは、音声エネルギ正規化手段は、正規化の基準値を記憶するための基準値記憶手段と、フレームエネルギ算出及び記憶手段により算出された音声エネルギが、基準値記憶手段に記憶された基準値を超えていることを検出し、検出信号を出力するための検出手段と、検出手段により出力される検出信号に応答して、基準値記憶手段に記憶された基準値を、フレームエネルギ算出及び記憶手段により算出された値で置換するための手段と、フレームエネルギ算出及び記憶手段により算出された音声エネルギ値を、基準値記憶手段に記憶された基準値で除算することにより、当該フレームの音声エネルギを正規化するための除算手段とを含む。 Preferably, the sound energy normalization means includes a reference value storage means for storing a reference value for normalization, and a reference value in which the sound energy calculated by the frame energy calculation and storage means is stored in the reference value storage means. Detecting means for outputting a detection signal and a reference value stored in the reference value storage means in response to the detection signal output from the detection means for calculating and storing the frame energy By dividing the sound energy value calculated by the means calculated by the means and the frame energy calculation and storage means by the reference value stored in the reference value storage means, Dividing means for normalizing.

さらに好ましくは、発話区間検出装置は、発話区間推定手段により、発話区間の終了位置に対応するフレームが推定されたことに応答して、基準値記憶手段の記憶内容を、所定のデフォルト値で置換するための手段をさらに含む。 More preferably, the utterance section detection device replaces the stored content of the reference value storage means with a predetermined default value in response to the utterance section estimation means estimating the frame corresponding to the end position of the utterance section. Means for further comprising:

発話区間検出装置は、所定のデフォルト値を、発話区間検出装置の起動時に与えられたオプション値に基づいて設定するための手段をさらに含んでもよい。 The utterance section detection device may further include means for setting a predetermined default value based on an option value given when the utterance section detection device is activated.

本発明の第２の局面に係るコンピュータプログラムは、上記したいずれかの発話区間検出装置としてコンピュータを動作させるためのものである。 A computer program according to the second aspect of the present invention is for operating a computer as one of the utterance section detection devices described above.

本発明の第３の局面にかかる音声エネルギ正規化装置は、フレーム化された音声データの正規化音声エネルギを実時間で算出するための音声エネルギ正規化装置であって、正規化の基準値を記憶するための基準値記憶手段と、フレームごとの音声データの音声エネルギを算出するための手段と、音声エネルギ算出手段により算出された音声エネルギが、基準値記憶手段に記憶された基準値を超えていることを検出し、検出信号を出力するための手段と、検出手段により出力される検出信号に応答して、基準値記憶手段に記憶された基準値を、音声エネルギ算出手段により算出された値で置換するための手段と、音声エネルギ算出手段により算出された音声エネルギを、基準値記憶手段に記憶された基準値で除算することにより、当該フレームの音声エネルギを正規化するための除算手段とを含む。 A speech energy normalization apparatus according to a third aspect of the present invention is a speech energy normalization apparatus for calculating the normalized speech energy of framed speech data in real time, wherein a normalization reference value is set. Reference value storage means for storing, means for calculating the sound energy of the sound data for each frame, and the sound energy calculated by the sound energy calculation means exceeds the reference value stored in the reference value storage means And a reference value stored in the reference value storage means in response to the detection signal output from the detection means and the detection signal output from the detection means. By dividing the sound energy calculated by the value and the sound energy calculated by the sound energy calculating means by the reference value stored in the reference value storing means, And a dividing means for normalizing speech energy.

発話区間の最初においては、デフォルトの値を基準値として音声エネルギを正規化する。発話区間の途中でフレームの音声エネルギが基準値を超えると、フレームの音声エネルギを新たな基準値として音声エネルギを正規化する。発話区間の終了まで到達しなくても擬似的にではあるが音声エネルギの実時間での正規化が可能になる。発話区間の最初では誤差が生ずるが、実際に音声エネルギが発話区間中での最大値まで到達すると、後は正確な正規化が行なえる。またデフォルトの値を適切に選ぶことにより、発話区間の最初に生ずる誤差も小さく抑えることができる。 At the beginning of the utterance interval, the voice energy is normalized using the default value as a reference value. If the voice energy of the frame exceeds the reference value in the middle of the speech section, the voice energy is normalized using the voice energy of the frame as a new reference value. Even if it does not reach the end of the utterance section, it is possible to normalize the voice energy in real time although it is pseudo. An error occurs at the beginning of the utterance interval, but when the speech energy actually reaches the maximum value in the utterance interval, accurate normalization can be performed thereafter. Further, by appropriately selecting a default value, an error occurring at the beginning of the utterance interval can be suppressed to a small value.

好ましくは、音声エネルギ正規化装置は、発話区間の終了を検出して発話終了検出信号を出力するための手段と、発話終了検出信号に応答して、基準値記憶手段の記憶内容を、所定のデフォルト値で置換するための手段とをさらに含む。 Preferably, the speech energy normalization device detects the end of the utterance section and outputs an utterance end detection signal, and in response to the utterance end detection signal, the stored content of the reference value storage means And means for replacing with a default value.

発話区間が終了すると、基準値を再びデフォルトの値に再設定できる。音声エネルギを、フレームごとに適切な基準値を使用して正規化できる。 When the utterance period ends, the reference value can be reset to the default value again. Speech energy can be normalized using an appropriate reference value for each frame.

さらに好ましくは、音声エネルギ正規化装置は、所定のデフォルト値を、音声エネルギ正規化装置の起動時に与えられたオプション値に基づいて設定するための手段をさらに含む。 More preferably, the speech energy normalization device further includes means for setting the predetermined default value based on an option value provided upon activation of the speech energy normalization device.

起動時のオプション値によってデフォルト値を設定できるので、様々なオプション値をデフォルト値として音声エネルギ正規化装置を動作させることができる。その結果、音声エネルギの正規化処理をより適切に実現することが容易になる。 Since the default value can be set according to the option value at the time of startup, the voice energy normalization apparatus can be operated with various option values as default values. As a result, it becomes easy to more appropriately realize the normalization processing of the sound energy.

本発明の第４の局面に係るコンピュータプログラムは、上記したいずれかの音声エネルギ正規化装置としてコンピュータを動作させるためのものである。 A computer program according to the fourth aspect of the present invention is for operating a computer as one of the above-described speech energy normalization apparatuses.

本発明の第５の局面に係るコンピュータは、上記した第２の局面に係るコンピュータプログラム、又は第４の局面に係るコンピュータプログラムによりプログラムされ、発話区間検出装置又は音声エネルギ正規化装置として動作する。 A computer according to the fifth aspect of the present invention is programmed by the computer program according to the second aspect described above or the computer program according to the fourth aspect, and operates as an utterance section detection device or a speech energy normalization device.

本実施の形態に係る発話区間検出装置は、フレーム化して入力される音声信号に基づき、統計的手法によって発話区間検出の際のしきい値を変化させる。その際、装置の立上がり時の遅延をできるだけ少なくするとともに、突発的な雑音があっても安定して発話区間の検出を行なうことができるよう、統計的手法を工夫している。また、音声認識のための特徴量パラメータとしてフレームの正規化した音声エネルギを算出する際、実時間処理によって、擬似的な正規化ができるような工夫をしている。 The speech segment detection apparatus according to the present embodiment changes a threshold value at the time of speech segment detection by a statistical method based on a voice signal input as a frame. At that time, a statistical method is devised so that the delay at the time of startup of the apparatus is minimized as much as possible and the utterance section can be stably detected even if there is sudden noise. In addition, when calculating the normalized speech energy of the frame as the feature amount parameter for speech recognition, a contrivance is made so that pseudo-normalization can be performed by real-time processing.

［発話区間の検出原理］
図１に、音声信号と、本実施の形態において発話区間の検出に使用される手法で使用される様々なパラメータとを示す。図１を参照して、音声信号２０に対し、発話開始しきい値２２と発話終了しきい値２４という二つのしきい値を用いて発話の開始位置２６及び終了位置２８を判定する。これら発話開始しきい値２２及び発話終了しきい値２４は、入力波形データからフレーム単位で算出されるエネルギから統計的手法により定められる。これらを定める手法については後述する。 [Speech interval detection principle]
FIG. 1 shows an audio signal and various parameters used in the technique used for detecting a speech interval in the present embodiment. Referring to FIG. 1, utterance start position 26 and ending position 28 are determined for voice signal 20 using two threshold values, utterance start threshold value 22 and utterance end threshold value 24. The utterance start threshold value 22 and the utterance end threshold value 24 are determined by a statistical method from energy calculated in units of frames from input waveform data. A method for determining these will be described later.

図１において、発話区間の判定の際に使用される時間的パラメータＴ１からＴ６は以下の意味を持つ。 In FIG. 1, temporal parameters T1 to T6 used in the determination of an utterance section have the following meanings.

Ｔ１：プリロール時間あるフレームが発話の開始位置であると判定されたとき、そのフレームからさらにこのプリロール時間だけさかのぼった位置（図１の参照符号２６）のフレームに、発話開始フレームとしてのマークが付される。 T1: Pre-roll time When it is determined that a certain frame is the start position of an utterance, a mark as an utterance start frame is added to a frame (reference numeral 26 in FIG. 1) that is further back from this frame by this pre-roll time. Is done.

Ｔ２：発話開始判定時間発話が開始したと判定されるための第１の条件として、フレーム単位のエネルギ値が連続して発話開始しきい値を超えなければならない時間。 T2: Utterance start determination time As a first condition for determining that an utterance has started, a time during which the energy value in units of frames must continuously exceed the utterance start threshold.

Ｔ３：最短発話時間発話開始と判定されるために、フレーム単位のエネルギ値が連続して超えなければならない最小時間。エネルギ値が発話開始しきい値をＴ２時間連続して超え、かつＴ３時間連続して超えてはじめて発話開始と判定される。 T3: Shortest utterance time The minimum time during which the energy value of each frame must be exceeded in order to be determined to be the utterance start. It is determined that the utterance starts only after the energy value exceeds the utterance start threshold value continuously for T2 time and continuously for T3 time.

Ｔ４：最長無音時間発話状態でフレーム単位のエネルギ値が発話終了しきい値を下回っても、発話終了と判定されない最長の時間。 T4: Longest silence time The longest time during which an utterance is not determined to be finished even if the energy value in units of frames falls below the utterance end threshold.

Ｔ５：発話終了判定時間発話が終了したと判定されるための第１の条件として、フレーム単位のエネルギ値が連続して発話終了しきい値を下回らなければならない時間。エネルギ値が発話終了しきい値をＴ５時間連続して下回り、かつＴ４時間連続して下回った場合、発話終了と判定される。 T5: Utterance end determination time As a first condition for determining that the utterance has ended, a time during which the energy value in units of frames must continuously fall below the utterance end threshold. When the energy value falls below the utterance end threshold value for T5 hours continuously and falls below T4 time continuously, it is determined that the utterance is finished.

Ｔ６：アフタロール時間あるフレームで発話終了と判定されたとき、そのフレームからさらにこのアフタロール時間だけ下った位置のフレーム（図１の参照符号２８）に、発話終了フレームとしてのマークが付される。 T6: Afterroll time When it is determined that the utterance has ended in a certain frame, a frame as the utterance end frame is attached to a frame (reference numeral 28 in FIG. 1) at a position further lower than this frame by the afterroll time. .

図１の水平軸付近に記載されているＳ１からＳ４の符号は、後述する手法により決定される、各フレームの状態を示す。図２に、フレームの状態の遷移を示す。 Symbols S1 to S4 written in the vicinity of the horizontal axis in FIG. 1 indicate the state of each frame determined by a method described later. FIG. 2 shows frame state transitions.

図２を参照して、フレームは４つの状態（非発話状態（Ｓ１）３０、発話開始状態（Ｓ２）３２、発話状態（Ｓ３）３４，及び発話終了状態（Ｓ４）３６）の間を遷移する。状態間の遷移は以下の様にして行なわれる。 Referring to FIG. 2, the frame transitions between four states (non-speech state (S1) 30, utterance start state (S2) 32, utterance state (S3) 34, and utterance end state (S4) 36). . Transitions between states are performed as follows.

（１）非発話状態（Ｓ１）３０で、フレームのエネルギ値が発話開始しきい値２２を上回ると状態は発話開始状態（Ｓ２）３２に遷移する（アーク４２）。 (1) In the non-speech state (S1) 30, when the energy value of the frame exceeds the speech start threshold 22, the state transitions to the speech start state (S2) 32 (arc 42).

（２）発話開始状態（Ｓ２）３２が、一定時間Ｔ３だけ継続すると状態は発話状態（Ｓ３）３４となる（アーク４８）。 (2) If the utterance start state (S2) 32 continues for a predetermined time T3, the state becomes the utterance state (S3) 34 (arc 48).

（３）発話開始状態（Ｓ２）３２で、フレームのエネルギ値が発話開始しきい値２２を下回ると状態は非発話状態（Ｓ１）３０に遷移する（アーク４６）。 (3) In the utterance start state (S2) 32, when the energy value of the frame falls below the utterance start threshold 22, the state transitions to the non-utterance state (S1) 30 (arc 46).

（４）発話状態（Ｓ３）３４で、フレームのエネルギ値が発話終了しきい値２４を下回ると状態は発話終了状態（Ｓ４）３６に遷移する（アーク５２）。 (4) In the utterance state (S3) 34, when the energy value of the frame falls below the utterance end threshold 24, the state transitions to the utterance end state (S4) 36 (arc 52).

（５）発話終了状態（Ｓ４）３６が、一定時間Ｔ４だけ継続すると状態は非発話状態（Ｓ１）３０に遷移する（アーク５８）。 (5) If the utterance end state (S4) 36 continues for a certain time T4, the state transitions to the non-utterance state (S1) 30 (arc 58).

（６）発話終了状態（Ｓ４）３６で、フレームのエネルギ値が発話終了しきい値２４を上回ると状態は発話状態（Ｓ３）３４に戻る（アーク５４）。 (6) In the utterance end state (S4) 36, when the energy value of the frame exceeds the utterance end threshold 24, the state returns to the utterance state (S3) 34 (arc 54).

（７）それ以外の場合、状態は現在の状態を維持する（アーク４０、４４、５０及び５６）。 (7) Otherwise, the state maintains the current state (arcs 40, 44, 50 and 56).

上記した種々のパラメータは、本実施の形態の装置では、装置の起動時に手操作により設定される。設定のないものはデフォルト値が用いられる。パラメータ設定の部分は本発明と直接関係をもたないため、以下の説明では詳細には説明しない。 In the apparatus according to the present embodiment, the various parameters described above are set manually when starting the apparatus. If there is no setting, the default value is used. The parameter setting portion has no direct relationship with the present invention and will not be described in detail in the following description.

［フレームの構成］
後述する様に、本実施の形態に係る装置は、音声入力信号をフレーム単位で処理する。図３にフレーム及びフレームシフトの概念を説明するための模式図を示す。 [Frame structure]
As will be described later, the apparatus according to the present embodiment processes an audio input signal in units of frames. FIG. 3 is a schematic diagram for explaining the concept of frame and frame shift.

図３を参照して、各フレーム７０、７２、７４、…はフレーム長Ｔｗ＝３０ミリ秒の長さの音声信号である。本実施の形態では、このフレームを１０ミリ秒単位で時間軸上を移動させながら順次音声信号をフレーム化する。この移動量をフレームシフト量と呼ぶ。従って、本実施の形態の装置の処理対象となる音声データは、フレーム長３０ミリ秒，フレームシフト量１０ミリ秒である。 Referring to FIG. 3, each of the frames 70, 72, 74,... Is an audio signal having a frame length Tw = 30 milliseconds. In this embodiment, the audio signal is sequentially framed while moving this frame on the time axis in units of 10 milliseconds. This amount of movement is called a frame shift amount. Accordingly, the audio data to be processed by the apparatus according to the present embodiment has a frame length of 30 milliseconds and a frame shift amount of 10 milliseconds.

また、各フレームのエネルギは、当該フレーム中のデータに窓関数８０（ハミング窓）で示される値を乗算して総和を計算することにより得られる。フレームごとのエネルギの算出方法については後述する。 The energy of each frame is obtained by multiplying the data in the frame by the value indicated by the window function 80 (Humming window) and calculating the sum. A method for calculating energy for each frame will be described later.

本実施の形態の装置では、通常は１００フレームのデータを統計的に処理することにより発話開始しきい値２２及び発話終了しきい値２４を動的に計算する。この様に動的な処理を行なう場合、ある程度のデータが集積されないと処理を開始することができない。他方で、あまり多くのデータを使用して統計的処理を行なおうとすると、装置が適切に動作するまでの時間的遅延が長くなり、発話の最初を正しく検出できなくなるおそれがある。 In the apparatus of the present embodiment, normally, the utterance start threshold value 22 and the utterance end threshold value 24 are dynamically calculated by statistically processing 100 frames of data. When dynamic processing is performed in this manner, processing cannot be started unless a certain amount of data is accumulated. On the other hand, if statistical processing is performed using too much data, the time delay until the device operates properly becomes long, and the beginning of the utterance may not be detected correctly.

そこで、本実施の形態の装置では、処理の開始後、最初の４００ミリ秒までは無音状態であると仮定し、この間に４０フレーム分のデータをフレームバッファに収集する。この４０フレーム分のデータを用いて環境雑音の初期値を求め、その値を用いてさらにしきい値の初期値を決める。以後、１００フレーム分のデータが収集されるまで、フレームデータをフレームバッファに蓄積しながら、収集したデータを用いてしきい値を動的に計算する。１００フレームに達したら、以後、ＦＩＦＯ（Ｆｉｒｓｔ−ＩｎＦｉｒｓｔ−Ｏｕｔ）形式でフレームデータを１００個に維持しながらしきい値の計算を行なう。なお、この最大のフレーム数（フレームバッファ内に記憶され使用される最大のフレーム数）をフレームバッファサイズと呼ぶことにする。また、環境雑音の初期値を求めるために使用するフレームの数を初期バッファサイズと呼ぶ。すなわち、本実施の形態の装置ではフレームバッファサイズは１００、初期バッファサイズは４０である。 Therefore, in the apparatus according to the present embodiment, it is assumed that there is no sound for the first 400 milliseconds after the start of processing, and during this time, 40 frames of data are collected in the frame buffer. The initial value of the environmental noise is obtained using the data for 40 frames, and the initial value of the threshold is further determined using the value. Thereafter, the threshold value is dynamically calculated using the collected data while accumulating the frame data in the frame buffer until data for 100 frames is collected. After reaching 100 frames, the threshold value is calculated while maintaining 100 frame data in the FIFO (First-In First-Out) format. This maximum number of frames (maximum number of frames stored and used in the frame buffer) is referred to as a frame buffer size. In addition, the number of frames used for obtaining the initial value of the environmental noise is referred to as an initial buffer size. That is, in the apparatus of the present embodiment, the frame buffer size is 100 and the initial buffer size is 40.

なお、これらのフレームバッファサイズ及び初期バッファサイズは一例であって、これ以外の値を用いることも考えられる。 Note that these frame buffer size and initial buffer size are merely examples, and other values may be used.

以下の説明では、入力されるフレームの番号をｔ（０≦ｔ）で表す。フレームは１０ミリ秒ごとに入力されるので、ｔはまた時刻も表す。従って、以下の説明では単に「ｔ番目のフレーム」を「時刻ｔにおけるフレーム」という表現で表すこともある。 In the following description, the number of the input frame is represented by t (0 ≦ t). Since the frame is input every 10 milliseconds, t also represents the time. Therefore, in the following description, the “t-th frame” may be simply expressed by the expression “frame at time t”.

こうした処理を行なうことで、処理開始時の遅延は４００ミリ秒となり、実用上の問題は見られない。通常は１００個のフレームデータを用いてしきい値を計算するので、信頼性高く発話区間の検出を行なうことができる。 By performing such a process, the delay at the start of the process is 400 milliseconds, and there is no practical problem. Usually, since the threshold value is calculated using 100 frame data, the speech section can be detected with high reliability.

［装置の構成］
図４は、本実施の形態に係る発話区間検出装置の構成を示す機能的ブロック図である。図４を参照して、この発話区間検出装置１００は、マイク１０２から与えられる音声信号の中で発話区間を検出するためのものである。発話区間検出装置１００は、マイク１０２から与えられる音声信号を標本化し、量子化することによりデジタル化し、さらに上記した形式のフレームデータとして１０ミリ秒ごとに出力するとともに、フレームデータを出力したことを示すフレーム出力信号１２４を出力するための音声入力部１０４と、音声入力部１０４から与えられる複数個のフレームデータを記憶するための入力バッファ１０６とを含む。 [Device configuration]
FIG. 4 is a functional block diagram showing the configuration of the utterance period detection device according to the present embodiment. Referring to FIG. 4, this utterance period detecting device 100 is for detecting an utterance period from a voice signal given from a microphone 102. The utterance section detection apparatus 100 samples the voice signal given from the microphone 102, digitizes it by quantizing it, and outputs it as frame data of the above format every 10 milliseconds, and outputs the frame data. A voice input unit 104 for outputting a frame output signal 124 shown in FIG. 6 and an input buffer 106 for storing a plurality of frame data provided from the voice input unit 104 are included.

発話区間検出装置１００はさらに、入力バッファ１０６からフレームデータを読出してエネルギ値などのフレーム情報を算出するためのフレーム情報算出部１０８と、フレーム情報算出部１０８の出力するフレーム情報を記憶するためのフレームバッファ１１０とを含む。フレームバッファ１１０のバッファサイズは、前述した通り１００フレーム分である。フレームバッファ１１０は、入力されたフレーム情報をＦＩＦＯ形式で１００個保持することができる。 The utterance section detecting apparatus 100 further reads out frame data from the input buffer 106 to calculate frame information such as an energy value, and stores frame information output from the frame information calculating unit 108. Frame buffer 110. The buffer size of the frame buffer 110 is 100 frames as described above. The frame buffer 110 can hold 100 pieces of input frame information in a FIFO format.

本実施の形態では、フレーム情報算出部１０８は、次の式に従って時刻ｔにおけるフレームの音声エネルギＥ（ｔ）を算出する。 In the present embodiment, frame information calculation section 108 calculates the sound energy E (t) of the frame at time t according to the following equation.

ただし、Ｎは１フレーム中のデータサンプル数、Ｓ_i（ｉ＝１〜Ｎ）はデータの値、Ｈ_i（ｉ＝１〜Ｎ）はハミング窓関数の値を、それぞれ示す。

Here, N represents the number of data samples in one frame, S _i (i = 1 to N) represents a data value, and H _i (i = 1 to N) represents a Hamming window function value.

発話区間検出装置１００はさらに、フレーム情報算出部１０８が算出したフレームの音声エネルギを、発話中の最大パワーを基準として正規化し、フレームの特徴ベクトルの一要素として入力バッファ１０６中に書込むためのフレーム音声エネルギ正規化処理部１２６を含む。フレームの音声エネルギの大きさを一発話の中の最大エネルギで正規化し、特徴量の一つとして音声認識に利用すると効果があることが認められている。しかし、そのためには発話の終了まで待ってフレームエネルギの最大値を算出する必要がある。しかしそれでは実時間処理を行なうことができない。 The utterance section detection apparatus 100 further normalizes the voice energy of the frame calculated by the frame information calculation unit 108 with respect to the maximum power during the utterance and writes it into the input buffer 106 as one element of the frame feature vector. A frame audio energy normalization processing unit 126 is included. It is recognized that it is effective to normalize the voice energy of a frame with the maximum energy in one utterance and use it as a feature quantity for voice recognition. However, for that purpose, it is necessary to wait until the end of the utterance and calculate the maximum value of the frame energy. However, real time processing cannot be performed.

そこでフレーム音声エネルギ正規化処理部１２６は、音声エネルギのダイナミックレンジを実時間に更新することにより、擬似的にではあるが音声エネルギを実時間で正規化する機能を持つ。フレーム音声エネルギ正規化処理部１２６はそのため、図５に示すような構成を持つ。 Therefore, the frame sound energy normalization processing unit 126 has a function of normalizing sound energy in real time although it is simulated by updating the dynamic range of sound energy in real time. Therefore, the frame audio energy normalization processing unit 126 has a configuration as shown in FIG.

図５を参照して、フレーム音声エネルギ正規化処理部１２６は、発話の先頭部分でまた音声エネルギの十分大きなフレームがないときに、最大音声エネルギのデフォルト値として使用されるデフォルト最大値を記憶するためのデフォルト最大値記憶部１３２と、発話の最初の部分ではデフォルト最大値記憶部１３２から与えられたデフォルト最大値を記憶し、発話途中でデフォルト最大値より大きな音声エネルギを持つフレームが検出された場合に、当該音声エネルギの値を記憶するための最大値記憶部１３４と、フレーム情報算出部１０８からの音声エネルギ１２８を最大値記憶部１３４に記憶されている最大値で除算し、結果を入力バッファ１０６の該当フレームの特徴量の一つとして書込むための除算部１３６と、最大値記憶部１３４の出力とフレーム情報算出部１０８からの音声エネルギ１２８とを受けて両者の値を比較し、比較結果信号１３９を最大値記憶部１３４に与えるための比較部１３８とを含む。比較結果信号１３９は、音声エネルギ１２８により示される値が最大値記憶部１３４に記憶された最大値を上回るとＨ（ハイ）レベルとなり，それ以外の場合はＬ（ロー）レベルとなる。なお、デフォルトの値は、オプションとしてこの装置（プログラム）起動時に与えられた値があれば、その値で書換えられる。 Referring to FIG. 5, frame speech energy normalization processing unit 126 stores a default maximum value used as a default value of maximum speech energy when there is no sufficiently large speech energy frame at the beginning of the speech. The default maximum value storage unit 132 for storing the default maximum value given from the default maximum value storage unit 132 in the first part of the utterance, and a frame having voice energy larger than the default maximum value is detected during the utterance In this case, the maximum value storage unit 134 for storing the value of the sound energy and the sound energy 128 from the frame information calculation unit 108 are divided by the maximum value stored in the maximum value storage unit 134 and the result is input. A division unit 136 for writing as one of the feature quantities of the corresponding frame in the buffer 106, and a maximum value storage unit 134 Comparing the value of both receiving and sound energy 128 from the output and the frame information calculation unit 108, and a comparing unit 138 for giving the maximum value storing unit 134 a comparison result signal 139. The comparison result signal 139 becomes the H (high) level when the value indicated by the sound energy 128 exceeds the maximum value stored in the maximum value storage unit 134, and becomes the L (low) level otherwise. Note that if there is an optional value given when this device (program) is activated as an option, it is rewritten with that value.

最大値記憶部１３４は、状態判定部１１８から与えられる信号２００によって発話が終了したことが示されると、デフォルト最大値記憶部１３２の値を新たな最大値として記憶し、比較部１３８からの比較結果信号１３９がＨレベルとなると、音声エネルギ１２８により示される値を新たな最大値として記憶する。したがって、最大値記憶部１３４に記憶される値は、発話開始時にはデフォルト最大値記憶部１３２に記憶されていたデフォルト値となり、発話の進行とともに音声エネルギがデフォルト値を上回るものが出現するとその音声エネルギとなる。以下、発話の進行中には同様の処理が繰返される。この値を発話中の音声エネルギの最大値として使用して各フレームの音声エネルギを正規化することにより、擬似的にではあるが、実時間で音声エネルギの正規化を行なうことができる。 When the signal 200 given from the state determination unit 118 indicates that the utterance has ended, the maximum value storage unit 134 stores the value of the default maximum value storage unit 132 as a new maximum value and compares the value from the comparison unit 138. When the result signal 139 becomes H level, the value indicated by the sound energy 128 is stored as a new maximum value. Therefore, the value stored in the maximum value storage unit 134 becomes the default value stored in the default maximum value storage unit 132 at the start of utterance, and when the voice energy exceeds the default value with the progress of the utterance, the voice energy is displayed. It becomes. Thereafter, the same processing is repeated while the utterance is in progress. By normalizing the voice energy of each frame using this value as the maximum value of the voice energy during speech, the voice energy can be normalized in real time although pseudo.

なお、デフォルトの値は予め実験により適切な値を決めておくことが望ましい。 It is desirable that the default value is determined in advance by an experiment.

発話区間検出装置１００はさらに、音声入力部１０４からのフレーム出力信号１２４を受け、入力バッファ１０６、フレーム情報算出部１０８及びフレームバッファ１１０の読出ポイント及び書込ポイント、並びにそれらへの書込み・読出しのタイミングを管理するための入出力・アドレス管理部１１４と、発話区間検出装置１００の処理開始後４００ミリ秒までの間にフレームバッファ１１０に格納されるフレームデータ１６０を読出し、初期環境雑音を算出するための初期環境雑音算出部１１２と、フレームバッファ１１０からのフレームデータ１９２、初期環境雑音算出部１１２からの初期環境雑音の推定値１９４、及び現在の状態が非発話状態（Ｓ１）３０（図２参照）か否かを示す信号１９０を受け、それらから発話開始しきい値２２及び発話終了しきい値２４を動的に算出し、しきい値の値を示す信号１９８として出力するための動的しきい値算出部１１６とを含む。 Further, the utterance section detecting device 100 receives the frame output signal 124 from the voice input unit 104, and the reading points and writing points of the input buffer 106, the frame information calculation unit 108 and the frame buffer 110, and the writing / reading of them. The frame data 160 stored in the frame buffer 110 is read up to 400 milliseconds after the start of processing by the input / output / address management unit 114 for managing the timing and the speech section detection device 100, and the initial environmental noise is calculated. Initial environmental noise calculation unit 112, frame data 192 from frame buffer 110, initial environmental noise estimation value 194 from initial environmental noise calculation unit 112, and current state is non-speech state (S1) 30 (FIG. 2). Signal) indicating whether or not a speech is started, and an utterance start threshold value is received therefrom. 2 and utterance termination threshold 24 dynamically calculates, and a dynamic threshold calculation unit 116 for outputting a signal 198 indicative of the value of the threshold.

入力バッファ１０６、フレームバッファ１１０などは半導体記憶装置などにより実現される。入出力・アドレス管理部１１４はタイマを装備しており、音声入力部１０４による音声データのデジタル化に同期して、入力バッファ１０６、フレームバッファ１１０などへの書込みのポインタ、それらからの読出しポインタを管理する。入出力・アドレス管理部１１４はまた、起動後４００ミリ秒までのフレームを処理する際にはＨレベル、それ以後はＬレベルの値をとる初回フラグ１９６を動的しきい値算出部１１６に与える機能も持つ。動的しきい値算出部１１６の処理は、初回フラグ１９６及び信号１９０の値によって制御される。 The input buffer 106, the frame buffer 110, and the like are realized by a semiconductor memory device or the like. The input / output / address management unit 114 is equipped with a timer, and in synchronization with the digitization of the audio data by the audio input unit 104, a pointer for writing to the input buffer 106, the frame buffer 110, etc., and a read pointer from them are provided. to manage. The input / output / address management unit 114 also provides the dynamic threshold value calculation unit 116 with an initial flag 196 that takes a value of H level when processing a frame up to 400 milliseconds after activation and thereafter takes a value of L level. It also has a function. The processing of the dynamic threshold value calculation unit 116 is controlled by the values of the initial flag 196 and the signal 190.

発話区間検出装置１００はさらに、動的しきい値算出部１１６から出力されたしきい値の値を示す信号１９８及びフレームバッファ１１０からのフレームデータ１９２とから、後述する方法に従ってフレームの状態を判定し、状態を表す信号２００を出力するための状態判定部１１８と、状態判定部１１８の出力する状態を表す信号２００を受け、入力バッファ１０６から状態の確定したフレームに対応する入力データを読出して予め定められた算出方法によってこのフレームの音声の特徴ベクトルを算出し、さらに発話区間の開始又は終了フレームの場合には、それらを示すマークを当該特徴ベクトル１２２に付して出力するための特徴ベクトル出力部１２０とを含む。状態判定部１１８はまた、現在の状態が非発話状態（Ｓ１）３０か否かを示す信号１９０を生成し、動的しきい値算出部１１６に与える機能も持つ。 The utterance section detection apparatus 100 further determines the frame state from the signal 198 indicating the threshold value output from the dynamic threshold value calculation unit 116 and the frame data 192 from the frame buffer 110 according to a method described later. In response to the state determination unit 118 for outputting the signal 200 indicating the state and the signal 200 indicating the state output from the state determination unit 118, the input data corresponding to the frame whose state has been determined is read from the input buffer 106. A feature vector for calculating the speech feature vector of this frame by a predetermined calculation method, and in the case of the start or end frame of the speech section, a mark indicating the same is added to the feature vector 122 and output. Output unit 120. The state determination unit 118 also has a function of generating a signal 190 indicating whether or not the current state is the non-speech state (S1) 30 and giving it to the dynamic threshold value calculation unit 116.

図６は初期環境雑音算出部１１２のブロック図であって、初期環境雑音算出部１１２は、フレームバッファ１１０から与えられるフレーム情報のうち、フレームごとのエネルギ値を昇順にソートしてソート後フレームエネルギ記憶部１４２に格納させるためのソート処理部１４０と、ソート処理部１４０に格納されたフレームごとのエネルギ値のうち、下位から２５％及び７５％の大きさにあたる位置のフレームのエネルギを算出し、それぞれ後述するクラスタリング処理のシードとなる値ｅｍ１及びｅｍ２として出力するためのシーズ算出部１４４と、この値ｅｍ１及びｅｍ２を記憶するための記憶部１４６とを含む。 FIG. 6 is a block diagram of the initial environmental noise calculation unit 112. The initial environmental noise calculation unit 112 sorts the energy values for each frame out of the frame information provided from the frame buffer 110 in ascending order, and sorts the frame energy. The sort processing unit 140 for storing in the storage unit 142, and the energy values of the frames corresponding to the magnitudes of 25% and 75% from the lower order of the energy values for each frame stored in the sort processing unit 140 are calculated, Each of them includes a seed calculation unit 144 for outputting values em1 and em2 which are seeds for clustering processing to be described later, and a storage unit 146 for storing these values em1 and em2.

初期環境雑音算出部１１２はさらに、記憶部１４６から値ｅｍ１及びｅｍ２を読出し、その平均値ｅ_averageを算出するための第１の平均値算出部１４８と、第１の平均値算出部１４８が出力する平均値を境界値としてそれより大きいエネルギ値を持つか否かを基準として、ソート後フレームエネルギ記憶部１４２中の各フレームを二つのクラスタＣ１及びＣ２に分類するためのフレーム分類部１５０と、フレーム分類部１５０により得られた二つのクラスタＣ１及びＣ２の各々に属するフレームのエネルギ値の平均値Ｅｍ１及びＥｍ２を次の式に従って算出するための第２の平均値算出部１５２とを含む。 Initial Environmental noise calculating unit 112 further reads the value em1 and em2 from the storage unit 146, a first average value calculating section 148 for calculating the average value e _average, the first average value calculation unit 148 outputs A frame classifying unit 150 for classifying each frame in the post-sort frame energy storage unit 142 into two clusters C1 and C2 based on whether or not the average value to be used is a boundary value and has a larger energy value, A second average value calculation unit 152 for calculating average values Em1 and Em2 of the energy values of the frames belonging to each of the two clusters C1 and C2 obtained by the frame classification unit 150 according to the following equation.

ただし、Ｎはフレームバッファ１１０内のフレーム数、Ｉ１はｅ_averageより小さいエネルギ値を持ち、クラスタＣ１に属するフレームの数、Ｉ２はｅ_averageより大きいエネルギ値を持ち、クラスタＣ２に属するフレームの数を、それぞれ表す。

Where N is the number of frames in the frame buffer 110, I1 has an energy value smaller than e _{average and} the number of frames belonging to cluster C1, I2 has an energy value larger than e _{average and} the number of frames belonging to cluster C2. , Respectively.

初期環境雑音算出部１１２はさらに、第２の平均値算出部１５２によって算出された二つの平均値Ｅｍ１及びＥｍ２をそれぞれ新たな値ｅｍ１及びｅｍ２として記憶部１４６に記憶させ、さらに第１の平均値算出部１４８、フレーム分類部１５０、及び第２の平均値算出部１５２に先ほどの処理を繰返し実行させ、その結果得られた平均値Ｅｍ１を初期環境雑音の推定値（ｅｍ１）１９４として図４に示す動的しきい値算出部１１６に与えるための判定部１５４とを含む。 The initial environmental noise calculation unit 112 further stores the two average values Em1 and Em2 calculated by the second average value calculation unit 152 in the storage unit 146 as new values em1 and em2, respectively, and further stores the first average value. The calculation unit 148, the frame classification unit 150, and the second average value calculation unit 152 repeatedly execute the above processing, and the average value Em1 obtained as a result is shown in FIG. 4 as an estimated value (em1) 194 of initial environmental noise. And a determination unit 154 for giving to the dynamic threshold value calculation unit 116 shown.

以下に、第１の平均値算出部１４８、フレーム分類部１５０及び第２の平均値算出部１５２により行なわれる処理について、図４、及び図６から図９を参照して説明する。一般に、図４に示すフレームバッファ１１０に記憶されている各フレームのエネルギ値は、図７に示される様に、入力音声信号のエネルギの大きさに従って変動する。これをエネルギの大きさに従って昇順にソートすると図８の様になると想定される。ソート処理部１４０が行なうソート処理はこうした処理であり、ソート後フレームエネルギ記憶部１４２に記憶されているフレーム情報は図８に示すものに対応している。 Hereinafter, processing performed by the first average value calculation unit 148, the frame classification unit 150, and the second average value calculation unit 152 will be described with reference to FIGS. 4 and 6 to 9. FIG. In general, the energy value of each frame stored in the frame buffer 110 shown in FIG. 4 varies according to the magnitude of the energy of the input audio signal, as shown in FIG. When this is sorted in ascending order according to the magnitude of energy, it is assumed that the result is as shown in FIG. The sort processing performed by the sort processing unit 140 is such processing, and the frame information stored in the post-sort frame energy storage unit 142 corresponds to that shown in FIG.

図８の様にソートすることで、エネルギ値のヒストグラムを容易にとることができる。図９にその例を示す。音声信号に環境雑音と発話成分とが含まれているとすれば、環境雑音のみのフレームのエネルギ値と、発話成分を含むフレームのエネルギ値とは、それぞれ別々の値を中心として分布することになると思われる。そして、それらは図９に示されるようなヒストグラムにおいて、エネルギの比較的低い部分のピークと、エネルギの比較的高い部分のピークとの二つのピークを形成することになるであろう。 By sorting as shown in FIG. 8, a histogram of energy values can be easily obtained. An example is shown in FIG. If the speech signal contains environmental noise and speech components, the energy value of the frame containing only the environmental noise and the energy value of the frame containing the speech component are distributed around different values. It seems to be. Then, they will form two peaks in a histogram as shown in FIG. 9, a peak of a relatively low energy part and a peak of a relatively high energy part.

図６に示す第１の平均値算出部１４８、フレーム分類部１５０、及び第２の平均値算出部１５２が行なっているのは、最初にエネルギ値の２５％と７５％の部分とをピークの初期位置として、上記した二つのピークをその後の演算により求め、ソート後フレームエネルギ記憶部１４２に格納されている各フレームを、環境雑音側のピークに近いフレームと、発話部分側のピークに近いフレームとの二つのクラスタにクラスタ化する処理である。 The first average value calculation unit 148, the frame classification unit 150, and the second average value calculation unit 152 shown in FIG. 6 perform the peak of the 25% and 75% energy values first. As the initial position, the above-described two peaks are obtained by subsequent calculation, and the frames stored in the sorted frame energy storage unit 142 are divided into a frame close to the peak on the environmental noise side and a frame close to the peak on the utterance part side. And clustering into two clusters.

図１０は、図４に示す動的しきい値算出部１１６の機能的ブロック図である。図１０を参照して、動的しきい値算出部１１６は、フレームデータ１９２を受け、フレームバッファ１１０に格納されているソート後のフレーム情報のうち、小さい方から９０％の位置にあるフレームのエネルギを、ｔ番目までのフレームバッファサイズ分の数のフレームにおける最大エネルギｅ_max（ｔ）（最大エネルギ信号１８２）として出力するための最大エネルギ算出部１７６と、フレームデータ１９２を受け、後述する式に従って環境雑音の推定値を算出するための環境雑音算出部１７０と、１フレーム分だけ前の処理で算出された環境雑音の推定値ｂ（ｔ−１）を記憶するための記憶部１７４とを含む。 FIG. 10 is a functional block diagram of the dynamic threshold value calculation unit 116 shown in FIG. Referring to FIG. 10, dynamic threshold value calculation unit 116 receives frame data 192, and among the sorted frame information stored in frame buffer 110, the dynamic threshold value calculation unit 116 calculates the frame at the position 90% from the smallest. The maximum energy calculation unit 176 for outputting the energy as the maximum energy e _max (t) (maximum energy signal 182) in the number of frames corresponding to the frame buffer size up to the t-th frame, and the frame data 192 are received. An environmental noise calculation unit 170 for calculating an estimated value of environmental noise according to the above, and a storage unit 174 for storing the estimated value b (t−1) of the environmental noise calculated in the previous process for one frame. Including.

動的しきい値算出部１１６はさらに、記憶部１７４に記憶されている１フレーム分だけ前の推定値ｂ（ｔ−１）と、環境雑音算出部１７０から与えられる環境雑音の推定値と、初期環境雑音の推定値（ｅｍ１）１９４とを受けて、初回フラグ１９６がＨレベルであれば初期環境雑音の推定値（ｅｍ１）１９４を、初回フラグ１９６がＬレベルでかつ状態を示す信号１９０が非発話状態を示す値であれば環境雑音算出部１７０の出力を、初回フラグ１９６がＬレベルでかつ状態を示す信号１９０が非発話状態を示す値でなければ記憶部１７４の出力を、それぞれ選択してｔ番目のフレームに対する環境雑音ｂ（ｔ）として出力するための選択部１７２とを含む。選択部１７２の出力は記憶部１７４に与えられ記憶される。 The dynamic threshold value calculation unit 116 further includes an estimated value b (t−1) that is one frame earlier stored in the storage unit 174, an estimated value of the environmental noise given from the environmental noise calculation unit 170, In response to the initial environmental noise estimated value (em1) 194, if the initial flag 196 is at the H level, the initial environmental noise estimated value (em1) 194 is displayed, and the initial flag 196 is at the L level and the signal 190 indicating the state is displayed. If the value indicates a non-speech state, the output of the environmental noise calculation unit 170 is selected, and if the initial flag 196 is L level and the state signal 190 does not indicate a non-speech state, the output of the storage unit 174 is selected. And a selection unit 172 for outputting as environmental noise b (t) for the t-th frame. The output of the selection unit 172 is given to the storage unit 174 and stored.

動的しきい値算出部１１６はさらに、最大エネルギ算出部１７６からの最大エネルギ値と、選択部１７２からのｔ番目のフレームにおける環境雑音ｂ（ｔ）とに基づいて発話開始しきい値２２及び発話終了しきい値２４を動的に算出するためのしきい値算出部１７８を含む。しきい値算出部１７８の出力する、しきい値を表す信号１９８は状態判定部１１８に与えられ、状態判定に用いられる。 The dynamic threshold value calculation unit 116 further includes the utterance start threshold value 22 based on the maximum energy value from the maximum energy calculation unit 176 and the environmental noise b (t) in the t-th frame from the selection unit 172. A threshold value calculation unit 178 for dynamically calculating the utterance end threshold value 24 is included. A signal 198 representing the threshold value output from the threshold value calculation unit 178 is given to the state determination unit 118 and used for state determination.

環境雑音算出部１７０は、フレームバッファ１１０に記憶されたフレームデータ１９２の中でｔ番目のフレームのエネルギＥ（ｔ）、及び記憶部１７４に記憶されたｔ−１番目のフレームに対する環境雑音ｂ（ｔ−１）とから次の式１に従って環境雑音の推定値ｂ’（ｔ）を算出する。
［式１］
ｂ’（ｔ）＝ｂ（ｔ−１）×α＋Ｅ（ｔ）×（１−α）
ここで、αは予め定められた忘却係数、Ｅ（ｔ）はｔ番目のフレームのエネルギを表す。忘却係数は０以上１以下の値であるが、本実施の形態では０．８を用いる。 The environmental noise calculation unit 170 includes the energy E (t) of the t-th frame in the frame data 192 stored in the frame buffer 110 and the environmental noise b () for the t−1th frame stored in the storage unit 174. The estimated value b ′ (t) of the environmental noise is calculated from t−1) according to the following formula 1.
[Formula 1]
b ′ (t) = b (t−1) × α + E (t) × (1−α)
Here, α represents a predetermined forgetting factor, and E (t) represents the energy of the t-th frame. The forgetting factor is a value between 0 and 1, but 0.8 is used in the present embodiment.

選択部１７２は、状態が非発話状態以外であれば記憶部１７４から出力されるｔ−１番目のフレームに対する環境雑音ｂ（ｔ−１）を選択する。従ってこの場合には環境雑音は変化しない。状態が非発話状態であれば、選択部１７２は、環境雑音算出部１７０から出力される環境雑音の推定値ｂ’（ｔ）を選択する。 If the state is other than the non-speech state, the selection unit 172 selects the environmental noise b (t−1) for the t−1th frame output from the storage unit 174. Therefore, the environmental noise does not change in this case. If the state is a non-speech state, the selection unit 172 selects the environmental noise estimated value b ′ (t) output from the environmental noise calculation unit 170.

従って、環境雑音算出部１７０から出力される時刻ｔにおける環境雑音ｂ（ｔ）は以下の通りの式で表される。ただしＥ（ｔ）は時刻ｔにおけるフレームのエネルギ値、αは前述の忘却係数である。
［式２］
b(t)＝b(t−1)×α＋E(t)×(1−α) （状態が非発話状態の場合）
b(t)＝b(t−1) （状態が非発話状態以外の場合）
しきい値算出部１７８は以下の式に従って発話開始しきい値Ｅｔｈ₁及び発話終了しきい値cを動的に算出する。
［式３］
０≦ｔ＜４００ミリ秒では
Eth1(t)＝b(t)+β×γ₁
Eth2(t)＝b(t)+β×γ₂、
４００ミリ秒≦ｔでは
Eth₁(t)＝b(t)+max(β，Emax(t)−b(t))×γ₁
Eth₂(t)＝b(t)+max(β，Emax(t)−b(t))×γ₂
ただし、βは発話の最低ダイナミックレンジで、本実施の形態では２０ｄＢである。またγ₁及びγ₂はそれぞれ発話開始しきい値比率及び発話終了しきい値比率であり、それぞれ実験的に定められる、０以上で１以下の定数である。本実施の形態ではγ₁＝０．２５、γ₂＝０．２０を用いる。 Therefore, the environmental noise b (t) at time t output from the environmental noise calculation unit 170 is expressed by the following equation. Where E (t) is the energy value of the frame at time t, and α is the forgetting factor described above.
[Formula 2]
b (t) = b (t−1) × α + E (t) × (1−α) (When the state is a non-speech state)
b (t) = b (t−1) (When the state is other than non-speech state)
The threshold calculation unit 178 dynamically calculates the utterance start threshold Eth ₁ and the utterance end threshold c according to the following equations.
[Formula 3]
For 0 ≦ t <400 milliseconds
Eth1 (t) = b (t) + β × γ ₁
Eth2 (t) = b (t) + β × γ ₂ ,
400 milliseconds ≤ t
Eth ₁ (t) = b (t) + max (β, Emax (t) −b (t)) × γ ₁
Eth ₂ (t) = b (t) + max (β, Emax (t) −b (t)) × γ ₂
However, β is the lowest dynamic range of the utterance, and is 20 dB in the present embodiment. Further, γ ₁ and γ ₂ are an utterance start threshold ratio and an utterance end threshold ratio, respectively, and are constants of 0 or more and 1 or less respectively determined experimentally. In the present embodiment, γ ₁ = 0.25 and γ ₂ = 0.20 are used.

こうして算出された発話開始しきい値Ｅｔｈ₁及び発話終了しきい値Ｅｔｈ₂が、図１を参照して説明した発話区間の検出時の発話開始しきい値２２及び発話終了しきい値２４として用いられる。 The utterance start threshold value Eth ₁ and the utterance end threshold value Eth ₂ calculated in this way are used as the utterance start threshold value 22 and the utterance end threshold value 24 when the utterance section described with reference to FIG. 1 is detected. It is done.

［装置の動作］
以上構成を述べた装置は以下のように動作する。 [Device operation]
The apparatus described above operates as follows.

-起動時-
起動時には、処理に必要となるバッファ及びオプションの値を格納するためのエリアを記憶装置に確保する。起動時に与えられるオプションの値を調べ、オプションの値に誤りがなければ当該オプションに、与えられた値を設定する。オプションの値が与えられなかったものにはデフォルト値を設定する。与えられたオプションの値に誤りがあれば、その旨のメッセージを表示して処理を終了する。図５に示すフレーム音声エネルギ正規化処理部１２６のデフォルト最大値記憶部１３２については、起動時にオプションの値が与えられれば、その値をデフォルトの値として記憶し、さらに最大値記憶部１３４に記憶する。オプションの値が与えられなければ、プログラム上のデフォルト値をデフォルト最大値記憶部１３２に記憶し、さらに最大値記憶部１３４に記憶する。 -At startup-
At the time of startup, an area for storing a buffer and option values necessary for processing is secured in the storage device. The option value given at startup is checked, and if the option value is correct, the given value is set for the option. If no option value is given, set the default value. If there is an error in the value of the given option, a message to that effect is displayed and the process is terminated. As for the default maximum value storage unit 132 of the frame audio energy normalization processing unit 126 shown in FIG. 5, if an option value is given at the time of activation, that value is stored as a default value, and further stored in the maximum value storage unit 134. To do. If no option value is given, the default value on the program is stored in the default maximum value storage unit 132 and further stored in the maximum value storage unit 134.

各バッファの書込みポイント及び読出しポイントをそれぞれ初期値に設定する。 The write point and read point of each buffer are set to initial values.

なお、起動後、実際の処理を開始する時刻（フレーム番号）をｔ＝０とする。このときのフレームの状態は非発話状態に設定される。以後、図４に示す音声入力部１０４は、マイク１０２からの電気信号を１０ミリ秒ごとに、３０ミリ秒のフレーム長でデジタル化する。 Note that the time (frame number) at which the actual processing is started after activation is set to t = 0. The frame state at this time is set to a non-speech state. Thereafter, the audio input unit 104 shown in FIG. 4 digitizes the electrical signal from the microphone 102 every 10 milliseconds with a frame length of 30 milliseconds.

-０ミリ秒から４００ミリ秒まで-
入出力・アドレス管理部１１４からの初回フラグ１９６はＨレベルである。音声入力部１０４は、発話判定に必要なデータ数が集まると、１回の処理で引き渡す数として予め定められた数のデータを入力バッファ１０６の、バッファ書込みポインタにより指定されるアドレスに書込む。 -From 0 ms to 400 ms-
The initial flag 196 from the input / output / address management unit 114 is at the H level. When the number of data necessary for speech determination is collected, the voice input unit 104 writes a predetermined number of data as the number to be delivered in one process to the address specified by the buffer write pointer in the input buffer 106.

フレーム情報算出部１０８は、入力バッファ１０６の、読出しポインタにより指定されるアドレスから１フレーム分のデータを読出し、フレームエネルギを算出してフレームバッファ１１０の当該フレームに対応するエリアに書込む。フレーム情報算出部１０８はまた、算出されたフレームエネルギをこのフレームの音声エネルギ１２８として図５に示す除算部１３６、比較部１３８及び最大値記憶部１３４に与える。比較部１３８は、最大値記憶部１３４に記憶された値と音声エネルギ１２８により示される値とを比較し、比較結果信号１３９を最大値記憶部１３４に与える。音声エネルギ１２８により示される値が最大値記憶部１３４に記憶されている値を上回ったことが検出されると、比較結果信号１３９はＨレベルとなり、最大値記憶部１３４は比較結果信号１３９がＨレベルとなったことに応答して、これまで記憶していた値に代えて音声エネルギ１２８により表される値を記憶する。 The frame information calculation unit 108 reads data for one frame from the address specified by the read pointer in the input buffer 106, calculates frame energy, and writes it in the area corresponding to the frame in the frame buffer 110. The frame information calculation unit 108 also supplies the calculated frame energy to the division unit 136, the comparison unit 138, and the maximum value storage unit 134 shown in FIG. The comparison unit 138 compares the value stored in the maximum value storage unit 134 with the value indicated by the sound energy 128 and provides a comparison result signal 139 to the maximum value storage unit 134. When it is detected that the value indicated by the sound energy 128 exceeds the value stored in the maximum value storage unit 134, the comparison result signal 139 becomes H level, and the maximum value storage unit 134 indicates that the comparison result signal 139 is H. In response to reaching the level, the value represented by the voice energy 128 is stored in place of the previously stored value.

除算部１３６は、音声エネルギ１２８により表される値を最大値記憶部１３４に記憶された値で除算して正規化された音声エネルギを算出する。正規化された音声エネルギ１３０は、入力バッファ１０６中で該当するフレームの、正規化音声エネルギのフィールドに書込まれる。以後、フレーム情報算出部１０８とフレーム音声エネルギ正規化処理部１２６とは、これと同様の動作を各フレームに対して繰返す。 The division unit 136 calculates the normalized speech energy by dividing the value represented by the speech energy 128 by the value stored in the maximum value storage unit 134. The normalized speech energy 130 is written into the normalized speech energy field of the corresponding frame in the input buffer 106. Thereafter, the frame information calculation unit 108 and the frame sound energy normalization processing unit 126 repeat the same operation for each frame.

初期環境雑音算出部１１２は、フレーム情報算出部１０８によりフレームバッファ１１０に書込まれたフレームエネルギを読出し、初期環境雑音の算出を行なう。時刻０ミリ秒から４００ミリ秒の間は、状態の判定は行なわない。 The initial environmental noise calculation unit 112 reads the frame energy written in the frame buffer 110 by the frame information calculation unit 108 and calculates the initial environmental noise. The state is not determined between time 0 milliseconds and 400 milliseconds.

次に、図６を参照して、初期環境雑音算出部１１２の動作について説明する。ソート処理部１４０は、フレームバッファ１１０から読出したフレームエネルギの値１６０をソートし、ソート後フレームエネルギ記憶部１４２に格納する。ｔ＝０では読出されるフレームエネルギの値は一つ（Ｅ（０））だけなので、その値をソート後フレームエネルギ記憶部１４２の１番目の領域に書込む。２回目以後は、ソート後フレームエネルギ記憶部１４２に前のソートの結果が既に書込まれており、そこに新たに一つのフレームエネルギをその大きさに従った位置に追加するだけでよい（ヒープソート）。従って、ソート処理は少ない計算量で実行できる。 Next, the operation of the initial environmental noise calculation unit 112 will be described with reference to FIG. The sort processing unit 140 sorts the frame energy values 160 read from the frame buffer 110 and stores them in the post-sort frame energy storage unit 142. At t = 0, only one frame energy value is read (E (0)), so that value is written in the first area of the sorted frame energy storage unit 142. After the second time, the result of the previous sorting has already been written in the post-sorting frame energy storage unit 142, and it is only necessary to newly add one frame energy to the position according to the size (heap sorting). ). Therefore, the sort process can be executed with a small amount of calculation.

起動後、０ミリ秒から４００ミリ秒の間は、シーズ算出部１４４以後の処理部は動作しない。 After the activation, the processing units subsequent to the seeds calculation unit 144 do not operate for 0 to 400 milliseconds.

-４００ミリ秒経過時-
起動後４００ミリ秒が経過すると、フレームバッファ１１０には４０個のフレームデータ（Ｅ（０）〜Ｅ（３９））のエネルギ値が格納されている。この状態が図７に対応する。ソート後フレームエネルギ記憶部１４２には、これら４０個のフレームのエネルギ値を昇順にソートしたものが格納されている。この状態が図８に対応する。 -When 400 milliseconds have passed-
When 400 milliseconds elapses after activation, the frame buffer 110 stores energy values of 40 frame data (E (0) to E (39)). This state corresponds to FIG. The post-sort frame energy storage unit 142 stores the energy values of these 40 frames sorted in ascending order. This state corresponds to FIG.

フレーム情報算出部１０８及びフレーム音声エネルギ正規化処理部１２６は、４００ミリ秒経過までと同様に動作する。 The frame information calculation unit 108 and the frame audio energy normalization processing unit 126 operate in the same manner up to 400 milliseconds.

除算部１３６は、音声エネルギ１２８により表される値を最大値記憶部１３４に記憶された値で除算して正規化された音声エネルギを算出する。正規化された音声エネルギ１３０は、入力バッファ１０６中で該当するフレームの、正規化音声エネルギのフィールドに書込まれる。 The division unit 136 calculates the normalized speech energy by dividing the value represented by the speech energy 128 by the value stored in the maximum value storage unit 134. The normalized speech energy 130 is written into the normalized speech energy field of the corresponding frame in the input buffer 106.

シーズ算出部１４４は、ソート後フレームエネルギ記憶部１４２に格納されている４０個のフレームエネルギのうち、小さい方から２５％及び７５％に相当する値を算出する。この値は記憶部１４６に記憶され、第１の平均値算出部１４８、フレーム分類部１５０及び第２の平均値算出部１５２により行なわれるクラスタリングのシードとなる。 The seeds calculation unit 144 calculates values corresponding to 25% and 75% from the smallest of the 40 frame energies stored in the post-sort frame energy storage unit 142. This value is stored in the storage unit 146 and becomes a seed for clustering performed by the first average value calculation unit 148, the frame classification unit 150, and the second average value calculation unit 152.

第１の平均値算出部１４８は、記憶部１４６からこのシードｅｍ１及びｅｍ２の平均値を算出しフレーム分類部１５０に与える。フレーム分類部１５０は、全てのフレームについて、そのエネルギ値がシードｅｍ１及びｅｍ２のいずれに近いかを基準として、４０個のフレームを二つのクラスタに分類し、分類した結果を第２の平均値算出部１５２に与える。 The first average value calculation unit 148 calculates the average value of the seeds em1 and em2 from the storage unit 146 and supplies the average value to the frame classification unit 150. The frame classification unit 150 classifies the 40 frames into two clusters based on whether the energy value of each frame is close to the seed em1 or em2, and calculates a second average value of the classified results. To unit 152.

第２の平均値算出部１５２は、それら二つのクラスタの各々について、そのクラスタに属するフレームのエネルギ値の平均値Ｅｍ１及びＥｍ２を算出し判定部１５４に与える。 The second average value calculation unit 152 calculates, for each of the two clusters, average values Em1 and Em2 of the energy values of the frames belonging to the cluster, and gives them to the determination unit 154.

判定部１５４は、第２の平均値算出部１５２から与えられたＥｍ１及びＥｍ２を記憶部１４６に新たなｅｍ１及びｅｍ２として記憶させ、先ほどと同じ処理を第１の平均値算出部１４８、フレーム分類部１５０，及び第２の平均値算出部１５２に実行させる。こうして再び得られたＥｍ１及びＥｍ２のうち、Ｅｍ１を初期環境雑音の推定値１９４（ｅｍ１）として動的しきい値算出部１１６に与える。 The determination unit 154 stores Em1 and Em2 given from the second average value calculation unit 152 as new em1 and em2 in the storage unit 146, and performs the same processing as the first average value calculation unit 148, frame classification. The unit 150 and the second average value calculation unit 152 are executed. Of Em1 and Em2 obtained again in this way, Em1 is given to the dynamic threshold value calculation unit 116 as the initial environmental noise estimated value 194 (em1).

図１０を参照して、動的しきい値算出部１１６の動作について説明する。動的しきい値算出部１１６の選択部１７２は、ｂ（ｔ）の初期値として初期環境雑音の推定値１９４であるｅｍ１を選択し、記憶部１７４及びしきい値算出部１７８に与える。記憶部１７４はこの値を記憶する。 The operation of the dynamic threshold value calculation unit 116 will be described with reference to FIG. The selection unit 172 of the dynamic threshold value calculation unit 116 selects em1 that is the estimated value 194 of the initial environmental noise as the initial value of b (t), and provides it to the storage unit 174 and the threshold value calculation unit 178. The storage unit 174 stores this value.

一方、最大エネルギ算出部１７６は、ソート後フレームエネルギ記憶部１４２に記憶されているソートされているフレームエネルギの値のうち、小さい方から９０％に相当するエネルギ値を計算し、最大エネルギ値（Ｅmax）１８２としてしきい値算出部１７８に与える。 On the other hand, the maximum energy calculation unit 176 calculates an energy value corresponding to 90% from the smallest of the sorted frame energy values stored in the post-sort frame energy storage unit 142, and calculates the maximum energy value ( Emax) 182 is given to the threshold value calculation unit 178.

しきい値算出部１７８は、選択部１７２から与えられる環境雑音の推定値ｅｍ１と、最大エネルギ算出部１７６からの最大エネルギ値（Ｅmax）１８２とに基づき、前述の式３によって発話開始しきい値２２及び発話終了しきい値２４を算出し（１９８）、図４に示す状態判定部１１８に与える。 The threshold value calculation unit 178 is based on the estimated value em1 of the environmental noise given from the selection unit 172 and the maximum energy value (Emax) 182 from the maximum energy calculation unit 176, and the utterance start threshold value according to the above-described equation 3. 22 and the utterance end threshold 24 are calculated (198) and provided to the state determination unit 118 shown in FIG.

状態判定部１１８は、動的しきい値算出部１１６から与えられる発話開始しきい値２２及び発話終了しきい値２４に基づき、図１及び図２に示す判定方法に従ってフレームの状態を判定し、その結果を表す信号２００を特徴ベクトル出力部１２０及びフレーム音声エネルギ正規化処理部１２６に与える。状態判定部１１８はまた、フレームの状態が非発話状態か否かを示す信号１９０を動的しきい値算出部１１６に与える。 The state determination unit 118 determines the state of the frame according to the determination method shown in FIGS. 1 and 2 based on the utterance start threshold value 22 and the utterance end threshold value 24 given from the dynamic threshold value calculation unit 116. A signal 200 representing the result is supplied to the feature vector output unit 120 and the frame audio energy normalization processing unit 126. The state determination unit 118 also provides the dynamic threshold value calculation unit 116 with a signal 190 indicating whether or not the frame state is a non-speech state.

フレーム音声エネルギ正規化処理部１２６の最大値記憶部１３４（図５参照）は、状態を表す信号２００により発話区間が終了したことが示されると、それまで記憶していた値に代えてデフォルト最大値記憶部１３２の値を記憶する。この処理により、次の発話に対する音声エネルギの正規化処理の開始時には、最大パワーとしてデフォルトの値（又はオプションとして与えられた値）が再び使用される。 If the maximum value storage unit 134 (see FIG. 5) of the frame audio energy normalization processing unit 126 indicates that the utterance period is ended by the signal 200 indicating the state, the maximum maximum storage unit 134 replaces the previously stored value. The value of the value storage unit 132 is stored. By this process, at the start of the normalization process of the voice energy for the next utterance, the default value (or a value given as an option) is used again as the maximum power.

特徴ベクトル出力部１２０は、状態判定部１１８の処理によって状態が確定したフレームのデータを入力バッファ１０６から読出し、そのフレームの特徴ベクトルを算出し、出力（１２２）する。特徴ベクトル出力部１２０はこのとき、当該フレームが発話開始フレーム又は発話終了フレームであれば、それを示すマークを当該特徴ベクトルに付して出力する。 The feature vector output unit 120 reads out the data of the frame whose state is determined by the processing of the state determination unit 118 from the input buffer 106, calculates the feature vector of the frame, and outputs (122). At this time, if the frame is an utterance start frame or an utterance end frame, the feature vector output unit 120 attaches a mark indicating the frame to the feature vector and outputs it.

-４００ミリ秒から１秒まで-
入出力・アドレス管理部１１４からの初回フラグ１９６はオフとなる。４０番目のフレームの後、１００番目までのフレームについては、４０番目のフレームに対する処理とほぼ同様である。この間の処理では、フレームバッファ１１０には１０ミリ秒ごとに１フレーム分のデータが追加されていく。そして、その結果フレームバッファ１１０に格納されている全てのフレーム情報を用いて状態判定が実行される。 -From 400 milliseconds to 1 second-
The initial flag 196 from the input / output / address management unit 114 is turned off. The processing up to the 100th frame after the 40th frame is almost the same as the processing for the 40th frame. In the processing during this time, data for one frame is added to the frame buffer 110 every 10 milliseconds. As a result, the state determination is executed using all the frame information stored in the frame buffer 110.

また、図１０に示す動的しきい値算出部１１６においては、既に記憶部１７４に一つ前のフレームに対する処理で計算された環境雑音の推定値ｂ（ｔ−１）が記憶されている。環境雑音算出部１７０は、記憶部１７４に記憶された環境雑音の推定値ｂ（ｔ−１）と、フレームデータ１９２から得られるｔ番目のフレームのエネルギＥ（ｔ）とから、式１に従って環境雑音の推定値ｂ’（ｔ）を算出し選択部１７２に与える。 In addition, in the dynamic threshold value calculation unit 116 shown in FIG. 10, the storage unit 174 has already stored the estimated value b (t−1) of the environmental noise calculated in the process for the previous frame. The environmental noise calculation unit 170 uses the environmental noise estimated value b (t−1) stored in the storage unit 174 and the energy E (t) of the t-th frame obtained from the frame data 192 according to Equation 1. An estimated noise value b ′ (t) is calculated and given to the selection unit 172.

選択部１７２は、初回フラグ１９６の値がオフなので、記憶部１７４の出力と、環境雑音算出部１７０の出力とのいずれかを状態を示す信号１９０の値に従って選択する。すなわち、信号１９０の表す状態が非発話状態であれば選択部１７２は環境雑音算出部１７０の出力を選択し、それ以外であれば記憶部１７４の出力を選択する。選択部１７２は、選択した値を示す信号を、記憶部１７４及びしきい値算出部１７８に与える。 Since the value of the initial flag 196 is off, the selection unit 172 selects either the output of the storage unit 174 or the output of the environmental noise calculation unit 170 according to the value of the signal 190 indicating the state. That is, if the state represented by the signal 190 is a non-speech state, the selection unit 172 selects the output of the environmental noise calculation unit 170, and otherwise selects the output of the storage unit 174. The selection unit 172 gives a signal indicating the selected value to the storage unit 174 and the threshold value calculation unit 178.

他の点では、動的しきい値算出部１１６は、４０番目のフレームに対する処理と同様の処理を実行する。状態判定部１１８、特徴ベクトル出力部１２０、及びフレーム音声エネルギ正規化処理部１２６の動作も同様である。 In other respects, the dynamic threshold value calculation unit 116 performs the same process as the process for the 40th frame. The operations of the state determination unit 118, the feature vector output unit 120, and the frame audio energy normalization processing unit 126 are the same.

-１秒以後-
１０１番目のフレーム以降の処理も、４００ミリ秒から１秒までの処理とほぼ同様である。ただしこの処理では、フレームバッファ１１０に記憶されているフレーム情報に新たなフレーム情報を追加する際には、最も古いフレーム情報が削除される。すなわちフレームバッファ１１０はＦＩＦＯ形式でデータを格納する。その結果、フレームバッファ１１０には常に１００フレーム分のフレーム情報が維持される。ソート処理部１４０によるソート処理も同様である。ソート後フレームエネルギ記憶部１４２のうち、最も古いフレームのエネルギ値が削除され、新たなフレームのエネルギ値が、その大きさに従って決まる位置に書込まれる。 -After 1 second-
The processing after the 101st frame is almost the same as the processing from 400 milliseconds to 1 second. However, in this process, when new frame information is added to the frame information stored in the frame buffer 110, the oldest frame information is deleted. That is, the frame buffer 110 stores data in the FIFO format. As a result, frame information for 100 frames is always maintained in the frame buffer 110. The same applies to the sort processing by the sort processing unit 140. In the post-sort frame energy storage unit 142, the energy value of the oldest frame is deleted, and the energy value of a new frame is written at a position determined according to the size.

初期環境雑音算出部１１２、動的しきい値算出部１１６、状態判定部１１８及び特徴ベクトル出力部１２０は、いずれもフレームバッファ１１０に記憶された１００フレーム分のデータに基づいて、背景雑音の推定、しきい値の算出、状態の判定、及び特徴ベクトルの作成を繰返し実行する。 The initial environmental noise calculation unit 112, the dynamic threshold value calculation unit 116, the state determination unit 118, and the feature vector output unit 120 all estimate background noise based on 100 frames of data stored in the frame buffer 110. , Threshold value calculation, state determination, and feature vector creation are repeatedly executed.

こうして、特徴ベクトル出力部１２０から出力されるフレームごとの特徴ベクトル１２２には、そのフレームが発話開始位置であれば発話開始マーカが、発話終了位置であれば発話終了マーカが、それぞれ付されている。このマーカにより、最初の音声データの発話区間（発話開始位置から発話終了位置まで）を検出する事ができる。 Thus, the feature vector 122 for each frame output from the feature vector output unit 120 is provided with an utterance start marker if the frame is an utterance start position, and an utterance end marker if the frame is an utterance end position. . With this marker, it is possible to detect the utterance section (from the utterance start position to the utterance end position) of the first voice data.

また、特徴ベクトル１２２にはフレームごとの音声エネルギを正規化した値が含まれ、これを特徴量として音声認識で利用することができる。またこの音声エネルギは、発話全体にわたって調べることで算出された最大値ではなく、発話の最初からの最大値によって実時間に更新される最大値で擬似的に正規化されたものである。この様子を図１１に示す。 Further, the feature vector 122 includes a value obtained by normalizing the sound energy for each frame, and this can be used as a feature amount for speech recognition. This voice energy is not the maximum value calculated by examining the entire utterance, but is pseudo-normalized with the maximum value updated in real time by the maximum value from the beginning of the utterance. This is shown in FIG.

図１１を参照して、この正規化処理により定められる音声エネルギの最大値の推移について説明する。図１１を参照して、従前の方法によれば、発話の終了まで完了した時点で発話の音声エネルギの最大値を調べ、その値によって音声エネルギを正規化する。図１１において、この音声エネルギの最大値は点線２１２とその後に続く太い実線２１８により表される。 With reference to FIG. 11, the transition of the maximum value of the sound energy determined by the normalization process will be described. Referring to FIG. 11, according to the conventional method, the maximum value of the speech energy of the utterance is examined at the time when the speech is completed, and the speech energy is normalized by the value. In FIG. 11, the maximum value of the sound energy is represented by a dotted line 212 followed by a thick solid line 218.

これに対し上記した実施の形態では、発話の開始時点では一定のデフォルト値（又はオプション値）２１４で、点線２１２で示される音声エネルギの最大値を近似する。さらに音声エネルギの値がこのデフォルト値より大きくなると（図１１における太い実線の曲線２１６の部分）、その値で音声エネルギの最大値の近似値を置換する。発話中で実際の音声エネルギの最大値位置に到達した後は、この近似値は実際の最大値と等しくなる（太い実線２１８の部分）。 On the other hand, in the above-described embodiment, the maximum value of the voice energy indicated by the dotted line 212 is approximated by a constant default value (or option value) 214 at the start time of the utterance. Further, when the value of the voice energy becomes larger than this default value (the portion of the thick solid curve 216 in FIG. 11), the approximate value of the maximum value of the voice energy is replaced with that value. After reaching the maximum position of the actual voice energy during speech, this approximate value becomes equal to the actual maximum value (the thick solid line 218).

この正規化処理によって、実時間で音声エネルギの正規化を行なうことができる。各発話の先頭部分ではデフォルトの値が最大値として使用されるため、多少の誤差は生じるが、デフォルトの値を適当な大きさに定めておくことにより、擬似的な正規化ではあっても十分な効果を得ることができる。 By this normalization processing, the sound energy can be normalized in real time. Since the default value is used as the maximum value at the beginning of each utterance, there will be some error, but by setting the default value to an appropriate size, even pseudo-normalization is sufficient Effects can be obtained.

-実施の形態の効果-
以上説明した本実施の形態の装置によれば、発話の開始及び終了のための発話開始しきい値及び発話終了しきい値が、実際の音声データを統計的に処理する事により、実際の音声データに従って動的に変化される。環境雑音の変化に追従して変化するしきい値を用いて発話区間の検出ができる。その結果、環境雑音の影響を最小限に抑えて、正しく発話区間を検出する事ができる。 -Effect of the embodiment-
According to the apparatus of the present embodiment described above, the utterance start threshold value and the utterance end threshold value for the start and end of the utterance are obtained by statistically processing the actual voice data, thereby It is dynamically changed according to the data. It is possible to detect an utterance interval using a threshold value that changes following environmental noise. As a result, it is possible to correctly detect an utterance section while minimizing the influence of environmental noise.

上記した実施の形態の装置では、しきい値を算出する際に用いられるフレームの最大エネルギ値として、実際の最大値の９０％のものを用いている。そのため、環境雑音の突発的な変化によるしきい値の大きな変化を抑制する事ができる。また、フレームバッファサイズだけの量のフレームに対する統計的処理によりしきい値を算出しているので、一部のフレームで突出したエネルギ値の変化があっても、しきい値にその変化が与える影響は比較的少なくて済む。その結果、安定してしきい値を算出できる。 In the apparatus according to the above-described embodiment, the maximum energy value of the frame used for calculating the threshold value is 90% of the actual maximum value. Therefore, it is possible to suppress a large change in threshold due to a sudden change in environmental noise. In addition, since the threshold value is calculated by statistical processing for the amount of frames corresponding to the frame buffer size, even if there is a change in the energy value that protrudes in some frames, the effect of the change on the threshold value Is relatively small. As a result, the threshold value can be calculated stably.

本実施の形態の装置ではさらに、フレームデータが４０個となった時点で状態の判定を開始している。統計処理にはある程度の数が必要なので、あまり少ない数のフレームデータを用いたしきい値算出では、状態判定結果の信頼性が低くなる。従って、最低で３００ミリ秒程度、望ましくは本実施の形態の装置のように４００ミリ秒程度の音声データに基づいてしきい値算出を開始する事がよい。また、処理対象のフレーム数が４０個となった時点で状態判定を開始するので、起動後、状態判定の開始までの遅延時間はほぼ４００ミリ秒程度となる。この程度の遅延の大きさは実用上で問題とならない程度である。あまり大きな遅延をとるようにすると、発話区間の開始の検出に失敗するおそれがある。また上記実施の形態では、遅延は４００ミリ秒であるが、しきい値判定には１０００ミリ秒分のデータが使用されるので、少ない遅延で信頼性の高いしきい値算出を行なう事ができる。 Furthermore, in the apparatus of the present embodiment, the state determination is started when the number of frame data becomes 40. Since a certain number is required for the statistical processing, the reliability of the state determination result is low in the threshold value calculation using a very small number of frame data. Therefore, the threshold value calculation may be started based on the voice data of about 300 milliseconds at the minimum, preferably about 400 milliseconds as in the apparatus of the present embodiment. Moreover, since the state determination is started when the number of frames to be processed reaches 40, the delay time from the start to the start of the state determination is about 400 milliseconds. This magnitude of the delay is not a problem in practical use. If the delay is too large, detection of the start of the utterance interval may fail. In the above embodiment, the delay is 400 milliseconds, but 1000 milliseconds of data is used for threshold determination, so that a highly reliable threshold calculation can be performed with a small delay. .

［変形例］
上記した実施の形態では、フレームのエネルギ算出の際の窓関数としてハミング窓を用いた。しかし使用可能な窓関数はこれに限らない。ハニング窓、ブラックマン、カイザー、ブラックマン-ハリスなど種々の窓関数のうち、適切と思われるものを用いればよい。 [Modification]
In the above-described embodiment, the Hamming window is used as the window function for calculating the frame energy. However, usable window functions are not limited to this. Of various window functions, such as Hanning window, Blackman, Kaiser, and Blackman-Harris, those that are considered appropriate may be used.

上記実施の形態では、フレームバッファサイズを１００、初期バッファサイズを４０とした。これらの値は一例であって、これ以外の組合せをとる事もできる。ただし、フレームバッファサイズをあまり大きくとると環境雑音の変化に追従してしきい値を変化させる事が困難になる。またフレームバッファサイズをあまり小さくとると、環境雑音のちょっとした変化に対応してしきい値が変化し、発話区間の検出が安定してできなくなる。また、初期バッファサイズをあまり大きくとると環境雑音の推定までの遅延が大きくなり、発話区間の先頭の検出に失敗する可能性が高くなる。また、当然の事ながら初期バッファサイズはフレームバッファサイズ以下でなければならない。従って、フレームバッファサイズとしては３００〜２０００ミリ秒程度、初期バッファサイズとしては２００から５００ミリ秒程度がよい。特に、フレームバッファサイズが６００〜１０００ミリ秒程度、初期バッファサイズとして３００から４５０ミリ秒程度が適している。 In the above embodiment, the frame buffer size is 100 and the initial buffer size is 40. These values are merely examples, and other combinations can be taken. However, if the frame buffer size is too large, it becomes difficult to change the threshold value following the change in environmental noise. If the frame buffer size is too small, the threshold value changes in response to a slight change in environmental noise, and the speech section cannot be detected stably. In addition, if the initial buffer size is too large, the delay until the environmental noise is estimated increases, and the possibility of failing to detect the head of the speech segment increases. Of course, the initial buffer size must be less than or equal to the frame buffer size. Accordingly, the frame buffer size is preferably about 300 to 2000 milliseconds, and the initial buffer size is preferably about 200 to 500 milliseconds. In particular, a frame buffer size of about 600 to 1000 milliseconds and an initial buffer size of about 300 to 450 milliseconds are suitable.

また、上記した実施の形態では、音声エネルギの正規化のため、予め算出された固定的な値をデフォルト値として使用している。しかし本発明はそのような実施の形態には限定されない。例えば、このデフォルト値を発話の終了時に直前の発話での最大パワーによって更新することもできる。このとき、最大エネルギに対して所定の係数ａ（０＜ａ≦１、好ましくは０．７＜ａ＜０．９、さらに好ましくはａ＝０．８程度）を乗算しておくとよい。また、直前の発話だけでなく、過去の複数個の発話での最大エネルギの関数としてこのデフォルトの値を更新するようにしてもよい。 In the above-described embodiment, a fixed value calculated in advance is used as a default value for normalization of voice energy. However, the present invention is not limited to such an embodiment. For example, this default value can be updated with the maximum power of the previous utterance at the end of the utterance. At this time, the maximum energy may be multiplied by a predetermined coefficient a (0 <a ≦ 1, preferably 0.7 <a <0.9, more preferably about a = 0.8). Further, the default value may be updated as a function of the maximum energy in the past utterances as well as the immediately preceding utterance.

また、上記した実施の形態では、フレーム内の各音声データの絶対値に窓関数の値を乗じた値の平均値の対数をとり、さらに係数２０を掛けることにより求めた対数音声エネルギを正規化したものを音声エネルギの特徴パラメータとしている。しかし本発明はそのような実施の形態には限定されず、例えば各音声データの二乗に窓関数の値を乗じた値の平均値の対数をとり、さらに係数１０を掛けることで対数音声エネルギを算出するようにした場合にも本発明は同様に適用できる。 In the embodiment described above, the logarithmic speech energy obtained by multiplying the absolute value of each speech data in the frame by the average value obtained by multiplying the value of the window function and multiplying by the coefficient 20 is normalized. This is used as a characteristic parameter of voice energy. However, the present invention is not limited to such an embodiment. For example, a logarithm of an average value obtained by multiplying the square of each voice data by the value of the window function is taken, and the logarithmic voice energy is multiplied by a coefficient of 10. The present invention can be similarly applied to the case where the calculation is performed.

上記した実施の形態の装置は、ＤＳＰ（ＤｉｇｉｔａｌＳｉｇｎａｌＰｒｏｃｅｓｓｏｒ）などのプロセッサと、そうしたプロセッサ上で実行されるプログラム（マイクロプログラムを含む。）とにより実現される事が想定される。上記した説明により、そうしたプログラムを作成する事は、当業者には容易であろう。 The apparatus of the above-described embodiment is assumed to be realized by a processor such as a DSP (Digital Signal Processor) and a program (including a microprogram) executed on the processor. From the above description, it will be easy for those skilled in the art to create such a program.

今回開示された実施の形態は単に例示であって、本発明が上記した実施の形態のみに制限されるわけではない。本発明の範囲は、発明の詳細な説明の記載を参酌した上で、特許請求の範囲の各請求項によって示され、そこに記載された文言と均等の意味及び範囲内でのすべての変更を含む。 The embodiment disclosed herein is merely an example, and the present invention is not limited to the above-described embodiment. The scope of the present invention is indicated by each claim in the claims after taking into account the description of the detailed description of the invention, and all modifications within the meaning and scope equivalent to the wording described therein are intended. Including.

本発明での発話区間判定の方式及びそのためのパラメータとを説明するための図である。It is a figure for demonstrating the method of the speech section determination in this invention, and the parameter for it. 本発明での発話区間処理における状態遷移図である。It is a state transition diagram in the utterance section process in the present invention. フレーム長及びフレームシフト量を説明するための図である。It is a figure for demonstrating frame length and frame shift amount. 本発明の一実施の形態に係る発話区間検出装置の機能的ブロック図である。It is a functional block diagram of the utterance area detection apparatus which concerns on one embodiment of this invention. 図４に示す装置の、音声エネルギ正規化処理部のブロック図である。It is a block diagram of the sound energy normalization processing part of the apparatus shown in FIG. 図４に示す装置の、初期環境雑音算出部の機能的ブロック図である。FIG. 5 is a functional block diagram of an initial environmental noise calculation unit of the apparatus shown in FIG. 4. フレームエネルギの変化の例を示す図である。It is a figure which shows the example of a change of a frame energy. フレームエネルギを昇順にソートした結果を示す図である。It is a figure which shows the result of having sorted frame energy into ascending order. フレームエネルギのヒストグラムである。It is a histogram of frame energy. 図４に示す装置の動的しきい値算出部の機能的ブロック図である。It is a functional block diagram of the dynamic threshold value calculation part of the apparatus shown in FIG. 本発明の一実施の形態における音声エネルギ正規化処理を説明するための図である。It is a figure for demonstrating the audio | voice energy normalization process in one embodiment of this invention.

Explanation of symbols

２０音声信号、２２発話開始しきい値、２４発話終了しきい値、３０非発話状態（Ｓ１）、３２発話開始状態（Ｓ２）、３４発話状態（Ｓ３）、３６発話終了状態（Ｓ４）、１００発話区間検出装置、１０２マイク、１０４音声入力部、１０６入力バッファ、１０８フレーム情報算出部、１１０フレームバッファ、１１２初期環境雑音算出部、１１４入出力・アドレス管理部、１１６動的しきい値算出部、１１８状態判定部、１２０特徴ベクトル出力部、１２２特徴ベクトル、１２４フレーム出力信号、１２６フレーム音声エネルギ正規化処理部、１４０ソート処理部、１４２ソート後フレームエネルギ記憶部、１４４シーズ算出部、１４６、１７４記憶部、１４８第１の平均値算出部、１５０フレーム分類部、１５２第２の平均値算出部、１５４判定部、１６０フレームデータ、１７０環境雑音算出部、１７２選択部、１７６最大エネルギ算出部、１７８しきい値算出部 20 speech signal, 22 utterance start threshold, 24 utterance end threshold, 30 non-utterance state (S1), 32 utterance start state (S2), 34 utterance state (S3), 36 utterance end state (S4), 100 Speaking section detection device, 102 microphone, 104 voice input unit, 106 input buffer, 108 frame information calculation unit, 110 frame buffer, 112 initial environmental noise calculation unit, 114 input / output / address management unit, 116 dynamic threshold calculation unit 118, state determination unit, 120 feature vector output unit, 122 feature vector, 124 frame output signal, 126 frame speech energy normalization processing unit, 140 sort processing unit, 142 post-sort frame energy storage unit, 144 seed calculation unit, 146, 174 Storage unit, 148 First average value calculation unit, 150 frames Classification unit, 152 second average value calculating unit, 154 determination unit, 160 frame data 170 environmental noise calculating unit, 172 selector, 176 maximum energy calculator, 178 threshold value calculation unit

Claims

Framing means for sequentially framing audio data;
Frame energy calculation and storage for calculating the energy value of the voice framed by the framing means for each frame and storing the energy values of the first number of frames in a FIFO (First-In First-Out) format. Means,
In response to storing the energy values of the second number of frames in the frame energy calculation and storage means, processing the energy values of the second number of frames according to a predetermined statistical technique, An initial value calculating means for calculating an initial value of an estimated value of environmental noise included in the audio data;
Based on the initial value of the estimated value and the energy value of the sound sequentially stored in the frame energy calculation and storage means, the utterance section changes so as to follow the change of the environmental noise included in the sound data. Means for sequentially calculating a threshold value of energy value for detecting each frame,
Utterance section estimation means for estimating a frame corresponding to a start position or an end position of the utterance section of the voice data among the frames after the second number of frames based on the threshold value. An utterance section detecting device ,
The initial value calculating means includes
Depending on the magnitude of the energy value of each frame, the second number of frames is centered on a first cluster centered on the first energy value and a second energy value greater than the first energy. Means for clustering with a second cluster to:
Means for outputting the first energy value as an initial value of the estimated value of the environmental noise .

The means for clustering is:
Means for determining a boundary value for clustering the second number of frames into the first and second clusters;
The speech section detection according to claim 1 , further comprising means for classifying a frame having an energy value smaller than the boundary value into the first cluster and a frame other than the frame into the second cluster. apparatus.

The means for determining the boundary value is:
Means for selecting two frames having a first sort order and a second sort order that are predetermined when sorting using the energy value as a key out of the second number of frames;
First average value calculating means for calculating an average value of energy values of the two selected frames;
Means for classifying the second number of frames into first and second groups based on whether the energy value is smaller than the average value calculated by the first average value calculating means;
Second average value calculating means for calculating average values of energy values of frames belonging to the first and second groups, respectively.
The utterance according to claim 2 , further comprising: a third average value calculating means for further calculating an average value of the two average values calculated by the second average value calculating means and outputting the average value as the boundary value. Section detection device.

Means for sequentially calculating the threshold value for each frame,
Based on the energy value of the frame stored in the frame energy calculation and storage means and the initial value of the estimated value of the environmental noise, the environmental noise energy of the frame stored in the frame energy calculation and storage means Means for estimating the value for each frame;
Means for sequentially estimating, for each frame, a maximum value of the total energy value of stationary background noise and speech among the energy values of the frames stored in the frame energy calculation and storage means;
To calculate an energy threshold value for detecting the utterance period for each frame based on the estimated energy value of the environmental noise and the estimated energy value of the background noise and the uttered speech. The utterance section detecting device according to claim 1, comprising:

The utterance interval estimation means includes means for determining a state of a frame after the second number of frames based on the threshold;
The state includes a non-speaking state,
Means for sequentially estimating the energy value of the environmental noise for each frame,
Means for storing an energy value of the environmental noise estimated at a time point one frame before;
Means for storing the initial value of the estimated value of the environmental noise in the means for storing at the time when the initial value of the estimated value of the environmental noise is calculated;
Based on the value stored in the means for storing, the energy value of the frame included in the frame energy calculation and storage means, and the determination result by the means for determining the state of the frame, the following formula
b (t) = b (t−1) × α + E (t) × (1−α) (When the state is a non-speech state)
b (t) = b (t−1) (When the state is other than non-speech state)
Where α is a predetermined forgetting factor, E (t) is the energy value of the frame at time t, and means for calculating the background noise b (t) at time t,
The utterance section detection device according to claim 4 , wherein the storing means stores the calculated background noise b (t).

Means for estimating the maximum value of the total energy value for each frame,
Means for sorting the frames stored in the frame energy calculation and storage means using energy values as keys;
6. The speech section according to claim 5 , comprising means for selecting, as the maximum value Emax (t) of the total energy values, the energy values of frames that are in a predetermined order as a result of sorting by the means for sorting. Detection device.

Means for sequentially calculating the threshold value for each frame,
The threshold value Eth1 (t) for detecting the utterance start position at time t is
Eth1 (t) = b (t) + max (β, Emax (t) −b (t)) × first constant ,
Where β is a constant predetermined as the minimum dynamic range of the audio data signal,
The utterance section detection device according to claim 6 , comprising means for calculating according to

The means for sequentially calculating the threshold value for each frame further includes:
The threshold value Eth2 (t) for detecting the utterance end position at time t is
Eth2 (t) = b (t) + max (β, Emax (t) −b (t)) × second constant ,
Where the second constant <the first constant,
β is the constant that is predetermined as the minimum dynamic range of the audio data signal,
The utterance section detection device according to claim 7 , comprising means for calculating according to

Further, the maximum energy value of the audio data of each frame from the head of the utterance or the predetermined default reference value, whichever is greater, is used to normalize the audio data of each frame and output it as the audio feature parameter of each frame including speech energy normalization means, voice activity detection device according to any one of claims 1 to 8.

The voice energy normalization means includes
A reference value storage means for storing a normalization reference value;
Detecting means for detecting that the sound energy calculated by the frame energy calculating and storing means exceeds a reference value stored in the reference value storing means, and outputting a detection signal;
Means for replacing a reference value stored in the reference value storage means with a value calculated by the frame energy calculation and storage means in response to the detection signal output by the detection means;
Dividing means for normalizing the voice energy of the frame by dividing the voice energy value calculated by the frame energy calculation and storage means by the reference value stored in the reference value storage means; The utterance section detection apparatus according to claim 9 .

In response to the estimation of the frame corresponding to the end position of the utterance interval by the utterance interval estimation means, the information further includes means for replacing the stored content of the reference value storage means with a predetermined default value. voice activity detection apparatus according to claim 1 0.

The predetermined default value, further comprising means for setting, based on the option value given at the start of the voice activity detection device, voice activity detection apparatus according to claim 1 0 or claim 1 1.

When executed by a computer, to operate as voice activity detection apparatus according to the computer in any one of claims 1 to 1 2, a computer program for voice activity detection.