JPS61262798A

JPS61262798A - Voice section detector

Info

Publication number: JPS61262798A
Application number: JP60103973A
Authority: JP
Inventors: 松下　満次; 村田　隆憲; 辻田　和一郎
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 1985-05-17
Filing date: 1985-05-17
Publication date: 1986-11-20

Abstract

(57)【要約】本公報は電子出願前の出願データであるた
め要約のデータは記録されません。(57) [Summary] This bulletin contains application data before electronic filing, so abstract data is not recorded.

Description

【発明の詳細な説明】（産業上の利用分野）本発明は音声認識装置において、雑音が重畳された入力
信号から音声区間を正確に抽出するための音声区間検出
装置に関するものである。DETAILED DESCRIPTION OF THE INVENTION (Field of Industrial Application) The present invention relates to a speech segment detection device for accurately extracting speech segments from an input signal on which noise is superimposed in a speech recognition device.

（従来の技術）音声応答装置、音声認識装置、回線制御装置から構成さ
れ、電話回線を通して多数の利用者に対して音声による
対話形式でサービスを提供する音声応答認識システムに
おいて、特に問題となるものの重要な要素の一つは、音
声認識における音声区間の正確な切シ出しである。音声
認識装置の入・力信号には話者が発声した音声信号の上
に、発声者の周囲から混入する各種の周囲雑音、回線雑
音等が重畳されておシ、これらは常に変動している。(Prior Art) A particular problem arises in a voice response recognition system that is composed of a voice response device, a voice recognition device, and a line control device, and provides services in the form of voice interaction to a large number of users through telephone lines. One of the important factors is accurate segmentation of speech sections in speech recognition. The input signal of a speech recognition device includes various ambient noises mixed in from around the speaker, line noise, etc. superimposed on the speech signal uttered by the speaker, and these are constantly changing. .

この様な状況のもとて音声区間を正確に切り出す為の方
法がいくつか提案されている（例えば特開昭５８−１３
０３９５号公報参照）。これらのうち、話者が発声して
いない時点における各種雑音のみ含まれた入力信号の特
徴パラメータをあらかじめ求め、更に、話者の発声によ
る入力音声信号の特徴ノ４ラメータとの差′分を求め、
この差分が決められた閾値以上の状態を所定の継続時間
満足したならば音声始端とし、前記閾値を下延る状態を
所定の継続時間満足したならば音声終端とし、該判定に
より決定された音声始端から音声終端までを音声区間と
する方法がある。以下この方法を適用した従来の音声認
識装置における音声区間検出装置を第４図に基づき説明
する。Several methods have been proposed for accurately extracting voice sections under such circumstances (for example, Japanese Patent Application Laid-Open No. 1986-13).
(See Publication No. 0395). Among these, the characteristic parameters of the input signal containing only various noises at the time when the speaker is not speaking are determined in advance, and the difference from the characteristic parameters of the input audio signal due to the speaker's utterance is also determined. ,
If the difference is greater than or equal to a predetermined threshold and satisfies a predetermined duration, it is defined as the start of the audio, and if the difference is below the threshold and the predetermined duration is satisfied, it is defined as the audio end, and the audio determined by this determination is defined as the audio end. There is a method in which the period from the beginning to the end of the audio is defined as the audio section. A speech segment detection device in a conventional speech recognition device to which this method is applied will be described below with reference to FIG.

なお、この例において雑音の特徴パラメータと、入力音
声信号の特徴・リメータの差分を求める方法としては、
ユークリッド距離、市街値距離等が一般的である。以下
の説明では雑音との距離値をユークリッド距離を用いて
差分を求める場合につき述べる。In this example, the method for finding the difference between the noise feature parameter and the input audio signal feature/remeter is as follows:
Euclidean distance, city value distance, etc. are common. In the following explanation, a case will be described in which a difference is calculated using Euclidean distance as a distance value from noise.

第４図は従来の音声区間検出装置を含む音声認識装置を
示すブロック図である。同図において、１０は特徴ノＪ
？ラメータ作成部、１１は音声区間検出装置部、１２は
認識部である。音声区間検出装置１１はノイズ距離演算
部１１ａ及び音声区間検出部１１ｂから成る。FIG. 4 is a block diagram showing a speech recognition device including a conventional speech section detection device. In the same figure, 10 is the feature number J
? 11 is a voice section detection device section, and 12 is a recognition section. The speech section detection device 11 includes a noise distance calculation section 11a and a speech section detection section 11b.

特徴パラメータ作成部１０は入力音声信号をチャンネル
数ｎ個のパントノ母スフィルタ群等で周波数分析するこ
とによフ、周波数軸ｎ次元の特徴・ぐラメータ時系列と
して表わしている。The feature parameter creation unit 10 performs frequency analysis on the input audio signal using a pantone matrix filter group with n channels, etc., thereby expressing it as an n-dimensional feature/grammeter time series on the frequency axis.

まず、話者が発声していない時点における入力信号の特
徴パラメータをあらかじめ抽出することによフ、発声さ
れた音声を含まない雑音の特徴・ぐラメータ（以後ノイ
ズ／母ターンと称し、ＮＰＡＴ　（ＣＨ）［ＣＨ＝１〜
ｎ〕と表わす）を得る。雑音の特徴パラメータが得られ
たならば、以後、順次話者の発声した時刻ＦＲ（ＦＲは
フレームであシ１通常１フレーム数ｍ５ｅｃ〜数１０ｍ
５ｅｃ）における入力音声信号の特徴パラメータを求め
る・この入力音声信号の特徴パラメータ時系列を、以後
入力・母ターンと称し、ＩＮＰＡＴ　（ＦＲ、ＣＨ）と
表わす。First, by extracting in advance the characteristic parameters of the input signal at the time when the speaker is not speaking, we can obtain the characteristics and parameters of the noise that does not include the uttered voice (hereinafter referred to as noise/main turn), and NPAT (CH ) [CH=1~
n]) is obtained. Once the characteristic parameters of the noise have been obtained, the time FR of the speaker's utterance (FR is a frame, usually one frame number of m5ec to several tens of meters)
Find the characteristic parameters of the input audio signal in 5ec) The characteristic parameter time series of this input audio signal is hereinafter referred to as the input/mother turn and expressed as INPAT (FR, CH).

特徴ノクラメータ作成部１０で得られた入力、やターン
及びノイズパターンを音声区間検出装置１１のノイズ距
離演算部１１ａへ送る。The inputs, turns, and noise patterns obtained by the feature noclameter creation section 10 are sent to the noise distance calculation section 11a of the speech section detection device 11.

ノイズ距離演算部１１ａでは入力・ぐターンとノイズパ
ターンとの差分を演算する。即ち、入カッ４ターンは音
声信号の他に当然、前記雑音成分を含んでいると考えら
れる為、例えば入力Ａ’ターンとノイズパターンとのユ
ークリッド距離を求めることによりムカバターン、つま
り入力音声信号よシ前記雑音成分を各周波数成分毎に差
し引いた音声信号成分のみを求めることができる。具体
的には雑音と入力信号との成る時刻ＦＢにおけるユーク
リッド距離（以後ノイズ距離と称す）を求める。The noise distance calculating section 11a calculates the difference between the input signal and the noise pattern. In other words, since the input 4-turn is considered to naturally include the above-mentioned noise components in addition to the audio signal, for example, by finding the Euclidean distance between the input A' turn and the noise pattern, we can distinguish between the input audio signal and the input audio signal. Only the audio signal component obtained by subtracting the noise component for each frequency component can be obtained. Specifically, the Euclidean distance (hereinafter referred to as noise distance) at time FB between the noise and the input signal is determined.

ノイズ距離演算部１１ａで求められたノイズ距離ＮＤＩ
ＳＴ（ＦＲ）は音声区間検出部１１ｂへ送られる。Noise distance NDI calculated by the noise distance calculation unit 11a
ST(FR) is sent to the voice section detection section 11b.

音声区間検出部１１ｂではノイズ距離ＮＤＩＳＴ（ＰＲ
）から入力音声の音声区間を検出する。即ち。The voice section detection unit 11b calculates the noise distance NDIST (PR
) to detect the audio section of the input audio. That is.

ノイズ距離ＮＤＩＳＴ（ＦＲ）が所定の闇値と比較して
犬である状態が決められた一定時間以上継続した時。When the noise distance NDIST (FR) is compared with a predetermined darkness value and the dog state continues for a predetermined period of time or more.

音声始端とし、始端決定後、ノイズ距離ＮＤＩＳＴ（Ｆ
Ｒ）が前記閾値と比較して、小である状態が決められた
一定時間以上継続した時、音声終端とし。After determining the start point, noise distance NDIST (F
When R) continues to be small compared to the threshold value for a predetermined period of time or more, it is determined that the audio has ended.

該判定により決定された音声始端から音声終端までを音
声区間として検出している。The period from the voice start point to the voice end determined by this determination is detected as a voice section.

音声区間検出部１１ｂで検出された音声区間を示す音声
区間信号は音声認識を行なう認識部１２へ送られる。A speech section signal indicating the speech section detected by the speech section detection section 11b is sent to the recognition section 12 which performs speech recognition.

（発明が解決しようとする問題点）しかしながら、前記構成の従来の音声区間検出装置では
次のような問題点があった。(Problems to be Solved by the Invention) However, the conventional voice section detection device having the above configuration has the following problems.

上記のようにユークリッド距離を用いて差分を求めて音
声区間を求める方法、或いは市街値距離を用いて差分を
求めて音声区間を求める方法のいずれの方法であっても
、ノイズノでターンを固定化している。ところが、実際
には雑音は多種の変動要因をもっているため、固定化し
たノイズパターンが各時点での全ての雑音を代表してい
るとは言い難く、固定化したノイズツクターンによる差
分ては、音声信号成分のみを抽出することは難しい。Regardless of the method described above, either the method of calculating the difference using the Euclidean distance to determine the voice interval, or the method of calculating the difference using the city value distance to determine the voice interval, the turn is fixed by noise noise. ing. However, in reality, noise has various fluctuation factors, so it is difficult to say that a fixed noise pattern represents all the noise at each point in time. It is difficult to extract only signal components.

したがって、固定化したノイズパターンの差分によって
は除去できない雑音成分が、ノイズ距離として出力され
てしまう。従って、従来の音声認識装置によれば１発声
者の周囲から混入する雑音のうち、送風音の様な定常的
な雑音はノイズパターンの差分により除去可能であるが
、ＯＡ製品（パソコン等）から発生する合図音（ピー音
）や、回線内に発生するクリック音等は、除去が不可能
であり、このような雑音も音声区間として検出してしま
うという問題点があった。Therefore, noise components that cannot be removed by the fixed noise pattern difference are output as noise distances. Therefore, according to conventional speech recognition devices, among the noise mixed in from the surroundings of a single speaker, steady noise such as the sound of a blower can be removed by difference in noise patterns, but from OA products (personal computers, etc.) There is a problem in that it is impossible to remove the signal sound (bleep sound) that occurs, the click sound that occurs within the line, and such noise is also detected as a voice section.

本発明は、以上述べた固定化したノイズツクターンでは
、除去することのできない雑音成分により誤った音声区
間を検出してしまうという従来技術の欠点を除去し、安
定でかつ正確に音声区間の検出が行なえる優れた音声区
間検出装置を提供することを目的とする。The present invention eliminates the drawback of the prior art in that the fixed noise filter described above detects incorrect speech sections due to noise components that cannot be removed, and detects speech sections stably and accurately. The purpose of the present invention is to provide an excellent voice section detection device that can perform the following.

（問題を解決するための手段）本発明は入力音声信号をフレーム毎１周波数チャンネル
毎に分析し、その分析結果に基づいて音声区間の切シ出
しを行う音声区間検出装置に係るもので、前記従来技術
の問題点を解決するため、差分検出手段と、２乗和算出
手段と、計数手段と、係数作成手段と、乗算手段と、音
声区間決定手段とを具備して構成される。(Means for Solving the Problem) The present invention relates to a voice section detection device that analyzes an input voice signal for each frequency channel for each frame, and extracts a voice section based on the analysis result. In order to solve the problems of the prior art, the present invention is configured to include a difference detection means, a sum of squares calculation means, a counting means, a coefficient creation means, a multiplication means, and a speech interval determination means.

差分検出手段は入力音声パターンと、あらかじめ無音声
区間にて得たノイズパターンとの差分をチャンネル毎に
検出する。２乗和算出手段は差分検出手段により検出さ
れ次チャンネル毎の各差分をそれぞれ２乗し、それらの
総和を算出して、各フレームにおける入力音声パターン
とノイズ／４’ターンとの距離とする。計数手段は差分
検出手段が検出した各差分を第１の閾値と比較し、前者
が後者を越えているチャンネルの数をカウントする。The difference detecting means detects the difference between the input audio pattern and the noise pattern obtained in advance in the silent section for each channel. The sum of squares calculation means squares each difference detected by the difference detection means for each next channel, calculates the sum of the sums, and uses the sum as the distance between the input audio pattern and the noise/4' turn in each frame. The counting means compares each difference detected by the difference detection means with a first threshold, and counts the number of channels in which the former exceeds the latter.

係数作成手段は計数手段がカウントしたカウント値に基
づいて該当フレームの係数を作成する。乗算手段は２乗
和算出手段によフ得られた２乗和に、計数作成手段が作
成した係数を乗じ、各７レームにおける補正距離を演算
する。音声区間決定手段は乗算手段により求められた補
正距離と第２の閾値とを比較し、補正距離が第２の閾値
より犬である状態が所定時間以上継続した時を始端とし
、始端決定後、補正距離が第２の閾値よフ小である状態
が初めて所定時間以上継続した時を終端として音声区間
を決定する。The coefficient creation means creates coefficients for the corresponding frame based on the count value counted by the counting means. The multiplication means multiplies the sum of squares obtained by the sum of squares calculation means by the coefficient created by the count creation means to calculate the corrected distance for each of the seven frames. The voice section determination means compares the corrected distance obtained by the multiplication means with the second threshold value, and defines the start point when the state in which the corrected distance is longer than the second threshold value continues for a predetermined period of time, and after determining the start point, A voice section is determined with the end point being the time when the state in which the corrected distance is smaller than the second threshold continues for the first time for a predetermined period of time or more.

（作用）差分検出手段はフレーム毎、周波数チャンネル毎に入力
音声ツクターンとノイズパターンとの差分を検出してそ
の結果を２乗和算出手段及び計数手段に供給する。２乗
和算出手段は差分検出手段からの情報に基づき入力音声
パターンとノイズパターンの距離を求める。一方、計数
手段は差分が第１の閾値を越えるチャンネルの数をカウ
ントするが、このカウント数は雑音の周波数方向の変動
景全あられしている。係数作成手段はこの変動量を考慮
に入れた係数を作成し、乗算手段により２乗和算出手段
で得られた距離と該係数が乗じられ、補正された距離が
求められる。この補正された距離は雑音の変動を考慮さ
れたものでちゃ、音声区間は該補正された距離に基づい
て音声区間の切シ出しを行なう。従って、切り出された
音声区間は雑音の変動分が除去された正確なものとなり
、前記従来技術の問題点が解決される。(Operation) The difference detection means detects the difference between the input audio output and the noise pattern for each frame and each frequency channel, and supplies the results to the square sum calculation means and the counting means. The sum of squares calculation means calculates the distance between the input speech pattern and the noise pattern based on the information from the difference detection means. On the other hand, the counting means counts the number of channels in which the difference exceeds the first threshold value, and this count is based on the entire noise fluctuation in the frequency direction. The coefficient creation means creates a coefficient taking this amount of variation into consideration, and the multiplication means multiplies the distance obtained by the sum of squares calculation means by the coefficient to obtain a corrected distance. This corrected distance must take into account noise fluctuations, and the speech section is cut out based on the corrected distance. Therefore, the extracted speech section becomes accurate with noise fluctuations removed, and the problems of the prior art described above are solved.

（実施例）以下本発明の実施例の音声区間検出装置について詳細に
説明する。(Embodiment) A voice section detection device according to an embodiment of the present invention will be described in detail below.

第１図は本実施例の音声区間検出装置の構成を示すブロ
ック図である。この音声区間検出装置はノイズ距離計算
部１と音声区間検出部２から成る。FIG. 1 is a block diagram showing the configuration of the voice section detection device of this embodiment. This speech section detection device consists of a noise distance calculation section 1 and a speech section detection section 2.

ノイズ距離計算部１は差分検出部３、計数部４、係数部
５，２乗和算出部６及び乗算部７よフ構成される。音声
区間検出装置の入力波形と各部の出力波形を第２図に示
す。本実施例では、入力音声信号と雑音との距離値をｎ
次元の特徴パラメータのユークリッド距離で表わす方法
を用いる。The noise distance calculation section 1 includes a difference detection section 3, a counting section 4, a coefficient section 5, a sum of squares calculation section 6, and a multiplication section 7. FIG. 2 shows the input waveform of the voice section detection device and the output waveform of each part. In this embodiment, the distance value between the input audio signal and the noise is n
A method of expressing Euclidean distance of dimensional feature parameters is used.

入力、ぐターンＩＮＰＡＴ　（ＦＲ、ＣＨ）はノイズ距
離計算部１の差分検出部３に入力され、差分検出部３の
出力は計数部４及び２乗和算出部６に接続される。The input signal INPAT (FR, CH) is input to the difference detection section 3 of the noise distance calculation section 1, and the output of the difference detection section 3 is connected to the counting section 4 and the sum of squares calculation section 6.

計数部４の出力は係数部５に接続され、２乗和算出部６
と係数部５の出力が乗算部７に入力される。The output of the counting section 4 is connected to the coefficient section 5, and the sum of squares calculation section 6
and the output of the coefficient section 5 are input to the multiplication section 7.

乗算部７の出力は音声区間検出部２に入力される。The output of the multiplication section 7 is input to the speech section detection section 2.

動作について述べると、まずノイズ・ぐターンＮＦＡＴ
　（ＯＨ）と、入カバターンＩＮＦＡＴ（ＦＲ，ＣＨ）
は差分検出部３に入力される。各チャンネル毎の差分出
力はＤＥＬ（ＣＨ）　＝　ＩＮＰＡＴ（ＦＲ，ＣＨ）　
−ＮＰＡＴ（ＣＨ）となシ、ＣＨ＝１〜ｎでｎ個の差分
が得られる。When talking about the operation, first of all, the noise and turn NFAT
(OH) and input cover turn INFAT (FR, CH)
is input to the difference detection section 3. The differential output for each channel is DEL (CH) = INPAT (FR, CH)
-NPAT(CH), n differences are obtained for CH=1 to n.

各差分出力ＤＥＬ（ＣＨ）は計数部４及び２乗和算出部
６に入力される。計数部４は、差分出力ＤＥＬ（ＣＨ）
と、あらかじめ設定されてい７．閾値ＮＬＥ”！との大
小比較を各チャンネル毎に行い、差分出力の値の方が大
となるチャンネルの個数Ｎ（ＦＲ）ｉ計算し、その個数
Ｎ（ＦＲ）を係数部５に送る。このように個数Ｎ（ＦＲ
）は雑音の周波数方向、の変動量をあられしている。２
乗和算出部６は各差分出力を２乗してそれらの総和を計
算する。従って、２乗和算出部６の出力はＮＤ（ＦＲ）
＝ΣＤＥＬ（ＣＨ）　　で表わされＣＨ＝　する。係数部５は個数Ｎ（ＦＲ）　ｉ基に、計算式Ｃ（Ｆ
Ｒ）＝ｆＣＮ（ＦＲ））　（ｆ　：係数関数）にニジ係
数Ｃ（ＦＲ）全求めそれを出力する。乗算部７は２乗和
ＮＤ（ＦＲ）と係数Ｃ（ＦＲ）を乗算して、ノイズ距離
ＮＤＩＳＴ（ＦＲ）を求めそれを出力する。すなわちノ
イズ距離は、ＮＤＩＳＴ（ＦＲ）　＝　Ｃ（ＦＲ）　ｘ
　ＮＤ（ＦＲ）で表わされる。ここで係数関数ｆには第
３図に示すように非線形な特性をもたせるようにする。Each difference output DEL (CH) is input to the counting section 4 and the sum of squares calculating section 6. The counting unit 4 outputs a differential output DEL (CH)
7. A comparison is made for each channel with the threshold NLE"!, the number N(FR)i of channels for which the difference output value is larger is calculated, and the number N(FR) is sent to the coefficient section 5. The number N(FR
) indicates the amount of variation in noise in the frequency direction. 2
The sum calculation unit 6 squares each difference output and calculates the sum thereof. Therefore, the output of the sum of squares calculating section 6 is ND(FR)
=ΣDEL(CH) and CH=. The coefficient part 5 is based on the number N(FR) i based on the calculation formula C(F
R)=fCN(FR)) (f: coefficient function), calculate all the coefficients C(FR) and output it. The multiplier 7 multiplies the sum of squares ND(FR) and the coefficient C(FR) to obtain a noise distance NDIST(FR) and outputs it. In other words, the noise distance is NDIST(FR) = C(FR) x
It is expressed as ND(FR). Here, the coefficient function f is made to have nonlinear characteristics as shown in FIG.

このようにすると、個数Ｎ　（ＦＲ）が小さい場合、つ
まシ入力音声信号が雑音と似ている場合には、係数Ｃ（
ＦＲ）は小さな値となシ、ノ・イズ距離ＮＤＩＳＴ（Ｆ
Ｒ）は小さく出力される。In this way, when the number N (FR) is small and the input audio signal resembles noise, the coefficient C (
FR) is a small value, and the noise distance NDIST(F
R) is output small.

ノイズ距離計算部１で得られたノイズ距離ＮＤＩＳＴ（
ＦＲ）は音声区間検出部２に入力され、音声区間検出部
２は、ノイズ距離ＮＤＩＳＴ　（ＰＲ）とあらかじめ設
定されている音声区間検出閾値ＶＬＥＶとの大小比較を
行ない、前者が後者よシも大の状態が一定時間以上継続
したならばＯＮｉ出力し、一方、前者が後者よりも下回
る状態が一定時間以上継続したならばＯＦＦを出力する
。そしてＯＮ状態の開始点と終了点をそれぞれ音声区間
の始端及び終端として音声区間を検出する。Noise distance NDIST (
FR) is input to the speech section detection section 2, and the speech section detection section 2 compares the noise distance NDIST (PR) with a preset speech section detection threshold VLEV, and determines whether the former is larger than the latter. If the state continues for a certain period of time or more, an ONi signal is output, and if the former continues to be lower than the latter for a certain period of time or more, an OFF signal is output. Then, a voice section is detected using the start point and end point of the ON state as the start and end points of the voice section, respectively.

（発明の効果）以上説明した様に１本発明によれば、あらかじめ無音声
区間において得られたノイズパターンと入カッ？ターン
の各チャンネルの差分が、所定の閾値を越えているチャ
ンネルの個数をカウントし、そのカウント数に応じた係
数１例えばカウント数が小さな時には小さな値を有する
係数を入力音声ノゼターンとノイズノやターンとの距離
（ノイズ距離）に乗算して補正された距離を求め、それ
に基づいて音声区間の検出を行うことにより、雑音、特
に特定の周波数に強いピークを有する単波長的な雑音（
・ぐンコンから発せられる合図音等）の変動分を除去で
きるようになる。従って、安定でかつ正確な音声区間の
検出の実現が可能となる。(Effects of the Invention) As explained above, according to the present invention, the noise pattern obtained in advance in the silent section and the input noise can be combined with the noise pattern obtained in advance in the silent section. The number of channels in which the difference between each channel of the turn exceeds a predetermined threshold is counted, and a coefficient 1 corresponding to the counted number is input.For example, when the count number is small, a coefficient having a small value is input. By multiplying the distance (noise distance) by the distance (noise distance) to find the corrected distance and detecting the voice section based on that, noise, especially single-wavelength noise with a strong peak at a specific frequency (
・It becomes possible to remove fluctuations in the signal sound emitted from Guncon, etc.). Therefore, it is possible to realize stable and accurate detection of voice sections.

[Brief explanation of the drawing]

第１図は本発明に係る音声区間検出装置の実施例を示す
ブロック図、第２図は上記実施例の装置の入力波形及び
各部の出力波形、第３図は係数関数の一例を示す図、第
４図は従来の音声区間検出装置を含む音声認識装置を示
すブロック図である。１・・・ノイズ距離計算部、２・・・音声区間検出部。３・・・差分検出部、４・・・計数部、５・・・係数部
、６・・・２乗和算出部、７・・・乗算部。FIG. 1 is a block diagram showing an embodiment of the speech interval detection device according to the present invention, FIG. 2 is an input waveform and output waveform of each part of the device of the above embodiment, and FIG. 3 is a diagram showing an example of a coefficient function. FIG. 4 is a block diagram showing a speech recognition device including a conventional speech section detection device. 1... Noise distance calculation unit, 2... Voice section detection unit. 3... Difference detection section, 4... Counting section, 5... Coefficient section, 6... Sum of squares calculation section, 7... Multiplication section.

Claims

[Claims] A voice section detection device that analyzes an input voice signal frame by frame and frequency channel, and extracts a voice section based on the analysis results, includes an input voice pattern and a voice section obtained in advance from a silent section. a difference detection means for detecting the difference between the input audio pattern and the noise pattern for each channel; and a difference detection means for detecting the difference between the input audio pattern and the noise pattern in each frame by squaring each difference detected by the difference detection means and calculating their sum. a sum of squares calculation means for calculating the distance between the two, and a counting means for comparing each difference detected by the difference detection means with a first threshold value and counting the number of channels for which the difference exceeds the first threshold value. and coefficient creation means for creating a coefficient based on the count value by the counting means, and calculating the corrected distance in each frame by multiplying the output of the sum of squares calculation means and the coefficient created by the count creation means. A multiplication means compares the corrected distance obtained by the multiplication means with a second threshold value, and defines the start point when the state in which the corrected distance is greater than the second threshold value continues for a predetermined time or more, and after determining the start point, A speech section detection device comprising: a speech section determining means that determines a speech section with a termination point when a state in which the corrected distance is smaller than a second threshold continues for the first time for a predetermined period of time or more.