JP2006209069A

JP2006209069A - Voice section detection device and program

Info

Publication number: JP2006209069A
Application number: JP2005211746A
Authority: JP
Inventors: Hiroaki Tagawa; 博章田川
Original assignee: ATR Advanced Telecommunications Research Institute International
Current assignee: ATR Advanced Telecommunications Research Institute International
Priority date: 2004-12-28
Filing date: 2005-07-21
Publication date: 2006-08-10
Anticipated expiration: 2025-07-21
Also published as: JP4798601B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a voice section detection device which can efficiently detect a voice section with a comparatively small computation amount under a noisy environment. <P>SOLUTION: The voice section detection device smoothes the volume of voice data to compute a first variation (S106 and S108), smoothes the variation of the first variation to compute a second variation (S110 and S112), compares the second variation with a threshold value to determine whether it is a voice or not for each frame (S114), and determines a voice section from a continuous length of a frame which is determined to be a voice or not to be a voice (S116). <P>COPYRIGHT: (C)2006,JPO&NCIPI

Description

本発明は、サンプリングされた音声データから音声区間を検出する音声区間検出装置および音声区間検出プログラムの構成に関する。 The present invention relates to a configuration of a speech segment detection device and a speech segment detection program for detecting a speech segment from sampled speech data.

たとえば、移動体通信などの音声処理の技術として、ＶＯＸ（ＶｏｉｃｅＯｐｅｒａｔｅｄＴｒａｎｓｍｉｔｔｅｒ）がある。ここで、ＶＯＸとは、音声の有無に応じて送信信号出力のＯＮ／ＯＦＦを行う技術のことで、例えば、音声を検出したときのみ信号を発信し、装置周辺が無音の時は信号を発信しないなどの処理を行うものであり、送信部の省電力化を図ることができる（たとえば、特許文献１を参照）。
特開２００４−２７２０５２号公報明細書 For example, there is VOX (Voice Operated Transmitter) as a voice processing technology such as mobile communication. Here, VOX is a technique for turning ON / OFF the transmission signal output according to the presence or absence of sound. For example, a signal is transmitted only when a sound is detected, and a signal is transmitted when there is no sound around the device. This is to perform processing such as not to perform power saving of the transmission unit (see, for example, Patent Document 1).
JP 2004-272052 A Specification

しかしながら、従来の方法は、高精度に音声区間を検出しようとすると、計算量が増加してしまう傾向があり、雑音環境下において、比較的少ない計算量で効率よく音声区間を検出する方法は、必ずしも確立されたとはいえない状況であった。 However, the conventional method tends to increase the amount of calculation when trying to detect a speech section with high accuracy. In a noisy environment, a method for efficiently detecting a speech section with a relatively small amount of calculation is as follows. The situation was not necessarily established.

本発明は、上記のような問題を解決するためになされたものであって、その目的は、雑音環境下において、比較的少ない計算量で効率よく音声区間を検出することが可能な音声区間検出装置および音声区間検出プログラムを提供することである。 The present invention has been made to solve the above-described problems, and an object of the present invention is to detect a voice section that can efficiently detect a voice section with a relatively small amount of calculation in a noisy environment. An apparatus and a voice segment detection program are provided.

このような目的を達成するために、本発明の音声区間検出装置は、サンプリングされた音声データに対してフレームの切り出し処理を行うためのフレーム処理手段と、音声データの音量を第１変動として算出する第１変動算出手段と、第１変動の変動を第２変動として算出する第２変動算出手段と、第２変動と所定のしきい値を比較することで、音声または非音声の判定をフレーム毎に行うフレーム判定手段と、音声および非音声に判定された結果をもとに音声区間を決定する音声区間決定手段とを備える。 In order to achieve such an object, the speech section detection device of the present invention calculates frame processing means for performing frame segmentation processing on sampled speech data, and the volume of the speech data as the first variation. The first fluctuation calculating means, the second fluctuation calculating means for calculating the fluctuation of the first fluctuation as the second fluctuation, and comparing the second fluctuation with a predetermined threshold value to determine whether the voice or non-voice is determined. Frame determination means that is performed every time, and speech section determination means that determines a speech section based on the result determined to be speech and non-speech.

好ましくは、第１変動算出手段は、音声データの音量をスムージングして第１変動として算出する。 Preferably, the first variation calculation means calculates the first variation by smoothing the volume of the audio data.

好ましくは、第２変動算出手段は、第１変動の変動をスムージングして第２変動として算出する。 Preferably, the second fluctuation calculating means calculates the second fluctuation by smoothing the fluctuation of the first fluctuation.

好ましくは、音声区間決定手段は、音声および非音声に判定されたフレームの継続長から音声区間を決定する。 Preferably, the speech segment determining means determines the speech segment from the continuation length of the frame determined to be speech or non-speech.

好ましくは、音声区間決定手段は、音声区間と判定されたフレームのうち、所定の継続長を満たさなかった音声区間は音声区間から除外する。 Preferably, the speech segment determining means excludes a speech segment that does not satisfy the predetermined duration from the speech segment among frames determined to be speech segments.

好ましくは、音声区間決定手段は、音声区間の間に挟まれていて、所定の継続長以下の非音声区間は、両端の音声区間と合わせて１つの音声区間とする。 Preferably, the speech segment determining means is sandwiched between speech segments, and a non-speech segment having a predetermined duration or less is combined with the speech segments at both ends to form one speech segment.

この発明の他の局面に従うと、演算処理装置と音声入力装置と記憶装置とを有するコンピュータに音声区間検出を実行させるための音声区間検出プログラムであって、音声入力装置によりサンプリングされ、記憶装置に格納された音声データに対してフレームの切り出し処理を行うステップと、演算処理装置が、音声データの音量を第１変動として算出するステップと、演算処理装置が、第１変動の変動を第２変動として算出するステップと、演算処理装置が、第２変動と所定のしきい値を比較することで、音声または非音声の判定をフレーム毎に行うステップと、演算処理装置が、音声および非音声に判定された結果をもとに音声区間を決定するステップと、をコンピュータに実行させる。 According to another aspect of the present invention, there is provided a speech segment detection program for causing a computer having an arithmetic processing unit, a speech input device, and a storage device to perform speech segment detection, which is sampled by the speech input device and stored in the storage device. A step of performing frame cut-out processing on the stored audio data, a step of the arithmetic processing unit calculating the volume of the audio data as the first variation, and the arithmetic processing unit determining the variation of the first variation as the second variation. And a step in which the arithmetic processing unit compares the second variation with a predetermined threshold value to determine voice or non-speech for each frame; And causing the computer to execute a step of determining a speech section based on the determined result.

好ましくは、第１変動として算出するステップは、音声データの音量をスムージングして第１変動として算出する。 Preferably, the step of calculating as the first variation calculates the first variation by smoothing the volume of the audio data.

好ましくは、第２変動として算出するステップは、第１変動の変動をスムージングして第２変動として算出する。 Preferably, the step of calculating as the second variation calculates the second variation by smoothing the variation of the first variation.

好ましくは、音声区間決定するステップは、音声および非音声に判定されたフレームの継続長から音声区間を決定する。 Preferably, in the step of determining the voice section, the voice section is determined from the continuation length of the frame determined to be voice or non-voice.

好ましくは、音声区間決定するステップは、音声区間と判定されたフレームのうち、所定の継続長を満たさなかった音声区間は音声区間から除外する。 Preferably, in the step of determining the speech segment, a speech segment that does not satisfy the predetermined duration is excluded from the speech segment among the frames determined as the speech segment.

好ましくは、音声区間決定するステップは、音声区間の間に挟まれていて、所定の継続長以下の非音声区間は、両端の音声区間と合わせて１つの音声区間とする。 Preferably, the step of determining the voice section is sandwiched between the voice sections, and the non-voice section having a predetermined duration or less is combined with the voice sections at both ends to form one voice section.

以下、図面を参照して本発明の実施の形態について説明する。
[実施の形態１]
（本発明のシステム構成）
図１は、本発明の音声区間検出装置１０００の構成の一例を示す概念図である。 Embodiments of the present invention will be described below with reference to the drawings.
[Embodiment 1]
(System configuration of the present invention)
FIG. 1 is a conceptual diagram showing an example of the configuration of a speech segment detection apparatus 1000 according to the present invention.

図１を参照して、音声区間検出装置１０００は、音声入力を受けて、音声データをサンプリングし、デジタルデータに変換するための音声データサンプリング部１０２と、音声データサンプリング部１０２によりサンプリングされた音声データを後の処理のために一時記憶するための一時記憶部１０４と、一時記憶部１０４に格納された音声データに対して音声区間の検出のための演算処理を行う演算部１０６と、演算部１０６により音声区間と判断された音声データを格納しておくためのデータ格納部１０８とを備える。 Referring to FIG. 1, audio section detection apparatus 1000 receives audio input, samples audio data, and converts the audio data sampling unit 102 to digital data, and the audio sampled by audio data sampling unit 102. Temporary storage unit 104 for temporarily storing data for later processing, arithmetic unit 106 for performing arithmetic processing for voice segment detection on audio data stored in temporary storage unit 104, and arithmetic unit And a data storage unit 108 for storing voice data determined as a voice section by 106.

なお、図１に示した音声区間検出装置１０００では、演算部１０６による音声区間の検出は、データ格納部１０８へのデータの格納処理を行うか否かの判断を行うために実行されるものとしたが、本発明の音声区間検出方法は、このような場合に限定されることなく、音声区間の検出を他の処理を行うための判断基準として用いることもできる。たとえば、音声処理の前処理とか、上述したような、音声信号の送信を行うか否か、というような判断の基準としても用いることが可能である。 In the speech segment detection apparatus 1000 shown in FIG. 1, the speech segment detection by the calculation unit 106 is executed to determine whether or not to store data in the data storage unit 108. However, the speech segment detection method of the present invention is not limited to such a case, and the speech segment detection can also be used as a criterion for performing other processes. For example, it can be used as a criterion for determination such as preprocessing of audio processing or whether or not to transmit audio signals as described above.

演算部１０６は、一時記憶部１０４に格納された音声データに対してフレーム処理（音声データの時系列に一定のウィンドウを順次かける処理）を行うフレーム処理部１０６２と、各フレームごとに音声か非音声かの判定を行って、音声区間の検出を行うための音声区間検出部１０６４とを含む。 The calculation unit 106 includes a frame processing unit 1062 that performs frame processing on the audio data stored in the temporary storage unit 104 (a process of sequentially applying a certain window to the time series of the audio data), and whether or not the audio is determined for each frame. A voice section detection unit 1064 for determining whether the voice is detected and detecting a voice section.

特に、限定されないが、たとえば、音声データサンプリング部１０２については、コンピュータにおける周知の音声入力システムを用いることができ、また、演算部１０６の機能は、コンピュータのＣＰＵ（Central Processing Unit）がソフトウェアにより実行する機能により実現することも可能である。 Although not particularly limited, for example, a well-known voice input system in a computer can be used for the voice data sampling unit 102, and the function of the calculation unit 106 is executed by a CPU (Central Processing Unit) of the computer by software. It is also possible to realize by the function to do.

もちろん、演算部１０６の機能は、専用のハードウェア（半導体集積回路）によって実現することも可能である。 Of course, the function of the arithmetic unit 106 can also be realized by dedicated hardware (semiconductor integrated circuit).

図２は、図１に示したフレーム処理部１０６２と、音声区間検出部１０６４とが行う処理を説明するためのフローチャートであり、図３は、図２のフローチャートの処理を示す概念図である。 FIG. 2 is a flowchart for explaining processing performed by the frame processing unit 1062 and the voice section detection unit 1064 shown in FIG. 1, and FIG. 3 is a conceptual diagram showing processing of the flowchart of FIG.

以下、図２および図３を参照して、本発明の音声区間検出装置１０００の動作について説明する。以下では、本発明の音声区間検出アルゴリズムを「ＶＳＤ（Variance Speech Detection）アルゴリズム」と呼ぶ。 Hereinafter, with reference to FIG. 2 and FIG. 3, operation | movement of the audio | voice area detection apparatus 1000 of this invention is demonstrated. Hereinafter, the speech segment detection algorithm of the present invention is referred to as a “VSD (Variance Speech Detection) algorithm”.

ＶＳＤアルゴリズムは、以下に説明するとおり、音声信号の変動（パワー）の変動（変化量）としきい値を比較することで、音声または非音声の判定をフレーム毎に行い、音声および非音声に判定されたフレームの継続長から音声区間を決定するアルゴリズムである。 As will be described below, the VSD algorithm compares voice signal fluctuation (power) fluctuation with a threshold value to determine voice or non-voice for each frame, and determines voice and non-voice. This is an algorithm for determining a speech section from the continuation length of the frame.

図２および図３を参照して、まず、音声データサンプリング部１０２により、以下のような音声データがサンプリングされる（ステップＳ１００）。 Referring to FIGS. 2 and 3, first, the following audio data is sampled by audio data sampling section 102 (step S100).

続いて、フレーム処理部１０６２により、以下のような音声フレームが切り出される（ステップＳ１０２）。 Subsequently, the following audio frame is cut out by the frame processing unit 1062 (step S102).

さらに、音声区間検出部１０６４により、各フレームについて、周波数の高域成分を強調するためのフィルタリング処理が行われる（ステップＳ１０４）。このようなフィルタリング処理を行う関数をＦＩＬＴＥＲ（…）で表す。 Further, filtering processing for emphasizing the high frequency component of the frequency is performed for each frame by the speech section detection unit 1064 (step S104). A function for performing such filtering processing is represented by FILTER (...).

このようにして、高域強調がなされた各フレームについて、音声区間検出部１０６４は、以下の式にしたがって音声の第１変動ν_fの算出処理が行われる（ステップＳ１０６）。第１変動は音声データの“ばらつき”（=音の大きさ（音量）、パワーに相当）を意味し、その値は、大きな音であれば大きくなり、小さな音であれば小さくなる。このような変動の演算を行う関数をＶＡＲＩＡＮＣＥ（…）で表す。 In this way, for each frame that has been subjected to high-frequency emphasis, the speech section detection unit 1064 performs processing for calculating the first variation ν _f of speech according to the following equation (step S106). The first variation means “variation” of audio data (= sound volume (volume), corresponding to power), and the value increases for a loud sound and decreases for a small sound. A function for calculating such fluctuation is represented by VARIANCE (...).

なお、第１変動は、上記のとおり、サンプリングされた各音声信号と平均値との差の絶対値の和に対応する量に限られず、たとえば、このような差の２乗和に対応する量としてもよい。すなわち、上述のとおり、音量の大きな音の音声信号の系列に対しては大きな値となり、音量の小さな音の音声信号の系列に対しては小さな値となるような関数であれば、他の関数を用いることも可能である。 As described above, the first variation is not limited to the amount corresponding to the sum of the absolute values of the differences between the sampled audio signals and the average value. For example, the first variation is an amount corresponding to the sum of squares of such differences. It is good. That is, as described above, other functions can be used as long as the function is a large value for a sound signal sequence of a loud sound and a small value for a sound signal sequence of a low sound. It is also possible to use.

さらに、音声区間検出部１０６４は、第１変動ν_fについて、以下のようなスムージング窓長Ｍについて中央値をとるメディアンスムージング処理により、スムージングされた第１変動が算出される（ステップＳ１０８）。 Further, the speech section detection unit 1064 calculates the first smoothed variation by the median smoothing process that takes the median value of the smoothing window length M as follows for the first variation ν _f (step S108).

このようにして得られたスムージングされた第１変動について、音声区間検出部１０６４は、さらに、音声変動の変動、なわち、第２変動ｗ_fの算出が以下のようにして行われる（ステップＳ１１０）。第２変動は音の大きさ（音量）の“ばらつき”（=パワーの変化量）を意味し、その値は、音量が大きなったり小さくなったりと変化するほど大きくなり、音量に変化がない場合は小さくなる。 With respect to the smoothed first variation obtained in this way, the speech section detection unit 1064 further calculates the variation in speech variation, that is, the second variation w _f as follows (step S110). ). The second variation means “variation” (= change in power) of the sound volume (volume), and the value increases as the volume increases or decreases, and the volume does not change. The case gets smaller.

このようにして得られた第２変動に対して、さらに、音声区間検出部１０６４は、以下のようなスムージング窓長Ｌについて中央値をとるメディアンスムージングを行うことで、スムージングされた第２変動の算出が行われる（ステップＳ１１２）。 In addition to the second variation obtained in this way, the speech section detection unit 1064 performs median smoothing that takes the median value of the smoothing window length L as follows, and thereby smoothed the second variation. Calculation is performed (step S112).

このようにして得られた「スムージングされた第２変動」に対して、以下のように予め定められたしきい値Ｈと比較することにより、音声区間検出部１０６４は、フレーム毎の音声・非音声判定を行う（ステップＳ１１４）。このようなしきい値Ｈについては、予め実験により、適切な値を定めておくものとする。 By comparing the “smoothed second variation” obtained in this way with a predetermined threshold value H as follows, the speech section detection unit 1064 can perform speech / non-speech for each frame. Voice determination is performed (step S114). For such a threshold value H, an appropriate value is determined in advance through experiments.

このようにして、フレームごとに音声区間と非音声区間とを予備的に判断した上で、音声区間検出部１０６４は、以下のような判定条件にしたがって、音声および非音声のフレーム継続長をもとにした音声区間を決定する（ステップＳ１１６）。 In this way, after preliminarily determining the voice segment and the non-speech segment for each frame, the voice segment detection unit 1064 has the voice and non-speech frame durations according to the following determination conditions. Then, the voice section is determined (step S116).

すなわち、しきい値比較により得られた仮の音声区間に対して、次の条件を当てはめる事で最適な音声区間を決定する。 That is, the optimum speech section is determined by applying the following condition to the temporary speech section obtained by the threshold comparison.

条件（１）：最低限必要な継続長を満たさなかった音声区間は音声区間として認めない。このような「最低限必要な継続長」としては、特に限定されないがたとえば、所定の値として「１００ｍｓｅｃ以上」とすることができる。 Condition (1): A voice segment that does not satisfy the minimum required duration is not allowed as a voice segment. Such “minimum required continuation length” is not particularly limited, but for example, a predetermined value can be “100 msec or more”.

条件（２）：音声区間の間に挟まれていて、連続した音声区間として扱うべき継続長を満たした非音声区間は、両端の音声区間と合わせて１つの音声区間とする。このような「連続した音声区間として扱うべき継続長」については、特に限定されないがたとえば、所定の値として「５００ｍｓｅｃ以下」とすることができる。 Condition (2): A non-speech segment that is sandwiched between speech segments and satisfies a continuation length to be treated as a continuous speech segment is combined with the speech segments at both ends to be one speech segment. Such a “continuation length to be treated as a continuous speech segment” is not particularly limited, but for example, a predetermined value can be “500 msec or less”.

条件（３）：変動の値が小さいために非音声として判定された音声区間始終端の一定数のフレームを音声区間に付け加える。このような「一定数」としては、たとえば９７フレームとすることができる。 Condition (3): A certain number of frames, which are determined as non-speech because the variation value is small, are added to the speech section. Such a “certain number” may be 97 frames, for example.

なお、以上の説明では、スムージング処理として、メディアンスムージングを例として説明したが、スムージング処理としては、他の方法を用いてもよい。 In the above description, media smoothing is described as an example of the smoothing process, but other methods may be used as the smoothing process.

以上のような処理により、雑音環境下において、比較的少ない計算量で効率よく音声区間を検出することが可能となる。 Through the processing as described above, it is possible to efficiently detect a speech section with a relatively small amount of calculation in a noisy environment.

すなわち、ＶＳＤアルゴリズムが音声・非音声を判定するために利用する音声の特徴としては、「言語音声の１つの特徴」として、比較的短い時間の単位で音量（パワー）が刻々と変化するということが挙げられる。ＶＳＤアルゴリズムでは、この特徴に着目して、パワーの変化量を抽出するために、音声変動の変動という値を利用している。 In other words, as a feature of the voice used by the VSD algorithm to determine voice / non-speech, “one feature of language voice” means that the volume (power) changes in units of relatively short time. Is mentioned. In the VSD algorithm, paying attention to this feature, the value of the fluctuation of voice is used to extract the amount of change in power.

さらに、雑音下においてＶＳＤアルゴリズムが効率よく音声区間を検出できる理由としては、無音状態や環境雑音では音量の“ばらつき”が比較的少なく、ほぼ一定の音量であったり、音量の変化速度が遅い場合が多いことが挙げられる。このような特徴はＶＳＤアルゴリズムが着目して検出しようとする音声の特徴とは反する。このように比較的定常な雑音はその音量に関係なく、音声と区別することができる。また、音量変化の激しい雑音は、音声と比較すると継続時間が短い場合が多い。このような特徴は、継続長をもとにした音声区間の決定操作により音声区間と区別することができる。 Furthermore, the reason why the VSD algorithm can efficiently detect a speech section under noise is that there is relatively little volume variation in silence or environmental noise, and the volume is almost constant or the volume change rate is slow. There are many. Such a feature is contrary to the feature of the voice that the VSD algorithm intends to detect. Thus, relatively stationary noise can be distinguished from speech regardless of its volume. In addition, noise with a large volume change often has a shorter duration than voice. Such a feature can be distinguished from a voice segment by a voice segment determination operation based on the duration.

図４から図７は、発声内容「あー」について、ＶＳＤアルゴリズムで計算される変動の時間変化を示す図である。 FIGS. 4 to 7 are diagrams showing temporal changes in fluctuations calculated by the VSD algorithm for the utterance content “Ah”.

図４は、第１変動を表し、図５は、スムージングされた第１変動を表し、図６は、第２変動を表し、図７は、スムージングされた第２変動を表わす。なお、縦軸は、いずれも強度を表し、横軸は時間を表す。 4 represents a first variation, FIG. 5 represents a first smoothed variation, FIG. 6 represents a second variation, and FIG. 7 represents a second smoothed variation. The vertical axis represents intensity, and the horizontal axis represents time.

発声内容「あー」については、長母音定状部分で第２変動が顕著に減衰することがわかる。そして、無音状態では、スムージングされた第２変動がほぼ０であるために、一定のしきい値を第２変動に用いれば、音声区間となるフレームを識別できることがわかる。 As for the utterance content “Ah”, it can be seen that the second variation is significantly attenuated in the long vowel constant portion. In the silent state, the smoothed second variation is almost zero, and therefore it can be seen that if a certain threshold value is used for the second variation, a frame serving as a speech segment can be identified.

ただし、長母音定状部分で第２変動が顕著に減衰するため、上述した条件（１）〜（３）をさらに用いることで、正しく音声区間を検出できる。 However, since the second variation is significantly attenuated in the long vowel regular portion, the speech section can be correctly detected by further using the above conditions (1) to (3).

図８から図１１は、発声内容「あいかわらず」(図３に使用したサンプル)について、ＶＳＤアルゴリズムで計算される変動の時間変化を示す図である。 FIG. 8 to FIG. 11 are diagrams showing temporal changes of fluctuations calculated by the VSD algorithm for the utterance content “Don't care” (sample used in FIG. 3).

図８は、第１変動を表し、図９は、スムージングされた第１変動を表し、図１０は、第２変動を表し、図１１は、スムージングされた第２変動を表わす。なお、縦軸は、いずれも強度を表し、横軸は時間を表す。 8 represents the first variation, FIG. 9 represents the first smoothed variation, FIG. 10 represents the second variation, and FIG. 11 represents the second smoothed variation. The vertical axis represents intensity, and the horizontal axis represents time.

発声内容「あいかわらず」については、長母音定状部分で第２変動が顕著に減衰することがわかる。そして、語尾近傍以外では、スムージングされた第２変動に、一定のしきい値を用いれば、音声区間となるフレームを識別できることがわかる。 With respect to the utterance content “OK”, it can be seen that the second variation is significantly attenuated in the long vowel constant portion. Then, it can be seen that, except for the vicinity of the ending, if a certain threshold value is used for the smoothed second variation, a frame that becomes a speech segment can be identified.

ただし、ここでも、語尾近傍部分で第２変動が減衰するため、上述した条件（１）〜（３）を用いることで、正しく音声区間を検出できる。
［実施の形態２］
実施の形態２では、実施の形態１で説明した音声区間検出装置１０００の構成を使用して、入力された音声信号の解析結果をユーザに対して表示し、一方で、ユーザは、音声区間検出装置の動作パラメータ等の設定を行なうことが可能なインタフェースを備えた、音声区間解析装置２０００の構成について説明する。 However, since the second variation is attenuated in the vicinity of the ending portion, the speech section can be correctly detected by using the above conditions (1) to (3).
[Embodiment 2]
In the second embodiment, the analysis result of the input voice signal is displayed to the user using the configuration of the voice section detection apparatus 1000 described in the first embodiment, while the user performs voice section detection. The configuration of the speech segment analysis apparatus 2000 provided with an interface capable of setting the operation parameters of the apparatus will be described.

図１２は、実施の形態２の音声区間解析装置２０００の構成を説明するための機能ブロック図である。 FIG. 12 is a functional block diagram for explaining the configuration of speech segment analysis apparatus 2000 according to the second embodiment.

図１２において、図１と同一部分には、同一符号を付している。
図１２を参照して、音声区間解析装置２０００は、マイク（図示せず）からの音声入力を、入出力インタフェース（以下、「入出力Ｉ／Ｆ」）１０１を介して受けて、音声データをサンプリングし、デジタルデータに変換するための音声データサンプリング部１０２と、音声データサンプリング部１０２によりサンプリングされた音声データを後の処理のために一時記憶するための一時記憶部１０４と、一時記憶部１０４に格納された音声データに対して音声区間の検出のための演算処理を行う演算部１０６と、演算部１０６により音声区間についての判断結果と関連づけて音声データを格納しておくためのデータ格納部１０８と、ユーザからの指示を入力するための操作部１２０と、データ格納部１０８に格納された音声データを演算部１０６の制御に基づいて、アナログの音声信号に変換して、入出力Ｉ／Ｆ１０１を介して、スピーカ（図示せず）に出力するためのＤ／Ａ変換器１１０とを備える。操作部１２０は、特に限定されないが、キーボードとマウスを備える。 In FIG. 12, the same parts as those in FIG.
Referring to FIG. 12, speech segment analysis apparatus 2000 receives speech input from a microphone (not shown) via input / output interface (hereinafter, “input / output I / F”) 101 and receives speech data. An audio data sampling unit 102 for sampling and converting into digital data, a temporary storage unit 104 for temporarily storing the audio data sampled by the audio data sampling unit 102 for later processing, and a temporary storage unit 104 A calculation unit 106 that performs calculation processing for detecting a voice section on the voice data stored in the voice data, and a data storage unit that stores voice data in association with the determination result of the voice section by the calculation unit 106 108, the operation unit 120 for inputting an instruction from the user, and the voice data stored in the data storage unit 108 are converted into the calculation unit 1 6 under the control of, and converts into an analog audio signal, via the input-output I / F101, and a D / A converter 110 to output to a speaker (not shown). The operation unit 120 includes, but is not limited to, a keyboard and a mouse.

演算部１０６は、操作部１２０からの指示に基づいて、音声区間解析装置２０００の動作を制御するための制御処理部１０６０と、一時記憶部１０４に格納された音声データに対してフレーム処理（音声データの時系列に一定のウィンドウを順次かける処理）を行うフレーム処理部１０６２と、各フレームごとに音声か非音声かの判定を行って、音声区間の検出を行い、音声区間を示すラベル情報と音声データとを関連づけて格納するための音声区間検出部１０６４とを含む。ここで、制御処理部１０６０は、操作部１２０からの指示に基づいて、音声入力信号の録音の開始、録音の停止、データ格納部１０８に格納された音声データに基づく音声信号の再生出力の開始、再生出力の停止、フレーム処理部１０６２や音声区間検出部１０６４の動作パラメータの設定等の処理を行なう。
（ラベルファイル出力機能）
以下では、フレーム処理部１０６２の機能について、さらに説明する。 Based on an instruction from the operation unit 120, the calculation unit 106 performs frame processing (audio processing) on the control processing unit 1060 for controlling the operation of the audio section analysis device 2000 and audio data stored in the temporary storage unit 104. A frame processing unit 1062 that performs a process of sequentially applying a certain window to a time series of data), and determines whether each frame is voice or non-voice, detects a voice section, and includes label information indicating a voice section, A voice section detecting unit 1064 for storing voice data in association with each other. Here, based on an instruction from the operation unit 120, the control processing unit 1060 starts recording of the audio input signal, stops recording, and starts reproduction of the audio signal based on the audio data stored in the data storage unit 108. Then, processing such as stop of reproduction output and setting of operation parameters of the frame processing unit 1062 and the voice section detection unit 1064 is performed.
(Label file output function)
Hereinafter, the function of the frame processing unit 1062 will be further described.

まず、音声区間解析装置２０００において、フレーム処理部１０６２は、フレーム処理されたフレームの個数から、フレーム処理部で処理を開始してからの経過時間をフレーム毎に算出して出力する機能を有するものとする。 First, in the speech section analysis apparatus 2000, the frame processing unit 1062 has a function of calculating and outputting an elapsed time for each frame from the start of processing by the frame processing unit from the number of frames subjected to frame processing. And

これに応じて、制御処理部１０６０は、音声区間検出部１０６４の検出結果に応じて、以下の処理を行なう。 In response to this, the control processing unit 1060 performs the following processing according to the detection result of the speech section detection unit 1064.

１）制御処理部１０６０は、音声区間の開始位置に判定されたフレームの経過時間を音声区間の開始時間として出力する。 1) The control processing unit 1060 outputs the elapsed time of the frame determined as the start position of the speech section as the start time of the speech section.

２）制御処理部１０６０は、音声区間の終了位置に判定されたフレームの経過時間を音声区間の終了時間として出力する。 2) The control processing unit 1060 outputs the elapsed time of the frame determined as the end position of the speech section as the end time of the speech section.

制御処理部１０６０は、このような、音声区間の開始時間と、終了時間とをラベルファイルとして、音声データファイルと関連づけて、データ格納部１０８０に格納する。 The control processing unit 1060 stores the start time and end time of the voice section in the data storage unit 1080 as a label file in association with the voice data file.

特に、限定されないが、ラベルファイルのフォーマットの出力例としては、例えば、以下のような形式とすることができる。
<開始時間[msec]> <この時間区間が音声区間であることを示すラベル> <終了時間[msec]>
なお、これも特に限定されないが、演算部１０６の機能は、コンピュータのＣＰＵ（Central Processing Unit）がアプリケーションソフトウェアにより実行する機能により実現することが可能である。以下では、このような機能を実現するためのソフトウェアを「音声区間検出機能付き音声収録試聴アプリケーション」と呼ぶ。このようなアプリケーションソフトウェアは、音声のキャプチャおよび音声出力のためのハードウェアが実装されているのであれば、一般的な、パーソナルコンピュータ等にインストールして実行させることができる。 Although not particularly limited, an output example of the format of the label file can be in the following format, for example.
<Start time [msec]><Label indicating that this time interval is a voice interval><End time [msec]>
Although this is not particularly limited, the function of the arithmetic unit 106 can be realized by a function executed by a CPU (Central Processing Unit) of the computer by application software. Hereinafter, software for realizing such a function is referred to as “audio recording / listening application with audio section detection function”. Such application software can be installed and executed on a general personal computer or the like as long as hardware for audio capture and audio output is implemented.

このとき、たとえば、データ格納部１０８がハードディスクであり、一時記憶部１０４がＲＡＭ（Random Access Memory）であるとすると、このような演算部１０６が実行するアプリケーションソフトウェアは記録媒体上に格納されており、図示しないドライブ装置により、パーソナルコンピュータに読み込まれて、ハードディスクに格納されることになる。
（音声区間検出機能付き音声収録試聴アプリケーション：基本画面）
次に、上述した「音声区間検出機能付き音声収録試聴アプリケーション」について、説明する。 At this time, for example, if the data storage unit 108 is a hard disk and the temporary storage unit 104 is a RAM (Random Access Memory), the application software executed by the arithmetic unit 106 is stored on a recording medium. Then, it is read into a personal computer by a drive device (not shown) and stored in a hard disk.
(Audio recording audition application with audio section detection function: basic screen)
Next, the above-described “audio recording / listening application with audio section detection function” will be described.

図１３は、表示装置１４０上に出力される「音声区間検出機能付き音声収録試聴アプリケーション」の基本画面を説明するための図である。 FIG. 13 is a diagram for explaining a basic screen of “audio recording / listening application with audio section detection function” output on display device 140.

初期状態では、音声波形表示窓１４１０には何も表示されていない。この状態で、操作部１２０のマウスの操作により、画面上の「録音開始ボタン」がクリックされると、制御処理部１０６０は、マイクなどの音声入力デバイスから音声波形データの読み込みを開始させる。 In the initial state, nothing is displayed in the audio waveform display window 1410. In this state, when a “recording start button” on the screen is clicked by operating the mouse of the operation unit 120, the control processing unit 1060 starts reading audio waveform data from an audio input device such as a microphone.

続いて、図１３に示すように、制御処理部１０６０の処理により、表示部１４０において、読み込んだ音声波形データが、音声波形表示窓１４１０に表示される。表示方法は、１）「録音停止ボタン」がクリックされてから読み込んだ全ての音声波形データを一度に表示しても良いし、２）「録音開始ボタン」がクリックされて読み込みが開始すると同時に所定の間隔で少しずつ窓の右端から逐次的に表示しても良い。 Subsequently, as shown in FIG. 13, the read voice waveform data is displayed on the voice waveform display window 1410 on the display unit 140 by the processing of the control processing unit 1060. The display method is as follows: 1) All the audio waveform data read after the “Recording stop button” is clicked may be displayed at once. 2) The reading starts when the “Recording start button” is clicked. The images may be displayed sequentially from the right edge of the window little by little at intervals.

演算部１０６においては、一時記憶部１０４から読み込んだ音声波形データを音声区間検出部１０６４へ伝達する。伝達するタイミングとしては、１）「録音停止ボタン」がクリックされてから読み込んだ全ての音声波形データを一時記憶部１０４から読み出して一度に渡しても良いし、２）「録音開始ボタン」がクリックされて読み込みが開始すると同時に所定の間隔で少しずつ逐次的に渡しても良い。 In the calculation unit 106, the voice waveform data read from the temporary storage unit 104 is transmitted to the voice section detection unit 1064. The timing of transmission is as follows: 1) All the audio waveform data read after the “Recording stop button” is clicked may be read from the temporary storage unit 104 and delivered at once. 2) The “Recording start button” is clicked. Then, at the same time when reading is started, the data may be sequentially transferred at predetermined intervals.

音声波形表示窓１４１０中のレベルメータ１４２０には、しきい値と比較されて音声／非音声判定の基準値となるスムージングされた第２変動を可視化して表示する。レベルメータ中の下から１／３程度の箇所に「しきい値バー」が表示される。しきい値以上の場合と以下の場合で表示色が変更される。 The level meter 1420 in the voice waveform display window 1410 visualizes and displays the smoothed second fluctuation that is compared with a threshold value and becomes a reference value for voice / non-voice judgment. A “threshold bar” is displayed at a position about 1/3 from the bottom in the level meter. The display color is changed between the case where the threshold is exceeded and the case where

レベルメータ１４２０は、録音時に音声区間検出部１０６４へ逐次的に音声波形データを伝送し、かつ制御処理部１０６０が音声区間検出部１０６４から逐次的にスムージングされた第２変動値を受け取った場合に有効になる。 The level meter 1420 transmits voice waveform data sequentially to the voice section detection unit 1064 during recording, and the control processing unit 1060 receives the second variation value sequentially smoothed from the voice section detection unit 1064. validate.

レベルメータ１４２０は、音声波形データの再生時にも有効になる。再生時に可視化して「レベルメータ」に表示するスムージングされた第２変動値は、１)音声区間検出処理実行時にあらかじめデータ格納部１０８に保持しておいたものを再生と同期して表示しても良いし、２)再生と同期して音声区間検出部１０６４が逐次的に音声区間検出処理を再実行したものを制御処理部１０６０が受け取ったものを表示しても良い。 The level meter 1420 is also effective when audio waveform data is reproduced. The smoothed second variation value that is visualized and displayed on the “level meter” during playback is as follows: 1) The data stored in the data storage unit 108 at the time of executing the voice segment detection process is displayed in synchronization with the playback. Alternatively, 2) what is received by the control processing unit 1060 after the voice section detection unit 1064 sequentially re-executed the voice section detection processing in synchronization with the reproduction may be displayed.

制御処理部１０６０は、データ格納部１０８を経由して音声区間検出部１０６４から音声区間検出結果を受け取る。受け取るタイミングは、１)音声区間検出処理が終了後、全ての音声区間情報を一度に受け取っても良いし、２)フレーム毎に音声／非音声の判定結果を受け取りながら、音声区間の開始／終了情報を逐次的に受け取っても良い。 The control processing unit 1060 receives the voice segment detection result from the voice segment detection unit 1064 via the data storage unit 108. The timing of reception is as follows: 1) After the voice segment detection process is completed, all the voice segment information may be received at once. 2) The voice segment start / end is received while receiving the voice / non-voice judgment result for each frame. Information may be received sequentially.

制御処理部１０６０は、音声区間検出部１０６４から受け取った音声区間情報を、音声波形表示窓１４１０に表示する。表示方法は、１)音声区間の開始／終了位置を表示するだけでも良いし、２)フレーム毎に判定された音声／非音声の情報を背景色を変更するなどの方法で表示しても良い。 The control processing unit 1060 displays the speech segment information received from the speech segment detection unit 1064 in the speech waveform display window 1410. The display method may be 1) display only the start / end position of the speech section, or 2) display the speech / non-speech information determined for each frame by changing the background color. .

制御処理部１０６０は、録音停止ボタンがクリックされると、マイクなどの音声入力デバイスから音声波形データの読み込みを停止する。さらに、制御処理部１０６０は、再生ボタンがクリックされると、読み込んだ音声波形データをスピーカなどの音声出力デバイスへ出力して再生する。 When the recording stop button is clicked, the control processing unit 1060 stops reading audio waveform data from an audio input device such as a microphone. Further, when the playback button is clicked, the control processing unit 1060 outputs the read audio waveform data to an audio output device such as a speaker and reproduces it.

なお、制御処理部１０６０は、音声波形データを再生する場合は、動的に波形中の再生されている位置を、音声波形表示窓１４１０に色の変化等により表示する。 Note that when the audio waveform data is reproduced, the control processing unit 1060 dynamically displays the reproduction position in the waveform on the audio waveform display window 1410 by a color change or the like.

また、マウス、あるいは他の指示入力デバイスを用いて、音声波形表示窓１４１０の中で任意の区間を（選択したい区間の先頭でマウスの左ボタンをクリックして選択したい区間の終端までドラッグしたのちリリースするなどの方法で）選択した上で、さらに「再生ボタン」をクリックした場合は、選択区間のみ再生される。音声波形表示窓１４１０中の区間選択は録音が終了（停止）するまで操作することはできない。音声波形表示窓１４１０中で選択区間解除操作（マウスの左ボタンクリックなど）を行うと選択区間を解除できる。 In addition, using a mouse or other instruction input device, after dragging an arbitrary section in the audio waveform display window 1410 (clicking the left button of the mouse at the beginning of the section to be selected to the end of the section to be selected) If you click the “Play button” after selecting (by releasing), only the selected section will be played. The section selection in the voice waveform display window 1410 cannot be operated until the recording is completed (stopped). When a selected section release operation (such as clicking the left button of the mouse) is performed in the audio waveform display window 1410, the selected section can be released.

マウス、あるいは他の指示入力デバイスを用いて音声波形表示窓１４１０の中で選択された任意の区間において、マウス等を用いて（マウスの右ボタンをクリックするなどの方法で）「メニュー画面」を呼び出すことで、選択区間に対して再生や保存などの操作ができる。 The “menu screen” is displayed by using the mouse or the like (by clicking the right button of the mouse) in an arbitrary section selected in the voice waveform display window 1410 using the mouse or other instruction input device. By calling, operations such as playback and saving can be performed on the selected section.

さらに、選択された区間が無い状態の音声波形表示窓１４１０中で「音声区間開始位置」と「音声区間終了位置」で挟まれた音声区間において、マウスなどの指示入力デバイスを用いて「メニュー画面」を呼び出すことで、音声区間に対して再生や保存などの操作ができる。音声波形表示窓１４１０中の音声区間でのメニュー表示は録音が終了（停止）するまで、および音声区間検出処理が終了するまで呼び出すことはできない。 Further, in the voice waveform display window 1410 in a state where there is no selected section, a “menu screen” is displayed using an instruction input device such as a mouse in a voice section sandwiched between “speech section start position” and “speech section end position”. Can be used to perform operations such as playback and saving. The menu display in the voice section in the voice waveform display window 1410 cannot be called until the recording is finished (stopped) and the voice section detection process is finished.

「設定ボタン」がクリックされると、制御処理部１０６０は、音声区間検出部１０６４の各種パラメータの設定と、後に説明する各変動値表示窓の表示／非表示を設定するための「設定画面」を呼び出す。 When the “set button” is clicked, the control processing unit 1060 sets “various parameters” of the voice section detection unit 1064 and “setting screen” for setting display / non-display of each variable value display window described later. Call.

また、制御処理部１０６０は、「音声区間検出ボタン」がクリックされると、録音されてデータ格納部１０８に格納された音声波形データを音声区間検出部１０６４に伝送して、音声区間検出処理を再実行する。「音声区間検出ボタン」は録音が終了（停止）するまで操作することはできない。 In addition, when the “voice section detection button” is clicked, the control processing unit 1060 transmits the voice waveform data recorded and stored in the data storage unit 108 to the voice section detection unit 1064 to perform voice section detection processing. Try again. The “voice section detection button” cannot be operated until the recording ends (stops).

制御処理部１０６０は、「時間情報保存ボタン」がクリックされると、音声区間検出部１０６４から受け取った音声区間開始／終了位置情報を、録音の開始時刻を基準とした経過時間に変換して、音声区間の開始／終了時間ファイルとして保存する。「時間情報保存ボタン」は音声区間検出処理が終了するまで操作することはできない。 When the “time information save button” is clicked, the control processing unit 1060 converts the voice segment start / end position information received from the voice segment detection unit 1064 into an elapsed time with reference to the recording start time, Save as voice segment start / end time file. The “time information save button” cannot be operated until the voice section detection process is completed.

さらに、制御処理部１０６０は、「音声区間保存ボタン」がクリックされると、検出された全ての音声区間中の音声波形データを保存する。「音声区間保存ボタン」は音声区間検出処理が終了するまで操作することはできない。また、制御処理部１０６０は、「録音音声保存ボタン」がクリックされると、録音された全ての音声波形データを保存する。「録音音声保存ボタン」は録音が終了（停止）するまで操作することはできない。
（音声区間検出機能付き音声収録試聴アプリケーション：設定画面）
図１４は、図１３で説明した基本画面（または各変動値表示画面）の「設定ボタン」がクリックされると、呼び出される設定画面を示す図である。 Furthermore, when the “speech section saving button” is clicked, the control processing unit 1060 stores the speech waveform data in all detected speech sections. The “voice section save button” cannot be operated until the voice section detection process is completed. In addition, when the “recorded sound storage button” is clicked, the control processing unit 1060 stores all recorded sound waveform data. The “recorded audio save button” cannot be operated until the recording ends (stops).
(Audio recording audition application with voice segment detection function: setting screen)
FIG. 14 is a diagram illustrating a setting screen that is called when the “setting button” on the basic screen (or each variation value display screen) described in FIG. 13 is clicked.

図１４に示すとおり、初期状態ではあらかじめ保持する所定の値がデフォルトとして設定されている。 As shown in FIG. 14, in the initial state, a predetermined value stored in advance is set as a default.

ユーザにより、操作部１２０から値が入力変更された後、「ＯＫボタン」がクリックされると、制御処理部１０６０は、保持する設定値を入力された値に変更して、設定画面を閉じて、基本画面（または各変動値表示画面）へ戻る。なお、値の変更があってもなくても、「キャンセル（Cancel）ボタン」がクリックされると、保持する設定値を変更せずに、設定画面を閉じて、基本画面（または各変動値表示画面）へ戻る。
（音声区間検出機能付き音声収録試聴アプリケーション：拡張画面（１））
図１５は、上記設定画面において、「スムージングされた第２変動の表示」を「表示する」に設定された場合、表示装置１４０に表示される第１の拡張画面を示す図である。第１の拡張画面では、「スムージングされた第２変動としきい値の表示」が表示される。なお、第１の各校画面では、「スムージングされた第２変動表示窓」が表示されること以外は基本画面の動作と同様であるので、以下では、相違点を説明する。 When the user changes the input value from the operation unit 120 and then clicks the “OK button”, the control processing unit 1060 changes the setting value to be held to the input value, and closes the setting screen. Return to the basic screen (or each fluctuation value display screen). If the “Cancel” button is clicked regardless of whether the value has been changed, the setting screen is closed without changing the setting value to be held, and the basic screen (or each variable value display) is displayed. Screen).
(Audio recording audition application with audio section detection function: extended screen (1))
FIG. 15 is a diagram showing a first extended screen displayed on the display device 140 when “display the smoothed second variation” is set to “display” on the setting screen. On the first extended screen, “display of smoothed second variation and threshold value” is displayed. The first school screen is the same as the operation of the basic screen except that the “smoothed second variation display window” is displayed, and therefore the difference will be described below.

制御処理部１０６０は、初期状態では、スムージングされた第２変動表示窓１４３０には「しきい値」のみを表示させる。 In the initial state, the control processing unit 1060 displays only the “threshold value” in the smoothed second variation display window 1430.

制御処理部１０６０は、音声区間検出部１０６４からスムージングされた第２変動値を受け取ると、これをスムージングされた第２変動表示窓１４３０に表示する。表示方法は、１)音声区間検出処理が終了した後で一度に表示しても良いし、２)音声区間検出処理が逐次的に実行されている場合は、音声区間検出処理と同期して逐次的に表示しても良い。なお、音声区間検出処理が再実行された場合は、スムージングされた第２変動表示窓１４３０の表示内容も更新される。
（音声区間検出機能付き音声収録試聴アプリケーション：拡張画面（２））
図１６は、設定画面において、「第１変動の表示」、「スムージングされた第１変動の表示」、「第２変動の表示」、「スムージングされた第２変動の表示」のいずれもが「表示する」に設定された場合の第２の拡張画面を示す図である。つまり、第２の拡張画面では、「全ての変動値の表示」が表示される。 Upon receiving the smoothed second variation value from the speech section detection unit 1064, the control processing unit 1060 displays this on the smoothed second variation display window 1430. The display method may be 1) display at a time after the speech segment detection process is completed, or 2) when the speech segment detection process is sequentially performed, sequentially in synchronization with the speech segment detection process May be displayed. Note that, when the voice section detection process is performed again, the display content of the smoothed second variation display window 1430 is also updated.
(Audio recording audition application with audio section detection function: extended screen (2))
In the setting screen, all of “display of first variation”, “display of first variation smoothed”, “display of second variation”, and “display of second variation smoothed” are “ It is a figure which shows the 2nd extended screen at the time of being set to "display." That is, “display all variation values” is displayed on the second extended screen.

なお、変動値の表示は、設定画面にも示したとおり、必要なものを任意に選択して表示させることが可能である。「第１変動表示窓」「スムージングされた第１変動表示窓」「第２変動表示窓」「スムージングされた第２変動表示窓」が表示されること以外は、原則として、基本画面の動作と同様である。 As shown in the setting screen, the variable values can be displayed by arbitrarily selecting necessary ones. In principle, except that the “first variation display window”, “smoothed first variation display window”, “second variation display window”, and “smoothed second variation display window” are displayed, It is the same.

つまり、初期状態では「第１変動の表示」「スムージングされた第１変動の表示」「第２変動の表示」には何も表示されていない。「スムージングされた第２変動表示窓」には「しきい値」のみが表示される。 That is, in the initial state, nothing is displayed in “display of first variation”, “display of smoothed first variation”, and “display of second variation”. Only the “threshold value” is displayed in the “smoothed second variation display window”.

さらに、制御処理部１０６０は、音声区間検出部１０６４から受け取った各変動値を各変動表示窓１４３０〜１４６０に表示する。表示方法は、１)音声区間検出処理が終了した後で一度に表示しても良いし、２)音声区間検出処理が逐次的に実行されている場合は、音声区間検出処理と同期して逐次的に表示しても良い。さらに、音声区間検出処理が再実行された場合は、各変動表示窓１４３０〜１４６０の表示内容も更新される。 Further, the control processing unit 1060 displays each variation value received from the voice segment detection unit 1064 in each variation display window 1430 to 1460. The display method may be 1) display at a time after the speech segment detection process is completed, or 2) when the speech segment detection process is sequentially performed, sequentially in synchronization with the speech segment detection process May be displayed. Further, when the voice section detection process is re-executed, the display content of each of the fluctuation display windows 1430 to 1460 is also updated.

このような構成により、実施の形態２の音声区間解析装置２０００は、録音された音声データについて、音声区間の検出処理を柔軟に実行しつつ、音声の解析を行なうことが可能である。
［実施の形態３］
次に、実施の形態３では、実施の形態１で説明した音声区間検出装置を、この音声区間検出装置に後続して接続される後続音声処理装置において利用する形態を説明する。 With such a configuration, the speech segment analysis apparatus 2000 according to Embodiment 2 can perform speech analysis while flexibly executing speech segment detection processing on recorded speech data.
[Embodiment 3]
Next, in the third embodiment, a description will be given of a mode in which the speech segment detection device described in the first embodiment is used in a subsequent speech processing device connected subsequent to the speech segment detection device.

（接続方式１）
まず、図１７は、第１の接続方式を説明するための機能ブロック図である。実施の形態１と同一部分には、同一符号を付す。 (Connection method 1)
First, FIG. 17 is a functional block diagram for explaining the first connection method. The same parts as those in the first embodiment are denoted by the same reference numerals.

図１７では、音声データサンプリング部１０２、一時記憶部１０４、フレーム処理部１０６２については、音声区間検出装置の音声区間検出部１０６４と後続音声処理装置の音声処理部２００とが共有する構成である。 In FIG. 17, the audio data sampling unit 102, the temporary storage unit 104, and the frame processing unit 1062 are configured to be shared by the audio interval detection unit 1064 of the audio interval detection device and the audio processing unit 200 of the subsequent audio processing device.

すなわち、音声区間検出部１０６４が検出したフレーム毎の音声／非音声の情報と、音声区間の開始／終了情報は、音声処理部２００へ伝送される。 That is, the voice / non-voice information for each frame and the voice section start / end information detected by the voice section detection unit 1064 are transmitted to the voice processing unit 200.

続いて、音声処理部２００は音声区間検出部１０６４から伝送されたフレーム毎の音声／非音声の情報と、音声区間の開始／終了情報をもとに、フレーム分割された音声波形データの音声区間に相当するフレーム部分のみに対して音声処理を実行する。 Subsequently, the speech processing unit 200 performs speech segment of speech waveform data divided into frames based on speech / non-speech information for each frame transmitted from the speech segment detection unit 1064 and start / end information of the speech segment. Audio processing is executed only for the frame portion corresponding to.

ここで、音声処理部２００が実行する「音声処理」とは、特に、限定されないが、たとえば、音声認識の前処理とか、後続音声処理装置から他の機器へ音声信号の送信を行うか否か、という判断をフレーム毎に行なって、伝送処理を選択的に行なう処理などである。 Here, the “speech processing” executed by the speech processing unit 200 is not particularly limited. For example, whether or not to perform speech recognition preprocessing or transmission of a speech signal from the subsequent speech processing apparatus to another device. , For each frame and selectively performing the transmission process.

図１７に示したような構成では、音声区間検出部１０６４から音声処理部２００へ伝送されるデータは、音声区間の開始／終了情報のみでよいので、これらの間のデータ伝送量を抑制できる。 In the configuration as shown in FIG. 17, the data transmitted from the voice section detection unit 1064 to the voice processing unit 200 may be only the start / end information of the voice section, and the data transmission amount between them can be suppressed.

なお、図１７では、音声区間検出装置と後続音声処理装置が、音声データサンプリング部１０２と一時記憶部１０４とフレーム処理部１０６２とを共有するものとしたが、かならずしも共有する必要はなく、音声区間検出装置と後続音声処理装置がそれぞれ個別に音声データサンプリング部１０２と一時記憶部１０４とフレーム処理部１０６２とを別系統で有するものとしてもよい。この場合は、音声区間検出部１０６４から音声処理部２００への情報の伝送量が少ないので、音声区間検出装置と後続音声処理装置を分離して遠隔地に設置しても、伝送路の伝送速度に影響を受けにくい。もちろん、このとき音声入力から音声データサンプリング部１０２までの間は音声区間検出装置と後続音声処理装置への２分岐されたアナログ音声信号として遠隔地間で伝送することになるものの、音声信号の情報量からすると、これも伝送路の伝送速度にさほど影響は受けない。 In FIG. 17, the voice section detection device and the subsequent voice processing apparatus share the voice data sampling unit 102, the temporary storage unit 104, and the frame processing unit 1062, but it is not always necessary to share the voice section. The detection device and the subsequent audio processing device may individually include the audio data sampling unit 102, the temporary storage unit 104, and the frame processing unit 1062 in different systems. In this case, since the transmission amount of information from the voice section detection unit 1064 to the voice processing unit 200 is small, even if the voice section detection device and the subsequent voice processing device are separated and installed in a remote place, the transmission speed of the transmission path It is hard to be influenced by. Of course, between the voice input and the voice data sampling unit 102 at this time, it is transmitted between the remote locations as a two-branched analog voice signal to the voice section detecting device and the subsequent voice processing device. In terms of quantity, this is also not significantly affected by the transmission speed of the transmission line.

［実施の形態３の変形例１］
（接続方式２）
図１８は、実施の形態３の変形例１である、第２の接続方式を説明するための機能ブロック図である。ここでも、実施の形態１と同一部分には、同一符号を付す。 [Modification 1 of Embodiment 3]
(Connection method 2)
FIG. 18 is a functional block diagram for explaining a second connection method, which is a first modification of the third embodiment. Again, the same parts as those in the first embodiment are denoted by the same reference numerals.

図１８では、音声区間検出装置１０００の音声区間検出部１０６４が検出した音声区間の音声波形データのみを、音声区間毎に後続音声処理装置２０００のフレーム処理部２０１０へ伝送する。 In FIG. 18, only the speech waveform data of the speech segment detected by the speech segment detection unit 1064 of the speech segment detection apparatus 1000 is transmitted to the frame processing unit 2010 of the subsequent speech processing apparatus 2000 for each speech segment.

後続音声処理装置２０００では、音声区間検出部１０６４から伝送された音声区間の音声波形データを、フレーム処理部２０１０において再度フレーム処理してから、音声処理部２００において音声処理を実行する。 In the subsequent speech processing apparatus 2000, the speech processing unit 200 performs speech processing on the speech waveform data of the speech segment transmitted from the speech segment detection unit 1064 again after the frame processing unit 2010 performs frame processing again.

このような構成とすると、音声区間検出装置１０００と後続音声処理装置２０００との間では、音声信号の伝送が行なわれるのみであるので、音声区間検出装置１０００と後続音声処理装置２０００との接続部分の仕組みが単純である。このため、前処理に音声区間検出部１０６４を持たない音声処理装置２０００に対して、当該音声処理装置２０００のフレーム処理部２０１０の直前に、音声区間検出装置１０００をそのまま接続するだけでよい。 With such a configuration, only the transmission of the audio signal is performed between the audio segment detection device 1000 and the subsequent audio processing device 2000, and therefore, the connection portion between the audio interval detection device 1000 and the subsequent audio processing device 2000 The mechanism is simple. For this reason, it is only necessary to connect the speech section detection apparatus 1000 as it is immediately before the frame processing section 2010 of the speech processing apparatus 2000 to the speech processing apparatus 2000 that does not have the speech section detection section 1064 for preprocessing.

［実施の形態３の変形例２］
（接続方式３）
図１９は、実施の形態３の変形例２である、第３の接続方式を説明するための機能ブロック図である。ここでも、実施の形態１と同一部分には、同一符号を付す。 [Modification 2 of Embodiment 3]
(Connection method 3)
FIG. 19 is a functional block diagram for explaining a third connection method, which is a second modification of the third embodiment. Again, the same parts as those in the first embodiment are denoted by the same reference numerals.

図１９では、音声区間検出装置１０００の音声区間検出部１０６４が検出した音声区間のフレーム分割した音声波形データを、フレーム毎に音声処理装置２０００の音声処理部２００へ伝送する。 In FIG. 19, the speech waveform data obtained by dividing the speech segment detected by the speech segment detection unit 1064 of the speech segment detection device 1000 is transmitted to the speech processing unit 200 of the speech processing device 2000 for each frame.

音声処理装置２０００の音声処理部２００は、音声区間検出部１０６４から伝送された音声区間のフレーム分割した音声波形データに対して音声処理を実行する。 The voice processing unit 200 of the voice processing device 2000 performs voice processing on the voice waveform data obtained by dividing the voice section transmitted from the voice section detecting unit 1064 into frames.

このような構成とすれば、音声区間検出装置１０００と後続音声処理装置２０００との接続部分の仕組みは、音声信号の伝達のみを担えばよいので比較的単純であり、しかも、音声区間検出装置１０００と後続音声処理装置２０００の間で重複する処理が無く、処理効率が高い。 With such a configuration, the mechanism of the connection portion between the speech segment detection device 1000 and the subsequent speech processing device 2000 is relatively simple because it only needs to transmit the speech signal, and the speech segment detection device 1000 And subsequent audio processing apparatus 2000 have no overlapping processing, and processing efficiency is high.

［実施の形態３の変形例３］
（接続方式４）
図２０は、実施の形態３の変形例３である、第４の接続方式を説明するための機能ブロック図である。ここでも、実施の形態１と同一部分には、同一符号を付す。 [Modification 3 of Embodiment 3]
(Connection method 4)
FIG. 20 is a functional block diagram for explaining a fourth connection method, which is a third modification of the third embodiment. Again, the same parts as those in the first embodiment are denoted by the same reference numerals.

図２０では、音声区間検出装置１０００の音声区間検出部１０６４が検出したフレーム毎の音声／非音声の情報と、音声区間の開始／終了情報とともに、フレーム分割した音声波形データを、フレーム毎に音声処理装置２０００の音声処理部２００へ伝送する。 In FIG. 20, the voice waveform data divided into frames together with the voice / non-voice information for each frame detected by the voice section detection unit 1064 of the voice section detection apparatus 1000 and the start / end information of the voice section are displayed for each frame. The data is transmitted to the audio processing unit 200 of the processing device 2000.

音声処理装置２０００の音声処理部２００は、音声区間検出部１０６４から伝送されたフレーム毎の音声／非音声の情報と、音声区間の開始／終了情報とに基づいて、処理方法を分別して、同じく音声区間検出部１０６４から伝送されたフレーム毎の音声波形データに対して個別の音声処理を実行する。 The voice processing unit 200 of the voice processing device 2000 classifies the processing method based on the voice / non-voice information for each frame transmitted from the voice section detection unit 1064 and the start / end information of the voice section. Individual speech processing is executed on speech waveform data for each frame transmitted from the speech section detection unit 1064.

このような構成とすれば、音声区間情報と音声波形データがフレーム毎に対になって音声処理部２００へ伝送されるので、音声処理部２００は音声区間情報を利用して処理内容を細分できる。 With such a configuration, since the voice section information and the voice waveform data are paired for each frame and transmitted to the voice processing section 200, the voice processing section 200 can subdivide the processing contents using the voice section information. .

今回開示された実施の形態はすべての点で例示であって制限的なものではないと考えられるべきである。本発明の範囲は上記した説明ではなくて特許請求の範囲によって示され、特許請求の範囲と均等の意味および範囲内でのすべての変更が含まれることが意図される。 The embodiment disclosed this time should be considered as illustrative in all points and not restrictive. The scope of the present invention is defined by the terms of the claims, rather than the description above, and is intended to include any modifications within the scope and meaning equivalent to the terms of the claims.

本発明の音声区間検出装置１０００の構成の一例を示す概念図である。It is a conceptual diagram which shows an example of a structure of the audio | voice area detection apparatus 1000 of this invention. 図１に示したフレーム処理部１０６２と、音声区間検出部１０６４とが行う処理を説明するためのフローチャートである。6 is a flowchart for explaining processing performed by a frame processing unit 1062 and a voice section detection unit 1064 shown in FIG. 1. 図２のフローチャートの処理を示す概念図である。It is a conceptual diagram which shows the process of the flowchart of FIG. 発声内容「あー」について、第１変動を表す図である。It is a figure showing 1st fluctuation | variation about utterance content "ah". スムージングされた第１変動を表す図である。It is a figure showing the 1st fluctuation | variation smoothed. 第２変動を表す図である。It is a figure showing the 2nd fluctuation. スムージングされた第２変動を表わす図である。It is a figure showing the 2nd fluctuation | variation smoothed. 発声内容「あいかわらず」について、第１変動を表す図である。It is a figure showing 1st fluctuation | variation about utterance content "I don't care." スムージングされた第１変動を表す図である。It is a figure showing the 1st fluctuation | variation smoothed. 第２変動を表す図である。It is a figure showing the 2nd fluctuation. スムージングされた第２変動を表わす図である。It is a figure showing the 2nd fluctuation | variation smoothed. 実施の形態２の音声区間解析装置２０００の構成を説明するための機能ブロック図である。FIG. 10 is a functional block diagram for explaining a configuration of speech segment analysis apparatus 2000 according to the second embodiment. 表示装置１４０上に出力される「音声区間検出機能付き音声収録試聴アプリケーション」の基本画面を説明するための図である。It is a figure for demonstrating the basic screen of "the audio | voice recording audition application with an audio | voice area detection function" output on the display apparatus. 図１３で説明した基本画面（または各変動値表示画面）の「設定ボタン」がクリックされると、呼び出される設定画面を示す図である。It is a figure which shows the setting screen called when the "setting button" of the basic screen (or each variation value display screen) demonstrated in FIG. 13 is clicked. 設定画面において、「スムージングされた第２変動の表示」を「表示する」に設定された場合、表示装置１４０に表示される第１の拡張画面を示す図である。It is a figure which shows the 1st extended screen displayed on the display apparatus 140, when "display of the 2nd smoothing smoothed" is set to "display" on the setting screen. 設定画面において、各変動の表示のいずれもが「表示する」に設定された場合の第２の拡張画面を示す図である。It is a figure which shows the 2nd expansion screen when all the display of each fluctuation | variation is set to "display" on a setting screen. 第１の接続方式を説明するための機能ブロック図である。It is a functional block diagram for demonstrating a 1st connection system. 第２の接続方式を説明するための機能ブロック図である。It is a functional block diagram for demonstrating a 2nd connection system. 第３の接続方式を説明するための機能ブロック図である。It is a functional block diagram for demonstrating a 3rd connection system. 第４の接続方式を説明するための機能ブロック図である。It is a functional block diagram for demonstrating a 4th connection system.

Explanation of symbols

１０１入出力Ｉ／Ｆ、１０２音声データサンプリング部、１０４一時記憶部、１０６演算部、１０８データ格納部、１１０Ａ／Ｄ変換器、１０００音声区間検出装置、１０６２フレーム処理部、１０６４音声区間検出部１０６４、２０００音声処理装置。 101 Input / Output I / F, 102 Audio Data Sampling Unit, 104 Temporary Storage Unit, 106 Arithmetic Unit, 108 Data Storage Unit, 110 A / D Converter, 1000 Audio Segment Detection Device, 1062 Frame Processing Unit, 1064 Audio Segment Detection Unit 1064, 2000 Voice processing device.

Claims

A speech section detection device,
Frame processing means for performing frame cut-out processing on the sampled audio data;
First fluctuation calculating means for calculating a volume of the audio data as a first fluctuation;
Second fluctuation calculating means for calculating the fluctuation of the first fluctuation as the second fluctuation;
A frame determination unit that performs voice or non-voice determination for each frame by comparing the second variation with a predetermined threshold;
A speech segment detecting device comprising speech segment determining means for determining a speech segment based on the result determined to be speech and non-speech.

The voice section detection device according to claim 1, wherein the first fluctuation calculation unit calculates the first fluctuation by smoothing the volume of the voice data.

The speech section detection device according to claim 1, wherein the second fluctuation calculating unit calculates the second fluctuation by smoothing the fluctuation of the first fluctuation.

The speech section detection device according to claim 1, wherein the speech section determination unit determines a speech section from a continuation length of the frame determined to be speech or non-speech.

The speech section detection device according to claim 1, wherein the speech section determination unit excludes a speech section that does not satisfy a predetermined continuation length from the speech section among the frames determined to be the speech section.

The voice according to claim 1, wherein the voice segment determining means is sandwiched between the voice segments, and a non-speech segment having a predetermined duration or less is combined with the voice segments at both ends to form one voice segment. Section detection device.

A speech segment detection program for causing a computer having an arithmetic processing unit, a speech input device, and a storage device to perform speech segment detection,
Performing frame cut-out processing on audio data sampled by the audio input device and stored in the storage device;
The arithmetic processing unit calculating the volume of the audio data as a first variation;
The arithmetic processing unit calculating a variation of the first variation as a second variation;
The arithmetic processing unit performing voice or non-voice determination for each frame by comparing the second variation with a predetermined threshold;
A speech section detection program for causing a computer to execute a step in which the arithmetic processing unit determines a speech section based on a result determined to be speech and non-speech.