JP6794064B2

JP6794064B2 - Model learning device, speech interval detector, their methods and programs

Info

Publication number: JP6794064B2
Application number: JP2017159288A
Authority: JP
Inventors: 清彰松井; 岡本　学; 学岡本; 山口　義和; 義和山口; 太一浅見; 隆朗福冨; 崇史森谷
Original assignee: Nippon Telegraph and Telephone Corp; NTT Inc
Current assignee: Nippon Telegraph and Telephone Corp; NTT Inc
Priority date: 2017-08-22
Filing date: 2017-08-22
Publication date: 2020-12-02
Anticipated expiration: 2037-08-22
Also published as: JP2019039946A

Description

本発明は音声区間検出技術に関する。 The present invention relates to a voice section detection technique.

音声区間検出技術の一つにＶＡＤ（voice activity detection）と呼ばれる手法がある（例えば、非特許文献１等参照）。ＶＡＤでは、音の強さや振動の激しさ（ゼロ交差数）、音響特徴量等を用いて音声区間を行っている。しかし、ＶＡＤには、子音や単語と単語の間の短いポーズ等を非音声区間と判定し、細切れの音声区間を検出してしまうという問題がある。これに対処するため、ハングオーバーという手法が用いられている（例えば、非特許文献１等参照）。これは、ＶＡＤで得られた２つの音声区間の間の非音声区間のフレーム数が閾値よりも短い場合に、これら２つの音声区間を一続きの音声区間とみなす手法である。 One of the voice section detection techniques is a method called VAD (voice activity detection) (see, for example, Non-Patent Document 1 and the like). In VAD, a voice section is performed using sound intensity, vibration intensity (zero crossing number), acoustic features, and the like. However, VAD has a problem that a consonant, a short pause between words, or the like is determined as a non-voice section, and a fragmented voice section is detected. In order to deal with this, a method called hangover is used (see, for example, Non-Patent Document 1 and the like). This is a method of regarding these two voice sections as a continuous voice section when the number of frames of the non-voice section between the two voice sections obtained by VAD is shorter than the threshold value.

ITU, “A silence compression scheme for G.729 optimized for terminals conforming to recommendation V.70,” ITUT/Recommendation G.729-Annex B. 1996.ITU, “A silence compression scheme for G.729 optimized for terminals conforming to recommendation V.70,” ITUT / Recommendation G.729-Annex B. 1996.

しかし、ハングオーバーを用いた従来手法は柔軟性が低い。すなわち、ＶＡＤで得られる２つの音声区間の間の非音声区間の長さは、発話タスクやドメインなどの場面によって異なる。そのため、想定される場面ごとに最適な閾値を人手で設定しなければならない。また、同一の場面においても、理想的には、発話ごとに適切な閾値を与えるべきである。しかし、従来手法はそのような柔軟性を持たない。 However, the conventional method using a hangover is less flexible. That is, the length of the non-voice section between the two voice sections obtained by VAD differs depending on the situation such as the speech task and the domain. Therefore, it is necessary to manually set the optimum threshold value for each assumed scene. Also, even in the same situation, ideally, an appropriate threshold value should be given for each utterance. However, conventional methods do not have such flexibility.

本発明はこのような点に鑑みてなされたものであり、音声区間検出における柔軟性を向上させることを目的とする。 The present invention has been made in view of these points, and an object of the present invention is to improve flexibility in voice section detection.

入力音響信号の各時間区間の音声尤度に対応する入力音声尤度系列を入力として入力音響信号の音声状態および非音声状態についての各時間区間での状態遷移の推定結果を得る状態遷移モデルに、入力音響信号の各時間区間の音声尤度に対応する入力音声尤度系列を適用し、入力音響信号の音声状態および非音声状態についての各時間区間での状態遷移の推定結果を得て出力する。 A state transition model that obtains the estimation result of the state transition in each time interval for the audio state and non-audio state of the input acoustic signal by inputting the input audio likelihood series corresponding to the audio likelihood in each time interval of the input acoustic signal. , The input audio likelihood series corresponding to the audio likelihood of each time interval of the input acoustic signal is applied, and the estimation result of the state transition in each time interval for the audio state and non-audio state of the input acoustic signal is obtained and output. To do.

本発明では、音声区間検出における柔軟性を向上させることができる。 In the present invention, the flexibility in voice section detection can be improved.

図１Ａは実施形態のモデル学習装置の機能構成を示すブロック図である。図１Ｂは実施形態の音声検出装置の機能構成を示すブロック図である。FIG. 1A is a block diagram showing a functional configuration of the model learning device of the embodiment. FIG. 1B is a block diagram showing a functional configuration of the voice detection device of the embodiment. 図２は実施形態の状態遷移モデルの状態遷移図である。FIG. 2 is a state transition diagram of the state transition model of the embodiment. 図３Ａおよび図３Ｂは音響信号を例示した図である。図３ＣはＶＡＤによって検出した音声区間および非音声区間の例示である。図３Ｄは実施形態の手法によって検出した音声区間および非音声区間の例示である。3A and 3B are diagrams illustrating acoustic signals. FIG. 3C is an example of audio and non-audio sections detected by VAD. FIG. 3D is an example of audio and non-audio sections detected by the method of the embodiment. 図４Ａは複数の発話区間を有する音響信号を例示した図である。図４Ｂは実施形態の手法によって検出した音声区間および非音声区間の例示である。FIG. 4A is a diagram illustrating an acoustic signal having a plurality of utterance sections. FIG. 4B is an example of a voice section and a non-voice section detected by the method of the embodiment.

以下、本発明の実施形態を説明する。
［概要］
まず、各実施形態の概要を説明する。各実施形態では、入力音響信号の各時間区間の音声尤度に対応する入力音声尤度系列を状態遷移モデルに適用し、当該入力音響信号の音声状態および非音声状態についての各時間区間での状態遷移の推定結果を得て出力する。「状態遷移モデル」は、入力音響信号の各時間区間の音声尤度に対応する入力音声尤度系列を入力として、当該入力音響信号の音声状態および非音声状態についての各時間区間での状態遷移の推定結果を得るモデルである。この状態遷移モデルは、入力音声尤度系列に対する実際の音声区間および非音声区間の遷移（すなわち、実際の音声区間および非音声区間の表れ方）をモデル化したものである。そのため、この状態遷移モデルに入力音声尤度系列を適用することで、たとえ入力音声尤度系列が子音や短いポーズ等の時間区間を表していたとしても、それが非音声区間ではなく、音声区間の一部であることを適切に推定できる。これにより、子音や短いポーズ等を非音声区間と判定し、細切れの音声区間を検出してしまうという問題を解決できる。また、状態遷移モデルの生成は、多様な入力音声尤度系列に適用できるため、ハングオーバーに比べて柔軟性が高い。特に、本形態の状態遷移モデルは、入力音声尤度系列を入力として音声区間および非音声区間の遷移を推定する。入力音声尤度系列は、環境に応じて変動が激しい入力音響信号を、それよりも変動の小さい音声尤度に対応する系列に集約したものである。そのため、本形態の状態遷移モデルは、多様な環境に柔軟に対応し、精度の高い推定を可能にする。以上のように、各実施形態の手法では、従来に比べ、音声区間検出における柔軟性を向上させることができる。状態遷移モデルの生成は、学習データを用いた機械学習によって可能であり、閾値を人手で設定するハングオーバーに比べてチューニングコストが低い。 Hereinafter, embodiments of the present invention will be described.
[Overview]
First, the outline of each embodiment will be described. In each embodiment, an input voice likelihood sequence corresponding to the voice likelihood of each time interval of the input acoustic signal is applied to the state transition model, and the input acoustic signal has a voice state and a non-voice state in each time interval. Obtain the estimation result of the state transition and output it. The "state transition model" takes an input audio likelihood series corresponding to the audio likelihood of each time interval of the input acoustic signal as an input, and the state transition of the input acoustic signal regarding the audio state and the non-audio state in each time interval. It is a model to obtain the estimation result of. This state transition model models the transition of the actual voice section and the non-speech section (that is, the appearance of the actual voice section and the non-speech section) with respect to the input voice likelihood series. Therefore, by applying the input speech likelihood sequence to this state transition model, even if the input speech likelihood sequence represents a time interval such as a consonant or a short pause, it is not a non-voice interval but a speech interval. Can be properly estimated to be part of. This solves the problem of determining a consonant, a short pause, or the like as a non-voice section and detecting a fragmented voice section. Moreover, since the generation of the state transition model can be applied to various input speech likelihood series, it is more flexible than the hangover. In particular, the state transition model of this embodiment estimates the transition of the voice section and the non-voice section by using the input voice likelihood series as an input. The input voice likelihood series is a collection of input acoustic signals that fluctuate greatly depending on the environment into a series corresponding to voice likelihood with less fluctuation. Therefore, the state transition model of this embodiment flexibly responds to various environments and enables highly accurate estimation. As described above, the method of each embodiment can improve the flexibility in voice section detection as compared with the conventional method. The state transition model can be generated by machine learning using training data, and the tuning cost is lower than the hangover in which the threshold value is set manually.

なお、「入力音響信号」は複数の所定の時間区間（例えば、フレーム、サブフレームなど）ごとに区分された時系列のデジタル音響信号である。時間区間の「音声尤度」は、当該時間区間が音声区間である尤度（尤もらしさ）を表す。「音声尤度」は、当該時間区間が音声区間である尤度をそのまま示してもよいし、当該時間区間が非音声区間である尤度を示すことで間接的に音声区間である尤度を表していてもよい。「入力音声尤度系列」は、各時間区間の「音声尤度」に対応する値の時系列である。「入力音声尤度系列」は、各時間区間の「音声尤度」の時系列であってもよいし、各時間区間の「音声尤度」を表す値の時系列であってもよい。「音声尤度」を表す値は、連続値であってもよいし、バイナリ値であってもよい。例えば、「音声尤度」を表す値は、「音声尤度」の関数値であってもよいし、「音声尤度」を用いて閾値判定されたバイナリ値であってもよい。しかし、推定精度の観点から、「入力音声尤度系列」は、各時間区間の「音声尤度」の時系列または連続値である「音声尤度」を表す値の時系列であることが望ましい。音声状態および非音声状態についての各時間区間での「状態遷移」は、例えば、音声状態から音声状態への状態遷移、非音声状態から非音声状態への状態遷移、音声状態から非音声状態への状態遷移、および非音声状態から音声状態への状態遷移の何れかである。ただし、音声状態から音声状態への状態遷移は、音声状態が持続されること、すなわち「音声状態」を意味する。同様に、非音声状態から非音声状態への状態遷移は、非音声状態が持続されること、すなわち「非音声状態」を意味する。入力音声尤度系列を状態遷移モデルに適用して得られる状態遷移の推定結果は、各状態遷移の尤度であってもよいし、各状態遷移の尤度の関数値であってもよいし、各状態遷移の尤度を比較して得られた値（例えば、最も尤度の高い状態遷移）であってもよい。状態遷移モデルの学習は、音響信号の各時間区間の音声尤度に対応する音声尤度系列と、当該音響信号の音声状態および非音声状態についての各時間区間での状態遷移の正解値の系列と、の組を含む学習データを用いて行われる。この学習データでは、音声尤度系列の各時間区間の要素（例えば、音声尤度）と当該各時間区間での状態遷移の正解値とが互いに対応付けられている。例えば、音声尤度系列の各時間区間の要素と当該各時間区間での状態遷移の正解値とが当該時間区間を表す識別子に対応付けられている。通常、学習データは複数の音響信号についての音声尤度系列と状態遷移の正解値の系列との組を含む。すなわち、学習データは音声尤度系列と状態遷移の正解値の系列との組を複数組含む。しかし、学習データが１個の音響信号についての音声尤度系列と状態遷移の正解値の系列との組のみを含んでもよい。「音響信号」は複数の所定の時間区間（例えば、フレーム、サブフレームなど）ごとに区分された時系列のデジタル音響信号である。学習データの音声尤度系列は、想定される入力音声尤度系列と同一である必要はないが、同じ種別である必要がある。例えば、想定される入力音声尤度系列が各時間区間の音声尤度の時系列であるならば、学習データの音声尤度系列も各時間区間の音声尤度の時系列である必要がある。 The "input acoustic signal" is a time-series digital acoustic signal divided into a plurality of predetermined time intervals (for example, a frame, a subframe, etc.). The "speech likelihood" of a time interval represents the likelihood (likelihood) that the time interval is a speech interval. The "speech likelihood" may indicate the likelihood that the time interval is a speech interval as it is, or indirectly the likelihood that the time interval is a speech interval by indicating the likelihood that the time interval is a non-speech interval. It may be represented. The "input voice likelihood series" is a time series of values corresponding to the "voice likelihood" of each time interval. The "input voice likelihood series" may be a time series of "voice likelihood" of each time interval, or may be a time series of values representing "voice likelihood" of each time interval. The value representing "speech likelihood" may be a continuous value or a binary value. For example, the value representing "speech likelihood" may be a function value of "speech likelihood" or a binary value determined by a threshold value using "speech likelihood". However, from the viewpoint of estimation accuracy, it is desirable that the "input voice likelihood series" is a time series of "voice likelihood" of each time interval or a time series of values representing "voice likelihood" which is a continuous value. .. The "state transition" in each time interval for the voice state and the non-voice state is, for example, the state transition from the voice state to the voice state, the state transition from the non-voice state to the non-voice state, and the state transition from the voice state to the non-voice state. It is either a state transition of or a state transition from a non-voice state to a voice state. However, the state transition from the voice state to the voice state means that the voice state is sustained, that is, the "voice state". Similarly, a state transition from a non-voice state to a non-voice state means that the non-voice state is sustained, that is, a "non-voice state". The estimation result of the state transition obtained by applying the input voice likelihood series to the state transition model may be the likelihood of each state transition or a function value of the likelihood of each state transition. , It may be a value obtained by comparing the likelihoods of each state transition (for example, the state transition having the highest likelihood). The learning of the state transition model is a sequence of the voice likelihood series corresponding to the voice likelihood of each time interval of the acoustic signal and the sequence of the correct answer values of the state transitions in each time interval for the voice state and the non-voice state of the sound signal. It is performed using the training data including the set of and. In this learning data, the elements of each time interval of the speech likelihood series (for example, speech likelihood) and the correct answer value of the state transition in each time interval are associated with each other. For example, the element of each time interval of the voice likelihood series and the correct answer value of the state transition in each time interval are associated with the identifier representing the time interval. Usually, the training data includes a set of a voice likelihood series for a plurality of acoustic signals and a series of correct answer values of state transitions. That is, the learning data includes a plurality of sets of a speech likelihood series and a series of correct answer values of state transitions. However, the learning data may include only a set of a voice likelihood series for one acoustic signal and a series of correct answer values of state transitions. The "acoustic signal" is a time-series digital acoustic signal divided into a plurality of predetermined time intervals (for example, frames, subframes, etc.). The voice likelihood series of the training data does not have to be the same as the assumed input voice likelihood series, but must be of the same type. For example, if the assumed input voice likelihood series is a time series of voice likelihood in each time interval, the voice likelihood series of the training data also needs to be a time series of voice likelihood in each time interval.

［第１実施形態］
第１実施形態を説明する。
＜モデル学習装置１１＞
まず本形態のモデル学習装置１１について説明する。図１Ａに例示するように、本形態のモデル学習装置１１は、学習データ１１１ａを記憶する記憶部１１１、および状態遷移モデル１２３ａを学習する学習部１１２を有する。 [First Embodiment]
The first embodiment will be described.
<Model learning device 11>
First, the model learning device 11 of this embodiment will be described. As illustrated in FIG. 1A, the model learning device 11 of the present embodiment has a storage unit 111 for storing the learning data 111a and a learning unit 112 for learning the state transition model 123a.

学習データ１１１ａは、音響信号の各時間区間の音声尤度に対応する音声尤度系列と、当該音響信号の音声状態および非音声状態についての各時間区間での状態遷移の正解値の系列と、の組を含む。学習部１１２は、記憶部１１１から読み出した学習データ１１１ａを用い、入力音響信号の各時間区間の音声尤度に対応する入力音声尤度系列を入力として入力音響信号の音声状態および非音声状態についての各時間区間での状態遷移の推定結果を得る状態遷移モデル１２３ａを得て出力する。本形態の状態遷移は、音声状態から音声状態への状態遷移、非音声状態から非音声状態への状態遷移、音声状態から非音声状態への状態遷移、および非音声状態から音声状態への状態遷移の何れかである。 The learning data 111a includes a voice likelihood sequence corresponding to the voice likelihood of each time interval of the acoustic signal, a sequence of correct answer values of state transitions in each time interval for the voice state and non-voice state of the sound signal, and Includes a set of. The learning unit 112 uses the learning data 111a read from the storage unit 111, and uses the input audio likelihood series corresponding to the audio likelihood of each time interval of the input acoustic signal as an input for the audio state and non-audio state of the input acoustic signal. The state transition model 123a for obtaining the estimation result of the state transition in each time interval of is obtained and output. The state transitions of this embodiment are state transitions from a voice state to a voice state, a state transition from a non-voice state to a non-voice state, a state transition from a voice state to a non-voice state, and a state from a non-voice state to a voice state. It is one of the transitions.

学習データ１１１ａが含む各時間区間の音声尤度系列は、学習データ用の音響信号から音響情報や分析情報を抽出し、それらに公知のＶＡＤを適用することで得られる（例えば、非特許文献１等参照）。ＶＡＤに用いる音響情報や分析情報としては、例えば、音のパワーの変化、波形の一定時間当たりのゼロ交差数、音響特徴量の特性の変化等、およびそれらの組み合わせが例示できる。学習データ用の音響信号は、所定の時間区間ごとに区分された時系列のデジタル音響信号である。学習データ用の音響信号は、マイクロホン等で観測されたアナログ音響信号を所定のサンプリング周波数でＡＤ変換したものであってもよいし、予め作成された任意のデジタル音響信号であってもよい。音声尤度系列の例は前述の通りであり、学習データ用の音響信号の各時間区間に対して得られた値の時系列である。音声尤度系列が各時間区間の音声尤度または音声尤度を表す連続値の系列の場合、音声尤度系列は例えば０．１，０．５，…，０．８のようになる。音声尤度系列が各時間区間の音声尤度を表すバイナリ値の系列の場合、音声尤度系列は例えば０，０，０，１…，１，０のようになる。 The voice likelihood series for each time interval included in the training data 111a can be obtained by extracting acoustic information and analysis information from the acoustic signal for the training data and applying a known VAD to them (for example, Non-Patent Document 1). Etc.). Examples of acoustic information and analysis information used for VAD include changes in sound power, the number of zero crossings of a waveform per fixed time period, changes in the characteristics of acoustic features, and combinations thereof. The acoustic signal for the training data is a time-series digital acoustic signal divided for each predetermined time interval. The acoustic signal for the training data may be an analog acoustic signal observed by a microphone or the like, AD-converted at a predetermined sampling frequency, or an arbitrary digital acoustic signal created in advance. An example of the voice likelihood series is as described above, and is a time series of values obtained for each time interval of the acoustic signal for learning data. When the voice likelihood series is a series of continuous values representing the voice likelihood or the voice likelihood of each time interval, the voice likelihood series is, for example, 0.1, 0.5, ..., 0.8. When the voice likelihood series is a series of binary values representing the voice likelihood of each time interval, the voice likelihood series is, for example, 0,0,0,1 ..., 1,0.

学習データ１１１ａが含む状態遷移の正解値の系列は、上述の学習データ用の音響信号から得られた音声尤度系列（以下「学習データ用の音声尤度系列」という）の各時間区間に、音声状態および非音声状態についての状態遷移の正解値を付与することで得られる。この状態遷移の正解値について詳細に説明する。例えば「今日は、いい天気です」と読み上げられる場合を想定する。この際、図３Ａのように主部（今日は）と述部（いい天気です）とが続けて読み上げられた場合であっても、図３Ｂのように間隔をあけて主部と述部が読み上げられた場合であっても、その入力音響信号に対応する入力音声尤度系列から「今日は、いい天気です」全体を１つの真の音声区間として推定することを目指す。すなわち、図３Ｂのように間隔をあけて主部と述部を読み上げた場合、それに対応する入力音声尤度系列は「今日は」と「いい天気です」との間に短い非音声区間を持つことを表す系列となる。例えば、入力音声尤度系列が音声区間と非音声区間とを表すバイナリ系列である場合、この入力音声尤度系列は「今日は」と「いい天気です」との間に非音声区間の時間区間を持つ系列となる（図３Ｃ）。本形態では、このように音声区間に挟まれた短い非音声区間による音声尤度の振動を吸収した真の音声区間を適切に推定する。逆に、非音声区間の間に突発的な雑音などが混入した場合であっても、それによる音声尤度の振動を吸収した真の非音声区間を適切に推定する。なお、「真の音声区間」とは、１つの発話区間において最初に音声が観測されてから、最後の音声が観測されなくなるまでの時間区間を意味する。「真の非音声区間」とは真の音声区間以外の時間区間を意味する。「発話区間」とは、「今日は、いい天気です」のように、まとまりのある発話が行われた時間区間を意味する。また「発話開始」とは「発話区間」の開始を意味し、「発話終了」とは「発話区間」の終了を意味する。 The series of correct answer values of the state transitions included in the training data 111a is included in each time interval of the voice likelihood series (hereinafter referred to as "sound likelihood series for training data") obtained from the above-mentioned acoustic signal for training data. It is obtained by giving the correct answer value of the state transition for the voice state and the non-voice state. The correct answer value of this state transition will be described in detail. For example, suppose you hear "Today is a nice day". At this time, even if the main part (today) and the predicate (good weather) are read out in succession as shown in FIG. 3A, the main part and the predicate are separated from each other as shown in FIG. 3B. Even if it is read aloud, we aim to estimate the entire "Today is good weather" as one true voice section from the input voice likelihood series corresponding to the input acoustic signal. That is, when the main part and the predicate are read aloud at intervals as shown in FIG. 3B, the corresponding input voice likelihood series has a short non-voice interval between "today" and "good weather". It becomes a series that represents that. For example, if the input voice likelihood series is a binary series representing a voice interval and a non-voice interval, this input voice likelihood series is the time interval of the non-voice interval between "today" and "good weather". (Fig. 3C). In this embodiment, the true voice section that absorbs the vibration of the voice likelihood due to the short non-voice section sandwiched between the voice sections is appropriately estimated. On the contrary, even if sudden noise or the like is mixed between the non-voice sections, the true non-voice section that absorbs the vibration of the voice likelihood due to the noise is appropriately estimated. The "true voice section" means a time section from the first observation of voice in one utterance section to the time when the last voice is no longer observed. "True non-speech interval" means a time interval other than the true audio interval. The "utterance section" means a time section in which a cohesive utterance is made, such as "Today is a good weather". Further, "start of utterance" means the start of "speech section", and "end of utterance" means the end of "speech section".

このような推定を行うため、入力音声尤度系列を入力として入力音響信号の音声状態および非音声状態についての各時間区間での状態遷移の推定結果を得る状態遷移モデル１２３ａを学習する。本形態では、非音声状態から音声状態への状態遷移（始端遷移状態）、および、音声状態から非音声状態への状態遷移（終端遷移状態）を強調するため、音声状態および非音声状態だけではなく、始端遷移状態および終端遷移状態にもそれぞれに特別なラベルを割り当てる。例えば、各状態遷移に以下のようなラベルを割り当てる。
ラベル“０”：非音声状態から非音声状態への状態遷移（非音声状態）
ラベル“１”：音声状態から音声状態への状態遷移（音声状態）
ラベル“２”：非音声状態から音声状態への状態遷移（始端遷移状態）
ラベル“３”：音声状態から非音声状態への状態遷移（終端遷移状態）
なお、音声状態とは真の音声区間に対応する状態であり、非音声状態とは真の非音声区間に対応する状態である。 In order to perform such estimation, the state transition model 123a for obtaining the estimation result of the state transition in each time interval for the voice state and the non-voice state of the input acoustic signal is learned by using the input voice likelihood series as an input. In this embodiment, in order to emphasize the state transition from the non-voice state to the voice state (starting transition state) and the state transition from the voice state to the non-voice state (end transition state), only the voice state and the non-voice state are used. Instead, a special label is assigned to each of the start transition state and the end transition state. For example, the following labels are assigned to each state transition.
Label "0": State transition from non-voice state to non-voice state (non-voice state)
Label "1": State transition from voice state to voice state (voice state)
Label "2": State transition from non-voice state to voice state (starting transition state)
Label "3": State transition from voice state to non-voice state (termination transition state)
The voice state is a state corresponding to a true voice section, and the non-voice state is a state corresponding to a true non-voice section.

図２に、このようなラベルが与えられた場合の状態遷移モデル１２３ａの状態遷移図を示す。この状態遷移モデル１２３ａは、入力音声尤度系列を入力として、非音声状態および音声状態をループしつつ、非音声状態、音声状態、始端遷移状態、および終端遷移状態の際にそれぞれに対応するラベルの尤度を出力する。この場合、学習データのパターンは、学習データ用の音声尤度系列に対応する時間区間が、
（１）真の音声区間を含むか、
（２）真の音声区間とその直前の真の非音声区間とを含むか、
（３）真の音声区間とその直後の真の非音声区間とを含むか、
によって細分化できる。 FIG. 2 shows a state transition diagram of the state transition model 123a when such a label is given. The state transition model 123a takes the input voice likelihood series as an input, loops the non-voice state and the voice state, and has labels corresponding to the non-voice state, the voice state, the start transition state, and the end transition state, respectively. Outputs the likelihood of. In this case, the training data pattern has a time interval corresponding to the speech likelihood series for the training data.
(1) Whether to include the true audio section
(2) Whether to include the true audio section and the true non-audio section immediately before it
(3) Whether to include the true audio section and the true non-audio section immediately after it
Can be subdivided by.

パターン１：学習データ用の音声尤度系列に対応する時間区間が、真の音声区間ならびにその直前および直後の真の非音声区間を含む場合
これは最も一般的なパターンである。この場合、学習データ用の音声尤度系列に対応する各時間区間に以下のラベルが与えられる。
・発話開始から非音声状態の終端までの時間区間：非音声状態“０”
・非音声状態の終端から音声状態へ遷移する時間区間：始端遷移状態“２”
・音声状態の始端から終端の時間区間：音声区間“１”
・音声状態の終端から非音声状態へ遷移する時間区間：終端遷移状態“３”
・非音声状態の始端（発話終了）から終端（次の発話開始）までの時間区間：非音声状態“０”
すなわち、当該音声尤度系列に対応する各時間区間での状態遷移の正解値の系列は０，…，０，２，１，…，１，３，０，…，０となる。例えば、図３Ａのように主部と述部とが続けて読み上げられた場合であっても、図３Ｂのように間隔をあけて主部と述部が読み上げられた場合であっても、各時間区間での状態遷移の正解値の系列（ラベルの系列）は、ともに図３Ｄに例示するように０，…，０，２，１，…，１，３，０，…，０となる。なお、通常、始端遷移状態の時間区間が２つ以上連続することはなく、終端遷移状態の時間区間も２つ以上連続することはない。しかし、これらの時間区間が２つ以上連続してもよい。 Pattern 1: When the time interval corresponding to the speech likelihood sequence for training data includes the true speech interval and the true non-speech interval immediately before and after it, this is the most common pattern. In this case, the following labels are given to each time interval corresponding to the speech likelihood series for training data.
-Time interval from the start of utterance to the end of the non-voice state: non-voice state "0"
-Time interval for transition from the end of the non-voice state to the voice state: Start transition state "2"
-Time interval from the beginning to the end of the voice state: Voice section "1"
-Time interval for transition from the end of the voice state to the non-voice state: End transition state "3"
-Time interval from the beginning (end of utterance) to the end (start of the next utterance) of the non-voice state: Non-voice state "0"
That is, the series of correct answer values of the state transitions in each time interval corresponding to the voice likelihood series is 0, ..., 0,2,1, ..., 1,3,0, ..., 0. For example, even if the main part and the predicate are read out in succession as shown in FIG. 3A, or even if the main part and the predicate are read out at intervals as shown in FIG. 3B, each The series of correct answer values (label series) of the state transitions in the time interval are 0, ..., 0,2,1, ..., 1,3,0, ..., 0, as illustrated in FIG. 3D. Normally, two or more time intervals in the start transition state do not continue, and two or more time intervals in the end transition state do not continue. However, two or more of these time intervals may be continuous.

パターン２：学習データ用の音声尤度系列に対応する時間区間が、真の音声区間およびその直後の真の非音声区間を含むが当該音声区間直前の非音声区間を含まない場合
この場合、先頭の時間区間が始端遷移状態の時間区間とみなされる。すなわち、学習データ用の音声尤度系列に対応する各時間区間に以下のラベルが与えられる。
・先頭の時間区間：始端遷移状態“２”
・次の時間区間から音声状態の終端までの時間区間：音声区間“１”
・音声状態の終端から非音声状態へ遷移する時間区間：終端遷移状態“３”
・非音声状態の始端（発話終了）から終端（次の発話開始）までの時間区間：非音声状態“０”
すなわち、当該音声尤度系列に対応する各時間区間での状態遷移の正解値の系列は２，１，…，１，３，０，…，０となる。 Pattern 2: When the time interval corresponding to the speech likelihood series for training data includes the true speech section and the true non-speech section immediately after it, but does not include the non-speech section immediately before the speech section. The time interval of is regarded as the time interval of the start transition state. That is, the following labels are given to each time interval corresponding to the speech likelihood series for training data.
・ First time interval: Start transition state “2”
-Time interval from the next time interval to the end of the voice state: Voice section "1"
-Time interval for transition from the end of the voice state to the non-voice state: End transition state "3"
-Time interval from the beginning (end of utterance) to the end (start of the next utterance) of the non-voice state: Non-voice state "0"
That is, the series of correct answer values of the state transitions in each time interval corresponding to the voice likelihood series are 2,1, ..., 1,3,0, ..., 0.

パターン３：学習データ用の音声尤度系列に対応する時間区間が、真の音声区間およびその直前の真の非音声区間を含むが当該音声区間直後の非音声区間を含まない場合
この場合、最終の時間区間が終端遷移状態の時間区間とみなされる。すなわち、学習データ用の音声尤度系列に対応する各時間区間に以下のラベルが与えられる。
・発話開始から非音声状態の終端までの時間区間：非音声状態“０”
・非音声状態の終端から音声状態へ遷移する時間区間：始端遷移状態“２”
・音声状態の始端から最終の時間区間直前までの時間区間：音声区間“１”
・最終の時間区間：終端遷移状態“３”
すなわち、当該音声尤度系列に対応する各時間区間での状態遷移の正解値の系列は０，…，０，２，１，…，１，３となる。 Pattern 3: When the time interval corresponding to the speech likelihood series for training data includes the true speech section and the true non-speech section immediately before it, but does not include the non-speech section immediately after the speech section. In this case, the final The time interval of is regarded as the time interval of the terminal transition state. That is, the following labels are given to each time interval corresponding to the speech likelihood series for training data.
-Time interval from the start of utterance to the end of the non-voice state: non-voice state "0"
-Time interval for transition from the end of the non-voice state to the voice state: Start transition state "2"
-Time interval from the beginning of the voice state to just before the last time section: Voice section "1"
-Final time interval: Termination transition state "3"
That is, the sequence of correct answer values of the state transition in each time interval corresponding to the speech likelihood sequence is 0, ..., 0, 2, 1, ..., 1, 3.

パターン４：学習データ用の音声尤度系列に対応する時間区間が、真の音声区間を含むがその前後に非音声区間を含まない場合
学習データ用の音声尤度系列に対応する時間区間が全て真の音声区間の場合である。この場合、当該音声尤度系列に対応する各時間区間に以下のラベルが与えられる。
・先頭から最終までの時間区間：音声区間“１”
すなわち、当該音声尤度系列に対応する各時間区間での状態遷移の正解値の系列は１，…，１となる。 Pattern 4: When the time interval corresponding to the speech likelihood sequence for training data includes the true speech interval but does not include the non-speech interval before and after it All the time intervals corresponding to the speech likelihood sequence for training data This is the case for the true audio section. In this case, the following labels are given to each time interval corresponding to the voice likelihood series.
-Time interval from the beginning to the end: Audio interval "1"
That is, the series of correct answer values of the state transitions in each time interval corresponding to the voice likelihood series are 1, ..., 1.

パターン５：音声尤度系列に対応する時間区間が真の音声区間を含まない場合
この場合、音声尤度系列に対応する各時間区間に以下のラベルが与えられる。
・先頭から最終までの時間区間：音声区間“０”
すなわち、当該音声尤度系列に対応する各時間区間での状態遷移の正解値の系列は０，…，０となる。 Pattern 5: When the time interval corresponding to the voice likelihood series does not include the true voice interval In this case, the following label is given to each time interval corresponding to the voice likelihood series.
-Time interval from the beginning to the end: Audio interval "0"
That is, the series of correct answer values of the state transition in each time interval corresponding to the voice likelihood series is 0, ..., 0.

なお、学習データ１１１ａは、上記のパターン１から５のすべての状態遷移の正解値の系列を含んでいてもよいし、それらの一部の状態遷移の正解値の系列のみを含んでいてもよい。また、音声尤度系列に対応する時間区間が複数の真の音声区間を含んでいてもよい。この場合の学習データ１１１ａは、上記のパターン１から５の状態遷移の正解値の系列の組み合わせからなる。例えば，図４Ａおよび図４Ｂに例示するように、１つ目の真の音声区間Ｉの前後に真の非音声区間を含み、２つ目の音声区間ＩＩの後に非音声区間が存在しない場合、学習データ１１１ａは、１つ目の真の音声区間Ｉの音声尤度系列とそれに対応するパターン１の状態遷移の正解値の系列０，…，０，２，１，…，１，３，０，…，０との組、および、２つ目の真の音声区間Ｉの音声尤度系列とそれに対応するパターン１の状態遷移の正解値の系列０，…，０，２，１，…，１，３との組、を含む。 The learning data 111a may include a series of correct answer values of all the state transitions of the above patterns 1 to 5, or may include only a series of correct answer values of some of the state transitions. .. In addition, the time interval corresponding to the speech likelihood series may include a plurality of true speech intervals. The learning data 111a in this case is composed of a combination of a series of correct answer values of the state transitions of the above patterns 1 to 5. For example, as illustrated in FIGS. 4A and 4B, when a true non-voice section is included before and after the first true voice section I and there is no non-voice section after the second voice section II. The training data 111a is a sequence of correct answer values of the speech likelihood sequence of the first true speech interval I and the corresponding state transition of the pattern 1, 0, ..., 0, 2, 1, ..., 1, 3, 0. , ..., 0, and the speech likelihood sequence of the second true speech interval I and the corresponding sequence of correct values of the state transition of pattern 1 0, ..., 0, 2, 1, ..., Includes pairs with 1 and 3.

上述したラベルの付与は、学習データ用の音響信号の各時間区間が音声区間であるか非音声区間であるかだけではなく、学習データ用の音響信号の発話区間としての音声区間および非音声区間（各発話区間に紐付けられた音声区間および非音声区間）に基づいて行う必要がある。このようなラベルの付与には様々な方法が考えられる。第１の方法は、人間が学習データ用の音響信号に含まれる一つ一つの発話を視聴し、波形を観測して正確なラベルを付与するものである。第２の方法は、公知の認識用のデコーダを利用することにより、自動的に上記のラベルを付与する方法である。しかし、第２の方法では一部の学習データにおいて誤ったラベルを付与してしまう場合がある。このような場合であっても、学習データ全体として正しくラベルが付与される頻度が高ければ大きな問題はない。第３の方法は、初めに上述のデコーダでラベルを付与した後、人手で誤りをチェックする折衷案である。 The above-mentioned labeling is given not only whether each time section of the acoustic signal for training data is a voice section or a non-voice section, but also a voice section and a non-voice section as the utterance section of the sound signal for training data. It is necessary to perform based on (voice section and non-voice section associated with each utterance section). Various methods can be considered for assigning such a label. In the first method, a human being listens to each utterance included in an acoustic signal for learning data, observes a waveform, and assigns an accurate label. The second method is a method of automatically assigning the above label by using a known recognition decoder. However, in the second method, some training data may be given an erroneous label. Even in such a case, there is no big problem as long as the frequency of correctly labeling the training data as a whole is high. The third method is a compromise in which the above-mentioned decoder first labels the label and then manually checks for errors.

学習部１１２は、上述のように生成した学習データ１１１ａを用い、図２の状態遷移図で表される状態遷移モデル１２３ａを学習する。この状態遷移モデル１２３ａは、公知の手法（例えば、参考文献１：Hochreiter, S., & Schmidhuber, J., “Long Short-Term Memory,” Neural Computation, 9(8), 1735-1780, 1997）を用いて表現・学習できる。参考文献１の手法は、非常に長い系列の情報を保持しておくことができ、より時間的に遠い位置の状態や音声尤度を考慮した状態遷移モデル１２３ａを構築することができ、非常に有用である。参考文献１の手法以外にも、時系列を扱えるモデル（ＲＮＮやＨＭＭ等）を用いることも可能である。ただし、ＨＭＭの場合は、その構造上、未来の状態を考慮することができない。 The learning unit 112 learns the state transition model 123a represented by the state transition diagram of FIG. 2 using the learning data 111a generated as described above. This state transition model 123a is a known method (for example, Reference 1: Hochreiter, S., & Schmidhuber, J., “Long Short-Term Memory,” Neural Computation, 9 (8), 1735-1780, 1997). Can be expressed and learned using. The method of Reference 1 can hold a very long series of information, and can construct a state transition model 123a in consideration of a state at a position farther in time and voice likelihood, which is very very. It is useful. In addition to the method of Reference 1, it is also possible to use a model (RNN, HMM, etc.) that can handle time series. However, in the case of HMM, the future state cannot be considered due to its structure.

＜音声区間検出装置１２＞
次に本形態の音声区間検出装置１２について説明する。図１Ｂに例示するように、本形態の音声区間検出装置１２は、入力部１２１、音声区間検出部１２２、記憶部１２３、推定部１２４、および出力部１２６を有する。 <Voice section detection device 12>
Next, the voice section detection device 12 of this embodiment will be described. As illustrated in FIG. 1B, the voice section detection device 12 of the present embodiment includes an input unit 121, a voice section detection unit 122, a storage unit 123, an estimation unit 124, and an output unit 126.

＜記憶部１２３＞
記憶部１２３には、前述のようにモデル学習装置１１から出力された状態遷移モデル１２３ａが格納される。状態遷移モデル１２３ａは、音声区間検出装置１２での音声区間検出が開始される前に記憶部１２３に格納されていてもよいし、モデル学習装置１１から新たな状態遷移モデル１２３ａが出力されるたびに記憶部１２３に格納されてもよい。 <Memory unit 123>
The state transition model 123a output from the model learning device 11 is stored in the storage unit 123 as described above. The state transition model 123a may be stored in the storage unit 123 before the voice section detection by the voice section detection device 12 is started, or each time a new state transition model 123a is output from the model learning device 11. May be stored in the storage unit 123.

＜入力部１２１＞
入力部１２１には、音声区間検出対象の入力音響信号が入力される。音声区間検出対象の入力音響信号は、所定の時間区間ごとに区分された時系列のデジタル音響信号である。入力音響信号は、マイクロホン等で観測されたアナログ音響信号を所定のサンプリング周波数でＡＤ変換したものであってもよいし、予め作成された任意のデジタル音響信号であってもよい。なお、入力音響信号の時間区間の長さは、前述の学習データ用の音響信号の時間区間の長さと同一または近似することが好ましい。入力音響信号は音声区間検出部１２２に送られる。 <Input unit 121>
The input acoustic signal of the voice section detection target is input to the input unit 121. The input acoustic signal to be detected in the audio section is a time-series digital acoustic signal divided for each predetermined time interval. The input acoustic signal may be an analog acoustic signal observed by a microphone or the like and AD-converted at a predetermined sampling frequency, or may be an arbitrary digital acoustic signal created in advance. The length of the time interval of the input acoustic signal is preferably the same as or close to the length of the time interval of the acoustic signal for the training data described above. The input acoustic signal is sent to the voice section detection unit 122.

＜音声区間検出部１２２＞
音声区間検出部１２２は、入力音響信号から音響情報や分析情報を抽出し、それらに対して公知のＶＡＤを適用することで（例えば、非特許文献１等参照）、入力音響信号の各時間区間の音声尤度に対応する入力音声尤度系列を得て出力する。ＶＡＤに用いる音響情報や分析情報の例は前述の通りである。また入力音声尤度系列の例も前述の通りであるが、入力音声尤度系列の種別は学習データ用の音声尤度系列の種別と同一である。例えば、学習データ用の音声尤度系列が各時間区間の音声尤度の系列である場合、入力音声尤度系列も各時間区間の音声尤度の系列である。入力音声尤度系列は推定部１２４に送られる。 <Voice section detection unit 122>
The audio section detection unit 122 extracts acoustic information and analysis information from the input acoustic signal, and applies a known VAD to them (see, for example, Non-Patent Document 1 and the like) to apply each time interval of the input acoustic signal. The input audio likelihood series corresponding to the audio likelihood of is obtained and output. Examples of acoustic information and analysis information used for VAD are as described above. The example of the input voice likelihood series is also as described above, but the type of the input voice likelihood series is the same as the type of the voice likelihood series for learning data. For example, when the voice likelihood sequence for training data is a series of voice likelihood in each time interval, the input voice likelihood series is also a series of voice likelihood in each time interval. The input voice likelihood series is sent to the estimation unit 124.

＜推定部１２４＞
推定部１２４は、送られた入力音声尤度系列を、記憶部１２３から読み出した状態遷移モデル１２３ａに適用し、入力音響信号の音声状態および非音声状態についての各時間区間での状態遷移の推定結果を得て出力する（post-filter処理）。状態遷移の推定結果の例は、各時間区間での非音声状態（ラベル“０”）の尤度、音声状態（ラベル“１”）の尤度、始端遷移状態（ラベル“２”）の尤度、および終端遷移状態（ラベル“３”）の尤度からなる４つの尤度の系列である。あるいは、各時間区間の非音声状態（ラベル“０”）の尤度の系列のみを状態遷移の推定結果としてもよいし、各時間区間の音声状態（ラベル“１”）の尤度の系列のみを状態遷移の推定結果としてもよい。あるいは、各時間区間の非音声状態（ラベル“０”）の尤度と音声状態（ラベル“１”）の尤度との系列を状態遷移の推定結果としてもよい。その他、各時間区間で最も大きな尤度を持つ状態を表す値の系列を状態遷移の推定結果としてもよい。 <Estimating unit 124>
The estimation unit 124 applies the sent input voice likelihood series to the state transition model 123a read from the storage unit 123, and estimates the state transitions of the voice state and non-voice state of the input acoustic signal in each time interval. Obtain the result and output it (post-filter processing). Examples of state transition estimation results are the likelihood of a non-voice state (label "0"), the likelihood of a voice state (label "1"), and the likelihood of a start transition state (label "2") in each time interval. It is a series of four likelihoods consisting of degrees and likelihoods of the terminal transition state (label “3”). Alternatively, only the series of likelihoods of the non-voice state (label "0") in each time interval may be used as the estimation result of the state transition, or only the series of likelihoods of the voice states (label "1") in each time interval. May be the estimation result of the state transition. Alternatively, a sequence of the likelihood of the non-voice state (label “0”) and the likelihood of the voice state (label “1”) in each time interval may be used as the estimation result of the state transition. In addition, a series of values representing the state having the highest likelihood in each time interval may be used as the estimation result of the state transition.

＜出力部１２６＞
推定部１２４から出力された状態遷移の推定結果は出力部１２６に送られる。出力部１２６は、状態遷移の推定結果を音声区間推定結果として出力する。 <Output unit 126>
The state transition estimation result output from the estimation unit 124 is sent to the output unit 126. The output unit 126 outputs the estimation result of the state transition as the voice interval estimation result.

［第２実施形態］
第２実施形態は第１実施形態の変形例である。本形態では、推定部１２４から出力された状態遷移の推定結果に後処理を行って得られたものを音声区間推定結果とする。以下では第１実施形態との相違点を中心に説明し、既に説明した事項については同じ参照番号を引用して説明を簡略化する。 [Second Embodiment]
The second embodiment is a modification of the first embodiment. In the present embodiment, the voice interval estimation result is obtained by performing post-processing on the state transition estimation result output from the estimation unit 124. In the following, the differences from the first embodiment will be mainly described, and the same reference numbers will be cited for the matters already described to simplify the explanation.

＜モデル学習装置１１＞
第１実施形態と同じである。 <Model learning device 11>
It is the same as the first embodiment.

＜音声区間検出装置２２＞
図１Ｂに例示するように、本形態の音声区間検出装置２２は、入力部１２１、音声区間検出部１２２、記憶部１２３、推定部１２４、後処理部２２５、および出力部２２６を有する。以下に第１実施形態との相違点である後処理部２２５および出力部２２６の詳細を説明する。 <Voice section detection device 22>
As illustrated in FIG. 1B, the voice section detection device 22 of the present embodiment includes an input unit 121, a voice section detection unit 122, a storage unit 123, an estimation unit 124, a post-processing unit 225, and an output unit 226. The details of the post-processing unit 225 and the output unit 226, which are the differences from the first embodiment, will be described below.

＜後処理部２２５＞
第１実施形態と異なり、推定部１２４から出力された状態遷移の推定結果は後処理部２２５に送られる。後処理部２２５は、送られた状態遷移の推定結果に対して所定の後処理を行って音声区間推定結果を得て出力する。例えば、状態遷移の推定結果として、各時間区間の非音声状態（ラベル“０”）の尤度、音声状態（ラベル“１”）の尤度、始端遷移状態（ラベル“２”）の尤度、および終端遷移状態（ラベル“３”）の尤度の系列が送られる場合、後処理部２２５は、これら４つの尤度の系列を用い、各時間区間の代表的な状態（非音声状態、音声状態、始端遷移状態、または終端遷移状態）を選択し、それによって得られた各時間区間の代表的な状態の系列を音声区間推定結果として出力してもよい。例えば、後処理部２２５は、各時間区間について、送られた４つの状態のうち最も大きな尤度を持つ状態を選択して出力してもよい。すなわち、後処理部２２５は、送られた４つの状態の尤度系列を最尤状態系列に変換して出力してもよい。例えば、ｎ＝０，…，Ｎ−１を時間区間に対応する識別子とし、Ｎを正整数とし、時間区間ｎでの非音声状態の尤度をＰ_ｎ，０とし、音声状態の尤度をＰ_ｎ，１とし、始端遷移状態の尤度をＰ_ｎ，２とし、および終端遷移状態の尤度をＰ_ｎ，３とする。この場合、後処理部２２５は、各時間区間ｎについて、送られた４つの状態の尤度系列（Ｐ_ｎ，０，Ｐ_ｎ，１，Ｐ_ｎ，２，Ｐ_ｎ，３）を最尤状態系列ｓ_ｎに変換して出力してもよい。

<Post-processing unit 225>
Unlike the first embodiment, the state transition estimation result output from the estimation unit 124 is sent to the post-processing unit 225. The post-processing unit 225 performs predetermined post-processing on the sent state transition estimation result, obtains the voice interval estimation result, and outputs the result. For example, as the estimation result of the state transition, the likelihood of the non-voice state (label “0”), the likelihood of the voice state (label “1”), and the likelihood of the start transition state (label “2”) in each time interval. , And when a series of likelihoods of the terminal transition state (label “3”) is sent, the post-processing unit 225 uses these four series of likelihoods and is a representative state (non-voice state, of each time interval). A voice state, a start transition state, or a end transition state) may be selected, and a series of representative states of each time interval obtained thereby may be output as a voice interval estimation result. For example, the post-processing unit 225 may select and output the state having the highest likelihood among the four sent states for each time interval. That is, the post-processing unit 225 may convert the sent likelihood series of the four states into the maximum likelihood state series and output it. For example, n = 0, ..., N-1 is an identifier corresponding to a time interval, N is a positive integer, the likelihood of a non-voice state in the time interval n is P _{n, 0,} and the likelihood of a voice state is _Let P _{n, 1} , the likelihood of the start transition state be P _{n, 2} , and the likelihood of the end transition state be P _{n, 3} . In this case, the post-processing unit 225 sets the maximum likelihood states (P _{n, 0} , P _{n, 1} , P _{n, 2} , P _{n, 3} ) of the four states sent for each time interval n. it may be output by converting the sequence s _n.

あるいは、後処理部２２５は、４つの状態の尤度系列（Ｐ_ｎ，０，Ｐ_ｎ，１，Ｐ_ｎ，２，Ｐ_ｎ，３）の特定の状態の尤度を強調したり、弱めたりして得られた尤度系列を最尤状態系列ｓ_ｎに変換して出力してもよい。例えば、後処理部２２５は、尤度系列（Ｐ_ｎ，０，Ｐ_ｎ，１，Ｐ_ｎ，２，Ｐ_ｎ，３）を以下の最尤状態系列ｓ_ｎに変換して出力してもよい。

ただし、α_ｉは尤度Ｐ_ｎ，ｉに与えられる重みである。例えば、α_ｉは０よりも大きな正値である。あるいは、特定の尤度Ｐ_ｎ，ｉに与えられる重みを０にしてもよい。例えば、α_２＝α_３＝０とすれば、始端遷移状態や終端遷移状態が音声区間推定結果として選択されることを避けることができる。 Alternatively, the post-processing unit 225 emphasizes or weakens the likelihood of a specific state of the likelihood series of four states (P _{n, 0} , P _{n, 1} , P _{n, 2} , P _{n, 3} ). the likelihood sequence obtained by may be converted and output to the maximum likelihood state sequence s _n. For example, the post-processing unit 225, the likelihood-series _{_{_{(P n, 0, P n}}} , 1, P n, 2, P n, 3) may be converted and output to the maximum likelihood state sequence _{s n} below ..

However, α _i is a weight given to the likelihood P _{n, i} . For example, α _i is a positive value greater than 0. Alternatively, the weight given to a specific likelihood P _{n, i} may be set to 0. For example, if α ₂ = α ₃ = 0, it is possible to avoid selecting the start end transition state and the end transition state as the voice interval estimation result.

その他、後処理部２２５が、送られた状態遷移の推定結果に対してＶＡＤにおいて一般的に行われる公知の補正手法を適用し、それによって得られた結果を音声区間推定結果として出力してもよい。例えば、各時間区間の音声状態（ラベル“１”）の尤度の系列のみが状態遷移の推定結果として送られる場合、後処理部２２５が各時間区間の音声状態の尤度と所定の閾値とを比較し、その比較結果の系列に対応する音声区間推定結果を出力してもよい。例えば、後処理部２２５は、音声状態の尤度が閾値以上の時間区間を音声区間とし、それ以外の時間区間を非音声区間とする音声区間推定結果を出力してもよい。各時間区間の非音声状態（ラベル“０”）の尤度の系列のみが状態遷移の推定結果として送られる場合にも、後処理部２２５が各時間区間の非音声状態の尤度と所定の閾値とを比較し、その比較結果の系列に対応する音声区間推定結果を出力してもよい。その他、後処理部２２５が、送られた状態遷移の推定結果の特定の状態の尤度を強調したり、弱めたりして得られた尤度系列に対してＶＡＤにおいて一般的に行われる公知の補正手法を適用し、それによって得られた結果を音声区間推定結果として出力してもよい。後処理部２２５は、音声区間検出の精度向上のためのその他の公知技術を用いてもよい。 In addition, even if the post-processing unit 225 applies a known correction method generally performed in VAD to the sent state transition estimation result, and outputs the result obtained by the correction method as a voice interval estimation result. Good. For example, when only the series of likelihood of the voice state (label "1") of each time interval is sent as the estimation result of the state transition, the post-processing unit 225 sets the likelihood of the voice state of each time interval and a predetermined threshold. May be compared and the voice interval estimation result corresponding to the series of the comparison results may be output. For example, the post-processing unit 225 may output a voice section estimation result in which a time interval in which the likelihood of the voice state is equal to or greater than a threshold value is a voice section and the other time section is a non-voice section. Even when only the series of likelihood of the non-voice state (label “0”) of each time interval is sent as the estimation result of the state transition, the post-processing unit 225 determines the likelihood of the non-voice state of each time interval and a predetermined value. The voice interval estimation result corresponding to the series of comparison results may be output by comparing with the threshold value. In addition, the post-processing unit 225 is known to be generally performed in VAD for the likelihood series obtained by emphasizing or weakening the likelihood of a specific state in the estimation result of the sent state transition. A correction method may be applied and the result obtained thereby may be output as a voice interval estimation result. The post-processing unit 225 may use other known techniques for improving the accuracy of voice section detection.

＜出力部２２６＞
推定部２２４から出力された音声区間推定結果は出力部２２６に送られる。出力部２２６はこの音声区間推定結果を出力する。 <Output unit 226>
The audio interval estimation result output from the estimation unit 224 is sent to the output unit 226. The output unit 226 outputs the voice section estimation result.

［第３実施形態］
第３実施形態は第１，２実施形態の変形例である。本形態では、モデル学習装置が複数の状態遷移モデルを生成し、音声区間検出装置が入力音声尤度系列をこれら複数の状態遷移モデルに適用し、当該複数の状態遷移モデルのそれぞれについて、入力音響信号の音声状態および非音声状態についての各時間区間での状態遷移の推定結果を得、得られた状態遷移の推定結果のうち、各時間区間において最も確からしい推定結果を選択する。以下では第１，２実施形態との相違点を中心に説明し、既に説明した事項については同じ参照番号を引用して説明を簡略化する。 [Third Embodiment]
The third embodiment is a modification of the first and second embodiments. In this embodiment, the model learning device generates a plurality of state transition models, the voice interval detection device applies the input voice likelihood series to the plurality of state transition models, and the input sound is applied to each of the plurality of state transition models. The estimation result of the state transition in each time interval for the voice state and the non-voice state of the signal is obtained, and the most probable estimation result in each time interval is selected from the obtained state transition estimation results. In the following, the differences from the first and second embodiments will be mainly described, and the same reference numbers will be cited for the matters already described to simplify the explanation.

＜モデル学習装置３１＞
図１Ａに例示するように、本形態のモデル学習装置３１は、学習データ３１１ａを記憶する記憶部３１１、および状態遷移モデル３２３ａを学習する学習部３１２を有する。 <Model learning device 31>
As illustrated in FIG. 1A, the model learning device 31 of the present embodiment has a storage unit 311 for storing learning data 311a and a learning unit 312 for learning the state transition model 323a.

本形態の学習データ３１１ａは、複数種類の学習データ用の音響信号の各時間区間の音声尤度に対応する音声尤度系列と、当該音響信号の音声状態および非音声状態についての各時間区間での状態遷移の正解値の系列と、の組を複数組含む。これらの組は学習データ用の音響信号の種類に応じてクラス分けされている。例えば、学習データ３１１ａは、複数種類の環境で得られた学習データ用の音響信号に対応する音声尤度系列と状態遷移の正解値の系列との組を複数組含み、これらの組は対応する環境ごとにクラス分けされている。このクラスをｃ＝０，…，Ｃ−１（ただし、Ｃは２以上の整数）とし、学習データ３１１ａに含まれたクラスｃに属する音声尤度系列と状態遷移の正解値の系列との組の集合をＤ（ｃ）と表記する。すなわち、学習データ３１１ａは集合Ｄ（０），…，Ｄ（Ｃ−１）を含む。 The learning data 311a of the present embodiment includes a voice likelihood series corresponding to the voice likelihood of each time interval of the acoustic signals for a plurality of types of training data, and each time interval regarding the voice state and the non-voice state of the sound signal. Includes a series of correct values for state transitions and multiple pairs of. These sets are classified according to the type of acoustic signal for training data. For example, the learning data 311a includes a plurality of sets of a voice likelihood series corresponding to acoustic signals for learning data obtained in a plurality of types of environments and a series of correct answer values of state transitions, and these sets correspond to each other. It is classified according to the environment. Let this class be c = 0, ..., C-1 (where C is an integer of 2 or more), and a set of the voice likelihood series belonging to the class c included in the training data 311a and the series of correct answer values of the state transitions. The set of is expressed as D (c). That is, the training data 311a includes the sets D (0), ..., D (C-1).

学習部３１２は記憶部３１１から読み出した学習データ３１１ａを用い、入力音響信号の各時間区間の音声尤度に対応する入力音声尤度系列を入力として入力音響信号の音声状態および非音声状態についての各時間区間での状態遷移の推定結果を得る複数の状態遷移モデル３２３ａを得て出力する。すなわち、学習部３１２は、学習データ３１１ａに含まれた集合Ｄ（ｃ）を用い、第１実施形態と同じ方法でクラスｃに対応する状態遷移モデルＭ（ｃ）を生成し、複数の状態遷移モデルＭ（０），…，Ｍ（Ｃ−１）を状態遷移モデル３２３ａとして出力する。その他は第１実施形態と同じである。 The learning unit 312 uses the learning data 311a read from the storage unit 311, and inputs the input voice likelihood series corresponding to the voice likelihood of each time interval of the input acoustic signal as input, and describes the audio state and the non-audio state of the input acoustic signal. A plurality of state transition models 323a for obtaining the estimation results of the state transitions in each time interval are obtained and output. That is, the learning unit 312 uses the set D (c) included in the learning data 311a to generate a state transition model M (c) corresponding to the class c by the same method as in the first embodiment, and a plurality of state transitions. The models M (0), ..., M (C-1) are output as the state transition model 323a. Others are the same as in the first embodiment.

＜音声区間検出装置３２＞
図１Ｂに例示するように、本形態の音声区間検出装置３２は、入力部１２１、音声区間検出部１２２、記憶部３２３、推定部３２４、後処理部３２５、および出力部２２６を有する。以下に第１，２実施形態との相違点である記憶部３２３、推定部３２４、および後処理部３２５の詳細を説明する。 <Voice section detection device 32>
As illustrated in FIG. 1B, the voice section detection device 32 of this embodiment has an input unit 121, a voice section detection unit 122, a storage unit 323, an estimation unit 324, a post-processing unit 325, and an output unit 226. The details of the storage unit 323, the estimation unit 324, and the post-processing unit 325, which are the differences from the first and second embodiments, will be described below.

＜記憶部３２３＞
記憶部３２３には、学習部３１２から出力された複数の状態遷移モデル３２３ａが格納される。状態遷移モデル３２３ａは、音声区間検出装置３２での音声区間検出が開始される前に記憶部３２３に格納されていてもよいし、モデル学習装置３１から新たな状態遷移モデル３２３ａが出力されるたびに記憶部３２３に格納されてもよい。 <Memory unit 323>
A plurality of state transition models 323a output from the learning unit 312 are stored in the storage unit 323. The state transition model 323a may be stored in the storage unit 323 before the voice section detection by the voice section detection device 32 is started, or each time a new state transition model 323a is output from the model learning device 31. It may be stored in the storage unit 323.

＜推定部３２４＞
推定部３２４には、音声区間検出部１２２から出力された入力音声尤度系列が入力される。推定部３２４は、この入力音声尤度系列を記憶部３２３から読み出した複数の状態遷移モデル３２３ａに適用し、複数の状態遷移モデル３２３ａのそれぞれについて、入力音響信号の音声状態および非音声状態についての各時間区間での状態遷移の推定結果を得て出力する。すなわち、推定部３２４は、入力音声尤度系列を各状態遷移モデルＭ（ｃ）（ただし、ｃ＝０，…，Ｃ−１）に適用し、各クラスｃについて各時間区間での状態遷移の推定結果Ｒ（ｃ）を得て出力する。各クラスｃについての推定結果Ｒ（ｃ）は後処理部３２５に送られる。その他は第１，２実施形態と同一である。 <Estimator 324>
The input voice likelihood series output from the voice section detection unit 122 is input to the estimation unit 324. The estimation unit 324 applies this input voice likelihood series to the plurality of state transition models 323a read from the storage unit 323, and for each of the plurality of state transition models 323a, regarding the voice state and the non-voice state of the input acoustic signal. The estimation result of the state transition in each time interval is obtained and output. That is, the estimation unit 324 applies the input speech likelihood series to each state transition model M (c) (where c = 0, ..., C-1), and for each class c, the state transition in each time interval. The estimation result R (c) is obtained and output. The estimation result R (c) for each class c is sent to the post-processing unit 325. Others are the same as those of the first and second embodiments.

＜後処理部３２５＞
後処理部３２５は、推定部３２４から出力された各クラスｃについて送られた推定結果Ｒ（ｃ）に対応する結果のうち、各時間区間において最も確からしい結果を選択する。この選択には音声区間音声検出の精度向上のための公知の技術を用いることができる。例えば、音声区間検出部１２２が非常に異なる複数の環境で得られた入力音響信号に対してＶＡＤを行う場合、後処理部３２５が、参考文献２（R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton. “Adaptive mixtures of local experts.” Neural Computation, 3:79-87, 1991.）に記載された手法を用いて最も確からしい推定結果を選択してもよい。すなわち、後処理部３２５は、送られた推定結果Ｒ（０），…，Ｒ（Ｃ−１）に対応する結果のうち、入力音響信号が得られた環境に対応する推定結果Ｒ（ｃ’）に対応する結果を最も確からしい結果として選択する。なお、推定結果Ｒ（ｃ）に対応する結果は、推定結果Ｒ（ｃ）そのものであってもよいし、推定結果Ｒ（ｃ）に対して第２実施形態の後処理部２２５の処理を行って得られたものであってもよい。後処理部３２５は、選択した最も確からしい結果を音声区間推定結果として出力してもよいし、選択した最も確からしい結果に第２実施形態の後処理部２２５の処理を行って得られた音声区間推定結果を出力してもよい。音声区間推定結果は出力部２２６に送られる。その後の処理は第２実施形態と同じである。 <Post-processing unit 325>
The post-processing unit 325 selects the most probable result in each time interval from the results corresponding to the estimation result R (c) sent for each class c output from the estimation unit 324. A known technique for improving the accuracy of voice section voice detection can be used for this selection. For example, when the voice section detection unit 122 performs VAD on input acoustic signals obtained in a plurality of environments in which they are very different, the post-processing unit 325 uses reference 2 (RA Jacobs, MI Jordan, SJ Nowlan, and GE). The most probable estimation results may be selected using the method described in Hinton. “Adaptive interpolation of local experts.” Neural Computation, 3: 79-87, 1991.). That is, the post-processing unit 325 has the estimation result R (c') corresponding to the environment in which the input acoustic signal is obtained among the results corresponding to the sent estimation results R (0), ..., R (C-1). ) Corresponds to the most probable result. The result corresponding to the estimation result R (c) may be the estimation result R (c) itself, or the estimation result R (c) is processed by the post-processing unit 225 of the second embodiment. It may be obtained by. The post-processing unit 325 may output the selected most probable result as a voice interval estimation result, or the voice obtained by performing the processing of the post-processing unit 225 of the second embodiment on the selected most probable result. The interval estimation result may be output. The voice section estimation result is sent to the output unit 226. Subsequent processing is the same as in the second embodiment.

［実施形態の手法の特徴］
以上のように、各実施形態では、様々な環境において、人手でのチューニングを行うことなくＶＡＤの精度を向上させることができ、コスト削減と精度向上の両方を実現できる。また、単一の環境においても、発話のブレを吸収し、既存手法より高精度に音声区間検出を行うことができる。さらに、状態遷移モデルを用いた推定部の構成はＶＡＤに縦続接続するpost-filterであるため、他のＶＡＤ精度改善手法と併用することもできる。例えば、従来のハングオーバーの手法を適用した後にこのpost-filterの処理を行うことで更なる性能向上が見込める。 [Characteristics of the method of the embodiment]
As described above, in each embodiment, the accuracy of VAD can be improved without manual tuning in various environments, and both cost reduction and accuracy improvement can be realized. In addition, even in a single environment, it is possible to absorb speech blur and perform voice section detection with higher accuracy than existing methods. Furthermore, since the configuration of the estimation unit using the state transition model is a post-filter that is connected longitudinally to the VAD, it can be used in combination with other VAD accuracy improvement methods. For example, further performance improvement can be expected by performing this post-filter process after applying the conventional hangover method.

従来、音響特徴量を入力として音声状態／非音声状態を出力する状態遷移モデルを用いたＶＡＤは存在していた。しかし、音響特徴量は環境による変動が激しく、モデルが膨大になり、かつ大量の学習データが必要であった。これに対し、各実施形態の状態遷移モデルは、ＶＡＤ処理後の音声尤度系列を入力特徴量として状態遷移の推定結果を得るものである。音声尤度系列は音響特徴量に比べて環境による変動が小さい。そのため、各実施形態の状態遷移モデルは、少量の学習データによって作成できる上に、様々なＶＡＤ技術のｐｏｓｔ−ｆｉｌｔｅｒとして頑健に動作する。さらに、各実施形態では、音声区間の始端（始端遷移状態）および終端（終端遷移状態）に特殊なラベルを付与した。これにより、「区間」としての情報をより強く学習することができ、突発的な雑音による一時的な音声状態フレームの出現や、子音、息継ぎなどによる一時的な非音声状態の出現に左右されにくい。 Conventionally, there has been a VAD using a state transition model that outputs a voice state / non-voice state by inputting an acoustic feature amount. However, the acoustic features fluctuate greatly depending on the environment, the model becomes enormous, and a large amount of learning data is required. On the other hand, in the state transition model of each embodiment, the estimation result of the state transition is obtained by using the voice likelihood series after the VAD processing as the input feature amount. The voice likelihood series has less variation due to the environment than the acoustic features. Therefore, the state transition model of each embodiment can be created with a small amount of training data, and also operates robustly as a post-filter of various VAD techniques. Further, in each embodiment, special labels are given to the start end (start end transition state) and end (end transition state) of the voice section. As a result, the information as a "section" can be learned more strongly, and it is less affected by the appearance of a temporary voice state frame due to sudden noise and the appearance of a temporary non-voice state due to consonants, breathing, etc. ..

［その他の変形例等］
なお、本発明は上述の実施形態に限定されるものではない。例えば、第１，２，３実施形態では、音声区間検出装置１２，２２，３２に入力音響信号が入力され、音声区間検出部１２２が当該入力音響信号の各時間区間の音声尤度に対応する入力音声尤度系列を得た。しかしながら、音声区間検出装置１２，２２，３２の外部で入力音響信号から入力音声尤度系列が生成され、この入力音声尤度系列が音声区間検出装置１２，２２，３２に入力されてもよい。この場合、音声区間検出装置１２，２２，３２から音声区間検出部１２２が省略されてもよい。 [Other variants]
The present invention is not limited to the above-described embodiment. For example, in the first, second, and third embodiments, the input acoustic signal is input to the audio section detection devices 12, 22, and 32, and the audio section detection unit 122 corresponds to the audio likelihood of each time interval of the input acoustic signal. An input audio likelihood series was obtained. However, an input voice likelihood series may be generated from the input acoustic signal outside the voice section detection devices 12, 22, and 32, and the input voice likelihood series may be input to the voice section detection devices 12, 22, 32. In this case, the voice section detection unit 122 may be omitted from the voice section detection devices 12, 22, and 32.

上述の各種の処理は、記載に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。その他、本発明の趣旨を逸脱しない範囲で適宜変更が可能であることはいうまでもない。 The various processes described above may not only be executed in chronological order according to the description, but may also be executed in parallel or individually as required by the processing capacity of the device that executes the processes. In addition, it goes without saying that changes can be made as appropriate without departing from the gist of the present invention.

上記の各装置は、例えば、ＣＰＵ（central processing unit）等のプロセッサ（ハードウェア・プロセッサ）およびＲＡＭ（random-access memory）・ＲＯＭ（read-only memory）等のメモリ等を備える汎用または専用のコンピュータが所定のプログラムを実行することで構成される。このコンピュータは１個のプロセッサやメモリを備えていてもよいし、複数個のプロセッサやメモリを備えていてもよい。このプログラムはコンピュータにインストールされてもよいし、予めＲＯＭ等に記録されていてもよい。また、ＣＰＵのようにプログラムが読み込まれることで機能構成を実現する電子回路（circuitry）ではなく、プログラムを用いることなく処理機能を実現する電子回路を用いて一部またはすべての処理部が構成されてもよい。１個の装置を構成する電子回路が複数のＣＰＵを含んでいてもよい。 Each of the above devices is, for example, a general-purpose or dedicated computer including a processor (hardware processor) such as a CPU (central processing unit) and a memory such as a RAM (random-access memory) and a ROM (read-only memory). Is composed of executing a predetermined program. This computer may have one processor and memory, or may have a plurality of processors and memory. This program may be installed in a computer or may be recorded in a ROM or the like in advance. Further, a part or all of the processing units are configured by using an electronic circuit that realizes a processing function without using a program, instead of an electronic circuit (circuitry) that realizes a function configuration by reading a program like a CPU. You may. The electronic circuits constituting one device may include a plurality of CPUs.

上述の構成をコンピュータによって実現する場合、各装置が有すべき機能の処理内容はプログラムによって記述される。このプログラムをコンピュータで実行することにより、上記処理機能がコンピュータ上で実現される。この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体の例は、非一時的な（non-transitory）記録媒体である。このような記録媒体の例は、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等である。 When the above configuration is realized by a computer, the processing contents of the functions that each device should have are described by a program. By executing this program on a computer, the above processing function is realized on the computer. The program describing the processing content can be recorded on a computer-readable recording medium. An example of a computer-readable recording medium is a non-transitory recording medium. Examples of such recording media are magnetic recording devices, optical disks, opto-magnetic recording media, semiconductor memories, and the like.

このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 The distribution of this program is carried out, for example, by selling, transferring, renting, or the like a portable recording medium such as a DVD or CD-ROM on which the program is recorded. Further, the program may be stored in the storage device of the server computer, and the program may be distributed by transferring the program from the server computer to another computer via a network.

このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶装置に格納する。処理の実行時、このコンピュータは、自己の記憶装置に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。このプログラムの別の実行形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるＡＳＰ（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。 A computer that executes such a program first stores, for example, a program recorded on a portable recording medium or a program transferred from a server computer in its own storage device. When executing the process, the computer reads the program stored in its own storage device and executes the process according to the read program. Another form of execution of this program may be for the computer to read the program directly from a portable recording medium and perform processing according to the program, and each time the program is transferred from the server computer to this computer. , Sequentially, the processing according to the received program may be executed. Even if the above processing is executed by a so-called ASP (Application Service Provider) type service that realizes the processing function only by the execution instruction and result acquisition without transferring the program from the server computer to this computer. Good.

コンピュータ上で所定のプログラムを実行させて本装置の処理機能が実現されるのではなく、これらの処理機能の少なくとも一部がハードウェアで実現されてもよい。 The processing functions of the present device may not be realized by executing a predetermined program on a computer, but at least a part of these processing functions may be realized by hardware.

本発明は、例えば、音声認識処理や音声対話処理の前段での音声区間検出に利用できる。本発明をこれらに適用した場合、音声区間の検出精度を向上させ、後段の音声認識や音声対話をより高精度に行うことができる。 The present invention can be used, for example, for detecting a voice section in a stage prior to voice recognition processing or voice dialogue processing. When the present invention is applied to these, the detection accuracy of the voice section can be improved, and the voice recognition and the voice dialogue in the subsequent stage can be performed with higher accuracy.

１１，３１モデル学習装置
１２，２２，３２音声区間検出装置 11,31 Model learning device 12,22,32 Voice section detection device

Claims

Learning including a set of a voice likelihood series corresponding to the voice likelihood of each time interval of the acoustic signal and a series of correct answer values of state transitions in each time interval for the voice state and non-voice state of the acoustic signal. Using the data, the input audio likelihood series corresponding to the audio likelihood of each time interval of the input acoustic signal is input, and the estimation result of the state transition in each time interval for the audio state and non-audio state of the input acoustic signal is obtained. A model learning device that learns the state transition model to be obtained.

The model learning device of claim 1.
The state transitions are a state transition from a voice state to a voice state, a state transition from a non-voice state to a non-voice state, a state transition from a voice state to a non-voice state, and a state transition from a non-voice state to a voice state. A model learning device that is either.

A state transition model that obtains the estimation result of the state transition in each time interval for the audio state and non-audio state of the input acoustic signal by inputting the input audio likelihood series corresponding to the audio likelihood in each time interval of the input acoustic signal. A storage unit that memorizes
The input voice likelihood series corresponding to the voice likelihood of each time interval of the input acoustic signal is applied to the state transition model, and the state transition of the input acoustic signal regarding the voice state and the non-voice state in each time interval is applied. An estimation unit that obtains and outputs estimation results,
A voice section detection device having.

A plurality of states for obtaining the estimation result of the state transition in each time interval for the audio state and the non-audio state of the input acoustic signal by inputting the input audio likelihood series corresponding to the audio likelihood in each time interval of the input acoustic signal. A storage unit that stores the transition model and
An input voice likelihood series corresponding to the voice likelihood of each time interval of the input acoustic signal is applied to the plurality of state transition models, and for each of the plurality of state transition models, the voice state and non-voice state of the input acoustic signal. An estimation unit that obtains and outputs the estimation result of the state transition in each time interval for the audio state,
Among the results corresponding to the state transition estimation results output from the estimation unit, the post-processing unit that selects the most probable result in each time interval, and the post-processing unit.
A voice section detection device having.

Learning including a set of a voice likelihood series corresponding to the voice likelihood of each time interval of the acoustic signal and a series of correct answer values of state transitions in each time interval for the voice state and non-voice state of the acoustic signal. Using the data, the input audio likelihood series corresponding to the audio likelihood of each time interval of the input acoustic signal is input, and the estimation result of the state transition in each time interval for the audio state and non-audio state of the input acoustic signal is obtained. A model learning method of a model learning device that learns the state transition model to be obtained.

A state transition model that obtains the estimation result of the state transition in each time interval for the audio state and non-audio state of the input acoustic signal by inputting the input audio likelihood series corresponding to the audio likelihood in each time interval of the input acoustic signal. The input audio likelihood series corresponding to the audio likelihood of each time interval of the input acoustic signal is applied to, and the estimation result of the state transition in each time interval for the audio state and non-audio state of the input acoustic signal is obtained. A method for detecting a voice section of a voice section detection device that obtains and outputs a sound section .

A plurality of states for obtaining the estimation result of the state transition in each time interval for the audio state and the non-audio state of the input acoustic signal by inputting the input audio likelihood series corresponding to the audio likelihood in each time interval of the input acoustic signal. An input audio likelihood series corresponding to the audio likelihood of each time interval of the input acoustic signal is applied to the transition model, and the audio state and non-audio state of the input acoustic signal are obtained for each of the plurality of state transition models. A method for detecting a voice section of a voice section detection device, which obtains an estimation result of a state transition in each time section of the above and selects the most probable result in each time section from the results corresponding to the estimation result of the state transition.

A program for operating a computer as the model learning device of claim 1 or 2 or the voice section detection device of claim 3 or 4.