JP2019039946A

JP2019039946A - Model learning device, voice section detection device, method thereof and program

Info

Publication number: JP2019039946A
Application number: JP2017159288A
Authority: JP
Inventors: 清彰松井; Kiyoaki Matsui; 岡本　学; Manabu Okamoto; 学岡本; 山口　義和; Yoshikazu Yamaguchi; 義和山口; 太一浅見; Taichi Asami; 隆朗福冨; Takaaki Fukutomi; 崇史森谷; Takashi Moriya
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2017-08-22
Filing date: 2017-08-22
Publication date: 2019-03-14
Anticipated expiration: 2037-08-22
Also published as: JP6794064B2

Abstract

To improve flexibility in voice section detection.SOLUTION: Input voice likelihood series corresponding to voice likelihood of respective time sections of an input sound signal is applied to a state transition model for obtaining an estimation result of state transition in the respective time sections on a voice state and a non-voice state of the input sound signal with the input voice series corresponding to voice likelihood of the respective time sections of the input sound signal as input. The estimation result of the state transition in the respective time sections on the voice state and the non-voice state of the input sound signal is obtained to be outputted.SELECTED DRAWING: Figure 1

Description

本発明は音声区間検出技術に関する。 The present invention relates to a speech section detection technique.

音声区間検出技術の一つにＶＡＤ（voice activity detection）と呼ばれる手法がある（例えば、非特許文献１等参照）。ＶＡＤでは、音の強さや振動の激しさ（ゼロ交差数）、音響特徴量等を用いて音声区間を行っている。しかし、ＶＡＤには、子音や単語と単語の間の短いポーズ等を非音声区間と判定し、細切れの音声区間を検出してしまうという問題がある。これに対処するため、ハングオーバーという手法が用いられている（例えば、非特許文献１等参照）。これは、ＶＡＤで得られた２つの音声区間の間の非音声区間のフレーム数が閾値よりも短い場合に、これら２つの音声区間を一続きの音声区間とみなす手法である。 There is a technique called VAD (voice activity detection) as one of the voice segment detection techniques (see, for example, Non-Patent Document 1). In VAD, a voice section is performed using sound intensity, vibration intensity (number of zero crossings), acoustic features, and the like. However, VAD has a problem that a consonant or a short pause between words is determined as a non-speech segment and a segmented speech segment is detected. In order to cope with this, a technique called hangover is used (see, for example, Non-Patent Document 1). This is a method in which, when the number of frames of a non-speech section between two speech sections obtained by VAD is shorter than a threshold, these two speech sections are regarded as a series of speech sections.

ITU, “A silence compression scheme for G.729 optimized for terminals conforming to recommendation V.70,” ITUT/Recommendation G.729-Annex B. 1996.ITU, “A silence compression scheme for G.729 optimized for terminals conforming to recommendation V.70,” ITUT / Recommendation G.729-Annex B. 1996.

しかし、ハングオーバーを用いた従来手法は柔軟性が低い。すなわち、ＶＡＤで得られる２つの音声区間の間の非音声区間の長さは、発話タスクやドメインなどの場面によって異なる。そのため、想定される場面ごとに最適な閾値を人手で設定しなければならない。また、同一の場面においても、理想的には、発話ごとに適切な閾値を与えるべきである。しかし、従来手法はそのような柔軟性を持たない。 However, the conventional method using hangover has low flexibility. That is, the length of the non-speech section between two speech sections obtained by VAD differs depending on scenes such as a speech task and a domain. Therefore, it is necessary to manually set an optimum threshold value for each assumed scene. In the same scene, ideally, an appropriate threshold value should be given for each utterance. However, the conventional method does not have such flexibility.

本発明はこのような点に鑑みてなされたものであり、音声区間検出における柔軟性を向上させることを目的とする。 The present invention has been made in view of these points, and an object thereof is to improve the flexibility in voice segment detection.

入力音響信号の各時間区間の音声尤度に対応する入力音声尤度系列を入力として入力音響信号の音声状態および非音声状態についての各時間区間での状態遷移の推定結果を得る状態遷移モデルに、入力音響信号の各時間区間の音声尤度に対応する入力音声尤度系列を適用し、入力音響信号の音声状態および非音声状態についての各時間区間での状態遷移の推定結果を得て出力する。 A state transition model that uses the input speech likelihood sequence corresponding to the speech likelihood of each time interval of the input acoustic signal as input and obtains estimation results of state transitions in each time interval for the speech state and non-speech state of the input acoustic signal Applying the input speech likelihood sequence corresponding to the speech likelihood of each time interval of the input acoustic signal, and obtaining and outputting the estimation result of the state transition in each time interval for the speech state and the non-speech state of the input acoustic signal To do.

本発明では、音声区間検出における柔軟性を向上させることができる。 In the present invention, the flexibility in voice segment detection can be improved.

図１Ａは実施形態のモデル学習装置の機能構成を示すブロック図である。図１Ｂは実施形態の音声検出装置の機能構成を示すブロック図である。FIG. 1A is a block diagram illustrating a functional configuration of the model learning device according to the embodiment. FIG. 1B is a block diagram illustrating a functional configuration of the voice detection device according to the embodiment. 図２は実施形態の状態遷移モデルの状態遷移図である。FIG. 2 is a state transition diagram of the state transition model of the embodiment. 図３Ａおよび図３Ｂは音響信号を例示した図である。図３ＣはＶＡＤによって検出した音声区間および非音声区間の例示である。図３Ｄは実施形態の手法によって検出した音声区間および非音声区間の例示である。3A and 3B are diagrams illustrating acoustic signals. FIG. 3C is an example of a speech segment and a non-speech segment detected by VAD. FIG. 3D is an illustration of a speech segment and a non-speech segment detected by the method of the embodiment. 図４Ａは複数の発話区間を有する音響信号を例示した図である。図４Ｂは実施形態の手法によって検出した音声区間および非音声区間の例示である。FIG. 4A is a diagram illustrating an acoustic signal having a plurality of speech sections. FIG. 4B is an example of a speech segment and a non-speech segment detected by the method of the embodiment.

以下、本発明の実施形態を説明する。
［概要］
まず、各実施形態の概要を説明する。各実施形態では、入力音響信号の各時間区間の音声尤度に対応する入力音声尤度系列を状態遷移モデルに適用し、当該入力音響信号の音声状態および非音声状態についての各時間区間での状態遷移の推定結果を得て出力する。「状態遷移モデル」は、入力音響信号の各時間区間の音声尤度に対応する入力音声尤度系列を入力として、当該入力音響信号の音声状態および非音声状態についての各時間区間での状態遷移の推定結果を得るモデルである。この状態遷移モデルは、入力音声尤度系列に対する実際の音声区間および非音声区間の遷移（すなわち、実際の音声区間および非音声区間の表れ方）をモデル化したものである。そのため、この状態遷移モデルに入力音声尤度系列を適用することで、たとえ入力音声尤度系列が子音や短いポーズ等の時間区間を表していたとしても、それが非音声区間ではなく、音声区間の一部であることを適切に推定できる。これにより、子音や短いポーズ等を非音声区間と判定し、細切れの音声区間を検出してしまうという問題を解決できる。また、状態遷移モデルの生成は、多様な入力音声尤度系列に適用できるため、ハングオーバーに比べて柔軟性が高い。特に、本形態の状態遷移モデルは、入力音声尤度系列を入力として音声区間および非音声区間の遷移を推定する。入力音声尤度系列は、環境に応じて変動が激しい入力音響信号を、それよりも変動の小さい音声尤度に対応する系列に集約したものである。そのため、本形態の状態遷移モデルは、多様な環境に柔軟に対応し、精度の高い推定を可能にする。以上のように、各実施形態の手法では、従来に比べ、音声区間検出における柔軟性を向上させることができる。状態遷移モデルの生成は、学習データを用いた機械学習によって可能であり、閾値を人手で設定するハングオーバーに比べてチューニングコストが低い。 Embodiments of the present invention will be described below.
[Overview]
First, the outline of each embodiment will be described. In each embodiment, the input speech likelihood sequence corresponding to the speech likelihood of each time interval of the input acoustic signal is applied to the state transition model, and the speech state and the non-speech state of the input acoustic signal in each time interval are applied. Obtain and output state transition estimation results. The “state transition model” is an input speech likelihood sequence corresponding to speech likelihood in each time interval of the input acoustic signal, and state transition in each time interval for the speech state and non-speech state of the input acoustic signal. This model obtains the estimation result of This state transition model is obtained by modeling the transition of an actual speech segment and a non-speech segment (that is, how the actual speech segment and the non-speech segment appear) with respect to the input speech likelihood sequence. Therefore, by applying the input speech likelihood sequence to this state transition model, even if the input speech likelihood sequence represents a time interval such as a consonant or a short pause, it is not a non-speech interval but a speech interval Can be estimated appropriately. As a result, it is possible to solve the problem that a consonant or a short pause is determined as a non-speech segment and a segmented speech segment is detected. Moreover, since the generation of the state transition model can be applied to various input speech likelihood sequences, it is more flexible than hangover. In particular, the state transition model according to the present embodiment estimates transitions between speech intervals and non-speech intervals using an input speech likelihood sequence as an input. The input speech likelihood sequence is obtained by collecting input acoustic signals whose fluctuations are significant according to the environment into a series corresponding to speech likelihoods having smaller fluctuations. Therefore, the state transition model of this embodiment can flexibly cope with various environments and enables high-precision estimation. As described above, according to the method of each embodiment, it is possible to improve the flexibility in voice section detection as compared with the conventional technique. Generation of the state transition model is possible by machine learning using learning data, and the tuning cost is lower than that of hangover in which a threshold is manually set.

なお、「入力音響信号」は複数の所定の時間区間（例えば、フレーム、サブフレームなど）ごとに区分された時系列のデジタル音響信号である。時間区間の「音声尤度」は、当該時間区間が音声区間である尤度（尤もらしさ）を表す。「音声尤度」は、当該時間区間が音声区間である尤度をそのまま示してもよいし、当該時間区間が非音声区間である尤度を示すことで間接的に音声区間である尤度を表していてもよい。「入力音声尤度系列」は、各時間区間の「音声尤度」に対応する値の時系列である。「入力音声尤度系列」は、各時間区間の「音声尤度」の時系列であってもよいし、各時間区間の「音声尤度」を表す値の時系列であってもよい。「音声尤度」を表す値は、連続値であってもよいし、バイナリ値であってもよい。例えば、「音声尤度」を表す値は、「音声尤度」の関数値であってもよいし、「音声尤度」を用いて閾値判定されたバイナリ値であってもよい。しかし、推定精度の観点から、「入力音声尤度系列」は、各時間区間の「音声尤度」の時系列または連続値である「音声尤度」を表す値の時系列であることが望ましい。音声状態および非音声状態についての各時間区間での「状態遷移」は、例えば、音声状態から音声状態への状態遷移、非音声状態から非音声状態への状態遷移、音声状態から非音声状態への状態遷移、および非音声状態から音声状態への状態遷移の何れかである。ただし、音声状態から音声状態への状態遷移は、音声状態が持続されること、すなわち「音声状態」を意味する。同様に、非音声状態から非音声状態への状態遷移は、非音声状態が持続されること、すなわち「非音声状態」を意味する。入力音声尤度系列を状態遷移モデルに適用して得られる状態遷移の推定結果は、各状態遷移の尤度であってもよいし、各状態遷移の尤度の関数値であってもよいし、各状態遷移の尤度を比較して得られた値（例えば、最も尤度の高い状態遷移）であってもよい。状態遷移モデルの学習は、音響信号の各時間区間の音声尤度に対応する音声尤度系列と、当該音響信号の音声状態および非音声状態についての各時間区間での状態遷移の正解値の系列と、の組を含む学習データを用いて行われる。この学習データでは、音声尤度系列の各時間区間の要素（例えば、音声尤度）と当該各時間区間での状態遷移の正解値とが互いに対応付けられている。例えば、音声尤度系列の各時間区間の要素と当該各時間区間での状態遷移の正解値とが当該時間区間を表す識別子に対応付けられている。通常、学習データは複数の音響信号についての音声尤度系列と状態遷移の正解値の系列との組を含む。すなわち、学習データは音声尤度系列と状態遷移の正解値の系列との組を複数組含む。しかし、学習データが１個の音響信号についての音声尤度系列と状態遷移の正解値の系列との組のみを含んでもよい。「音響信号」は複数の所定の時間区間（例えば、フレーム、サブフレームなど）ごとに区分された時系列のデジタル音響信号である。学習データの音声尤度系列は、想定される入力音声尤度系列と同一である必要はないが、同じ種別である必要がある。例えば、想定される入力音声尤度系列が各時間区間の音声尤度の時系列であるならば、学習データの音声尤度系列も各時間区間の音声尤度の時系列である必要がある。 The “input sound signal” is a time-series digital sound signal divided into a plurality of predetermined time intervals (for example, frames, subframes, etc.). The “speech likelihood” of a time interval represents the likelihood (likelihood) that the time interval is a speech interval. The “speech likelihood” may indicate the likelihood that the time interval is a speech interval as it is, or indirectly indicate the likelihood that the time interval is a non-speech interval, thereby indicating the likelihood that the time interval is a non-speech interval. It may be expressed. The “input speech likelihood sequence” is a time series of values corresponding to the “speech likelihood” of each time interval. The “input speech likelihood series” may be a time series of “speech likelihood” of each time interval, or may be a time series of values representing “speech likelihood” of each time interval. The value representing “voice likelihood” may be a continuous value or a binary value. For example, the value representing “speech likelihood” may be a function value of “speech likelihood” or may be a binary value that is threshold-determined using “speech likelihood”. However, from the viewpoint of estimation accuracy, the “input speech likelihood sequence” is preferably a time series of “speech likelihood” in each time interval or a time series of values representing “speech likelihood” that is a continuous value. . The “state transition” in each time interval for the speech state and the non-speech state is, for example, a state transition from the speech state to the speech state, a state transition from the non-speech state to the non-speech state, or from the speech state to the non-speech state. State transition and state transition from a non-voice state to a voice state. However, the state transition from the voice state to the voice state means that the voice state is maintained, that is, the “voice state”. Similarly, the state transition from the non-voice state to the non-voice state means that the non-voice state is maintained, that is, the “non-voice state”. The state transition estimation result obtained by applying the input speech likelihood sequence to the state transition model may be the likelihood of each state transition, or may be a function value of the likelihood of each state transition. The value obtained by comparing the likelihood of each state transition (for example, the state transition with the highest likelihood) may be used. The state transition model learning includes a speech likelihood sequence corresponding to the speech likelihood of each time interval of the acoustic signal, and a sequence of correct values of state transitions in each time interval for the speech state and the non-speech state of the acoustic signal. And learning data including the set of. In this learning data, elements of each time interval of the speech likelihood series (for example, speech likelihood) and the correct value of the state transition in each time interval are associated with each other. For example, an element of each time interval of the speech likelihood series and a correct value of the state transition in each time interval are associated with an identifier representing the time interval. Usually, the learning data includes a set of a speech likelihood sequence for a plurality of acoustic signals and a sequence of correct values of state transitions. That is, the learning data includes a plurality of sets of a speech likelihood sequence and a sequence of correct values of state transitions. However, the learning data may include only a set of a speech likelihood sequence for one acoustic signal and a sequence of correct values of state transitions. The “acoustic signal” is a time-series digital acoustic signal divided into a plurality of predetermined time intervals (for example, frames, subframes, etc.). The speech likelihood sequence of the learning data does not need to be the same as the assumed input speech likelihood sequence, but needs to be of the same type. For example, if the assumed input speech likelihood sequence is a time series of speech likelihoods in each time interval, the speech likelihood sequence of learning data also needs to be a time series of speech likelihoods in each time interval.

［第１実施形態］
第１実施形態を説明する。
＜モデル学習装置１１＞
まず本形態のモデル学習装置１１について説明する。図１Ａに例示するように、本形態のモデル学習装置１１は、学習データ１１１ａを記憶する記憶部１１１、および状態遷移モデル１２３ａを学習する学習部１１２を有する。 [First Embodiment]
A first embodiment will be described.
<Model learning device 11>
First, the model learning device 11 of this embodiment will be described. As illustrated in FIG. 1A, the model learning device 11 of the present embodiment includes a storage unit 111 that stores learning data 111a and a learning unit 112 that learns the state transition model 123a.

学習データ１１１ａは、音響信号の各時間区間の音声尤度に対応する音声尤度系列と、当該音響信号の音声状態および非音声状態についての各時間区間での状態遷移の正解値の系列と、の組を含む。学習部１１２は、記憶部１１１から読み出した学習データ１１１ａを用い、入力音響信号の各時間区間の音声尤度に対応する入力音声尤度系列を入力として入力音響信号の音声状態および非音声状態についての各時間区間での状態遷移の推定結果を得る状態遷移モデル１２３ａを得て出力する。本形態の状態遷移は、音声状態から音声状態への状態遷移、非音声状態から非音声状態への状態遷移、音声状態から非音声状態への状態遷移、および非音声状態から音声状態への状態遷移の何れかである。 The learning data 111a includes a speech likelihood sequence corresponding to the speech likelihood of each time interval of the acoustic signal, a sequence of correct values of state transitions in each time interval for the speech state and non-speech state of the acoustic signal, A set of The learning unit 112 uses the learning data 111a read out from the storage unit 111 and inputs an input speech likelihood sequence corresponding to the speech likelihood of each time interval of the input acoustic signal, regarding the speech state and non-speech state of the input acoustic signal. The state transition model 123a for obtaining the state transition estimation result in each time interval is obtained and output. State transition of this form is state transition from voice state to voice state, state transition from non-voice state to non-voice state, state transition from voice state to non-voice state, and state from non-voice state to voice state One of the transitions.

学習データ１１１ａが含む各時間区間の音声尤度系列は、学習データ用の音響信号から音響情報や分析情報を抽出し、それらに公知のＶＡＤを適用することで得られる（例えば、非特許文献１等参照）。ＶＡＤに用いる音響情報や分析情報としては、例えば、音のパワーの変化、波形の一定時間当たりのゼロ交差数、音響特徴量の特性の変化等、およびそれらの組み合わせが例示できる。学習データ用の音響信号は、所定の時間区間ごとに区分された時系列のデジタル音響信号である。学習データ用の音響信号は、マイクロホン等で観測されたアナログ音響信号を所定のサンプリング周波数でＡＤ変換したものであってもよいし、予め作成された任意のデジタル音響信号であってもよい。音声尤度系列の例は前述の通りであり、学習データ用の音響信号の各時間区間に対して得られた値の時系列である。音声尤度系列が各時間区間の音声尤度または音声尤度を表す連続値の系列の場合、音声尤度系列は例えば０．１，０．５，…，０．８のようになる。音声尤度系列が各時間区間の音声尤度を表すバイナリ値の系列の場合、音声尤度系列は例えば０，０，０，１…，１，０のようになる。 The speech likelihood series of each time interval included in the learning data 111a is obtained by extracting acoustic information and analysis information from the acoustic signal for learning data and applying known VAD to them (for example, Non-Patent Document 1). Etc.). As acoustic information and analysis information used for VAD, for example, a change in sound power, the number of zero crossings per fixed time of a waveform, a change in the characteristics of acoustic features, and combinations thereof can be exemplified. The acoustic signal for learning data is a time-series digital acoustic signal divided every predetermined time interval. The learning data acoustic signal may be an analog acoustic signal observed with a microphone or the like obtained by AD conversion at a predetermined sampling frequency, or may be an arbitrary digital acoustic signal created in advance. An example of the speech likelihood series is as described above, and is a time series of values obtained for each time interval of the acoustic signal for learning data. When the speech likelihood sequence is a speech likelihood of each time interval or a series of continuous values representing the speech likelihood, the speech likelihood sequence is, for example, 0.1, 0.5,. When the speech likelihood sequence is a binary value sequence representing the speech likelihood of each time interval, the speech likelihood sequence is, for example, 0, 0, 0, 1,.

学習データ１１１ａが含む状態遷移の正解値の系列は、上述の学習データ用の音響信号から得られた音声尤度系列（以下「学習データ用の音声尤度系列」という）の各時間区間に、音声状態および非音声状態についての状態遷移の正解値を付与することで得られる。この状態遷移の正解値について詳細に説明する。例えば「今日は、いい天気です」と読み上げられる場合を想定する。この際、図３Ａのように主部（今日は）と述部（いい天気です）とが続けて読み上げられた場合であっても、図３Ｂのように間隔をあけて主部と述部が読み上げられた場合であっても、その入力音響信号に対応する入力音声尤度系列から「今日は、いい天気です」全体を１つの真の音声区間として推定することを目指す。すなわち、図３Ｂのように間隔をあけて主部と述部を読み上げた場合、それに対応する入力音声尤度系列は「今日は」と「いい天気です」との間に短い非音声区間を持つことを表す系列となる。例えば、入力音声尤度系列が音声区間と非音声区間とを表すバイナリ系列である場合、この入力音声尤度系列は「今日は」と「いい天気です」との間に非音声区間の時間区間を持つ系列となる（図３Ｃ）。本形態では、このように音声区間に挟まれた短い非音声区間による音声尤度の振動を吸収した真の音声区間を適切に推定する。逆に、非音声区間の間に突発的な雑音などが混入した場合であっても、それによる音声尤度の振動を吸収した真の非音声区間を適切に推定する。なお、「真の音声区間」とは、１つの発話区間において最初に音声が観測されてから、最後の音声が観測されなくなるまでの時間区間を意味する。「真の非音声区間」とは真の音声区間以外の時間区間を意味する。「発話区間」とは、「今日は、いい天気です」のように、まとまりのある発話が行われた時間区間を意味する。また「発話開始」とは「発話区間」の開始を意味し、「発話終了」とは「発話区間」の終了を意味する。 A sequence of correct values of state transitions included in the learning data 111a is in each time interval of a speech likelihood sequence (hereinafter referred to as “speech likelihood sequence for learning data”) obtained from the acoustic signal for learning data described above. It is obtained by assigning correct values of state transitions for voice states and non-voice states. The correct value of this state transition will be described in detail. For example, suppose a case where “Today is a good weather” is read out. At this time, even if the main part (today) and the predicate (good weather) are continuously read as shown in FIG. 3A, the main part and the predicate are spaced apart as shown in FIG. 3B. Even if it is read out, it aims to estimate the entire “Today is good weather” as one true speech segment from the input speech likelihood sequence corresponding to the input acoustic signal. That is, when the main part and the predicate are read out at intervals as shown in FIG. 3B, the corresponding input speech likelihood series has a short non-speech interval between “today” and “good weather”. It becomes a series representing this. For example, when the input speech likelihood sequence is a binary sequence representing a speech interval and a non-speech interval, this input speech likelihood sequence is a time interval of a non-speech interval between “Today” and “Good weather”. (FIG. 3C). In this embodiment, a true speech segment that absorbs vibration of speech likelihood due to a short non-speech segment sandwiched between speech segments is appropriately estimated. On the other hand, even when sudden noise or the like is mixed between non-speech intervals, a true non-speech interval in which the vibration of speech likelihood due to the noise is absorbed is appropriately estimated. Note that the “true speech interval” means a time interval from when the first speech is observed in one utterance interval until the last speech is no longer observed. “True non-speech segment” means a time segment other than the true speech segment. The “speech segment” means a time segment in which a coherent utterance was made, such as “Today is a good weather”. “Speech start” means the start of “speech segment”, and “utterance end” means the end of “speech segment”.

このような推定を行うため、入力音声尤度系列を入力として入力音響信号の音声状態および非音声状態についての各時間区間での状態遷移の推定結果を得る状態遷移モデル１２３ａを学習する。本形態では、非音声状態から音声状態への状態遷移（始端遷移状態）、および、音声状態から非音声状態への状態遷移（終端遷移状態）を強調するため、音声状態および非音声状態だけではなく、始端遷移状態および終端遷移状態にもそれぞれに特別なラベルを割り当てる。例えば、各状態遷移に以下のようなラベルを割り当てる。
ラベル“０”：非音声状態から非音声状態への状態遷移（非音声状態）
ラベル“１”：音声状態から音声状態への状態遷移（音声状態）
ラベル“２”：非音声状態から音声状態への状態遷移（始端遷移状態）
ラベル“３”：音声状態から非音声状態への状態遷移（終端遷移状態）
なお、音声状態とは真の音声区間に対応する状態であり、非音声状態とは真の非音声区間に対応する状態である。 In order to perform such estimation, a state transition model 123a is obtained that obtains an estimation result of state transition in each time interval for the speech state and the non-speech state of the input sound signal by using the input speech likelihood sequence as an input. In this embodiment, the state transition from the non-voice state to the sound state (start transition state) and the state transition from the sound state to the non-voice state (terminal transition state) are emphasized. In addition, a special label is assigned to each of the start transition state and the end transition state. For example, the following labels are assigned to each state transition.
Label “0”: State transition from non-voice state to non-voice state (non-voice state)
Label “1”: State transition from voice state to voice state (voice state)
Label “2”: State transition from non-voice state to voice state (start transition state)
Label “3”: State transition from voice state to non-voice state (terminal transition state)
Note that the speech state is a state corresponding to a true speech interval, and the non-speech state is a state corresponding to a true non-speech interval.

図２に、このようなラベルが与えられた場合の状態遷移モデル１２３ａの状態遷移図を示す。この状態遷移モデル１２３ａは、入力音声尤度系列を入力として、非音声状態および音声状態をループしつつ、非音声状態、音声状態、始端遷移状態、および終端遷移状態の際にそれぞれに対応するラベルの尤度を出力する。この場合、学習データのパターンは、学習データ用の音声尤度系列に対応する時間区間が、
（１）真の音声区間を含むか、
（２）真の音声区間とその直前の真の非音声区間とを含むか、
（３）真の音声区間とその直後の真の非音声区間とを含むか、
によって細分化できる。 FIG. 2 shows a state transition diagram of the state transition model 123a when such a label is given. The state transition model 123a receives the input speech likelihood sequence as an input, loops the non-speech state and the speech state, and corresponds to the labels in the non-speech state, the speech state, the start transition state, and the end transition state, respectively. The likelihood of. In this case, the pattern of the learning data has a time interval corresponding to the speech likelihood sequence for learning data,
(1) whether or not it contains a true speech segment
(2) includes a true speech segment and a true non-speech segment immediately before
(3) include a true speech segment and a true non-speech segment immediately after
Can be subdivided.

パターン１：学習データ用の音声尤度系列に対応する時間区間が、真の音声区間ならびにその直前および直後の真の非音声区間を含む場合
これは最も一般的なパターンである。この場合、学習データ用の音声尤度系列に対応する各時間区間に以下のラベルが与えられる。
・発話開始から非音声状態の終端までの時間区間：非音声状態“０”
・非音声状態の終端から音声状態へ遷移する時間区間：始端遷移状態“２”
・音声状態の始端から終端の時間区間：音声区間“１”
・音声状態の終端から非音声状態へ遷移する時間区間：終端遷移状態“３”
・非音声状態の始端（発話終了）から終端（次の発話開始）までの時間区間：非音声状態“０”
すなわち、当該音声尤度系列に対応する各時間区間での状態遷移の正解値の系列は０，…，０，２，１，…，１，３，０，…，０となる。例えば、図３Ａのように主部と述部とが続けて読み上げられた場合であっても、図３Ｂのように間隔をあけて主部と述部が読み上げられた場合であっても、各時間区間での状態遷移の正解値の系列（ラベルの系列）は、ともに図３Ｄに例示するように０，…，０，２，１，…，１，３，０，…，０となる。なお、通常、始端遷移状態の時間区間が２つ以上連続することはなく、終端遷移状態の時間区間も２つ以上連続することはない。しかし、これらの時間区間が２つ以上連続してもよい。 Pattern 1: When a time interval corresponding to a speech likelihood sequence for learning data includes a true speech interval and true non-speech intervals immediately before and immediately after this, this is the most common pattern. In this case, the following labels are given to each time interval corresponding to the speech likelihood sequence for learning data.
-Time interval from the start of speech to the end of non-speech state: non-speech state “0”
-Time interval for transition from the end of the non-speech state to the speech state: Start transition state “2”
-Time period from the beginning to the end of the voice state: voice section “1”
-Time period for transition from the end of the voice state to the non-voice state: the end transition state “3”
-Time interval from the beginning (utterance end) to the end (next utterance start) of the non-speech state: non-speech state “0”
That is, the sequence of correct values of state transitions in each time interval corresponding to the speech likelihood sequence is 0,..., 0, 2, 1,..., 1, 3, 0,. For example, even when the main part and the predicate are read out continuously as shown in FIG. 3A, or even when the main part and the predicate are read out at intervals as shown in FIG. A series of correct values of state transitions in a time interval (a series of labels) are 0,..., 0, 2, 1,..., 1, 3, 0,. Normally, two or more time intervals in the start transition state do not continue, and two or more time intervals in the end transition state do not continue. However, two or more of these time intervals may be continuous.

パターン２：学習データ用の音声尤度系列に対応する時間区間が、真の音声区間およびその直後の真の非音声区間を含むが当該音声区間直前の非音声区間を含まない場合
この場合、先頭の時間区間が始端遷移状態の時間区間とみなされる。すなわち、学習データ用の音声尤度系列に対応する各時間区間に以下のラベルが与えられる。
・先頭の時間区間：始端遷移状態“２”
・次の時間区間から音声状態の終端までの時間区間：音声区間“１”
・音声状態の終端から非音声状態へ遷移する時間区間：終端遷移状態“３”
・非音声状態の始端（発話終了）から終端（次の発話開始）までの時間区間：非音声状態“０”
すなわち、当該音声尤度系列に対応する各時間区間での状態遷移の正解値の系列は２，１，…，１，３，０，…，０となる。 Pattern 2: When the time interval corresponding to the speech likelihood sequence for learning data includes a true speech segment and a true non-speech segment immediately after it, but does not include a non-speech segment immediately before the speech segment. This time interval is regarded as the time interval of the start transition state. That is, the following labels are given to each time interval corresponding to the speech likelihood sequence for learning data.
-First time section: Start transition state “2”
-Time interval from the next time interval to the end of the audio state: audio interval “1”
-Time period for transition from the end of the voice state to the non-voice state: the end transition state “3”
-Time interval from the beginning (utterance end) to the end (next utterance start) of the non-speech state: non-speech state “0”
That is, the sequence of correct values of state transitions in each time interval corresponding to the speech likelihood sequence is 2, 1,..., 1, 3, 0,.

パターン３：学習データ用の音声尤度系列に対応する時間区間が、真の音声区間およびその直前の真の非音声区間を含むが当該音声区間直後の非音声区間を含まない場合
この場合、最終の時間区間が終端遷移状態の時間区間とみなされる。すなわち、学習データ用の音声尤度系列に対応する各時間区間に以下のラベルが与えられる。
・発話開始から非音声状態の終端までの時間区間：非音声状態“０”
・非音声状態の終端から音声状態へ遷移する時間区間：始端遷移状態“２”
・音声状態の始端から最終の時間区間直前までの時間区間：音声区間“１”
・最終の時間区間：終端遷移状態“３”
すなわち、当該音声尤度系列に対応する各時間区間での状態遷移の正解値の系列は０，…，０，２，１，…，１，３となる。 Pattern 3: When the time interval corresponding to the speech likelihood sequence for learning data includes a true speech segment and a true non-speech segment immediately before it, but does not include a non-speech segment immediately after the speech segment. Is regarded as the time interval of the terminal transition state. That is, the following labels are given to each time interval corresponding to the speech likelihood sequence for learning data.
-Time interval from the start of speech to the end of non-speech state: non-speech state “0”
-Time interval for transition from the end of the non-speech state to the speech state: Start transition state “2”
-Time interval from the beginning of the audio state to immediately before the last time interval: Audio interval “1”
-Last time interval: terminal transition state “3”
That is, the sequence of correct values of state transitions in each time interval corresponding to the speech likelihood sequence is 0,..., 0, 2, 1,.

パターン４：学習データ用の音声尤度系列に対応する時間区間が、真の音声区間を含むがその前後に非音声区間を含まない場合
学習データ用の音声尤度系列に対応する時間区間が全て真の音声区間の場合である。この場合、当該音声尤度系列に対応する各時間区間に以下のラベルが与えられる。
・先頭から最終までの時間区間：音声区間“１”
すなわち、当該音声尤度系列に対応する各時間区間での状態遷移の正解値の系列は１，…，１となる。 Pattern 4: When the time interval corresponding to the speech likelihood sequence for learning data includes a true speech interval but does not include a non-speech interval before and after that, all the time intervals corresponding to the speech likelihood sequence for learning data are all This is the case of a true speech segment. In this case, the following labels are given to each time interval corresponding to the speech likelihood series.
・ Time section from the beginning to the end: Voice section “1”
That is, the sequence of correct values of state transitions in each time interval corresponding to the speech likelihood sequence is 1,.

パターン５：音声尤度系列に対応する時間区間が真の音声区間を含まない場合
この場合、音声尤度系列に対応する各時間区間に以下のラベルが与えられる。
・先頭から最終までの時間区間：音声区間“０”
すなわち、当該音声尤度系列に対応する各時間区間での状態遷移の正解値の系列は０，…，０となる。 Pattern 5: When the time interval corresponding to the speech likelihood sequence does not include a true speech interval In this case, the following labels are given to each time interval corresponding to the speech likelihood sequence.
・ Time section from the beginning to the end: Voice section “0”
That is, the sequence of correct values of state transitions in each time interval corresponding to the speech likelihood sequence is 0,.

なお、学習データ１１１ａは、上記のパターン１から５のすべての状態遷移の正解値の系列を含んでいてもよいし、それらの一部の状態遷移の正解値の系列のみを含んでいてもよい。また、音声尤度系列に対応する時間区間が複数の真の音声区間を含んでいてもよい。この場合の学習データ１１１ａは、上記のパターン１から５の状態遷移の正解値の系列の組み合わせからなる。例えば，図４Ａおよび図４Ｂに例示するように、１つ目の真の音声区間Ｉの前後に真の非音声区間を含み、２つ目の音声区間ＩＩの後に非音声区間が存在しない場合、学習データ１１１ａは、１つ目の真の音声区間Ｉの音声尤度系列とそれに対応するパターン１の状態遷移の正解値の系列０，…，０，２，１，…，１，３，０，…，０との組、および、２つ目の真の音声区間Ｉの音声尤度系列とそれに対応するパターン１の状態遷移の正解値の系列０，…，０，２，１，…，１，３との組、を含む。 Note that the learning data 111a may include a series of correct values of all the state transitions of the above patterns 1 to 5, or may include only a series of correct values of some of the state transitions. . In addition, the time interval corresponding to the speech likelihood sequence may include a plurality of true speech intervals. The learning data 111a in this case consists of a combination of series of correct values of the state transitions of the above patterns 1 to 5. For example, as illustrated in FIGS. 4A and 4B, when a true non-speech segment is included before and after the first true speech segment I, and no non-speech segment exists after the second speech segment II, The learning data 111a includes a speech likelihood sequence of the first true speech interval I and a sequence of correct values 0,..., 0, 2, 1,. ,..., 0, and the second true speech interval I speech likelihood sequence and the corresponding sequence value 0 of the pattern 1 corresponding to state transition 0,..., 0, 2, 1,. 1 and 3 are included.

上述したラベルの付与は、学習データ用の音響信号の各時間区間が音声区間であるか非音声区間であるかだけではなく、学習データ用の音響信号の発話区間としての音声区間および非音声区間（各発話区間に紐付けられた音声区間および非音声区間）に基づいて行う必要がある。このようなラベルの付与には様々な方法が考えられる。第１の方法は、人間が学習データ用の音響信号に含まれる一つ一つの発話を視聴し、波形を観測して正確なラベルを付与するものである。第２の方法は、公知の認識用のデコーダを利用することにより、自動的に上記のラベルを付与する方法である。しかし、第２の方法では一部の学習データにおいて誤ったラベルを付与してしまう場合がある。このような場合であっても、学習データ全体として正しくラベルが付与される頻度が高ければ大きな問題はない。第３の方法は、初めに上述のデコーダでラベルを付与した後、人手で誤りをチェックする折衷案である。 The labeling described above is not limited to whether each time section of the acoustic signal for learning data is a speech section or a non-speech section, but also a speech section and a non-speech section as a speech section of the learning data acoustic signal. It is necessary to carry out based on (speech segment and non-speech segment associated with each utterance segment). Various methods are conceivable for applying such a label. In the first method, a person views each utterance included in an acoustic signal for learning data, observes the waveform, and gives an accurate label. The second method is a method of automatically assigning the above-mentioned label by using a known recognition decoder. However, in the second method, an erroneous label may be given to some learning data. Even in such a case, there is no major problem as long as the label is correctly assigned as the entire learning data. The third method is a compromise method in which a label is first assigned by the above-described decoder and then manually checked for errors.

学習部１１２は、上述のように生成した学習データ１１１ａを用い、図２の状態遷移図で表される状態遷移モデル１２３ａを学習する。この状態遷移モデル１２３ａは、公知の手法（例えば、参考文献１：Hochreiter, S., & Schmidhuber, J., “Long Short-Term Memory,” Neural Computation, 9(8), 1735-1780, 1997）を用いて表現・学習できる。参考文献１の手法は、非常に長い系列の情報を保持しておくことができ、より時間的に遠い位置の状態や音声尤度を考慮した状態遷移モデル１２３ａを構築することができ、非常に有用である。参考文献１の手法以外にも、時系列を扱えるモデル（ＲＮＮやＨＭＭ等）を用いることも可能である。ただし、ＨＭＭの場合は、その構造上、未来の状態を考慮することができない。 The learning unit 112 learns the state transition model 123a represented by the state transition diagram of FIG. 2 using the learning data 111a generated as described above. This state transition model 123a is a known method (for example, Reference 1: Hochreiter, S., & Schmidhuber, J., “Long Short-Term Memory,” Neural Computation, 9 (8), 1735-1780, 1997). Can be expressed and learned using The technique of Reference 1 can hold a very long series of information, can construct a state transition model 123a in consideration of the state of a farther position and the speech likelihood, and is very Useful. In addition to the method described in Reference 1, it is possible to use a model (such as RNN or HMM) that can handle time series. However, in the case of the HMM, the future state cannot be taken into consideration due to its structure.

＜音声区間検出装置１２＞
次に本形態の音声区間検出装置１２について説明する。図１Ｂに例示するように、本形態の音声区間検出装置１２は、入力部１２１、音声区間検出部１２２、記憶部１２３、推定部１２４、および出力部１２６を有する。 <Voice section detection device 12>
Next, the speech section detection device 12 of this embodiment will be described. As illustrated in FIG. 1B, the speech segment detection device 12 of the present exemplary embodiment includes an input unit 121, a speech segment detection unit 122, a storage unit 123, an estimation unit 124, and an output unit 126.

＜記憶部１２３＞
記憶部１２３には、前述のようにモデル学習装置１１から出力された状態遷移モデル１２３ａが格納される。状態遷移モデル１２３ａは、音声区間検出装置１２での音声区間検出が開始される前に記憶部１２３に格納されていてもよいし、モデル学習装置１１から新たな状態遷移モデル１２３ａが出力されるたびに記憶部１２３に格納されてもよい。 <Storage unit 123>
The storage unit 123 stores the state transition model 123a output from the model learning device 11 as described above. The state transition model 123a may be stored in the storage unit 123 before the speech section detection by the speech section detection device 12 is started, or each time a new state transition model 123a is output from the model learning device 11. May be stored in the storage unit 123.

＜入力部１２１＞
入力部１２１には、音声区間検出対象の入力音響信号が入力される。音声区間検出対象の入力音響信号は、所定の時間区間ごとに区分された時系列のデジタル音響信号である。入力音響信号は、マイクロホン等で観測されたアナログ音響信号を所定のサンプリング周波数でＡＤ変換したものであってもよいし、予め作成された任意のデジタル音響信号であってもよい。なお、入力音響信号の時間区間の長さは、前述の学習データ用の音響信号の時間区間の長さと同一または近似することが好ましい。入力音響信号は音声区間検出部１２２に送られる。 <Input unit 121>
The input unit 121 receives an input acoustic signal to be detected as a voice segment. The input acoustic signal to be detected as a voice segment is a time-series digital acoustic signal divided for each predetermined time segment. The input sound signal may be an analog sound signal observed with a microphone or the like obtained by AD conversion at a predetermined sampling frequency, or may be an arbitrary digital sound signal created in advance. Note that the length of the time interval of the input acoustic signal is preferably the same as or approximate to the length of the time interval of the acoustic signal for learning data described above. The input acoustic signal is sent to the voice segment detection unit 122.

＜音声区間検出部１２２＞
音声区間検出部１２２は、入力音響信号から音響情報や分析情報を抽出し、それらに対して公知のＶＡＤを適用することで（例えば、非特許文献１等参照）、入力音響信号の各時間区間の音声尤度に対応する入力音声尤度系列を得て出力する。ＶＡＤに用いる音響情報や分析情報の例は前述の通りである。また入力音声尤度系列の例も前述の通りであるが、入力音声尤度系列の種別は学習データ用の音声尤度系列の種別と同一である。例えば、学習データ用の音声尤度系列が各時間区間の音声尤度の系列である場合、入力音声尤度系列も各時間区間の音声尤度の系列である。入力音声尤度系列は推定部１２４に送られる。 <Audio section detection unit 122>
The voice section detection unit 122 extracts acoustic information and analysis information from the input acoustic signal, and applies a known VAD to the extracted information (for example, see Non-Patent Document 1 etc.), thereby each time section of the input acoustic signal. An input speech likelihood sequence corresponding to the speech likelihood is obtained and output. Examples of acoustic information and analysis information used for VAD are as described above. An example of the input speech likelihood sequence is also as described above, but the type of the input speech likelihood sequence is the same as the type of the speech likelihood sequence for learning data. For example, when the speech likelihood sequence for learning data is a speech likelihood sequence for each time interval, the input speech likelihood sequence is also a speech likelihood sequence for each time interval. The input speech likelihood sequence is sent to the estimation unit 124.

＜推定部１２４＞
推定部１２４は、送られた入力音声尤度系列を、記憶部１２３から読み出した状態遷移モデル１２３ａに適用し、入力音響信号の音声状態および非音声状態についての各時間区間での状態遷移の推定結果を得て出力する（post-filter処理）。状態遷移の推定結果の例は、各時間区間での非音声状態（ラベル“０”）の尤度、音声状態（ラベル“１”）の尤度、始端遷移状態（ラベル“２”）の尤度、および終端遷移状態（ラベル“３”）の尤度からなる４つの尤度の系列である。あるいは、各時間区間の非音声状態（ラベル“０”）の尤度の系列のみを状態遷移の推定結果としてもよいし、各時間区間の音声状態（ラベル“１”）の尤度の系列のみを状態遷移の推定結果としてもよい。あるいは、各時間区間の非音声状態（ラベル“０”）の尤度と音声状態（ラベル“１”）の尤度との系列を状態遷移の推定結果としてもよい。その他、各時間区間で最も大きな尤度を持つ状態を表す値の系列を状態遷移の推定結果としてもよい。 <Estimation unit 124>
The estimation unit 124 applies the transmitted input speech likelihood sequence to the state transition model 123a read from the storage unit 123, and estimates state transitions in each time interval for the speech state and the non-speech state of the input acoustic signal. Obtain and output the result (post-filter processing). Examples of state transition estimation results include the likelihood of the non-speech state (label “0”), the likelihood of the speech state (label “1”), and the likelihood of the start transition state (label “2”) in each time interval. It is a series of four likelihoods consisting of the likelihood and the likelihood of the terminal transition state (label “3”). Alternatively, only the likelihood sequence of the non-speech state (label “0”) in each time interval may be the estimation result of the state transition, or only the likelihood sequence of the speech state (label “1”) in each time interval. May be the state transition estimation result. Alternatively, a sequence of the likelihood of the non-speech state (label “0”) and the likelihood of the speech state (label “1”) in each time interval may be used as the state transition estimation result. In addition, a series of values representing a state having the greatest likelihood in each time interval may be used as the state transition estimation result.

＜出力部１２６＞
推定部１２４から出力された状態遷移の推定結果は出力部１２６に送られる。出力部１２６は、状態遷移の推定結果を音声区間推定結果として出力する。 <Output unit 126>
The state transition estimation result output from the estimation unit 124 is sent to the output unit 126. The output unit 126 outputs the state transition estimation result as a speech segment estimation result.

［第２実施形態］
第２実施形態は第１実施形態の変形例である。本形態では、推定部１２４から出力された状態遷移の推定結果に後処理を行って得られたものを音声区間推定結果とする。以下では第１実施形態との相違点を中心に説明し、既に説明した事項については同じ参照番号を引用して説明を簡略化する。 [Second Embodiment]
The second embodiment is a modification of the first embodiment. In this embodiment, the result obtained by performing post-processing on the estimation result of the state transition output from the estimation unit 124 is used as the speech segment estimation result. Below, it demonstrates centering around difference with 1st Embodiment, and it cites the same reference number about the already demonstrated matter, and simplifies description.

＜モデル学習装置１１＞
第１実施形態と同じである。 <Model learning device 11>
The same as in the first embodiment.

＜音声区間検出装置２２＞
図１Ｂに例示するように、本形態の音声区間検出装置２２は、入力部１２１、音声区間検出部１２２、記憶部１２３、推定部１２４、後処理部２２５、および出力部２２６を有する。以下に第１実施形態との相違点である後処理部２２５および出力部２２６の詳細を説明する。 <Audio section detection device 22>
As illustrated in FIG. 1B, the speech segment detection device 22 of this embodiment includes an input unit 121, a speech segment detection unit 122, a storage unit 123, an estimation unit 124, a post-processing unit 225, and an output unit 226. Details of the post-processing unit 225 and the output unit 226, which are the differences from the first embodiment, will be described below.

＜後処理部２２５＞
第１実施形態と異なり、推定部１２４から出力された状態遷移の推定結果は後処理部２２５に送られる。後処理部２２５は、送られた状態遷移の推定結果に対して所定の後処理を行って音声区間推定結果を得て出力する。例えば、状態遷移の推定結果として、各時間区間の非音声状態（ラベル“０”）の尤度、音声状態（ラベル“１”）の尤度、始端遷移状態（ラベル“２”）の尤度、および終端遷移状態（ラベル“３”）の尤度の系列が送られる場合、後処理部２２５は、これら４つの尤度の系列を用い、各時間区間の代表的な状態（非音声状態、音声状態、始端遷移状態、または終端遷移状態）を選択し、それによって得られた各時間区間の代表的な状態の系列を音声区間推定結果として出力してもよい。例えば、後処理部２２５は、各時間区間について、送られた４つの状態のうち最も大きな尤度を持つ状態を選択して出力してもよい。すなわち、後処理部２２５は、送られた４つの状態の尤度系列を最尤状態系列に変換して出力してもよい。例えば、ｎ＝０，…，Ｎ−１を時間区間に対応する識別子とし、Ｎを正整数とし、時間区間ｎでの非音声状態の尤度をＰ_ｎ，０とし、音声状態の尤度をＰ_ｎ，１とし、始端遷移状態の尤度をＰ_ｎ，２とし、および終端遷移状態の尤度をＰ_ｎ，３とする。この場合、後処理部２２５は、各時間区間ｎについて、送られた４つの状態の尤度系列（Ｐ_ｎ，０，Ｐ_ｎ，１，Ｐ_ｎ，２，Ｐ_ｎ，３）を最尤状態系列ｓ_ｎに変換して出力してもよい。

<Post-processing unit 225>
Unlike the first embodiment, the estimation result of the state transition output from the estimation unit 124 is sent to the post-processing unit 225. The post-processing unit 225 performs predetermined post-processing on the sent state transition estimation result to obtain and output a speech section estimation result. For example, as the estimation result of the state transition, the likelihood of the non-speech state (label “0”), the likelihood of the speech state (label “1”), and the likelihood of the start transition state (label “2”) in each time interval , And the likelihood sequence of the terminal transition state (label “3”), the post-processing unit 225 uses these four likelihood sequences to represent a representative state (non-voice state, (Speech state, start transition state, or end transition state) may be selected, and a series of representative states of each time section obtained thereby may be output as a speech section estimation result. For example, the post-processing unit 225 may select and output a state having the greatest likelihood among the four transmitted states for each time interval. That is, the post-processing unit 225 may convert the four likelihood sequences sent to the maximum likelihood state sequence and output the result. For example, n = 0,..., N−1 is an identifier corresponding to a time interval, N is a positive integer, the likelihood of a non-speech state in time interval n is P _{n, 0,} and the likelihood of a speech state is _Let P _{n, 1} be the likelihood of the start transition state, P _{n, 2,} and let the likelihood of the end transition state be P _{n, 3} . In this case, the post-processing unit 225 uses the four state likelihood sequences ( _{Pn, 0} , _{Pn, 1} , _{Pn, 2} , _{Pn, 3} ) sent to the maximum likelihood state for each time interval n. it may be output by converting the sequence s _n.

あるいは、後処理部２２５は、４つの状態の尤度系列（Ｐ_ｎ，０，Ｐ_ｎ，１，Ｐ_ｎ，２，Ｐ_ｎ，３）の特定の状態の尤度を強調したり、弱めたりして得られた尤度系列を最尤状態系列ｓ_ｎに変換して出力してもよい。例えば、後処理部２２５は、尤度系列（Ｐ_ｎ，０，Ｐ_ｎ，１，Ｐ_ｎ，２，Ｐ_ｎ，３）を以下の最尤状態系列ｓ_ｎに変換して出力してもよい。

ただし、α_ｉは尤度Ｐ_ｎ，ｉに与えられる重みである。例えば、α_ｉは０よりも大きな正値である。あるいは、特定の尤度Ｐ_ｎ，ｉに与えられる重みを０にしてもよい。例えば、α_２＝α_３＝０とすれば、始端遷移状態や終端遷移状態が音声区間推定結果として選択されることを避けることができる。 Alternatively, the post-processing unit 225 emphasizes or weakens the likelihood of specific states of the four state likelihood sequences ( _{Pn, 0} , _{Pn, 1} , _{Pn, 2} , _{Pn, 3} ). the likelihood sequence obtained by may be converted and output to the maximum likelihood state sequence s _n. For example, the post-processing unit 225 may convert the likelihood sequence (P _{n, 0} , P _{n, 1} , P _{n, 2} , P _{n, 3} ) into the following maximum likelihood state sequence s _n and output it. .

Here, α _i is a weight given to the likelihood P _{n, i} . For example, α _i is a positive value larger than 0. Alternatively, the weight given to the specific likelihood P _{n, i} may be zero. For example, if α ₂ = α ₃ = 0, it is possible to avoid that the start transition state or the end transition state is selected as the speech section estimation result.

その他、後処理部２２５が、送られた状態遷移の推定結果に対してＶＡＤにおいて一般的に行われる公知の補正手法を適用し、それによって得られた結果を音声区間推定結果として出力してもよい。例えば、各時間区間の音声状態（ラベル“１”）の尤度の系列のみが状態遷移の推定結果として送られる場合、後処理部２２５が各時間区間の音声状態の尤度と所定の閾値とを比較し、その比較結果の系列に対応する音声区間推定結果を出力してもよい。例えば、後処理部２２５は、音声状態の尤度が閾値以上の時間区間を音声区間とし、それ以外の時間区間を非音声区間とする音声区間推定結果を出力してもよい。各時間区間の非音声状態（ラベル“０”）の尤度の系列のみが状態遷移の推定結果として送られる場合にも、後処理部２２５が各時間区間の非音声状態の尤度と所定の閾値とを比較し、その比較結果の系列に対応する音声区間推定結果を出力してもよい。その他、後処理部２２５が、送られた状態遷移の推定結果の特定の状態の尤度を強調したり、弱めたりして得られた尤度系列に対してＶＡＤにおいて一般的に行われる公知の補正手法を適用し、それによって得られた結果を音声区間推定結果として出力してもよい。後処理部２２５は、音声区間検出の精度向上のためのその他の公知技術を用いてもよい。 In addition, even if the post-processing unit 225 applies a known correction method generally performed in VAD to the sent state transition estimation result, the result obtained thereby is output as a speech section estimation result. Good. For example, when only the likelihood sequence of the speech state (label “1”) in each time interval is sent as the estimation result of the state transition, the post-processing unit 225 determines the likelihood of the speech state in each time interval and a predetermined threshold value. May be output, and speech segment estimation results corresponding to the series of comparison results may be output. For example, the post-processing unit 225 may output a speech section estimation result in which a time section having a speech state likelihood equal to or greater than a threshold is set as a speech section and the other time sections are set as non-speech sections. Even when only the likelihood sequence of the non-speech state (label “0”) in each time interval is sent as the estimation result of the state transition, the post-processing unit 225 determines the likelihood of the non-speech state in each time interval and a predetermined value. You may compare with a threshold value and output the audio | voice area estimation result corresponding to the series of the comparison result. In addition, the post-processing unit 225 is a publicly known method that is generally performed in the VAD on the likelihood sequence obtained by emphasizing or weakening the likelihood of a specific state in the state transition estimation result sent. A correction method may be applied and a result obtained thereby may be output as a speech section estimation result. The post-processing unit 225 may use other known techniques for improving the accuracy of voice segment detection.

＜出力部２２６＞
推定部２２４から出力された音声区間推定結果は出力部２２６に送られる。出力部２２６はこの音声区間推定結果を出力する。 <Output unit 226>
The speech section estimation result output from the estimation unit 224 is sent to the output unit 226. The output unit 226 outputs this speech segment estimation result.

［第３実施形態］
第３実施形態は第１，２実施形態の変形例である。本形態では、モデル学習装置が複数の状態遷移モデルを生成し、音声区間検出装置が入力音声尤度系列をこれら複数の状態遷移モデルに適用し、当該複数の状態遷移モデルのそれぞれについて、入力音響信号の音声状態および非音声状態についての各時間区間での状態遷移の推定結果を得、得られた状態遷移の推定結果のうち、各時間区間において最も確からしい推定結果を選択する。以下では第１，２実施形態との相違点を中心に説明し、既に説明した事項については同じ参照番号を引用して説明を簡略化する。 [Third Embodiment]
The third embodiment is a modification of the first and second embodiments. In the present embodiment, the model learning device generates a plurality of state transition models, the speech section detection device applies the input speech likelihood sequence to the plurality of state transition models, and the input sound for each of the plurality of state transition models. The estimation result of the state transition in each time interval for the speech state and the non-speech state of the signal is obtained, and the most probable estimation result in each time interval is selected from the obtained state transition estimation results. In the following description, differences from the first and second embodiments will be mainly described, and the description will be simplified by citing the same reference numerals for the matters already described.

＜モデル学習装置３１＞
図１Ａに例示するように、本形態のモデル学習装置３１は、学習データ３１１ａを記憶する記憶部３１１、および状態遷移モデル３２３ａを学習する学習部３１２を有する。 <Model learning device 31>
As illustrated in FIG. 1A, the model learning device 31 of the present embodiment includes a storage unit 311 that stores learning data 311a and a learning unit 312 that learns the state transition model 323a.

本形態の学習データ３１１ａは、複数種類の学習データ用の音響信号の各時間区間の音声尤度に対応する音声尤度系列と、当該音響信号の音声状態および非音声状態についての各時間区間での状態遷移の正解値の系列と、の組を複数組含む。これらの組は学習データ用の音響信号の種類に応じてクラス分けされている。例えば、学習データ３１１ａは、複数種類の環境で得られた学習データ用の音響信号に対応する音声尤度系列と状態遷移の正解値の系列との組を複数組含み、これらの組は対応する環境ごとにクラス分けされている。このクラスをｃ＝０，…，Ｃ−１（ただし、Ｃは２以上の整数）とし、学習データ３１１ａに含まれたクラスｃに属する音声尤度系列と状態遷移の正解値の系列との組の集合をＤ（ｃ）と表記する。すなわち、学習データ３１１ａは集合Ｄ（０），…，Ｄ（Ｃ−１）を含む。 The learning data 311a of the present embodiment includes a speech likelihood sequence corresponding to the speech likelihood of each time interval of the acoustic signal for a plurality of types of learning data, and each time interval for the speech state and the non-speech state of the acoustic signal. A plurality of sets of correct value series of state transitions are included. These sets are classified according to the type of acoustic signal for learning data. For example, the learning data 311a includes a plurality of sets of speech likelihood sequences corresponding to acoustic signals for learning data obtained in a plurality of types of environments and sequences of correct values of state transitions, and these sets correspond to each other. Each environment is classified. This class is c = 0,..., C-1 (where C is an integer equal to or greater than 2), and a set of a speech likelihood sequence belonging to class c included in the learning data 311a and a sequence of correct values of state transitions Is denoted as D (c). That is, the learning data 311a includes a set D (0), ..., D (C-1).

学習部３１２は記憶部３１１から読み出した学習データ３１１ａを用い、入力音響信号の各時間区間の音声尤度に対応する入力音声尤度系列を入力として入力音響信号の音声状態および非音声状態についての各時間区間での状態遷移の推定結果を得る複数の状態遷移モデル３２３ａを得て出力する。すなわち、学習部３１２は、学習データ３１１ａに含まれた集合Ｄ（ｃ）を用い、第１実施形態と同じ方法でクラスｃに対応する状態遷移モデルＭ（ｃ）を生成し、複数の状態遷移モデルＭ（０），…，Ｍ（Ｃ−１）を状態遷移モデル３２３ａとして出力する。その他は第１実施形態と同じである。 The learning unit 312 uses the learning data 311 a read from the storage unit 311, and inputs an input speech likelihood sequence corresponding to the speech likelihood of each time interval of the input acoustic signal as an input for the speech state and non-speech state of the input acoustic signal. A plurality of state transition models 323a for obtaining estimation results of state transitions in each time interval are obtained and output. That is, the learning unit 312 uses the set D (c) included in the learning data 311a to generate a state transition model M (c) corresponding to the class c in the same manner as in the first embodiment, and a plurality of state transitions The models M (0),..., M (C-1) are output as the state transition model 323a. Others are the same as the first embodiment.

＜音声区間検出装置３２＞
図１Ｂに例示するように、本形態の音声区間検出装置３２は、入力部１２１、音声区間検出部１２２、記憶部３２３、推定部３２４、後処理部３２５、および出力部２２６を有する。以下に第１，２実施形態との相違点である記憶部３２３、推定部３２４、および後処理部３２５の詳細を説明する。 <Voice section detection device 32>
As illustrated in FIG. 1B, the speech segment detection device 32 of this embodiment includes an input unit 121, a speech segment detection unit 122, a storage unit 323, an estimation unit 324, a post-processing unit 325, and an output unit 226. Details of the storage unit 323, the estimation unit 324, and the post-processing unit 325, which are different from the first and second embodiments, will be described below.

＜記憶部３２３＞
記憶部３２３には、学習部３１２から出力された複数の状態遷移モデル３２３ａが格納される。状態遷移モデル３２３ａは、音声区間検出装置３２での音声区間検出が開始される前に記憶部３２３に格納されていてもよいし、モデル学習装置３１から新たな状態遷移モデル３２３ａが出力されるたびに記憶部３２３に格納されてもよい。 <Storage unit 323>
The storage unit 323 stores a plurality of state transition models 323a output from the learning unit 312. The state transition model 323a may be stored in the storage unit 323 before the speech section detection by the speech section detection device 32 is started, or each time a new state transition model 323a is output from the model learning device 31. May be stored in the storage unit 323.

＜推定部３２４＞
推定部３２４には、音声区間検出部１２２から出力された入力音声尤度系列が入力される。推定部３２４は、この入力音声尤度系列を記憶部３２３から読み出した複数の状態遷移モデル３２３ａに適用し、複数の状態遷移モデル３２３ａのそれぞれについて、入力音響信号の音声状態および非音声状態についての各時間区間での状態遷移の推定結果を得て出力する。すなわち、推定部３２４は、入力音声尤度系列を各状態遷移モデルＭ（ｃ）（ただし、ｃ＝０，…，Ｃ−１）に適用し、各クラスｃについて各時間区間での状態遷移の推定結果Ｒ（ｃ）を得て出力する。各クラスｃについての推定結果Ｒ（ｃ）は後処理部３２５に送られる。その他は第１，２実施形態と同一である。 <Estimation unit 324>
The estimation unit 324 receives the input speech likelihood sequence output from the speech segment detection unit 122. The estimation unit 324 applies this input speech likelihood sequence to the plurality of state transition models 323a read out from the storage unit 323, and for each of the plurality of state transition models 323a, the speech state and non-speech state of the input acoustic signal. Obtain and output estimation results of state transitions in each time interval. That is, the estimation unit 324 applies the input speech likelihood sequence to each state transition model M (c) (where c = 0,..., C−1), and performs state transition in each time interval for each class c. An estimation result R (c) is obtained and output. The estimation result R (c) for each class c is sent to the post-processing unit 325. Others are the same as the first and second embodiments.

＜後処理部３２５＞
後処理部３２５は、推定部３２４から出力された各クラスｃについて送られた推定結果Ｒ（ｃ）に対応する結果のうち、各時間区間において最も確からしい結果を選択する。この選択には音声区間音声検出の精度向上のための公知の技術を用いることができる。例えば、音声区間検出部１２２が非常に異なる複数の環境で得られた入力音響信号に対してＶＡＤを行う場合、後処理部３２５が、参考文献２（R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton. “Adaptive mixtures of local experts.” Neural Computation, 3:79-87, 1991.）に記載された手法を用いて最も確からしい推定結果を選択してもよい。すなわち、後処理部３２５は、送られた推定結果Ｒ（０），…，Ｒ（Ｃ−１）に対応する結果のうち、入力音響信号が得られた環境に対応する推定結果Ｒ（ｃ’）に対応する結果を最も確からしい結果として選択する。なお、推定結果Ｒ（ｃ）に対応する結果は、推定結果Ｒ（ｃ）そのものであってもよいし、推定結果Ｒ（ｃ）に対して第２実施形態の後処理部２２５の処理を行って得られたものであってもよい。後処理部３２５は、選択した最も確からしい結果を音声区間推定結果として出力してもよいし、選択した最も確からしい結果に第２実施形態の後処理部２２５の処理を行って得られた音声区間推定結果を出力してもよい。音声区間推定結果は出力部２２６に送られる。その後の処理は第２実施形態と同じである。 <Post-processing unit 325>
The post-processing unit 325 selects the most probable result in each time interval among the results corresponding to the estimation result R (c) sent for each class c output from the estimation unit 324. For this selection, a known technique for improving the accuracy of voice segment voice detection can be used. For example, when the voice section detection unit 122 performs VAD on the input acoustic signals obtained in a plurality of very different environments, the post-processing unit 325 is referred to the reference document 2 (RA Jacobs, MI Jordan, SJ Nowlan, and GE). Hinton. “Adaptive mixtures of local experts.” Neural Computation, 3: 79-87, 1991.) The most probable estimation result may be selected. That is, the post-processing unit 325 estimates the estimation result R (c ′) corresponding to the environment in which the input acoustic signal is obtained from the results corresponding to the estimation results R (0),. ) Is selected as the most probable result. Note that the result corresponding to the estimation result R (c) may be the estimation result R (c) itself, or the processing of the post-processing unit 225 of the second embodiment is performed on the estimation result R (c). It may be obtained. The post-processing unit 325 may output the selected most probable result as a speech segment estimation result, or the speech obtained by performing the processing of the post-processing unit 225 of the second embodiment on the most probable selected result. The section estimation result may be output. The speech section estimation result is sent to the output unit 226. The subsequent processing is the same as in the second embodiment.

［実施形態の手法の特徴］
以上のように、各実施形態では、様々な環境において、人手でのチューニングを行うことなくＶＡＤの精度を向上させることができ、コスト削減と精度向上の両方を実現できる。また、単一の環境においても、発話のブレを吸収し、既存手法より高精度に音声区間検出を行うことができる。さらに、状態遷移モデルを用いた推定部の構成はＶＡＤに縦続接続するpost-filterであるため、他のＶＡＤ精度改善手法と併用することもできる。例えば、従来のハングオーバーの手法を適用した後にこのpost-filterの処理を行うことで更なる性能向上が見込める。 [Characteristics of Embodiment Method]
As described above, in each embodiment, the accuracy of the VAD can be improved without performing manual tuning in various environments, and both cost reduction and accuracy improvement can be realized. In addition, even in a single environment, it is possible to absorb utterance blurring and perform speech segment detection with higher accuracy than existing methods. Furthermore, since the configuration of the estimation unit using the state transition model is a post-filter cascaded to the VAD, it can be used in combination with other VAD accuracy improvement methods. For example, further performance improvement can be expected by performing this post-filter processing after applying the conventional hangover method.

従来、音響特徴量を入力として音声状態／非音声状態を出力する状態遷移モデルを用いたＶＡＤは存在していた。しかし、音響特徴量は環境による変動が激しく、モデルが膨大になり、かつ大量の学習データが必要であった。これに対し、各実施形態の状態遷移モデルは、ＶＡＤ処理後の音声尤度系列を入力特徴量として状態遷移の推定結果を得るものである。音声尤度系列は音響特徴量に比べて環境による変動が小さい。そのため、各実施形態の状態遷移モデルは、少量の学習データによって作成できる上に、様々なＶＡＤ技術のｐｏｓｔ−ｆｉｌｔｅｒとして頑健に動作する。さらに、各実施形態では、音声区間の始端（始端遷移状態）および終端（終端遷移状態）に特殊なラベルを付与した。これにより、「区間」としての情報をより強く学習することができ、突発的な雑音による一時的な音声状態フレームの出現や、子音、息継ぎなどによる一時的な非音声状態の出現に左右されにくい。 Conventionally, there has been a VAD using a state transition model that outputs a voice state / non-voice state with an acoustic feature amount as an input. However, the acoustic feature amount varies greatly depending on the environment, the model becomes enormous, and a large amount of learning data is required. On the other hand, the state transition model of each embodiment obtains a state transition estimation result using the speech likelihood sequence after VAD processing as an input feature amount. The speech likelihood sequence is less subject to environmental variation than the acoustic feature amount. Therefore, the state transition model of each embodiment can be created with a small amount of learning data, and operates robustly as a post-filter of various VAD techniques. Furthermore, in each embodiment, a special label is given to the start end (start transition state) and the end (end transition state) of the speech section. As a result, information as “section” can be learned more strongly, and it is difficult to be influenced by the appearance of a temporary voice state frame due to sudden noise or the appearance of a temporary non-voice state due to consonant, breathing, etc. .

［その他の変形例等］
なお、本発明は上述の実施形態に限定されるものではない。例えば、第１，２，３実施形態では、音声区間検出装置１２，２２，３２に入力音響信号が入力され、音声区間検出部１２２が当該入力音響信号の各時間区間の音声尤度に対応する入力音声尤度系列を得た。しかしながら、音声区間検出装置１２，２２，３２の外部で入力音響信号から入力音声尤度系列が生成され、この入力音声尤度系列が音声区間検出装置１２，２２，３２に入力されてもよい。この場合、音声区間検出装置１２，２２，３２から音声区間検出部１２２が省略されてもよい。 [Other variations]
In addition, this invention is not limited to the above-mentioned embodiment. For example, in the first, second, and third embodiments, an input acoustic signal is input to the speech segment detection devices 12, 22, and 32, and the speech segment detection unit 122 corresponds to the speech likelihood of each time segment of the input acoustic signal. The input speech likelihood sequence was obtained. However, an input speech likelihood sequence may be generated from the input acoustic signal outside the speech segment detection devices 12, 22, and 32, and this input speech likelihood sequence may be input to the speech segment detection devices 12, 22, and 32. In this case, the voice section detection unit 122 may be omitted from the voice section detection devices 12, 22, and 32.

上述の各種の処理は、記載に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。その他、本発明の趣旨を逸脱しない範囲で適宜変更が可能であることはいうまでもない。 The various processes described above are not only executed in time series according to the description, but may also be executed in parallel or individually as required by the processing capability of the apparatus that executes the processes. Needless to say, other modifications are possible without departing from the spirit of the present invention.

上記の各装置は、例えば、ＣＰＵ（central processing unit）等のプロセッサ（ハードウェア・プロセッサ）およびＲＡＭ（random-access memory）・ＲＯＭ（read-only memory）等のメモリ等を備える汎用または専用のコンピュータが所定のプログラムを実行することで構成される。このコンピュータは１個のプロセッサやメモリを備えていてもよいし、複数個のプロセッサやメモリを備えていてもよい。このプログラムはコンピュータにインストールされてもよいし、予めＲＯＭ等に記録されていてもよい。また、ＣＰＵのようにプログラムが読み込まれることで機能構成を実現する電子回路（circuitry）ではなく、プログラムを用いることなく処理機能を実現する電子回路を用いて一部またはすべての処理部が構成されてもよい。１個の装置を構成する電子回路が複数のＣＰＵを含んでいてもよい。 Each of the above devices is a general-purpose or dedicated computer including, for example, a processor (hardware processor) such as a CPU (central processing unit) and a memory such as random-access memory (RAM) and read-only memory (ROM). Is configured by executing a predetermined program. The computer may include a single processor and memory, or may include a plurality of processors and memory. This program may be installed in a computer, or may be recorded in a ROM or the like in advance. In addition, some or all of the processing units are configured using an electronic circuit that realizes a processing function without using a program, instead of an electronic circuit (circuitry) that realizes a functional configuration by reading a program like a CPU. May be. An electronic circuit constituting one device may include a plurality of CPUs.

上述の構成をコンピュータによって実現する場合、各装置が有すべき機能の処理内容はプログラムによって記述される。このプログラムをコンピュータで実行することにより、上記処理機能がコンピュータ上で実現される。この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体の例は、非一時的な（non-transitory）記録媒体である。このような記録媒体の例は、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等である。 When the above configuration is realized by a computer, the processing contents of the functions that each device should have are described by a program. By executing this program on a computer, the above processing functions are realized on the computer. The program describing the processing contents can be recorded on a computer-readable recording medium. An example of a computer-readable recording medium is a non-transitory recording medium. Examples of such a recording medium are a magnetic recording device, an optical disk, a magneto-optical recording medium, a semiconductor memory, and the like.

このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 This program is distributed, for example, by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Furthermore, the program may be distributed by storing the program in a storage device of the server computer and transferring the program from the server computer to another computer via a network.

このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶装置に格納する。処理の実行時、このコンピュータは、自己の記憶装置に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。このプログラムの別の実行形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるＡＳＰ（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。 A computer that executes such a program first stores, for example, a program recorded on a portable recording medium or a program transferred from a server computer in its own storage device. When executing the process, the computer reads a program stored in its own storage device, and executes a process according to the read program. As another execution form of the program, the computer may read the program directly from the portable recording medium and execute processing according to the program, and each time the program is transferred from the server computer to the computer. The processing according to the received program may be executed sequentially. The above-described processing may be executed by a so-called ASP (Application Service Provider) type service that realizes a processing function only by an execution instruction and result acquisition without transferring a program from the server computer to the computer. Good.

コンピュータ上で所定のプログラムを実行させて本装置の処理機能が実現されるのではなく、これらの処理機能の少なくとも一部がハードウェアで実現されてもよい。 The processing functions of the apparatus are not realized by executing a predetermined program on a computer, but at least a part of these processing functions may be realized by hardware.

本発明は、例えば、音声認識処理や音声対話処理の前段での音声区間検出に利用できる。本発明をこれらに適用した場合、音声区間の検出精度を向上させ、後段の音声認識や音声対話をより高精度に行うことができる。 The present invention can be used, for example, for voice segment detection in the previous stage of voice recognition processing or voice dialogue processing. When the present invention is applied to these, it is possible to improve the detection accuracy of the voice section and perform the subsequent voice recognition and voice dialogue with higher accuracy.

１１，３１モデル学習装置
１２，２２，３２音声区間検出装置 11, 31 Model learning device 12, 22, 32 Voice section detection device

Claims

Learning including a set of a speech likelihood sequence corresponding to the speech likelihood of each time interval of the acoustic signal and a sequence of correct values of state transitions in each time interval for the speech state and the non-speech state of the acoustic signal Using the data, the input speech likelihood sequence corresponding to the speech likelihood of each time interval of the input acoustic signal is input, and the state transition estimation result in each time interval for the speech state and non-speech state of the input acoustic signal A model learning device for learning a state transition model to be obtained.

The model learning device according to claim 1,
The state transition includes a state transition from a speech state to a speech state, a state transition from a non-speech state to a non-speech state, a state transition from a speech state to a non-speech state, and a state transition from a non-speech state to a speech state. One of the model learning devices.

A state transition model that obtains an estimation result of state transition in each time interval for a speech state and a non-speech state of the input sound signal by using an input speech likelihood sequence corresponding to the speech likelihood of each time interval of the input acoustic signal as an input A storage unit for storing
Applying the input speech likelihood sequence corresponding to the speech likelihood in each time interval of the input acoustic signal to the state transition model, the state transition in each time interval for the speech state and the non-speech state of the input acoustic signal An estimation unit that obtains and outputs an estimation result;
A speech section detecting device having

A plurality of states for obtaining an estimation result of state transition in each time interval for the speech state and the non-speech state of the input acoustic signal by using an input speech likelihood sequence corresponding to the speech likelihood of each time interval of the input acoustic signal as an input A storage unit for storing the transition model;
The input speech likelihood sequence corresponding to the speech likelihood of each time interval of the input acoustic signal is applied to a plurality of the state transition models, and the speech state of the input acoustic signal and the non-state for each of the plurality of state transition models. An estimation unit that obtains and outputs an estimation result of the state transition in each time interval for the speech state;
Out of the results corresponding to the state transition estimation results output from the estimation unit, a post-processing unit that selects the most probable result in each time interval;
A speech section detecting device having

Learning including a set of a speech likelihood sequence corresponding to the speech likelihood of each time interval of the acoustic signal and a sequence of correct values of state transitions in each time interval for the speech state and the non-speech state of the acoustic signal Using the data, the input speech likelihood sequence corresponding to the speech likelihood of each time interval of the input acoustic signal is input, and the state transition estimation result in each time interval for the speech state and non-speech state of the input acoustic signal A model learning method for learning a state transition model to be obtained.

A state transition model that obtains an estimation result of state transition in each time interval for a speech state and a non-speech state of the input sound signal by using an input speech likelihood sequence corresponding to the speech likelihood of each time interval of the input acoustic signal as an input The input speech likelihood sequence corresponding to the speech likelihood of each time interval of the input acoustic signal is applied to the estimation result of the state transition in each time interval for the speech state and the non-speech state of the input acoustic signal. A method for detecting and outputting speech segments.

A plurality of states for obtaining an estimation result of state transition in each time interval for the speech state and the non-speech state of the input acoustic signal by using an input speech likelihood sequence corresponding to the speech likelihood of each time interval of the input acoustic signal as an input The input speech likelihood sequence corresponding to the speech likelihood of each time interval of the input acoustic signal is applied to the transition model, and the speech state and the non-speech state of the input acoustic signal for each of the plurality of state transition models A speech interval detection method for obtaining an estimation result of a state transition in each time interval and selecting a most probable result in each time interval from results corresponding to the estimation result of the state transition.

A program for causing a computer to function as the model learning device according to claim 1 or 2 or the speech section detection device according to claim 3 or 4.