JP2018180482A

JP2018180482A - Speech detection apparatus and speech detection program

Info

Publication number: JP2018180482A
Application number: JP2017084682A
Authority: JP
Inventors: 周作伊藤; Shusaku Ito; 美由紀白川; Miyuki Shirakawa; 土永　義照; Yoshiteru Tsuchinaga; 義照土永; 克守萩原; Katsumori Hagiwara
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2017-04-21
Filing date: 2017-04-21
Publication date: 2018-11-15

Abstract

PROBLEM TO BE SOLVED: To reduce degradation of speech detection accuracy.SOLUTION: The watching terminal 10 includes a sound signal acquisition unit that acquires a sound signal; a sound candidate extraction unit that extracts a candidate for a section likely to sound as a sound section candidate based on a sound section of the time waveform of the sound signal, a feature amount calculation unit that calculates a feature amount indicating a degree of speech likeness of a frequency change in a speech section candidate, a speech speed calculation unit that calculates a speech speed in a speech section candidate, a correction unit for correcting the feature quantity according to the speech speed, and a speech detection unit for detecting the presence or absence of a speech in the speech section candidate based on the corrected feature quantity.SELECTED DRAWING: Figure 4

Description

本発明は、音声検出装置及び音声検出プログラムに関する。 The present invention relates to a voice detection apparatus and a voice detection program.

ＩｏＴ（Internet of Things）を活用して居住者、例えば高齢者や要介護者を見守るソリューションが注目されている。例えば、居住者宅に設置された見守り端末のマイクで集音される生活音から発話による音声を検出する音声検出が居住者の安否確認に活用される。 Solutions that use the Internet of Things (IoT) to watch residents, such as the elderly and the care recipient, are attracting attention. For example, voice detection for detecting a voice by speech from living sound collected by a microphone of a watching terminal installed in a resident's house is utilized for the safety confirmation of the resident.

このような音声検出には、一側面として、音声らしさを示す特徴の１つである基音の周波数、いわゆる基本周波数の揺らぎが用いられる。例えば、入力信号にフーリエ変換、例えばＦＦＴ（Fast Fourier Transform）を適用する。これにより、入力信号が時間領域から周波数領域へ変換される結果、パワースペクトルが算出される。このパワースペクトル上でパワー値がピークを持つ周波数のうち、最小の周波数が「基本周波数」として抽出される。 In such voice detection, as one aspect, the fluctuation of the frequency of the fundamental sound, which is one of the features indicating the voice likeness, that is, the so-called fundamental frequency is used. For example, Fourier transform such as FFT (Fast Fourier Transform) is applied to the input signal. As a result, as a result of converting the input signal from the time domain to the frequency domain, a power spectrum is calculated. Of the frequencies at which the power value has a peak on this power spectrum, the lowest frequency is extracted as the "fundamental frequency".

この基本周波数の時系列データから特徴量が算出される。例えば、所定の分析長を持つ時間窓をシフトさせながら、当該時間窓に含まれる基本周波数の時系列データに一次関数による関数近似等を行うことにより近似直線の傾きを求める。このように時間窓をシフトされる度に得られた近似直線の傾きのうち正の値を持つ傾き及び負の値を持つ傾きごとに分けて傾きの統計値、例えば平均値が算出される。 A feature amount is calculated from the time series data of the fundamental frequency. For example, while shifting the time window having a predetermined analysis length, the slope of the approximate straight line is obtained by performing function approximation or the like by a linear function on the time series data of the fundamental frequency included in the time window. As described above, when the time window is shifted, the statistical value of the slope, for example, the average value, is calculated by dividing the slope of the approximate straight line obtained for each of the obtained slopes into a slope having a positive value and a slope having a negative value.

これら正の傾きの平均値および負の傾きの平均値が基本周波数に関する揺らぎの特徴量として用いられる。例えば、正の傾きの平均値および負の傾きの平均値が閾値と比較される。すなわち、正の傾きの平均値が揺らぎの下限値、例えば０．３７Ｈｚ／ｍｓおよび上限値、例えば１．０９Ｈｚ／ｍｓの範囲内に含まれ、かつ負の傾きの平均値が揺らぎの下限値、例えば−１．１９Ｈｚ／ｍｓおよび上限値、例えば−０．２９Ｈｚ／ｍｓの範囲内に含まれる場合、入力信号に音声有りとして検出される。 The average value of these positive slopes and the average value of the negative slopes are used as feature quantities of fluctuation related to the fundamental frequency. For example, the mean value of the positive slope and the mean value of the negative slope are compared to the threshold. That is, the average value of the positive slope is included in the lower limit value of the fluctuation, for example, 0.37 Hz / ms and the upper limit value, for example 1.09 Hz / ms, and the average value of the negative slope is the lower limit value of the fluctuation, For example, when it is included in the range of −1.19 Hz / ms and an upper limit value, eg, −0.29 Hz / ms, it is detected that the input signal has voice.

特開２０１２−１３３３４６号公報JP 2012-133346 A 特開２０１１−１４５３２６号公報JP, 2011-145326, A 特開２０１４−０１３３０２号公報JP, 2014-013302, A

しかしながら、上記の技術では、音声の検出精度が低下する場合がある。 However, in the above-described technique, the detection accuracy of speech may be reduced.

すなわち、基本周波数の揺らぎの特徴量と比較される閾値は、あくまで平均的な話速で発話が行われる想定の下で設定されるものに過ぎない。このため、早口で発話された音声やゆっくりと発話された音声を検出するのは困難である。 That is, the threshold value to be compared with the feature value of the fluctuation of the fundamental frequency is merely set under the assumption that speech is performed at an average speech speed. For this reason, it is difficult to detect speech uttered at a rapid pace and speech uttered slowly.

そうであるからと言って、音声と識別する閾値の範囲を広げる補正を行ったとしても、音声以外の音が音声として検出される誤検出が増加する。例えば、ゆっくりと発話される音声を検出し易いように、正の傾きの平均値と比較する下限値を引き下げたり、あるいは負の傾きの平均値と比較する上限値を引き上げたりする場合、音声と車のエンジン音などの低周波音との識別が困難となる結果、音声でない低周波音の誤検出が増加する。また、早口で発話される音声を検出し易いように、正の傾きの平均値と比較する上限値を引き上げたり、あるいは負の傾きの平均値と比較する下限値を引き下げたりする場合、音声と音声以外の楽曲などとの識別が困難となる結果、誤検出が増加する。 Even so, even if correction is performed to widen the range of the threshold to be identified as speech, false detections in which sounds other than speech are detected as speech increase. For example, to make it easy to detect speech uttered slowly, when lowering the lower limit value to be compared with the average value of positive slopes or raising the upper limit value to compare with the average value of negative slopes, As a result of difficulty in discrimination from low frequency sounds such as car engine sounds, false detection of non-voice low frequency sounds increases. In addition, when raising the upper limit value to be compared with the average value of positive slopes or lowering the lower limit value to be compared with the average value of negative slopes so as to make it easier to detect speech uttered at a rapid pace, As a result of making it difficult to distinguish music other than voice, false detection increases.

１つの側面では、本発明は、音声の検出精度が低下するのを抑制できる音声検出装置及び音声検出プログラムを提供することを目的とする。 In one aspect, the present invention aims to provide a voice detection device and a voice detection program that can suppress degradation in voice detection accuracy.

一態様では、音声検出装置は、音信号を取得する音信号取得部と、前記音信号の時間波形の有音区間に基づいて音声らしい区間の候補を音声区間候補として抽出する音声候補抽出部と、前記音声区間候補における周波数変化の音声らしさの度合いを示す特徴量を算出する特徴量算出部と、前記音声区間候補における話速を算出する話速算出部と、前記話速にしたがって前記特徴量を補正する補正部と、補正後の特徴量に基づいて前記音声区間候補における音声の有無を検出する音声検出部と、を有する。 In one aspect, the voice detection device includes a sound signal acquisition unit for obtaining a sound signal, and a voice candidate extraction unit for extracting a candidate of a section likely to be a voice as a voice section candidate based on a sound section of the time waveform of the sound signal. A feature amount calculation unit for calculating a feature amount indicating a degree of speech likeness of frequency change in the voice section candidate; a speech speed calculation unit for calculating a speech speed in the speech section candidate; and the feature quantity according to the speech speed And a voice detection unit that detects the presence or absence of voice in the voice section candidate based on the feature amount after correction.

音声の検出精度が低下するのを抑制できる。 It is possible to suppress the decrease in the voice detection accuracy.

図１は、実施例１に係るリモートケアシステムの構成例を示す図である。FIG. 1 is a view showing a configuration example of a remote care system according to a first embodiment. 図２は、閾値の補正例を示す図である。FIG. 2 is a diagram showing an example of correction of the threshold value. 図３Ａは、スペクトログラムの一例を示す図である。FIG. 3A is a diagram showing an example of a spectrogram. 図３Ｂは、スペクトログラムの一例を示す図である。FIG. 3B is a diagram showing an example of a spectrogram. 図３Ｃは、スペクトログラムの一例を示す図である。FIG. 3C is a diagram showing an example of a spectrogram. 図４は、実施例１に係る見守り端末の機能的構成の一例を示すブロック図である。FIG. 4 is a block diagram of an example of a functional configuration of the watching terminal according to the first embodiment. 図５は、音声区間候補における基本周波数の時間波形の一例を示す図である。FIG. 5 is a diagram showing an example of the time waveform of the fundamental frequency in the speech segment candidate. 図６は、音声区間候補における相関係数のグラフを示す図である。FIG. 6 is a diagram showing a graph of correlation coefficients in speech segment candidates. 図７は、補正係数αの導出関数の一例を示す図である。FIG. 7 is a diagram showing an example of a derived function of the correction coefficient α. 図８は、実施例１に係る音声検出処理の手順を示すフローチャートである。FIG. 8 is a flowchart illustrating the procedure of the voice detection process according to the first embodiment. 図９は、補正係数βの導出関数の一例を示す図である。FIG. 9 is a diagram showing an example of a derived function of the correction coefficient β. 図１０は、伸縮係数γの導出関数の一例を示す図である。FIG. 10 is a diagram illustrating an example of a derived function of the expansion coefficient γ. 図１１は、伸縮係数Δの導出関数の一例を示す図である。FIG. 11 is a diagram illustrating an example of a derived function of the expansion coefficient Δ. 図１２は、実施例１及び実施例２に係る音声検出プログラムを実行するコンピュータのハードウェア構成例を示す図である。FIG. 12 is a diagram illustrating an example of a hardware configuration of a computer that executes a voice detection program according to the first embodiment and the second embodiment.

以下に添付図面を参照して本願に係る音声検出装置及び音声検出プログラムについて説明する。なお、この実施例は開示の技術を限定するものではない。そして、各実施例は、処理内容を矛盾させない範囲で適宜組み合わせることが可能である。 A voice detection device and a voice detection program according to the present application will be described below with reference to the attached drawings. Note that this embodiment does not limit the disclosed technology. And each Example can be suitably combined in the range which does not make processing contents contradictory.

［リモートケアシステム］
図１は、実施例１に係るリモートケアシステムの構成例を示す図である。図１に示すリモートケアシステム１は、居住者宅に設置された見守り端末１０のマイクで集音される生活音から発話による音声を検出する音声検出により、居住者の安否を確認するリモートケアサービスを実現するものである。 [Remote Care System]
FIG. 1 is a view showing a configuration example of a remote care system according to a first embodiment. The remote care system 1 shown in FIG. 1 is a remote care service for confirming the safety of a resident by voice detection that detects speech due to speech from the living sound collected by the microphone of the watching terminal 10 installed in the resident's house. To achieve

図１に示すように、リモートケアシステム１には、見守り端末１０と、サーバ装置３０と、関係者端末５０とが含まれる。図１には、見守り端末１０および関係者端末５０を１つずつ例示したが、任意の数の見守り端末１０および関係者端末５０がリモートケアシステム１に収容されることとしてもかまわない。 As shown in FIG. 1, the remote care system 1 includes a watching terminal 10, a server device 30, and a related party terminal 50. Although the watching terminal 10 and the party terminal 50 are illustrated one by one in FIG. 1, any number of the watching terminal 10 and the party terminal 50 may be accommodated in the remote care system 1.

これら見守り端末１０、サーバ装置３０及び関係者端末５０の間は、ネットワークＮＷを介して接続される。このネットワークＮＷは、有線または無線を問わず、任意の通信網により構築することができる。さらに、ネットワークＮＷの一部には、有線または無線の通信網が混在してもよい。例えば、関係者端末５０が無線通信装置として実装される場合、ネットワークＮＷには、無線通信装置を収容するセルに対応する最寄りの基地局やアクセスポイント等の中継装置が含まれる。 The monitoring terminal 10, the server device 30, and the related party terminal 50 are connected via the network NW. This network NW can be constructed by any communication network, whether wired or wireless. Furthermore, in part of the network NW, a wired or wireless communication network may be mixed. For example, when the concerned party terminal 50 is implemented as a wireless communication device, the network NW includes relay devices such as the nearest base station and access point corresponding to the cell accommodating the wireless communication device.

見守り端末１０は、居住者、例えば高齢者や要介護者を見守るソリューションを実現する端末装置である。 The watching terminal 10 is a terminal device that realizes a solution for watching a resident, for example, an elderly person or a care recipient.

一実施形態として、見守り端末１０は、居住者宅に設置される。例えば、見守り端末１０は、音を電気信号に変換するマイクロフォン、いわゆるマイクを有する。このマイクを用いて、見守り端末１０は、マイクから入力される音信号から音声を検出する音声検出処理を実行する。その上で、見守り端末１０は、音声検出結果をサーバ装置３０にアップロードする。例えば、サーバ装置３０へのアップロードは、音声検出処理が実行される度、所定の期間もしくは所定の回数にわたって音声検出結果が蓄積された場合などの任意の契機に実行することができる。 As one embodiment, watching terminal 10 is installed in a resident's house. For example, the watching terminal 10 has a microphone that converts sound into an electrical signal, a so-called microphone. Using this microphone, the watching terminal 10 executes voice detection processing for detecting voice from a sound signal input from the microphone. Then, the watching terminal 10 uploads the voice detection result to the server device 30. For example, uploading to the server device 30 can be performed at any timing such as when a voice detection process is executed, and a voice detection result is accumulated for a predetermined period or a predetermined number of times.

サーバ装置３０は、上記のリモートケアサービスを関係者端末５０に提供するコンピュータである。 The server device 30 is a computer that provides the above-described remote care service to the related party terminal 50.

一実施形態として、サーバ装置３０は、見守り端末１０から収集される音声検出結果を蓄積する。そして、サーバ装置３０は、所定の期間、例えば１時間にわたって音声検出結果が蓄積される度に、所定の期間にわたって蓄積された音声検出結果に各種の統計処理を実行する。例えば、サーバ装置３０は、１時間のうち音声検出有りの時間の割合を算出したり、あるいは１時間のうち音声検出無しの時間の割合を算出したりする。その上で、サーバ装置３０は、音声検出有りの時間の割合または音声検出無しの時間の割合を関係者端末５０へ通知する。 In one embodiment, the server device 30 accumulates voice detection results collected from the watching terminal 10. Then, the server device 30 executes various statistical processes on the voice detection result accumulated for a predetermined period, for example, every time the voice detection result is accumulated for a predetermined period, for example, one hour. For example, the server device 30 calculates the ratio of time with voice detection in one hour, or calculates the ratio of time without voice detection in one hour. Then, the server device 30 notifies the concerned person terminal 50 of the ratio of time with voice detection or the ratio of time without voice detection.

ここで、サーバ装置３０は、サーバ装置３０から関係者端末５０への通知に条件を設定することもできる。例えば、サーバ装置３０は、音声検出有りの時間の割合が所定の閾値未満である場合、あるいは音声検出無しの時間の割合が所定の閾値以上である場合、アラートを関係者端末５０へ出力することができる。この他、サーバ装置３０は、所定の期間にわたって音声が検出されなかった場合、アラートを関係者端末５０へ出力することもできる。ここでは、あくまで一側面として、安否が疑われる場合にアラートを出力する場合を例示したが、安全が確認された場合に通知を行うこともできる。例えば、サーバ装置３０は、所定の期間に音声検出有りの結果が含まれる場合、音声検出有りの時間の割合が所定の閾値以上である場合、あるいは音声検出無しの時間の割合が所定の閾値未満である場合、居住者の活動を関係者端末５０に通知することができる。 Here, the server device 30 can also set a condition in the notification from the server device 30 to the terminal 50 of the person concerned. For example, the server device 30 outputs an alert to the terminal 50 when the percentage of time with voice detection is less than a predetermined threshold or when the percentage of time without voice detection is equal to or greater than a predetermined threshold. Can. In addition, the server device 30 can also output an alert to the concerned party terminal 50 when no voice is detected for a predetermined period. Here, although the case where an alert is output when safety is suspected is illustrated as one aspect to the last, notification can also be performed when safety is confirmed. For example, when the server device 30 includes the result with voice detection in a predetermined period, the percentage of time with voice detection is greater than or equal to a predetermined threshold, or the percentage of time without voice detection is less than a predetermined threshold If so, the concerned person's terminal 50 can be notified of the activity of the resident.

関係者端末５０は、居住者の関係者により使用される端末装置である。ここで言う「関係者」には、あくまで例示として、居住者の家族や親戚の他、上記のリモートケアサービスの運用に携わるコールセンタの担当者やコールセンタに駐在する医療従事者などが含まれてもかまわない。 The concerned party terminal 50 is a terminal device used by a resident related party. The "relevant person" mentioned here includes, by way of example only, the resident's family and relatives, the person in charge of the call center involved in the operation of the remote care service described above, the medical worker resident in the call center, etc. I do not mind.

一実施形態として、関係者端末５０には、スマートフォン、携帯電話機やＰＨＳ（Personal Handyphone System）などの移動体通信端末のみならず、タブレット端末やスレート端末などを採用することができる。このようにハンドヘルド型の携帯端末装置に限定されず、関係者端末５０には、ヘッドマウントディスプレイ、スマートグラスやスマートウォッチなどのウェアラブルデバイスの他、あらゆるＩｏＴデバイスを採用することができる。なお、ここでは、あくまで携帯端末装置の例を挙げたが、関係者端末５０は、据置き型の情報処理装置、例えばパーソナルコンピュータ等であってもかまわない。 As one embodiment, not only mobile communication terminals such as smartphones, mobile phones and PHS (Personal Handyphone System) but also tablet terminals and slate terminals can be adopted as the related party terminal 50. As described above, the present invention is not limited to the handheld portable terminal device, and the related party terminal 50 may employ any IoT device other than a head mounted display, a wearable device such as a smart glass or a smart watch. In addition, although the example of the portable terminal device was given to the last, the concerned person terminal 50 may be a stationary information processing device, for example, a personal computer or the like.

［音声検出処理の改善点の一側面］
ここで、上記の音声検出処理には、音声らしさを示す特徴の１つである基本周波数の揺らぎが用いられる。例えば、基本周波数の時系列データから算出される特徴量、例えば正の傾きの平均値および負の傾きの平均値などが閾値と比較される。すなわち、正の傾きの平均値が揺らぎの下限値、例えば０．３７Ｈｚ／ｍｓおよび上限値、例えば１．０９Ｈｚ／ｍｓの範囲内に含まれ、かつ負の傾きの平均値が揺らぎの下限値、例えば−１．１９Ｈｚ／ｍｓおよび上限値、例えば−０．１５Ｈｚ／ｍｓの範囲内に含まれる場合、入力信号に音声有りとして検出される。 [One aspect of improvement of voice detection processing]
Here, in the above-described speech detection process, fluctuation of the fundamental frequency, which is one of the features indicating speech likeness, is used. For example, feature quantities calculated from time series data of the fundamental frequency, for example, an average value of positive slopes and an average value of negative slopes are compared with a threshold. That is, the average value of the positive slope is included in the lower limit value of the fluctuation, for example, 0.37 Hz / ms and the upper limit value, for example 1.09 Hz / ms, and the average value of the negative slope is the lower limit value of the fluctuation, For example, when it is included in the range of −1.19 Hz / ms and an upper limit value, for example, −0.15 Hz / ms, it is detected as voice present in the input signal.

しかしながら、背景技術の項でも述べた通り、基本周波数の揺らぎの特徴量と比較される閾値は、あくまで平均的な話速で発話が行われる想定の下で設定されるものに過ぎない。このため、早口で発話された音声やゆっくりと発話された音声を検出するのは困難である。 However, as described in the background art section, the threshold value to be compared with the feature value of the fluctuation of the fundamental frequency is only set under the assumption that speech is performed at an average speech speed. For this reason, it is difficult to detect speech uttered at a rapid pace and speech uttered slowly.

そうであるからと言って、音声として識別する閾値の範囲を広げる補正を行ったとしても、音声以外の音が音声として検出される誤検出が増加する。図２は、閾値の補正例を示す図である。図２の表の上段には、平均的な話速向けの閾値、すなわち補正無しの閾値、補正無しの閾値で検出できる音声及び検出漏れが起こる音声、さらには、誤検出される音声以外の音が示されている。また、図２の表の中段には、緩やかな話速向けに補正された閾値、補正後の閾値で検出できる音声及び検出漏れが起こる音声、さらには、誤検出される音声以外の音が示されている。さらに、図２の表の下段には、急峻な話速向けに補正された閾値、補正後の閾値で検出できる音声及び検出漏れが起こる音声、さらには、誤検出される音声以外の音が示されている。 Even so, even if correction is performed to widen the range of the threshold value to be identified as speech, false detection in which sounds other than speech are detected as speech increases. FIG. 2 is a diagram showing an example of correction of the threshold value. In the upper part of the table of FIG. 2, the threshold for average speech speed, that is, the threshold without correction, the voice that can be detected with the threshold without correction, the voice that causes an omission of detection, and a sound other than voice that is erroneously detected It is shown. Further, in the middle part of the table in FIG. 2, there are shown thresholds corrected for slow speech speed, voices that can be detected with the thresholds after correction, voices that cause omission of detection, and sounds other than voices that are erroneously detected. It is done. Furthermore, the lower part of the table in FIG. 2 shows a threshold corrected for steep speech speed, a voice that can be detected with the threshold after correction, a voice that causes a detection failure, and a voice other than a voice that is erroneously detected. It is done.

例えば、音声と識別する基本周波数の揺らぎの範囲が平均的な話速に合わせて設定される場合、下限値および上限値を含む閾値は、図２の表の上段に示す通りとなる。このうち、正の傾き、すなわち単調増加の平均値と比較される下限値には、０．３７Ｈｚ／ｍｓが設定されると共に、上限値には、１．０９Ｈｚ／ｍｓが設定される。また、負の傾き、すなわち単調減少の平均値と比較される下限値には、−１．１９Ｈｚ／ｍｓが設定されると共に、上限値には、−０．２９Ｈｚ／ｍｓが設定される。これらの閾値が用いられる場合、平均的な話速の音声が検出できる一方で、上述の通り、早口で発話された音声やゆっくりと発話された音声を検出するのは困難である。このように検出能に限界がある一方で、音声以外の音が音声として検出される誤検出のおそれは少ない。 For example, when the fluctuation range of the fundamental frequency to be identified as voice is set in accordance with the average speech speed, the threshold including the lower limit value and the upper limit value is as shown in the upper part of the table of FIG. Among these, 0.37 Hz / ms is set to the positive gradient, that is, the lower limit value to be compared with the monotonically increasing average value, and 1.09 Hz / ms is set to the upper limit value. In addition, -1.19 Hz / ms is set as the lower limit compared with the negative slope, that is, the average value of monotonically decreasing, and -0.29 Hz / ms is set as the upper limit. When these threshold values are used, it is possible to detect speech of average speech speed, but as described above, it is difficult to detect speech uttered at a fast pace and speech uttered slowly. Thus, while the detection capability is limited, there is little risk of false detection in which sounds other than voice are detected as voice.

また、音声と識別する基本周波数の揺らぎの範囲が緩やかな話速に合わせて設定される場合、下限値および上限値を含む閾値は、図２の表の中段に示す通りとなる。このうち、正の傾き、すなわち単調増加の平均値と比較される下限値は、０．１５Ｈｚ／ｍｓに補正される一方で、上限値は、補正されず、１．０９Ｈｚ／ｍｓのままとされる。また、負の傾き、すなわち単調減少の平均値と比較される下限値は、補正されず、−１．１９Ｈｚ／ｍｓのままとされる一方で、上限値は、−０．１５Ｈｚ／ｍｓに補正される。これらの閾値が用いられる場合、緩やかな話速の音声や平均的な話速の音声が検出できる一方で、依然として、早口で発話された音声を検出するのは困難である。加えて、音声と車のエンジン音などの低周波音との識別が困難となる結果、音声でない低周波音の誤検出が増加する。 In addition, when the range of fluctuation of the fundamental frequency to be identified as voice is set in accordance with the slow speech speed, the threshold value including the lower limit value and the upper limit value is as shown in the middle part of the table of FIG. Among them, the positive slope, that is, the lower limit compared to the average value of monotonically increasing, is corrected to 0.15 Hz / ms, while the upper limit is not corrected, and remains 1.09 Hz / ms. Ru. Also, the negative slope, ie the lower limit compared to the mean value of monotonically decreasing, is not corrected and remains -1.19 Hz / ms while the upper limit is corrected to -0.15 Hz / ms Be done. When these threshold values are used, it is possible to detect speech with moderate speech speed and speech with average speech speed, but it is still difficult to detect speech uttered at high speed. In addition, it becomes difficult to distinguish between voice and low frequency sound such as car engine sound, resulting in an increase in false detection of non-voice low frequency sound.

また、音声と識別する基本周波数の揺らぎの範囲が急峻な話速に合わせて設定される場合、下限値および上限値を含む閾値は、図２の表の下段に示す通りとなる。このうち、正の傾き、すなわち単調増加の平均値と比較される下限値は、０．３７Ｈｚ／ｍｓのままとされる一方で、上限値は、１．３Ｈｚ／ｍｓに補正される。また、負の傾き、すなわち単調減少の平均値と比較される下限値は、−１．３Ｈｚ／ｍｓに補正される一方で、上限値は、−０．２９Ｈｚ／ｍｓのままとされる。これらの閾値が用いられる場合、平均的な話速の音声や急峻な話速の音声が検出できる一方で、依然として、緩やかな話速で発話された音声を検出するのは困難である。加えて、音声と音声以外の音楽、楽曲や音響などとの識別が困難となる結果、誤検出が増加する。 In addition, when the range of fluctuation of the fundamental frequency to be identified as voice is set in accordance with the steep speech speed, the threshold including the lower limit value and the upper limit value is as shown in the lower part of the table of FIG. Among these, the positive slope, that is, the lower limit value to be compared with the average value of monotonically increasing, is kept at 0.37 Hz / ms while the upper limit value is corrected to 1.3 Hz / ms. Also, the negative slope, ie, the lower limit compared to the monotonically decreasing average, is corrected to -1.3 Hz / ms while the upper limit remains at -0.29 Hz / ms. When these threshold values are used, it is possible to detect speech of average speech speed or speech of steep speech speed, but it is still difficult to detect speech uttered at moderate speech speed. In addition, it becomes difficult to distinguish between voice and music other than voice, music or sound, etc., resulting in an increase in false detection.

このように閾値を補正する他、時間長が異なる２つの時間窓を用いて音信号に周波数分析を実行し、２つの分析結果の差分から音声を検出することも考えられている。すなわち、音信号の時間波形のうち短区間の部分波形に周波数分析を実行すると共に、短区間を含む長区間の部分波形にも周波数解析を実行する。これら短区間および長区間の分析結果は、音信号に音声が含まれる場合に違いが現れる。例えば、長区間で周波数分析するとスペクトルの変動が大きいので周波数スペクトルが滑らかに現れる一方で、短区間で周波数分析するとスペクトルの変動が小さいので周波数スペクトルに山谷が現れる。このように短区間および長区間の間で周波数スペクトルの違いが大きいか否かにより、音声有りまたは音声無しを判定する。 It is also conceivable to perform frequency analysis on the sound signal using two time windows having different time lengths and to detect speech from the difference between the two analysis results, in addition to correcting the threshold value in this manner. That is, frequency analysis is performed on a partial waveform of a short interval in the time waveform of the sound signal, and frequency analysis is performed on a partial waveform of a long interval including the short interval. The analysis results of the short section and the long section differ in the case where the sound signal includes speech. For example, when the frequency analysis is performed in the long section, the fluctuation of the spectrum is large and the frequency spectrum appears smoothly, while when the frequency analysis is performed in the short section, the fluctuation of the spectrum is small, the peaks and valleys appear in the frequency spectrum. Thus, the presence or absence of speech is determined depending on whether or not the difference in frequency spectrum between the short interval and the long interval is large.

しかしながら、発話は、必ずしも一定の話速で行われるとは限らないので、周波数分析に２つの時間窓を用いたとしても、基本周波数の揺らぎが短区間または長区間のいずれかの時間窓に一致して収まるように現れるとは限らない。 However, since the utterance is not always performed at a constant speech speed, even if two time windows are used for frequency analysis, the fluctuation of the fundamental frequency is not limited to one of the short or long time windows. It does not always appear to fit.

図３Ａ〜図３Ｃは、スペクトログラムの一例を示す図である。図３Ａには、女性が平均的な話速で発話を行った音声を含む音信号のスペクトログラムが示されている。また、図３Ｂには、女性が緩やかな話速で発話を行った音声を含む音信号のスペクトログラムが示されている。さらに、図３Ｃには、車両のエンジン音を含む音信号のスペクトログラムが示されている。 3A to 3C are diagrams showing an example of a spectrogram. FIG. 3A shows a spectrogram of a sound signal including a voice uttered by a woman at an average speech speed. Further, FIG. 3B shows a spectrogram of a sound signal including a voice uttered by a woman at a slow speech speed. Further, FIG. 3C shows a spectrogram of a sound signal including the engine sound of a vehicle.

図３Ａに示すように、平均的な話速で発話が行われる場合、複数の音節がスペクトログラムに含まれる。これにより、スペクトログラムには、基本周波数の傾きが音節単位で変化して現れる。すなわち、図３Ａに示すスペクトログラムには、基本周波数が揺らがない区間、基本周波数が単調増加で揺らぐ区間、基本周波数が単調減少で揺らぐ区間の３つが時系列に現れる。このように、音信号のうち周波数分析が行われる時間窓に基本周波数の揺らぎが現れる場合、音声の検出に耐えうる。 As shown in FIG. 3A, a plurality of syllables are included in the spectrogram when speech is performed at an average speech speed. As a result, the slope of the fundamental frequency appears to change in syllable units in the spectrogram. That is, in the spectrogram shown in FIG. 3A, three periods of a period in which the fundamental frequency does not fluctuate, a period in which the fundamental frequency fluctuates monotonously and a period in which the fundamental frequency monotonically decreases appear in time series. As described above, when fluctuation of the fundamental frequency appears in a time window of the sound signal in which frequency analysis is performed, it is possible to endure detection of voice.

一方、図３Ｂ及び図３Ｃに示すスペクトログラムでは、全区間にわたって基本周波数が揺らいでいない。例えば、図３Ｂに示すスペクトログラムの場合、話者が緩やかに発話を行っていることに起因して、音信号のうち周波数分析が行われた区間に含まれる音節の数が少なく、基本周波数の単調増加や単調減少などの揺らぎがスペクトログラムに現れていない。 On the other hand, in the spectrograms shown in FIGS. 3B and 3C, the fundamental frequency does not fluctuate over the entire interval. For example, in the case of the spectrogram shown in FIG. 3B, the number of syllables included in the section of the sound signal in which the frequency analysis is performed is small and the monotonicity of the fundamental frequency is due to the speaker uttering slowly. Fluctuations such as increase and monotonous decrease do not appear in the spectrogram.

それ故、図３Ｂに示すスペクトログラムは、図３Ｃに示すスペクトログラムとの弁別が困難である。図３Ｂに示すスペクトログラムを生成する音信号の区間を広げた場合、区間に含まれる音節の数が増える結果、緩やかな話速で発話が行われる場合に音声を検出し易くなるが、急峻な話速で発話が行われる場合に音声検出の応答が遅れる。 Therefore, the spectrogram shown in FIG. 3B is difficult to distinguish from the spectrogram shown in FIG. 3C. When the section of the sound signal for generating the spectrogram shown in FIG. 3B is expanded, the number of syllables included in the section increases. As a result, it becomes easy to detect speech when speech is performed at a slow speech speed, but steep speech The response of voice detection is delayed when speech is made quickly.

このように、基本周波数の揺らぎの特徴量と比較する閾値として緩やかな話速向け、平均的な話速向け又は急峻な話速向けのいずれかの閾値を固定して用いたとしても、時間長が異なる複数の時間窓を用いて周波数分析を実行したとしても、音声の検出精度に自ずから限界がある。 As described above, even if the threshold value for slow speech speed, average speech speed, or steep speech speed is fixedly used as the threshold value to be compared with the feature value of fluctuation of the fundamental frequency, the time length is Even if frequency analysis is performed using a plurality of different time windows, the speech detection accuracy is naturally limited.

そこで、本実施例に係る見守り端末１０は、上記の音声検出処理の一環として、入力される音信号の音声候補区間から話速を求め、基本周波数の揺らぎを示す特徴量を話速に応じて補正して音声候補区間の音声の有無を判定する。これにより、話速の変化により基本周波数の揺らぎの現れ方が変化する場合でも、基本周波数の揺らぎの特徴量に話速の正規化を行って音声検出を実施できる結果、音声の検出漏れや誤検出を抑制できる。したがって、本実施例に係る見守り端末１０によれば、音声の検出精度が低下するのを抑制できる。 Therefore, the watching terminal 10 according to the present embodiment obtains the speech speed from the speech candidate section of the input sound signal as part of the above speech detection process, and uses the feature quantity indicating the fluctuation of the fundamental frequency according to the speech speed. It correct | amends and determines the presence or absence of the audio | voice of an audio | voice candidate area. As a result, even if the appearance of the fluctuation of the fundamental frequency changes due to the change of the speech speed, normalization of the speech speed can be performed on the feature quantity of the fluctuation of the fundamental frequency and speech detection can be performed. Detection can be suppressed. Therefore, according to the watching terminal 10 according to the present embodiment, it is possible to suppress the decrease in the voice detection accuracy.

［見守り端末１０の構成］
図４は、実施例１に係る見守り端末１０の機能的構成の一例を示すブロック図である。図４に示すように、見守り端末１０は、音信号取得部１１と、音声候補抽出部１２と、特徴量算出部１３と、話速算出部１４と、補正部１５と、音声検出部１６とを有する。 [Configuration of watching terminal 10]
FIG. 4 is a block diagram illustrating an example of a functional configuration of the watching terminal 10 according to the first embodiment. As shown in FIG. 4, the watching terminal 10 includes a sound signal acquisition unit 11, a speech candidate extraction unit 12, a feature quantity calculation unit 13, a speech speed calculation unit 14, a correction unit 15, and a speech detection unit 16. Have.

図４に示す音信号取得部１１、音声候補抽出部１２、特徴量算出部１３、話速算出部１４、補正部１５及び音声検出部１６などの機能部は、ＭＰＵ（Micro Processing Unit）やＣＰＵ（Central Processing Unit）などのハードウェアプロセッサにより仮想的に実現される。すなわち、プロセッサがＲＡＭ（Random Access Memory）等のメモリ上に上記の音声検出処理を実行する音声検出プログラムをプロセスとして展開することにより、上記の機能部が仮想的に実現される。ここでは、プロセッサの一例として、ＣＰＵやＭＰＵを例示したが、汎用型および特化型を問わず、任意のプロセッサにより上記の機能部が実現されることとしてもかまわない。この他、上記の機能部は、ＡＳＩＣ（Application Specific Integrated Circuit）やＦＰＧＡ（Field Programmable Gate Array）などのハードワイヤードロジックによって実現されることとしてもかまわない。 Functional units such as the sound signal acquisition unit 11, the speech candidate extraction unit 12, the feature amount calculation unit 13, the speech speed calculation unit 14, the correction unit 15, and the speech detection unit 16 shown in FIG. (Virtual processing) by a hardware processor such as (Central Processing Unit). That is, the above functional unit is virtually realized by the processor expanding a speech detection program for executing the speech detection processing on a memory such as a RAM (Random Access Memory) as a process. Here, the CPU and the MPU are illustrated as an example of the processor, but the above functional unit may be realized by any processor regardless of general purpose type or special purpose type. In addition to the above, the above-described functional units may be realized by hard wired logic such as an application specific integrated circuit (ASIC) or a field programmable gate array (FPGA).

また、図４に示す機能部は、あくまで例示であり、見守り端末１０の機能的構成が図４に示す例以外の機能的構成を有することを妨げない。すなわち、見守り端末１０は、上記の機能部以外の他の機能部を有することとしてもかまわない。例えば、見守り端末１０は、図４では図示が省略されているが、音波を電気信号に変換するマイクの他、ネットワークＮＷに接続する通信インタフェースなどを有することができる。この他、上記のリモートケアサービスでは、音声検出のみならず、人の動きや温湿度などを複合して安否の確認を行うこともできるので、人感センサや温湿度センサなどをさらに有することもできる。また、上記のリモートケアサービスでは、コールセンタとの間で通話を行う通話機能をさらに有することもできる。 Further, the functional units shown in FIG. 4 are merely examples, and the functional configuration of the watching terminal 10 does not prevent the functional configuration other than the example shown in FIG. 4 from being included. That is, the watching terminal 10 may have other functional units other than the above-described functional units. For example, the watching terminal 10 may have a communication interface or the like connected to the network NW, in addition to a microphone for converting sound waves into electric signals, although not shown in FIG. In addition to the above, since the remote care service described above can not only detect voice but also combine the movement of a person, temperature and humidity, etc. to confirm the safety, it may also have a human sensor, a temperature and humidity sensor, etc. it can. In addition, the above-mentioned remote care service can further include a call function for making a call with the call center.

また、図４に示す機能部に対する信号やデータの入出力の関係を表す実線が例示されているが、これはデータの入出力の関係は、少なくとも一方から他方へデータが伝送されることが示されており、必ずしも双方向にデータの授受が行われずともかまわない。 In addition, although a solid line representing the relationship between the input and output of signals and data to the functional unit shown in FIG. 4 is illustrated, the relationship between the input and output of data indicates that data is transmitted from at least one to the other. It is not always necessary to exchange data in both directions.

［音信号取得部１１］
音信号取得部１１は、音信号を取得する処理部である。 [Sound signal acquisition unit 11]
The sound signal acquisition unit 11 is a processing unit that acquires a sound signal.

一実施形態として、音信号取得部１１は、図示しないマイクにより音波から変換された音信号を入力信号として取得する。ここで、音信号取得部１１が音信号を取得するソースは任意であってよく、マイクに限定されない。例えば、音信号取得部１１は、音データを蓄積するハードディスクや光ディスクなどの補助記憶装置またはメモリカードやＵＳＢ（Universal Serial Bus）メモリなどのリムーバブルメディアから読み出すことにより音信号を取得することもできる。この他、音信号取得部１１は、外部装置からネットワークＮＷを介して受信することによって音信号を取得することもできる。 In one embodiment, the sound signal acquisition unit 11 acquires, as an input signal, a sound signal converted from a sound wave by a microphone (not shown). Here, the source from which the sound signal acquisition unit 11 acquires the sound signal may be arbitrary, and is not limited to the microphone. For example, the sound signal acquisition unit 11 can also acquire a sound signal by reading from an auxiliary storage device such as a hard disk or an optical disk that stores sound data, or a removable medium such as a memory card or a USB (Universal Serial Bus) memory. In addition to this, the sound signal acquisition unit 11 can also acquire a sound signal by receiving from an external device via the network NW.

［音声候補抽出部１２］
音声候補抽出部１２は、音信号から音声区間候補を抽出する処理部である。ここで言う「音声区間候補」とは、音信号の時間波形のうち音声らしい区間の候補を指す。図４に示す通り、音声候補抽出部１２は、有音区間検出部１２ａと、基音算出部１２ｂと、判定部１２ｃとを有する。 [Speech candidate extraction unit 12]
The speech candidate extraction unit 12 is a processing unit that extracts speech segment candidates from the sound signal. The term "speech segment candidate" as used herein refers to a candidate of a segment that seems to be speech in the time waveform of the sound signal. As shown in FIG. 4, the voice candidate extraction unit 12 has a sounded section detection unit 12a, a basic sound calculation unit 12b, and a determination unit 12c.

このうち、有音区間検出部１２ａは、音信号の時間波形から有音区間を検出する処理部である。 Among these, the sounded section detection unit 12a is a processing section that detects a sounded section from the time waveform of the sound signal.

一実施形態として、有音区間検出部１２ａは、音信号取得部１１により音信号の新規のフレームが取得される度に、フレームごとにパワー包絡を算出する。例えば、フレーム長をＴとしたとき、有音区間検出部１２ａは、下記の式（１）にしたがってフレーム長Ｔに含まれる各サンプリング点の振幅ｘ_ｉを入力することにより、フレームのパワー包絡を算出する。そして、有音区間検出部１２ａは、フレームのパワー包絡が所定の閾値以上であるか否を判定する。その上で、有音区間検出部１２ａは、フレームのパワー包絡が閾値以上である場合、当該フレームを「有音区間」として検出する。このように新規のフレームから検出される有音区間と、それ以前のフレームから検出された有音区間とが連接する場合、互いが連接する有音区間同士を連結することにより、２つ以上の有音区間を１つの有音区間に統合することができる。 As one embodiment, the voiced section detection unit 12 a calculates a power envelope for each frame every time a new frame of the sound signal is acquired by the sound signal acquisition unit 11. For example, assuming that the frame length is T, the voiced section detection unit 12a inputs the amplitude x _i of each sampling point included in the frame length T according to the following equation (1) to obtain the power envelope of the frame. calculate. Then, the voiced section detection unit 12a determines whether the power envelope of the frame is equal to or greater than a predetermined threshold. Then, when the power envelope of the frame is equal to or greater than the threshold, the sounded section detection unit 12a detects the frame as a "sent section". Thus, when the sounding section detected from the new frame and the sounding section detected from the previous frame are connected, by connecting the sounding sections connected to each other, two or more A sound interval can be integrated into one sound interval.

基音算出部１２ｂは、基本周波数を算出する処理部である。 The fundamental sound calculation unit 12 b is a processing unit that calculates a fundamental frequency.

一実施形態として、基音算出部１２ｂは、あくまで一例として、音信号の時間波形の自己相関により基本周波数を算出する。例えば、基音算出部１２ｂは、下記の式（２）にしたがって、有音区間検出部１２ａにより検出された有音区間のオリジナル波形に対して有音区間の波形が複製された複製波形をシフトさせつつ、有音区間のオリジナル波形と有音区間の複製波形との相関係数Ｒ［ｘ］をシフト幅τごとに算出する。その後、基音算出部１２ｂは、シフト幅τごとに算出された相関係数のうち最大の相関係数が算出されたときのシフト幅τを「基本周期」と識別する。その上で、基音算出部１２ｂは、最大の相関係数が算出されたときのシフト幅τからその逆数をとることにより、基本周波数を算出する。 As one embodiment, the basic sound calculation unit 12 b calculates the fundamental frequency by autocorrelation of the time waveform of the sound signal, as an example only. For example, the fundamental sound calculation unit 12b shifts the replica waveform in which the waveform of the sounding section is duplicated with respect to the original waveform of the sounding section detected by the sounding section detection unit 12a according to the following equation (2). Meanwhile, the correlation coefficient R [x] between the original waveform of the sound interval and the duplicate waveform of the sound interval is calculated for each shift width τ. Thereafter, the basic sound calculation unit 12b identifies the shift width τ when the largest correlation coefficient is calculated among the correlation coefficients calculated for each shift width τ as the “basic cycle”. Then, the basic sound calculation unit 12b calculates the fundamental frequency by taking the reciprocal of the shift width τ when the maximum correlation coefficient is calculated.

なお、ここでは、あくまで基本周波数の算出方法の一例として、時間波形の自己相関を用いる場合を例示したが、基本周波数の算出方法はこれに限定されない。例えば、上記の背景技術の項で説明したように、周波数スペクトルから基本周波数を求めることもできる。また、波形包絡法、零交差法やケプストラム法などを含む他の方法を用いて基本周波数を算出することとしてもかまわない。 In addition, although the case where the autocorrelation of a time waveform is used was illustrated as an example of the calculation method of fundamental frequency to the last, the calculation method of fundamental frequency is not limited to this. For example, as described in the background section above, the fundamental frequency can also be determined from the frequency spectrum. Also, the fundamental frequency may be calculated using another method including a waveform envelope method, a zero crossing method, a cepstrum method, and the like.

判定部１２ｃは、有音区間が音声区間候補に対応するか否かを判定する処理部である。 The determination unit 12c is a processing unit that determines whether or not the sound section corresponds to a speech section candidate.

一実施形態として、判定部１２ｃは、基音算出部１２ｂによりシフト幅ごとに算出された相関係数ごとに、当該相関係数が所定の閾値以上であるか否かを判定する。これにより、基本算出部１２ｂにより相関係数が算出された総回数のうち閾値以上の相関係数が算出された回数の割合が求まる。また、判定部１２ｃは、上記の割合が閾値以上である場合、基音算出部１２ｂにより算出された基本周波数が音声として許容される範囲、例えば１００Ｈｚ以上４００Ｈｚ以下の範囲内であるか否かを判定する。これら２つの条件を満たす場合、すなわち上記の割合が閾値以上であり、かつ基本周波数が音声として許容される範囲である場合、判定部１２ｃは、上記２つの条件を満たす有音区間を「音声区間候補」として検出する。 In one embodiment, the determination unit 12c determines, for each correlation coefficient calculated for each shift width by the basic sound calculation unit 12b, whether the correlation coefficient is equal to or more than a predetermined threshold. As a result, the ratio of the number of times the correlation coefficient equal to or greater than the threshold value is calculated is obtained among the total number of times the correlation coefficient is calculated by the basic calculation unit 12b. In addition, when the above ratio is equal to or higher than the threshold, the determination unit 12c determines whether the fundamental frequency calculated by the basic sound calculation unit 12b is within the allowable range for voice, for example, in the range of 100 Hz to 400 Hz. Do. If these two conditions are satisfied, that is, if the above ratio is equal to or higher than the threshold and the fundamental frequency is within the allowable range for voice, the determining unit 12c may Detect as a candidate.

なお、ここでは、条件付きで有音区間を音声区間候補として検出する場合を例示したが、有音区間を無条件で音声区間候補として検出することもできる。この場合、上記２つの条件の判定を省略することもできる。 Here, although a case where a sounded section is detected as a speech section candidate conditionally is illustrated, a sounded section can be detected as a speech section candidate unconditionally. In this case, the determination of the above two conditions can be omitted.

［特徴量算出部１３］
特徴量算出部１３は、基本周波数の揺らぎの特徴量を算出する処理部である。 [Feature amount calculation unit 13]
The feature amount calculation unit 13 is a processing unit that calculates a feature amount of fluctuation of the fundamental frequency.

一実施形態として、特徴量算出部１３は、音声候補抽出部１２により抽出された音声区間候補における基本周波数の時間波形上で所定の分析長を持つ時間窓をシフトさせながら、当該時間窓に含まれる基本周波数の時系列データに一次関数による関数近似等を行うことにより近似直線の傾きを求める。 As one embodiment, the feature quantity calculation unit 13 is included in the time window while shifting the time window having a predetermined analysis length on the time waveform of the fundamental frequency in the speech section candidate extracted by the speech candidate extraction unit 12 The inclination of the approximate straight line is obtained by performing function approximation or the like by a linear function on the time series data of the fundamental frequency to be obtained.

図５は、音声区間候補における基本周波数の時間波形の一例を示す図である。図５のグラフの縦軸は、周波数を指し、横軸は、時間を指す。図５には、一次関数による関数近似を行う分析長を６０ｍｓｅｃとする場合の時間窓が示されている。図５に示すように、特徴量算出部１３は、音声区間候補内の基本周波数の時間波形上で分析長６０ｍｓｅｃの時間窓を所定のシフト幅でシフトさせる。すなわち、時間窓（イ）、時間窓（ロ）、時間窓（ハ）の順にシフトされる。このように時間窓がシフトされる度に、特徴量算出部１３は、当該時間窓に含まれる基本周波数の時系列データに一次関数による関数近似を行う。例えば、時間窓（イ）の例であれば、一次のモデル関数に対する残差の二乗和を最小化する線形回帰分析を行うことにより、近似直線（い）の傾き及び切片が推定される。このように時間窓がシフトされる度に得られた近似直線の傾きのうち正の値を持つ傾き及び負の値を持つ傾きごとに分けて傾きの統計値、例えば平均値が算出される。 FIG. 5 is a diagram showing an example of the time waveform of the fundamental frequency in the speech segment candidate. The vertical axis of the graph in FIG. 5 indicates frequency, and the horizontal axis indicates time. FIG. 5 shows a time window in the case where the analysis length for performing the function approximation by the linear function is 60 msec. As shown in FIG. 5, the feature quantity calculation unit 13 shifts a time window of an analysis length of 60 msec by a predetermined shift width on the time waveform of the fundamental frequency in the speech segment candidate. That is, the time window (i), the time window (ii) and the time window (iii) are shifted in this order. As described above, every time the time window is shifted, the feature quantity calculation unit 13 performs function approximation using a linear function on time series data of the fundamental frequency included in the time window. For example, in the case of the time window (i), the slope and the intercept of the approximate straight line (i) are estimated by performing linear regression analysis that minimizes the sum of squares of residuals for the first-order model function. As described above, when the time window is shifted, the statistical value of the slope, for example, the average value, is calculated by dividing each of the slopes of the approximate straight line obtained with the positive value and the negative value.

なお、ここでは、あくまで特徴量の一例として、正の傾きの平均値および負の傾きの平均値を算出する場合を例示したが、これに限定されず、音声区間候補に占める基音が揺らぐ時間の割合などの他の特徴量を算出することもできる。この場合、例えば、音声区間候補で正または負のいずれかの値の傾きが検出される第１の区間と音声区間候補で傾きがゼロである第２の区間との割合を算出することにより、基音が揺らぐ時間の割合を特徴量として算出することができる。 Here, although the case of calculating the average value of the positive slope and the average value of the negative slope has been illustrated as an example of the feature quantity, the present invention is not limited to this, and it is possible to Other feature quantities such as proportions can also be calculated. In this case, for example, by calculating the ratio between the first section in which the slope of any positive or negative value is detected in the voice section candidate and the second section in which the slope is zero in the voice section candidate, The ratio of time in which the fundamental sound fluctuates can be calculated as the feature value.

［話速算出部１４］
話速算出部１４は、音声区間候補の話速を算出する処理部である。 [Speech rate calculator 14]
The speech speed calculation unit 14 is a processing unit that calculates the speech speed of the speech segment candidate.

一実施形態として、話速算出部１４は、音声候補抽出部１２により抽出された音声区間候補に含まれる音節数を推定する。この音節数の推定には、一例として、判定部１２ｃにより検出された音声区間候補で基音算出部１２ｂによりシフト幅τごとに算出された相関係数が用いられる。すなわち、母音が発話される箇所では、周期性があるので、相関係数が高くなる傾向がある一方で、子音が発話される箇所では、周期性がないので、相関係数が低くなる傾向がある。このように連続する言語音の特性を利用して音節数を推定するために、音声区間候補で相関係数の値が所定の閾値、例えば「０．７」以上の値で連続する区間の総数を算出することにより、音声区間候補における音節数を推定する。ここでは、一例として、音声区間候補で相関係数の値が所定の閾値以上の値で連続する区間の総数を算出する場合を例示したが、音声区間候補で相関係数の波形で観測される谷部の個数を算出することにより、音声区間候補における音節数を推定することもできる。 As one embodiment, the speech speed calculation unit 14 estimates the number of syllables included in the speech segment candidates extracted by the speech candidate extraction unit 12. For the estimation of the number of syllables, the correlation coefficient calculated for each shift width τ by the basic sound calculation unit 12b is used as the voice section candidate detected by the determination unit 12c, as an example. That is, since there is periodicity in the part where vowels are uttered, the correlation coefficient tends to be high, while there is no periodicity in the part where consonants are uttered, so the correlation coefficient tends to be low. is there. As described above, in order to estimate the number of syllables using the characteristics of continuous speech sounds, the total number of continuous intervals with the value of the correlation coefficient at a predetermined threshold, for example, “0.7” or more, in the speech interval candidate The number of syllables in the speech segment candidate is estimated by calculating Here, as an example, the case of calculating the total number of consecutive sections with the value of the correlation coefficient being a predetermined threshold or more in the speech section candidate has been illustrated, but the waveform of the correlation coefficient is observed in the speech section candidate The number of syllables in the speech segment candidate can also be estimated by calculating the number of valleys.

図６は、音声区間候補における相関係数のグラフを示す図である。図６には、一例として、「おにいさん」という発話が行われた音声区間候補における相関係数のグラフが示されている。図６に示すグラフの縦軸は、相関係数の値を指し、横軸は、シフト幅τを指す。図６に示す音声区間候補６０では、相関係数が閾値「０．７」以上となる区間は、母音の周期性が観測される箇所で相関係数が高くなり、音節が途切れる箇所で相関係数が低くなる言語音の特性にしたがって、区間（ニ）、区間（ホ）、区間（ヘ）、区間（ト）の４つとなる。このため、音声区間候補６０の音節数は、「４」と推定される。ここでは、あくまで一例として、相関係数が閾値「０．７」以上となる区間の数を計数することにより音節数を求める場合を例示したが、音節数の算出方法はこれに限定されない。例えば、音声区間候補６０に出現する谷部、すなわち区間（ニ）前後の谷部、区間（ホ）の後方の谷部、区間（ヘ）の後方の谷部及び区間（ト）の後方の谷部の個数「５」から５つの谷部に挟まれる山部の個数「４」を求めることにより、音節数を「４」と推定することもできる。 FIG. 6 is a diagram showing a graph of correlation coefficients in speech segment candidates. FIG. 6 shows, as an example, a graph of the correlation coefficient in the speech segment candidate in which the utterance "man" is performed. The vertical axis of the graph shown in FIG. 6 indicates the value of the correlation coefficient, and the horizontal axis indicates the shift width τ. In the speech section candidate 60 shown in FIG. 6, in the section where the correlation coefficient is equal to or more than the threshold value "0.7", the correlation coefficient becomes high at the place where the periodicity of vowels is observed, and the relationship is the place where the syllable breaks off. According to the characteristics of the speech sound whose number is low, there are four sections (d), (d), (d) and (t). Therefore, the number of syllables of the speech segment candidate 60 is estimated to be “4”. Here, although the case where the number of syllables is calculated | required by counting the number of area where a correlation coefficient becomes more than threshold value "0.7" was illustrated as an example to the last, the calculation method of the number of syllables is not limited to this. For example, valleys appearing in the voice section candidate 60, that is, valleys before and after the section (d), valleys behind the section (e), valleys behind the section (f) and valleys after the section (t) The number of syllables can also be estimated to be "4" by obtaining the number "4" of peak portions sandwiched between five valleys from the number "5" of portions.

このように音声区間候補における音節数が推定された後、話速算出部１４は、音声区間候補における音節数を単位時間、例えば１秒間あたりの音節数に換算することにより、音声区間候補における話速（音節／秒）を算出することができる。 As described above, after the number of syllables in the speech segment candidate is estimated, the speech speed calculation unit 14 converts the number of syllables in the speech segment candidate into unit time, for example, the number of syllables per second. The speed (syllables / second) can be calculated.

［補正部１５］
補正部１５は、話速算出部１４により算出される話速にしたがって特徴量算出部１３により算出される特徴量を補正する処理部である。 [Correction unit 15]
The correction unit 15 is a processing unit that corrects the feature amount calculated by the feature amount calculation unit 13 according to the speech speed calculated by the speech speed calculation unit 14.

一実施形態として、補正部１５は、話速算出部１４により算出された話速から、特徴量算出部１３により算出された特徴量に乗算する補正係数αを算出する。この補正係数αは、一例として、図７に示す補正係数αの導出関数にしたがって算出される。図７は、補正係数αの導出関数の一例を示す図である。図７には、話速および補正係数αの関係を示すグラフが示されている。図７に示すグラフの縦軸は、補正係数αを指し、横軸は、話速［音節／秒］を指す。図７に示すように、補正係数αの導出には、話速が小さいほど大きい値が導出される一方で、話速が大きいほど小さい値が導出される関数が用いられる。言い換えれば、音声区間候補における発話がゆっくりであるほど大きい補正係数αが導出される一方で、音声区間候補における発話が早いほど小さい補正係数αが導出される。このように話速の大小に応じて補正係数αの大きさを変えることができるが、補正係数の導出関数は、図７に示す通り、必ずしも単調減少のグラフとせずともかまわない。すなわち、話速が平均の範囲に収まる場合、当該平均の範囲では話速の値によらず、補正係数αは「１」と導出される。このように補正係数αが「１」である場合、特徴量算出部１３により算出された特徴量は実質的に補正されない。また、話速が２．９［音節／秒］以下である場合、２．９［音節／秒］以下の話速であっても話速が２．９［音節／秒］である場合と同一の補正係数αが導出される。また、話速が１１［音節／秒］以上である場合、１１［音節／秒］以上の話速であっても話速が１１［音節／秒］である場合と同一の補正係数αが導出される。このような導出関数を通じて、補正部１５は、音声区間候補における話速から補正係数αを算出することができる。 In one embodiment, the correction unit 15 calculates a correction coefficient α by which the feature amount calculated by the feature amount calculation unit 13 is multiplied from the speech speed calculated by the speech speed calculation unit 14. The correction coefficient α is calculated according to a derived function of the correction coefficient α shown in FIG. 7 as an example. FIG. 7 is a diagram showing an example of a derived function of the correction coefficient α. FIG. 7 shows a graph showing the relationship between the speech speed and the correction coefficient α. The vertical axis of the graph shown in FIG. 7 indicates the correction coefficient α, and the horizontal axis indicates the speech speed [syllable / second]. As shown in FIG. 7, for the derivation of the correction coefficient α, a larger value is derived as the speech speed is smaller, while a function is used which derives a smaller value as the speech speed is larger. In other words, as the speech in the speech segment candidate is slower, the larger correction coefficient α is derived, while the earlier the speech in the speech segment candidate is smaller, the smaller correction coefficient α is derived. Thus, the magnitude of the correction coefficient α can be changed according to the magnitude of the speech speed, but the derivation function of the correction coefficient may not necessarily be a monotonically decreasing graph, as shown in FIG. That is, when the speech speed falls within the range of the average, the correction coefficient α is derived as “1” regardless of the value of the speech speed within the range of the average. As described above, when the correction coefficient α is “1”, the feature amount calculated by the feature amount calculating unit 13 is not substantially corrected. Also, when the speech speed is 2.9 [syllables / second] or less, the speech speed is the same as 2.9 [syllables / second] even if the speech speed is 2.9 [syllables / second] or less The correction coefficient α of is derived. If the speech speed is 11 [syllables / second] or more, the same correction coefficient α is derived as in the case where the speech speed is 11 [syllables / second] even if the speech speed is 11 [syllables / second] or more. Be done. The correction unit 15 can calculate the correction coefficient α from the speech speed in the speech segment candidate through such a derivation function.

このように補正係数αが算出された後、補正部１５は、特徴量算出部１３により算出された特徴量に補正係数αを乗算することにより、当該特徴量を補正する。すなわち、補正部１５は、特徴量算出部１３により算出された正の傾き、すなわち単調増加の平均値［周波数／ｍｓ］に補正係数αを乗算する一方で、負の傾き、すなわち単調減少の平均値［周波数／ｍｓ］にも補正係数αを乗算する。これにより、補正後の単調増加の平均値および補正後の単調減少の平均値が算出される。 After the correction coefficient α is calculated as described above, the correction unit 15 corrects the feature amount by multiplying the feature amount calculated by the feature amount calculation unit 13 by the correction coefficient α. That is, while the correction unit 15 multiplies the correction coefficient α by the positive slope calculated by the feature amount calculation unit 13, that is, the average value [frequency / ms] of monotonous increase, the correction unit 15 multiplies the negative coefficient, that is, the average of monotonous decrease. The value [frequency / ms] is also multiplied by the correction factor α. Thereby, the average value of monotonous increase after correction and the average value of monotonous decrease after correction are calculated.

［音声検出部１６］
音声検出部１６は、補正部１５により補正された音声区間候補における特徴量に基づいて音声を検出する処理部である。 [Voice detection unit 16]
The voice detection unit 16 is a processing unit that detects a voice based on the feature amount of the voice section candidate corrected by the correction unit 15.

一実施形態として、音声検出部１６は、補正部１５により補正された単調増加の平均値および単調減少の平均値と、閾値とを比較する。例えば、音声検出部１６は、補正後の単調増加の平均値が揺らぎの下限値、例えば０．３７Ｈｚ／ｍｓおよび上限値、例えば１．０９Ｈｚ／ｍｓの範囲内に含まれるか否かを判定する。このとき、補正後の単調増加の平均値が揺らぎの下限値および上限値の範囲内に含まれる場合、音声検出部１６は、補正後の単調減少の平均値が揺らぎの下限値、例えば−１．１９Ｈｚ／ｍｓおよび上限値、例えば−０．２９Ｈｚ／ｍｓの範囲内に含まれるか否かをさらに判定する。そして、補正後の単調減少の平均値が揺らぎの下限値および上限値の範囲内に含まれる場合、音声検出部１６は、当該音声区間候補を「音声有り」として検出する。一方、補正後の単調増加の平均値が揺らぎの下限値および上限値の範囲内に含まれない場合、あるいは補正後の単調減少の平均値が揺らぎの下限値および上限値の範囲内に含まれない場合、当該音声区間候補を「音声無し」として検出する。この音声検出結果は、音声検出結果が得られる度、あるいは音声検出結果が所定の期間、例えば１時間にわたって蓄積される度に、見守り端末１０からサーバ装置３０へ出力される。 In one embodiment, the voice detection unit 16 compares the average value of monotonous increase and the average value of monotonous decrease corrected by the correction unit 15 with a threshold. For example, the voice detection unit 16 determines whether or not the average value of monotonically increasing after correction is included in the lower limit value of fluctuation, for example, 0.37 Hz / ms and the upper limit value, for example, 1.09 Hz / ms. . At this time, when the average value of monotonically increasing after correction is included in the range between the lower limit value and the upper limit value of fluctuation, the voice detection unit 16 determines that the average value of monotonically decreasing after correction is the lower limit value of fluctuation, for example, −1. It is further determined whether it falls within the range of 19 Hz / ms and an upper limit value, for example, −0.29 Hz / ms. Then, when the average value of monotonically decreasing after the correction is included in the range of the lower limit value and the upper limit value of the fluctuation, the voice detection unit 16 detects the voice section candidate as “with voice”. On the other hand, when the average value of monotonically increasing after correction is not included in the lower limit value and the upper limit of fluctuation, or the average value of monotonous decrease after correction is included in the lower limit value and upper limit of fluctuation. If not, the voice segment candidate is detected as "no voice". The voice detection result is output from the watching terminal 10 to the server device 30 each time the voice detection result is obtained or the voice detection result is accumulated for a predetermined period, for example, one hour.

［処理の流れ］
図８は、実施例１に係る音声検出処理の手順を示すフローチャートである。この処理は、一例として、音信号取得部１１により音信号の新規のフレームが取得された場合に開始される。 [Flow of processing]
FIG. 8 is a flowchart illustrating the procedure of the voice detection process according to the first embodiment. This process is started, for example, when a new frame of the sound signal is acquired by the sound signal acquisition unit 11.

図８に示すように、音信号取得部１１により音信号のフレームが新たに取得されると（ステップＳ１０１）、有音区間検出部１２ａは、ステップＳ１０１で取得されたフレームのパワー包絡が所定の閾値以上である場合、当該フレームを「有音区間」として検出する（ステップＳ１０２）。このように新規のフレームから検出される有音区間と、それ以前のフレームから検出された有音区間とが連接する場合、連接する有音区間同士を連結することにより、２つ以上の有音区間を１つの有音区間に統合することができる。 As shown in FIG. 8, when a sound signal frame is newly acquired by the sound signal acquisition unit 11 (step S101), the sound section detection unit 12a determines that the power envelope of the frame acquired in step S101 is predetermined. If it is equal to or greater than the threshold value, the frame is detected as a "sound interval" (step S102). As described above, when the sounding section detected from the new frame is connected to the sounding section detected from the previous frame, two or more sounding sections can be connected by connecting the sounding sections to be connected. A section can be integrated into one sound section.

続いて、基音算出部１２ｂは、一例として、有音区間検出部１２ａにより検出された有音区間における時間波形の自己相関により基本周波数を算出する（ステップＳ１０３）。そして、判定部１２ｃは、有音区間が音声区間候補に対応するか否かを判定する（ステップＳ１０４）。例えば、ステップＳ１０４では、有音区間で相関係数が算出された総回数のうち閾値以上の相関係数が算出された回数の割合および有音区間における基本周波数が所定の条件を満たすか否かにより、有音区間が音声区間候補に対応するか否かが判定される。 Subsequently, as one example, the basic sound calculation unit 12b calculates a fundamental frequency by autocorrelation of the time waveform in the sounding section detected by the sounding section detection unit 12a (step S103). Then, the determination unit 12c determines whether or not the sound section corresponds to a speech section candidate (step S104). For example, in step S104, the ratio of the number of times the correlation coefficient equal to or greater than the threshold is calculated among the total number of times the correlation coefficient is calculated in the sounded section and whether the fundamental frequency in the sounded section satisfies a predetermined condition Thus, it is determined whether the sound section corresponds to the voice section candidate.

このとき、有音区間が音声区間候補に対応する場合（ステップＳ１０４Ｙｅｓ）、特徴量算出部１３は、音声区間候補における基本周波数の時間波形上で所定の分析長を持つ時間窓をシフトさせながら、当該時間窓に含まれる基本周波数の時系列データに一次関数による関数近似等を行うことにより近似直線の傾きを求める（ステップＳ１０５）。 At this time, when the sound section corresponds to the voice section candidate (Yes at step S104), the feature quantity calculation unit 13 shifts the time window having a predetermined analysis length on the time waveform of the fundamental frequency in the voice section candidate. The gradient of the approximate straight line is obtained by performing function approximation or the like by a linear function on the time series data of the fundamental frequency included in the time window (step S105).

その上で、特徴量算出部１３は、時間窓がシフトされる度に得られた近似直線の傾きのうち正の値を持つ傾き及び負の値を持つ傾きごとに分けて傾きの統計値、例えば平均値を基本周波数の揺らぎの特徴量として算出する（ステップＳ１０６）。 Then, the feature quantity calculation unit 13 divides the inclination of the approximate straight line obtained each time the time window is shifted into inclinations having positive values and inclinations having negative values, and the inclination statistical values, For example, the average value is calculated as a feature of fluctuation of the fundamental frequency (step S106).

また、話速算出部１４は、音声区間候補に含まれる音節数を推定し、音声区間候補における音節数を単位時間、例えば１秒間あたりの音節数に換算することにより、音声区間候補における話速（音節／秒）を算出する（ステップＳ１０７）。 In addition, the speech speed calculation unit 14 estimates the number of syllables included in the speech segment candidate, and converts the number of syllables in the speech segment candidate into unit time, for example, the number of syllables per second, to calculate the speech velocity in the speech segment candidate (Syllables / second) is calculated (step S107).

その後、補正部１５は、ステップＳ１０６で算出された特徴量にステップＳ１０７で算出された話速により定まる補正係数αを乗算することにより、ステップＳ１０６で算出された特徴量を補正する（ステップＳ１０８）。 Thereafter, the correction unit 15 corrects the feature amount calculated in step S106 by multiplying the feature amount calculated in step S106 by the correction coefficient α determined by the speech speed calculated in step S107 (step S108). .

そして、音声検出部１６は、ステップＳ１０８で補正された特徴量が音声らしい基本周波数の揺らぎの範囲内であるか否かを判定する（ステップＳ１０９）。例えば、音声検出部１６は、単調増加の平均値が揺らぎの下限値および上限値の範囲内に含まれ、かつ単調減少の平均値が揺らぎの下限値および上限値の範囲内に含まれるか否かを判定する。 Then, the voice detection unit 16 determines whether or not the feature value corrected in step S108 is within the fluctuation range of the fundamental frequency that is likely to be voice (step S109). For example, the voice detection unit 16 determines whether the average value of monotonically increasing is included in the range between the lower limit value and the upper limit value of fluctuation, and the average value of monotonously decreasing is included in the range of the lower limit value and upper limit value of fluctuation. Determine if

その後、音声検出部１６は、補正後の特徴量が音声らしい基本周波数の揺らぎの範囲内であるか否かのステップＳ１０９の判定結果を当該音声区間候補における音声検出結果としてサーバ装置３０へ出力し（ステップＳ１１０）、ステップＳ１０１の処理へ戻る。 After that, the voice detection unit 16 outputs the determination result in step S109 as to whether or not the corrected feature amount is within the fluctuation range of the fundamental frequency that seems to be voice to the server device 30 as the voice detection result in the voice section candidate. (Step S110) The process returns to the process of step S101.

［効果の一側面］
上述してきたように、本実施例に係る見守り端末１０は、入力される音信号の音声候補区間から話速を求め、基本周波数の揺らぎを示す特徴量を話速に応じて補正して音声候補区間の音声の有無を判定する。これにより、話速の変化により基本周波数の揺らぎの現れ方が変化する場合でも、基本周波数の揺らぎの特徴量に話速の正規化を行って音声検出を実施できる結果、音声の検出漏れや誤検出を抑制できる。したがって、本実施例に係る見守り端末１０によれば、音声の検出精度が低下するのを抑制できる。 [One side of effect]
As described above, the watching terminal 10 according to the present embodiment obtains the speech speed from the speech candidate section of the input sound signal, corrects the feature amount indicating the fluctuation of the fundamental frequency according to the speech speed, and calculates the speech candidate. Determine the presence or absence of voice in the section. As a result, even if the appearance of the fluctuation of the fundamental frequency changes due to the change of the speech speed, normalization of the speech speed can be performed on the feature quantity of the fluctuation of the fundamental frequency and speech detection can be performed. Detection can be suppressed. Therefore, according to the watching terminal 10 according to the present embodiment, it is possible to suppress the decrease in the voice detection accuracy.

さて、これまで開示の装置に関する実施例について説明したが、本発明は上述した実施例以外にも、種々の異なる形態にて実施されてよいものである。そこで、以下では、本発明に含まれる他の実施例を説明する。 Although the embodiments of the disclosed apparatus have been described above, the present invention may be implemented in various different forms other than the above-described embodiments. Therefore, another embodiment included in the present invention will be described below.

［閾値の補正］
上記の実施例１では、基本周波数の揺らぎの特徴量を補正する場合を例示したが、これに限定されない。例えば、補正部１５は、特徴量と比較する閾値を補正することができる。このとき、補正部１５は、特徴量と比較する閾値のみを補正することもできるし、特徴量と閾値の両方を補正することもできる。 [Threshold correction]
Although the case of correcting the feature quantity of the fluctuation of the fundamental frequency is illustrated in the above-described first embodiment, the present invention is not limited to this. For example, the correction unit 15 can correct the threshold value to be compared with the feature amount. At this time, the correction unit 15 can correct only the threshold value to be compared with the feature value, or can correct both the feature value and the threshold value.

具体的には、補正部１５は、話速算出部１４により算出された話速から、特徴量算出部１３により算出された特徴量と比較する閾値に乗算する補正係数βを算出する。この補正係数βは、一例として、図９に示す補正係数βの導出関数にしたがって算出される。図９は、補正係数βの導出関数の一例を示す図である。図９には、話速および補正係数βの関係を示すグラフが示されている。図９に示すグラフの縦軸は、補正係数βを指し、横軸は、話速［音節／秒］を指す。図９に示すように、補正係数βの導出には、話速が小さいほど小さい値が導出される一方で、話速が大きいほど大きい値が導出される関数が用いられる。言い換えれば、音声区間候補における発話がゆっくりであるほど小さい補正係数βが導出される一方で、音声区間候補における発話が早いほど大きい補正係数βが導出される。このように話速の大小に応じて補正係数βの大きさを変えることができるが、補正係数βの導出関数は、図９に示す通り、必ずしも単調増加のグラフとせずともかまわない。すなわち、話速が平均の範囲に収まる場合、当該平均の範囲では話速の値によらず、補正係数βは「１」と導出される。このように補正係数βが「１」である場合、特徴量と比較する閾値は実質的に補正されない。また、話速が２．９［音節／秒］以下である場合、２．９［音節／秒］以下の話速であっても話速が２．９［音節／秒］である場合と同一の値「０．５」が補正係数βとして導出される。また、話速が１１［音節／秒］以上である場合、１１［音節／秒］以上の話速であっても話速が１１［音節／秒］である場合と同一の値「１．５」が補正係数βが導出される。このような導出関数を通じて、補正部１５は、音声区間候補における話速から補正係数βを算出することができる。 Specifically, the correction unit 15 calculates a correction coefficient β to be multiplied by a threshold value to be compared with the feature amount calculated by the feature amount calculation unit 13 from the speech speed calculated by the speech speed calculation unit 14. The correction coefficient β is calculated, for example, in accordance with a derived function of the correction coefficient β shown in FIG. FIG. 9 is a diagram showing an example of a derived function of the correction coefficient β. A graph showing the relationship between the speech speed and the correction coefficient β is shown in FIG. The vertical axis of the graph shown in FIG. 9 indicates the correction coefficient β, and the horizontal axis indicates speech speed [syllables / second]. As shown in FIG. 9, for the derivation of the correction coefficient β, a smaller value is derived as the speech speed is smaller, while a function is used in which a larger value is derived as the speech speed is larger. In other words, the slower the speech in the speech segment candidate is, the smaller the correction coefficient β is derived, while the earlier the speech in the speech segment candidate is, the larger the correction coefficient β is derived. As described above, the magnitude of the correction coefficient β can be changed according to the magnitude of the speech speed, but the derivation function of the correction coefficient β may not necessarily be a monotonically increasing graph, as shown in FIG. That is, when the speech speed falls within the range of the average, the correction coefficient β is derived as “1” regardless of the value of the speech speed within the range of the average. Thus, when the correction coefficient β is “1”, the threshold value to be compared with the feature amount is not substantially corrected. Also, when the speech speed is 2.9 [syllables / second] or less, the speech speed is the same as 2.9 [syllables / second] even if the speech speed is 2.9 [syllables / second] or less The value of “0.5” is derived as the correction coefficient β. Also, when the speech speed is 11 [syllables / second] or more, the same value “1.5” as in the case where the speech speed is 11 [syllables / second] even if the speech speed is 11 [syllables / second] or more “Is the correction coefficient β is derived. The correction unit 15 can calculate the correction coefficient β from the speech speed in the speech segment candidate through such a derivation function.

この補正係数βが算出された後、補正部１５は、特徴量算出部１３により算出された特徴量と比較する閾値に補正係数βを乗算することにより、当該閾値を補正する。すなわち、補正部１５は、単調増加の平均値と比較する閾値、すなわち音声らしい揺らぎの下限値および上限値に補正係数βを乗算する。このような補正により、単調増加の平均値と比較する閾値のうち下限値は「０．３７×β（Ｈｚ／ｍｓ）」となり、上限値は「１．０９×β（Ｈｚ／ｍｓ）」となる。さらに、補正部１５は、単調減少の平均値と比較する閾値、すなわち音声らしい揺らぎの下限値および上限値に補正係数βを乗算する。このような補正により、単調減少の平均値と比較する閾値のうち下限値は「−１．１９×β（Ｈｚ／ｍｓ）」となり、上限値は「−０．２９×β（Ｈｚ／ｍｓ）」となる。 After the correction coefficient β is calculated, the correction unit 15 corrects the threshold value by multiplying the threshold value to be compared with the feature amount calculated by the feature amount calculation unit 13 by the correction coefficient β. That is, the correction unit 15 multiplies the correction coefficient β by the threshold value to be compared with the average value of monotonically increasing, that is, the lower limit value and the upper limit value of the sound-like fluctuation. With such correction, the lower limit value of the threshold value to be compared with the average value of monotonically increasing becomes “0.37 × β (Hz / ms)” and the upper limit value is “1.09 × β (Hz / ms)”. Become. Furthermore, the correction unit 15 multiplies the correction coefficient β by the threshold value to be compared with the average value of monotonically decreasing, that is, the lower limit value and the upper limit value of the sound-like fluctuation. With such correction, the lower limit value of the threshold value to be compared with the average value of monotonically decreasing becomes “−1.19 × β (Hz / ms)” and the upper limit value is “−0.29 × β (Hz / ms) It becomes ".

このように閾値が補正された後、音声検出部１６では、補正部１５により補正された単調増加の平均値および単調減少の平均値と、補正部１５により補正された閾値とが比較される。例えば、音声検出部１６は、補正後の単調増加の平均値が補正後の音声らしい揺らぎの閾値の範囲、すなわち下限値「０．３７×β（Ｈｚ／ｍｓ）」および上限値「１．０９×β（Ｈｚ／ｍｓ）」の範囲内に含まれるか否かを判定する。このとき、補正後の単調増加の平均値が補正後の音声らしい揺らぎの閾値の範囲内に含まれる場合、音声検出部１６は、補正後の単調減少の平均値が補正後の音声らしい揺らぎの閾値の範囲、すなわち下限値「−１．１９×β（Ｈｚ／ｍｓ）」および上限値「−０．２９×β（Ｈｚ／ｍｓ）」の範囲内に含まれるか否かをさらに判定する。そして、補正後の単調減少の平均値が補正後の音声らしい揺らぎの閾値の範囲内に含まれる場合、音声検出部１６は、当該音声区間候補を「音声有り」として検出する。一方、補正後の単調増加の平均値が補正後の音声らしい揺らぎの閾値の範囲内に含まれない場合、あるいは補正後の単調減少の平均値が補正後の音声らしい揺らぎの閾値の範囲内に含まれない場合、当該音声区間候補を「音声無し」として検出する。この音声検出結果は、音声検出結果が得られる度、あるいは音声検出結果が所定の期間、例えば１時間にわたって蓄積される度に、見守り端末１０からサーバ装置３０へ出力される。なお、ここでは、一例として、特徴量および閾値の両方が補正される場合を例示したが、閾値だけに補正を行うこともできる。 After the threshold value is corrected as described above, the voice detection unit 16 compares the average value of monotonous increase and the average value of monotonous decrease corrected by the correction unit 15 with the threshold value corrected by the correction unit 15. For example, the voice detection unit 16 sets the threshold value range of the fluctuation that seems to be voice after correction, ie, the lower limit value “0.37 × β (Hz / ms)” and the upper limit value “1.09. It is determined whether or not it falls within the range of x β (Hz / ms). At this time, when the average value of monotonically increasing after correction is included in the range of the threshold value of fluctuation that seems to be the voice after correction, the voice detection unit 16 detects the average value of monotonous decrease after correction is the fluctuation that seems to be voice after correction. It is further determined whether it falls within the range of the threshold value, that is, within the range of the lower limit value “−1.19 × β (Hz / ms)” and the upper limit value “−0.29 × β (Hz / ms)”. Then, when the average value of monotonically decreasing after correction is included in the threshold range of fluctuation that seems to be the voice after correction, the voice detection unit 16 detects the voice segment candidate as “voice present”. On the other hand, when the average value of monotonically increasing after correction is not included in the threshold range of fluctuation like speech after correction, or the average value of monotonous decrease after correction is within the threshold range of fluctuation like speech after correction If it is not included, the speech segment candidate is detected as "no speech". The voice detection result is output from the watching terminal 10 to the server device 30 each time the voice detection result is obtained or the voice detection result is accumulated for a predetermined period, for example, one hour. Here, as an example, although the case where both the feature amount and the threshold value are corrected is illustrated, it is also possible to correct only the threshold value.

［話速の平滑化］
上記の実施例１では、音声区間候補における話速を算出する場合を例示したが、他のスパンで話速を算出することもできる。例えば、話速算出部１４は、サブフレーム、例えば１秒間単位で話速を算出することもできる。この場合、話速算出部１４は、下記の式（３）にしたがってサブフレーム間で話速を平滑化することができる。なお、サブフレーム単位で話速算出部１４やそれに後続する処理部の各処理が実施される場合、各処理が実行されるサブフレームは音声区間候補の一部または全部と重なるサブフレームに絞り込まれる。 [Smoothing of speech speed]
In the first embodiment described above, the speech speed in the speech segment candidate is calculated. However, the speech speed can be calculated in another span. For example, the speech speed calculation unit 14 can also calculate the speech speed in units of sub-frames, for example, one second. In this case, the speech speed calculation unit 14 can smooth the speech speed between subframes according to the following equation (3). In addition, when each process of the speech speed calculation unit 14 and the processing unit subsequent thereto is performed in units of subframes, subframes on which each process is performed are narrowed down to subframes overlapping with part or all of the speech segment candidates. .

Av＿Speech＿speed＿tx＝０．２×Speech＿speed＿tx＋０．８×Av＿Speech＿speed＿tx−1・・・（３） Av_Speech_speed_tx = 0.2 × Speech_speed_tx + 0.8 × Av_Speech_speed_tx-1 (3)

上記の式（３）における「Av＿Speech＿speed＿tx」は、現サブフレームにおける平滑化後の話速［音節／秒］を指す。また、上記の式（３）における「Speech＿speed＿tx」は、現サブフレームの話速［音節／秒］を指す。この「Speech＿speed＿tx」は、上記の実施例１で説明した音声区間候補の部分をサブフレームに置き換えることによりサブフレームにおける話速を同様に算出することができる。また、上記の式（３）における「Av＿Speech＿speed＿tx−1」は、１つ前のサブフレームで算出された平滑化後の話速［音節／秒］を指す。 “Av_Speech_speed_tx” in the above equation (3) indicates the speech speed [syllabus / second] after smoothing in the current subframe. Also, "Speech_speed_tx" in the above equation (3) indicates the speech speed [syllables / second] of the current subframe. This "Speech_speed_tx" can similarly calculate the speech speed in a subframe by replacing the speech segment candidate portion described in the first embodiment above with the subframe. Further, “Av_Speech_speed_tx-1” in the above equation (3) indicates the smoothed speech speed [syllables / second] calculated in the previous subframe.

［特徴量算出区間の設定その１］
ここで、見守り端末１０は、特徴量の算出に有効な区間の全長を話速により設定することができる。以下では、特徴量の算出が行われる区間のことを「特徴量算出区間」と記載する場合がある。すなわち、上記の実施例１では、音声区間候補の中で６０ｍｓｅｃ程度の分析長を持つ時間窓をスライドさせながら一次の近似直線の傾きを繰り返して算出する場合を例示したが、特徴量の算出を行う区間の全長を音声区間候補とする代わりに、平滑化後の話速に応じてサブフレームを拡縮することにより特徴量算出区間を設定する。なお、ここでは、あくまで一例として、平滑化後の話速を用いてサブフレームを拡縮する場合を例示するが、上記の実施例１で算出される話速を用いてサブフレームを拡縮することとしてもかまわない。 [Setting of feature amount calculation section 1]
Here, the watching terminal 10 can set the total length of the section effective for the calculation of the feature amount according to the speech speed. Hereinafter, the section in which the calculation of the feature amount is performed may be referred to as a “feature amount calculation section”. That is, in the above-described first embodiment, the case where the inclination of the primary approximate straight line is repeatedly calculated while sliding the time window having the analysis length of about 60 msec in the speech section candidates has been exemplified. Instead of setting the total length of the section to be performed as the speech section candidate, the feature amount calculation section is set by expanding and reducing the sub-frame according to the speech speed after smoothing. In addition, although the case where a sub-frame is expanded and contracted using the speech speed after smoothing is illustrated as an example to the last only, as the sub-frame is expanded and contracted using the speech speed calculated in the above-mentioned first embodiment. I don't care.

例えば、見守り端末１０は、話速算出部１４により算出される平滑化後の話速から、サブフレームの時間長に乗算する伸縮係数γを算出する。この伸縮係数γは、一例として、図１０に示す伸縮係数γの導出関数にしたがって算出される。図１０は、伸縮係数γの導出関数の一例を示す図である。図１０には、話速および伸縮係数γの関係を示すグラフが示されている。図１０に示すグラフの縦軸は、伸縮係数γを指し、横軸は、話速［音節／秒］を指す。図１０に示すように、伸縮係数γの導出には、話速が小さいほど大きい値が導出される一方で、話速が大きいほど小さい値が導出される関数が用いられる。言い換えれば、サブフレームにおける発話がゆっくりであるほど大きい伸縮係数γが導出される一方で、サブフレームにおける発話が早いほど小さい伸縮係数γが導出される。このように話速の大小に応じて伸縮係数γの大きさを変えることができるが、伸縮係数の導出関数は、図１０に示す通り、必ずしも単調減少のグラフとせずともかまわない。すなわち、話速が平均の範囲に収まる場合、当該平均の範囲では話速の値によらず、伸縮係数γは「１」と導出される。このように伸縮係数γが「１」である場合、特徴量の算出が行われる区間の全長は実質的にサブフレームの時間長と同一となる。また、話速が２．９［音節／秒］以下である場合、２．９［音節／秒］以下の話速であっても話速が２．９［音節／秒］である場合と同一の伸縮係数γが導出される。また、話速が１１［音節／秒］以上である場合、１１［音節／秒］以上の話速であっても話速が１１［音節／秒］である場合と同一の伸縮係数γが導出される。このような導出関数を通じて、見守り端末１０は、話速から伸縮係数γを算出することができる。 For example, the watching terminal 10 calculates a scaling factor γ to be multiplied by the time length of the subframe from the smoothed speech speed calculated by the speech speed calculation unit 14. The scaling factor γ is calculated according to the derivation function of the scaling factor γ shown in FIG. 10 as an example. FIG. 10 is a diagram illustrating an example of a derived function of the expansion coefficient γ. FIG. 10 shows a graph showing the relationship between the speech speed and the expansion coefficient γ. The vertical axis of the graph shown in FIG. 10 indicates the expansion coefficient γ, and the horizontal axis indicates the speech speed [syllable / second]. As shown in FIG. 10, for the derivation of the expansion / contraction coefficient γ, a larger value is derived as the speech speed is smaller, while a function is used which derives a smaller value as the speech speed is larger. In other words, as the utterance in the sub-frame is slower, a larger expansion / contraction factor γ is derived, while as the utterance in the sub-frame is faster, the smaller expansion / contraction factor γ is derived. Thus, the magnitude of the expansion coefficient γ can be changed according to the size of the speech speed, but the derivation function of the expansion coefficient may not necessarily be a monotonically decreasing graph, as shown in FIG. That is, when the speech speed falls within the range of the average, the scaling factor γ is derived as “1” regardless of the value of the speech speed within the range of the average. As described above, when the expansion coefficient γ is “1”, the total length of the section in which the feature amount is calculated is substantially the same as the sub-frame time length. Also, when the speech speed is 2.9 [syllables / second] or less, the speech speed is the same as 2.9 [syllables / second] even if the speech speed is 2.9 [syllables / second] or less The scaling factor γ of is derived. When the speech speed is 11 [syllables / second] or more, the same expansion coefficient γ is derived as in the case where the speech speed is 11 [syllables / second] even if the speech speed is 11 [syllables / second] or more Be done. The watching terminal 10 can calculate the scaling factor γ from the speech speed through such a derivation function.

この伸縮係数γが算出された後、見守り端末１０は、サブフレームの時間長、例えば１秒間に伸縮係数γを乗算することにより、特徴量算出区間を設定する。このように設定された特徴量算出区間を全長として、特徴量算出部１３は、正の値を持つ傾き及び負の値を持つ傾きごとに分けて傾きの統計値、例えば平均値を算出することにより、基本周波数の揺らぎの特徴量を算出する。その後、補正部１５は、平滑化後の話速から定まる補正係数αを用いて特徴量を補正し、音声検出部１６は、補正後の特徴量が音声らしい基本周波数の揺らぎの範囲内であるか否かにより、サブフレームにおける音声の有無を判定する。 After the expansion coefficient γ is calculated, the watching terminal 10 sets a feature amount calculation section by multiplying the time length of the sub-frame, for example, 1 second by the expansion coefficient γ. Taking the feature amount calculation section set in this way as the total length, the feature amount calculation unit 13 calculates the statistical value of the inclination, for example, the average value, by dividing the inclination having a positive value and the inclination having a negative value. By this, the feature quantity of the fluctuation of the fundamental frequency is calculated. After that, the correction unit 15 corrects the feature amount using the correction coefficient α determined from the speech speed after smoothing, and the voice detection unit 16 detects that the feature amount after correction falls within the fluctuation range of the fundamental frequency like speech. Whether or not there is voice in the subframe is determined.

［特徴量算出区間の設定その２］
ここでは、特徴量算出区間の設定方法の一例として、話速を用いる場合を例示したが、この他の因子を用いて特徴量算出区間を設定することもできる。例えば、見守り端末１０は、平滑化後の話速およびサブフレームの信頼度に応じてサブフレームを拡縮することにより特徴量算出区間を設定することもできる。 [Setting of feature amount calculation section 2]
Here, although the case where speech speed is used is illustrated as an example of the setting method of a feature-value calculation area, a feature-value calculation area can also be set using this other factor. For example, the watching terminal 10 can set the feature value calculation section by expanding or reducing the subframe in accordance with the speech speed after smoothing and the reliability of the subframe.

この場合、見守り端末１０は、サブフレームごとに当該サブフレームのデータを信頼することができる度合いを表す信頼度を算出する。この信頼度は、一例として、サブフレーム内で判定部１２ｃが判定に用いる２つの条件を満たす基本周波数を算出することができたフレームの個数を分子とし、サブフレームに含まれるフレームの総数を分母とする算出式にしたがって算出することができる。 In this case, the watching terminal 10 calculates, for each subframe, a reliability that indicates the degree to which the data of the subframe can be trusted. As an example of this reliability, the denominator is the total number of frames included in a subframe, where the numerator is the number of frames in which the determination unit 12c can calculate two basic frequencies that use the determination in the subframe. It can be calculated according to the calculation formula

例えば、見守り端末１０は、サブフレームの時間長に乗算する伸縮係数γおよび伸縮係数Δを算出する。このうち、伸縮係数γは、平滑化後の話速により算出される係数であり、上述の通り、図１０にしたがって話速から算出される。一方、伸縮係数Δは、サブフレームの信頼度から算出される係数である。この伸縮係数Δは、一例として、図１１に示す伸縮係数Δの導出関数にしたがって算出される。図１１は、伸縮係数Δの導出関数の一例を示す図である。図１１には、信頼度および伸縮係数Δの関係を示すグラフが示されている。図１１に示すグラフの縦軸は、伸縮係数Δを指し、横軸は、信頼度を指す。図１１に示すように、伸縮係数Δの導出には、信頼度が小さいほど大きい値が導出される一方で、信頼度が大きいほど小さい値が導出される関数が用いられる。このように信頼度の大小に応じて伸縮係数Δの大きさを変えることができるが、伸縮係数の導出関数は、図１１に示す通り、必ずしも単調減少のグラフとせずともかまわない。すなわち、信頼度が０．６５以上である場合、０．６５以上の信頼度であっても信頼度が０．６５である場合と同一の値「１」が伸縮係数Δとして導出される。このような導出関数を通じて、見守り端末１０は、話速から伸縮係数Δを算出することができる。なお、図１１には、信頼度が０．３未満である場合の伸縮係数Δが設定されていないが、信頼度が０．３未満である場合、音声が検出される可能性が低いので、特徴量の算出を中止させることができる。 For example, the watching terminal 10 calculates a scaling factor γ and a scaling factor Δ by which the time length of the subframe is multiplied. Among these, the expansion / contraction coefficient γ is a coefficient calculated by the speech speed after smoothing, and is calculated from the speech speed according to FIG. 10 as described above. On the other hand, the expansion coefficient Δ is a coefficient calculated from the reliability of the subframe. The scaling factor Δ is calculated according to a derived function of the scaling factor Δ shown in FIG. 11 as an example. FIG. 11 is a diagram illustrating an example of a derived function of the expansion coefficient Δ. FIG. 11 shows a graph showing the relationship between the reliability and the expansion coefficient Δ. The vertical axis of the graph shown in FIG. 11 indicates the expansion coefficient Δ, and the horizontal axis indicates the degree of reliability. As shown in FIG. 11, for the derivation of the expansion coefficient Δ, a larger value is derived as the reliability is smaller, while a function is used which derives a smaller value as the reliability is larger. As described above, the magnitude of the expansion coefficient Δ can be changed according to the degree of reliability, but the derivation function of the expansion coefficient may not necessarily be a monotonically decreasing graph, as shown in FIG. That is, when the reliability is 0.65 or more, the same value “1” as in the case of the reliability of 0.65 is derived as the expansion coefficient Δ even if the reliability is 0.65 or more. The watching terminal 10 can calculate the scaling factor Δ from the speech speed through such a derivation function. In addition, although the expansion / contraction coefficient Δ in the case where the reliability is less than 0.3 is not set in FIG. 11, since the possibility that voice is detected is low when the reliability is less than 0.3, It is possible to stop the calculation of the feature amount.

この伸縮係数γ及び伸縮係数Δが算出された後、見守り端末１０は、サブフレームの時間長、例えば１秒間に伸縮係数γ及び伸縮係数Δを乗算することにより、特徴量算出区間を設定することができる。このように設定された特徴量算出区間を全長として、特徴量算出部１３は、正の値を持つ傾き及び負の値を持つ傾きごとに分けて傾きの統計値、例えば平均値を算出することにより、基本周波数の揺らぎの特徴量を算出する。このとき、信頼度が０．３未満である場合、音声が検出される可能性が低いので、特徴量の算出を中止させることができる。その後、補正部１５は、平滑化後の話速から定まる補正係数αを用いて特徴量を補正し、音声検出部１６は、補正後の特徴量が音声らしい基本周波数の揺らぎの範囲内であるか否かにより、サブフレームにおける音声の有無を判定する。 After the expansion coefficient γ and the expansion coefficient Δ are calculated, the watching terminal 10 sets a feature amount calculation section by multiplying the time length of the subframe, for example, 1 second by the expansion coefficient γ and the expansion coefficient Δ. Can. Taking the feature amount calculation section set in this way as the total length, the feature amount calculation unit 13 calculates the statistical value of the inclination, for example, the average value, by dividing the inclination having a positive value and the inclination having a negative value. By this, the feature quantity of the fluctuation of the fundamental frequency is calculated. At this time, if the degree of reliability is less than 0.3, the possibility of the voice being detected is low, so that the calculation of the feature amount can be stopped. After that, the correction unit 15 corrects the feature amount using the correction coefficient α determined from the speech speed after smoothing, and the voice detection unit 16 detects that the feature amount after correction falls within the fluctuation range of the fundamental frequency like speech. Whether or not there is voice in the subframe is determined.

［特徴量］
上記の実施例１では、特徴量の一例として、基本周波数の揺らぎに関する特徴量を算出する場合を例示したが、見守り端末１０が算出する特徴量は基本周波数の揺らぎに関するものに限定されない。例えば、見守り端末１０は、基本周波数の倍音とよばれる周波数の揺らぎに関する特徴量を算出することとしてもかまわない。この場合にも、補正係数αの導出を倍音に対応する導出にアレンジしたり、補正後の特徴量と比較する閾値を倍音に対応する閾値にアレンジする他には、話速の算出、特徴量の補正や音声検出などの処理の内容に変りはなく、上記の実施例１と同様に、音声区間候補における音声の有無を検出することができる。 [Feature value]
In the first embodiment described above, the feature amount related to the fluctuation of the fundamental frequency is calculated as an example of the feature amount, but the feature amount calculated by the watching terminal 10 is not limited to the one related to the fluctuation of the fundamental frequency. For example, the watching terminal 10 may calculate a feature amount related to the fluctuation of the frequency called harmonic of the fundamental frequency. Also in this case, apart from arranging the derivation of the correction coefficient α to the derivation corresponding to the harmonic or arranging the threshold to be compared with the corrected feature amount to the threshold corresponding to the harmonic, calculation of speech speed, feature amount There is no change in the contents of the processing such as the correction and the voice detection, and the presence or absence of voice in the voice segment candidate can be detected as in the first embodiment.

［分散および統合］
また、図示した各装置の各構成要素は、必ずしも物理的に図示の如く構成されておらずともよい。すなわち、各装置の分散・統合の具体的形態は図示のものに限られず、その全部または一部を、各種の負荷や使用状況などに応じて、任意の単位で機能的または物理的に分散・統合して構成することができる。例えば、音信号取得部１１、音声候補抽出部１２、特徴量算出部１３、話速算出部１４、補正部１５または音声検出部１６を見守り端末１０の外部装置としてネットワーク経由で接続するようにしてもよい。また、音信号取得部１１、音声候補抽出部１２、特徴量算出部１３、話速算出部１４、補正部１５または音声検出部１６を別の装置がそれぞれ有し、ネットワーク接続されて協働することで、上記の見守り端末１０の機能を実現するようにしてもよい。 Distributed and integrated
Further, each component of each illustrated device may not necessarily be physically configured as illustrated. That is, the specific form of the distribution and integration of each device is not limited to the illustrated one, and all or a part thereof may be functionally or physically dispersed in any unit depending on various loads, usage conditions, etc. It can be integrated and configured. For example, the sound signal acquisition unit 11, the speech candidate extraction unit 12, the feature amount calculation unit 13, the speech speed calculation unit 14, the correction unit 15 or the speech detection unit 16 are connected via a network as an external device of the terminal 10 It is also good. In addition, another device has the sound signal acquisition unit 11, the speech candidate extraction unit 12, the feature quantity calculation unit 13, the speech speed calculation unit 14, the correction unit 15, or the speech detection unit 16, and they are connected by a network and cooperate Thus, the function of the watching terminal 10 described above may be realized.

この他、上記の実施例１では、図８に示す音声検出処理が見守り端末１０により実行される例を説明したが、図８に示す音声検出処理の実行主体は見守り端末１０に限定されない。例えば、サーバ装置３０が見守り端末１０のマイクから収集された音信号のアップロードを受け付け、アップロードされた音信号に音声検出処理を実行することもできる。 In addition, although the example in which the voice detection process shown in FIG. 8 is executed by the watching terminal 10 has been described in the first embodiment, the execution subject of the voice detection process shown in FIG. 8 is not limited to the watching terminal 10. For example, the server device 30 may receive an upload of the sound signal collected from the microphone of the watching terminal 10, and may execute voice detection processing on the uploaded sound signal.

［音声検出プログラム］
また、上記の実施例で説明した各種の処理は、予め用意されたプログラムをパーソナルコンピュータやワークステーションなどのコンピュータで実行することによって実現することができる。そこで、以下では、図１２を用いて、上記の実施例と同様の機能を有する音声検出プログラムを実行するコンピュータの一例について説明する。 [Voice detection program]
The various processes described in the above embodiments can be realized by executing a prepared program on a computer such as a personal computer or a workstation. Therefore, an example of a computer that executes a voice detection program having the same function as that of the above embodiment will be described below with reference to FIG.

図１２は、実施例１及び実施例２に係る音声検出プログラムを実行するコンピュータのハードウェア構成例を示す図である。図１２に示すように、コンピュータ１００は、操作部１１０ａと、スピーカ１１０ｂと、カメラ１１０ｃと、ディスプレイ１２０と、通信部１３０とを有する。さらに、このコンピュータ１００は、ＣＰＵ１５０と、ＲＯＭ１６０と、ＨＤＤ１７０と、ＲＡＭ１８０とを有する。これら１１０〜１８０の各部はバス１４０を介して接続される。 FIG. 12 is a diagram illustrating an example of a hardware configuration of a computer that executes a voice detection program according to the first embodiment and the second embodiment. As illustrated in FIG. 12, the computer 100 includes an operation unit 110a, a speaker 110b, a camera 110c, a display 120, and a communication unit 130. The computer 100 further includes a CPU 150, a ROM 160, an HDD 170, and a RAM 180. The components 110 to 180 are connected via a bus 140.

ＨＤＤ１７０には、図１２に示すように、上記の実施例１で示した音信号取得部１１、音声候補抽出部１２、特徴量算出部１３、話速算出部１４、補正部１５及び音声検出部１６と同様の機能を発揮する音声検出プログラム１７０ａが記憶される。この音声検出プログラム１７０ａは、図４に示した音信号取得部１１、音声候補抽出部１２、特徴量算出部１３、話速算出部１４、補正部１５及び音声検出部１６の各構成要素と同様、統合又は分離してもかまわない。すなわち、ＨＤＤ１７０には、必ずしも上記の実施例１で示した全てのデータが格納されずともよく、処理に用いるデータがＨＤＤ１７０に格納されればよい。 In the HDD 170, as shown in FIG. 12, the sound signal acquisition unit 11, the speech candidate extraction unit 12, the feature quantity calculation unit 13, the speech speed calculation unit 14, the correction unit 15, and the speech detection unit shown in the first embodiment A voice detection program 170a that exerts the same function as that of S.16 is stored. The speech detection program 170a is similar to the respective components of the sound signal acquisition unit 11, the speech candidate extraction unit 12, the feature amount calculation unit 13, the speech speed calculation unit 14, the correction unit 15, and the speech detection unit 16 shown in FIG. , May be integrated or separated. That is, the HDD 170 may not necessarily store all the data described in the first embodiment, and data used for processing may be stored in the HDD 170.

このような環境の下、ＣＰＵ１５０は、ＨＤＤ１７０から音声検出プログラム１７０ａを読み出した上でＲＡＭ１８０へ展開する。この結果、音声検出プログラム１７０ａは、図１２に示すように、音声検出プロセス１８０ａとして機能する。この音声検出プロセス１８０ａは、ＲＡＭ１８０が有する記憶領域のうち音声検出プロセス１８０ａに割り当てられた領域にＨＤＤ１７０から読み出した各種データを展開し、この展開した各種データを用いて各種の処理を実行する。例えば、音声検出プロセス１８０ａが実行する処理の一例として、図８に示す処理などが含まれる。なお、ＣＰＵ１５０では、必ずしも上記の実施例１で示した全ての処理部が動作せずともよく、実行対象とする処理に対応する処理部が仮想的に実現されればよい。 Under such an environment, the CPU 150 reads the voice detection program 170 a from the HDD 170 and develops the voice detection program 170 a in the RAM 180. As a result, the speech detection program 170a functions as a speech detection process 180a as shown in FIG. The voice detection process 180a expands various data read from the HDD 170 in the area allocated to the speech detection process 180a in the storage area of the RAM 180, and executes various processes using the expanded various data. For example, the process shown in FIG. 8 or the like is included as an example of the process executed by the speech detection process 180a. In the CPU 150, all the processing units described in the first embodiment may not necessarily operate, and processing units corresponding to processing to be executed may be virtually realized.

なお、上記の音声検出プログラム１７０ａは、必ずしも最初からＨＤＤ１７０やＲＯＭ１６０に記憶されておらずともかまわない。例えば、コンピュータ１００に挿入されるフレキシブルディスク、いわゆるＦＤ、ＣＤ−ＲＯＭ、ＤＶＤディスク、光磁気ディスク、ＩＣカードなどの「可搬用の物理媒体」に音声検出プログラム１７０ａを記憶させる。そして、コンピュータ１００がこれらの可搬用の物理媒体から音声検出プログラム１７０ａを取得して実行するようにしてもよい。また、公衆回線、インターネット、ＬＡＮ、ＷＡＮなどを介してコンピュータ１００に接続される他のコンピュータまたはサーバ装置などに音声検出プログラム１７０ａを記憶させておき、コンピュータ１００がこれらから音声検出プログラム１７０ａを取得して実行するようにしてもよい。 The voice detection program 170 a may not necessarily be stored in the HDD 170 or the ROM 160 from the beginning. For example, the voice detection program 170 a is stored in a “portable physical medium” such as a flexible disc, a so-called FD, a CD-ROM, a DVD disc, a magneto-optical disc, an IC card or the like inserted into the computer 100. Then, the computer 100 may acquire the voice detection program 170 a from these portable physical media and execute it. Further, the voice detection program 170a is stored in another computer or server device connected to the computer 100 via a public line, the Internet, LAN, WAN, etc., and the computer 100 acquires the voice detection program 170a from these. It may be carried out.

１リモートケアシステム
１０見守り端末
１１音信号取得部
１２音声候補抽出部
１２ａ有音区間検出部
１２ｂ基音算出部
１２ｃ判定部
１３特徴量算出部
１４話速算出部
１５補正部
１６音声検出部
３０サーバ装置
５０関係者端末 DESCRIPTION OF SYMBOLS 1 remote care system 10 watching terminal 11 sound signal acquisition part 12 audio | voice candidate extraction part 12a sounding area detection part 12b fundamental sound calculation part 12c determination part 13 feature-value calculation part 14 speech speed calculation part 15 correction part 16 audio | voice detection part 30 server apparatus 50 concerned party terminal

Claims

A sound signal acquisition unit for acquiring a sound signal;
A voice candidate extraction unit which extracts a candidate of a section likely to be voice as a voice section candidate based on a voiced section of the time waveform of the sound signal;
A feature amount calculation unit that calculates a feature amount indicating the degree of speech likeness of frequency change in the voice section candidate;
A speech speed calculation unit that calculates a speech speed in the speech segment candidate;
A correction unit that corrects the feature amount according to the speech speed;
A voice detection unit that detects the presence or absence of voice in the voice segment candidate based on the corrected feature amount;
A voice detection apparatus characterized by having:

The speech speed calculation unit calculates the autocorrelation of the time waveform in the speech segment candidate, and estimates the number of syllables included in the speech segment candidate from the number of consecutive segments whose correlation coefficient is greater than or equal to a predetermined threshold, The speech detection apparatus according to claim 1, wherein the speech speed per unit time is calculated from the number of syllables.

The correction unit corrects a threshold compared with the feature amount according to the speech speed.
The voice detection apparatus according to claim 1 or 2, wherein the voice detection unit detects the presence or absence of voice in the voice section candidate by comparing the feature amount after correction and the threshold value after correction. .

The speech detection apparatus according to any one of claims 1 to 3, further comprising: a section setting unit that performs expansion or contraction of a speech section candidate that calculates the feature quantity according to the speech speed.

The speech detection according to claim 4, wherein the section setting unit performs expansion or contraction of a speech section candidate for calculating the feature quantity according to the reliability of data in the speech section candidate and the speech speed. apparatus.

Get a sound signal,
The candidate of the section which seems to be voice is extracted as a voice section candidate based on the sound section of the time waveform of the sound signal,
Calculating a feature amount indicating a degree of speech likeness of frequency change in the speech segment candidate;
Calculate the speech speed in the speech segment candidate;
Correcting the feature amount according to the speech speed;
Detecting the presence or absence of speech in the speech segment candidate based on the corrected feature amount;
A voice detection program that causes a computer to execute a process.