JP2019132912A

JP2019132912A - Living sound recording device and living sound recording method

Info

Publication number: JP2019132912A
Application number: JP2018013032A
Authority: JP
Inventors: 境　克司; Katsushi Sakai; 克司境; 村瀬　有一; Yuichi Murase; 有一村瀬
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2018-01-29
Filing date: 2018-01-29
Publication date: 2019-08-08

Abstract

To protect privacy of a user.SOLUTION: A living sound recording device comprises: an acquisition section for acquiring voice data where voice occurring in a living space of a user is included; a division section for dividing the voice data acquired by the acquisition section into frames in time length corresponding to duration time of living sound with behavior or action of the user; and a recording control section for changing an order of the plurality of frames obtained by division of the division section and recording it in a memory.SELECTED DRAWING: Figure 2

Description

本発明は、生活音記録装置及び生活音記録方法に関する。 The present invention relates to a living sound recording apparatus and a living sound recording method.

近年、人工知能技術の発達や膨大な会話データベースの蓄積などにより、電子機器に対して人間の音声による入力をすることが実用的になりつつある。また、比較的マイクロフォンから離れた位置の音声を認識するファーフィルド音声認識も実用レベルに達し、例えばマイクロフォンから数ｍ程度離れた位置からの音声認識が可能となっている。これらの技術は、多様なサービスの実現に貢献すると期待されている。 In recent years, with the development of artificial intelligence technology and the accumulation of a huge conversation database, it has become practical to input electronic devices using human speech. Further, far-field speech recognition for recognizing sound at a position relatively far from the microphone has reached a practical level, and for example, sound recognition from a position about several meters away from the microphone is possible. These technologies are expected to contribute to the realization of various services.

一方、人間の生活空間において発生する音は、発話によって発生する音声に限らず、例えば咳き込む音、足音及びドアの開閉音など、突発的な音や人間の可聴域を超えた音が発生することがある。このような生活音は、例えば高齢者の安否を確認する見守りサービスなど種々のサービスに活用可能であると考えられる。 On the other hand, sounds generated in human living space are not limited to sounds generated by utterances, but sudden sounds such as coughing sounds, footsteps and door opening / closing sounds, and sounds exceeding the human audible range may be generated. There is. Such living sounds can be used for various services such as a watch service for confirming the safety of elderly people.

特開２００８−２３３６７１号公報JP 2008-233671 A 特開２００５−０８６７０７号公報Japanese Patent Laying-Open No. 2005-086707

しかしながら、生活音を活用するサービスを提供する上で、生活音を記録することについては具体的に検討されていない。すなわち、例えばユーザの住宅で発生する音を単にすべて記録する場合には、ユーザが発話した音声もそのまま記録されてしまい、ユーザのプライバシーが保護されないという問題がある。そこで、電子機器へ入力される音声に対してリアルタイム処理を施し、特定のイベントを検出して記録することも考えられるが、リアルタイム処理の処理負荷が大きく、現実的ではない。 However, in providing a service that utilizes living sounds, recording of living sounds has not been specifically studied. That is, for example, when all sounds generated in the user's house are simply recorded, the voice spoken by the user is recorded as it is, and there is a problem that the privacy of the user is not protected. Therefore, it is conceivable to perform real-time processing on audio input to the electronic device and detect and record a specific event. However, the processing load of the real-time processing is large, which is not realistic.

１つの側面では、本発明は、ユーザのプライバシーを保護しつつ生活音を記録することができる生活音記録装置及び生活音記録方法を提供することを目的とする。 In one aspect, an object of the present invention is to provide a living sound recording apparatus and a living sound recording method capable of recording a living sound while protecting a user's privacy.

一態様では、生活音記録装置は、ユーザの生活空間において発生する音声が含まれる音声データを取得する取得部と、前記取得部によって取得された音声データを、ユーザの行動又は動作に伴う生活音の継続時間に対応する時間長のフレームに分割する分割部と、前記分割部によって分割されて得られた複数のフレームの順番を変更してメモリに記録する記録制御部とを有する。 In one aspect, the living sound recording device includes an acquisition unit that acquires audio data including audio generated in a user's living space, and the audio data acquired by the acquisition unit is used as a living sound associated with a user's action or action. And a recording control unit that changes the order of a plurality of frames obtained by dividing by the dividing unit and records them in a memory.

ユーザのプライバシーを保護しつつ生活音を記録することができる。 Life sounds can be recorded while protecting user privacy.

図１は、実施例１に係る情報処理システムの構成例を示す図である。FIG. 1 is a diagram illustrating a configuration example of an information processing system according to the first embodiment. 図２は、実施例１に係る生活音記録装置の構成を示すブロック図である。FIG. 2 is a block diagram illustrating the configuration of the life sound recording apparatus according to the first embodiment. 図３は、生活音の具体例を示す図である。FIG. 3 is a diagram illustrating a specific example of life sounds. 図４は、実施例１に係るサーバ装置の構成を示すブロック図である。FIG. 4 is a block diagram illustrating the configuration of the server apparatus according to the first embodiment. 図５は、実施例１に係る生活音記録方法を示すフロー図である。FIG. 5 is a flowchart illustrating the life sound recording method according to the first embodiment. 図６は、実施例２に係る生活音記録装置の構成を示すブロック図である。FIG. 6 is a block diagram illustrating the configuration of the life sound recording apparatus according to the second embodiment. 図７は、実施例２に係るサーバ装置の構成を示すブロック図である。FIG. 7 is a block diagram illustrating the configuration of the server device according to the second embodiment. 図８は、実施例２に係る生活音記録方法を示すフロー図である。FIG. 8 is a flowchart illustrating the life sound recording method according to the second embodiment. 図９は、フレーム入れ替えの具体例を示す図である。FIG. 9 is a diagram illustrating a specific example of frame replacement. 図１０は、フレーム入れ替えの他の具体例を示す図である。FIG. 10 is a diagram illustrating another specific example of frame replacement. 図１１は、実施例３に係る情報処理システムの構成例を示す図である。FIG. 11 is a diagram illustrating a configuration example of an information processing system according to the third embodiment. 図１２は、実施例３に係る生活音判別方法を示すフロー図である。FIG. 12 is a flowchart illustrating the living sound discrimination method according to the third embodiment. 図１３は、生活音の判別を説明する図である。FIG. 13 is a diagram for explaining the discrimination of life sounds.

以下に添付図面を参照して本願に係る生活音記録装置及び生活音記録方法について説明する。なお、この実施例は開示の技術を限定するものではない。そして、各実施例は、処理内容を矛盾させない範囲で適宜組み合わせることが可能である。 The living sound recording apparatus and the living sound recording method according to the present application will be described below with reference to the accompanying drawings. Note that this embodiment does not limit the disclosed technology. Each embodiment can be appropriately combined within a range in which processing contents are not contradictory.

［情報処理システム］
図１は、実施例１に係る情報処理システムの構成例を示す図である。図１に示す情報処理システムにおいては、例えばユーザの住宅内などに生活音記録装置１００が設置され、生活音記録装置１００と基地局装置１０とが互いに通信可能となっている。そして、基地局装置１０は、例えばインターネットなどのネットワークＮを介してサーバ装置２００と接続されている。生活音記録装置１００は、例えばユーザが所持するパーソナルコンピュータやスマートフォンなどの汎用の情報処理端末であっても良いし、専用のＩｏＴ（Internet of Things）端末であっても良い。 [Information processing system]
FIG. 1 is a diagram illustrating a configuration example of an information processing system according to the first embodiment. In the information processing system shown in FIG. 1, a living sound recording device 100 is installed in a user's house, for example, and the living sound recording device 100 and the base station device 10 can communicate with each other. The base station device 10 is connected to the server device 200 via a network N such as the Internet. The living sound recording apparatus 100 may be a general-purpose information processing terminal such as a personal computer or a smartphone possessed by the user, or may be a dedicated IoT (Internet of Things) terminal.

生活音記録装置１００は、例えばマイクロフォンなどの音声入力デバイスを備え、音声入力デバイスから入力される周囲の音声を記録する。このとき、生活音記録装置１００は、入力される音声を生活音が判別可能な時間長（例えば５００ｍｓ（ミリ秒））のフレームに分割し、フレームの順番を変更した上で、各フレームの音声データを記録する。そして、生活音記録装置１００は、記録した音声データを基地局装置１０へ送信する。なお、生活音記録装置１００の具体的な構成及び動作については、後に詳述する。 The living sound recording apparatus 100 includes an audio input device such as a microphone, and records surrounding sounds input from the audio input device. At this time, the living sound recording apparatus 100 divides the input sound into frames having a length of time (for example, 500 ms (milliseconds)) in which the living sound can be identified, and after changing the order of the frames, the sound of each frame is recorded. Record the data. Then, the life sound recording device 100 transmits the recorded sound data to the base station device 10. The specific configuration and operation of the life sound recording apparatus 100 will be described in detail later.

基地局装置１０は、例えば生活音記録装置１００と同じ住宅内に設置されたアクセスポイント、又は住宅外に設置された移動体通信システムの基地局装置であり、生活音記録装置１００から送信された音声データを受信して、サーバ装置２００へ転送する。なお、基地局装置１０は、必ずしも生活音記録装置１００と無線通信する必要はなく、生活音記録装置１００と基地局装置１０が有線接続されていても良い。生活音記録装置１００と基地局装置１０が有線接続される場合には、基地局装置１０は、例えばゲートウェイに相当する。 The base station apparatus 10 is, for example, an access point installed in the same house as the living sound recording apparatus 100 or a base station apparatus of a mobile communication system installed outside the house, and is transmitted from the living sound recording apparatus 100. The audio data is received and transferred to the server device 200. The base station device 10 does not necessarily need to communicate wirelessly with the living sound recording device 100, and the living sound recording device 100 and the base station device 10 may be connected by wire. When the living sound recording apparatus 100 and the base station apparatus 10 are connected by wire, the base station apparatus 10 corresponds to, for example, a gateway.

サーバ装置２００は、基地局装置１０から転送される音声データを取得し、各フレームの音声データに含まれる生活音を抽出する。そして、サーバ装置２００は、抽出した生活音が何の音であるかを判別し、判別結果を出力する。このとき、生活音記録装置１００によってフレームの順番が変更されているため、ユーザの発話による音声など複数フレームにわたって連続する音声はサーバ装置２００において判別されず、ユーザのプライバシーが保護される。 The server device 200 acquires the voice data transferred from the base station device 10, and extracts the life sounds included in the voice data of each frame. And the server apparatus 200 discriminate | determines what kind of sound the extracted life sound is, and outputs a discrimination | determination result. At this time, since the order of frames is changed by the life sound recording apparatus 100, the voice that is continuous over a plurality of frames such as the voice of the user's utterance is not discriminated in the server apparatus 200, and the privacy of the user is protected.

［生活音記録装置の構成］
図２は、実施例１に係る生活音記録装置１００の構成を示すブロック図である。図２に示す生活音記録装置１００は、音声入力部１１０、プロセッサ１２０、無線送信部１３０及びメモリ１４０を有する。 [Configuration of living sound recording device]
FIG. 2 is a block diagram illustrating the configuration of the life sound recording apparatus 100 according to the first embodiment. The living sound recording apparatus 100 illustrated in FIG. 2 includes a voice input unit 110, a processor 120, a wireless transmission unit 130, and a memory 140.

音声入力部１１０は、例えばマイクロフォンを備え、生活音記録装置１００の周囲の音声の入力を受け付ける。そして、音声入力部１１０は、入力された音声を例えば１９２ｋＨｚのサンプリング周波数で高速サンプリングし、得られた音声データをプロセッサ１２０へ出力する。なお、音声入力部１１０は、音声データを取得する取得部の一例である。 The voice input unit 110 includes, for example, a microphone, and receives an input of voice around the living sound recording apparatus 100. The voice input unit 110 samples the input voice at a high speed, for example, at a sampling frequency of 192 kHz, and outputs the obtained voice data to the processor 120. The voice input unit 110 is an example of an acquisition unit that acquires voice data.

生活音記録装置１００が例えばユーザの住宅内に設置されるため、音声入力部１１０へ入力される音声には、ユーザが発話して発生する話声音やユーザの生活に伴って発生する生活音が含まれる。生活音は、ユーザの発話以外の行動や動作に伴って発生する音声であり、具体的な例としては例えば図３に示すようなものがある。 Since the living sound recording device 100 is installed in, for example, a user's house, the voice input to the voice input unit 110 includes a voice sound generated when the user speaks and a living sound generated along with the user's life. included. The life sound is a sound generated along with an action or action other than the user's utterance, and a specific example is as shown in FIG.

図３は、生活音の具体例として、（Ａ）包丁使用時の音、（Ｂ）冷蔵庫のドアを閉じる音、及び（Ｃ）ガスコンロに点火する音のレベルの時間変化を示す図である。これらの生活音は、人間の可聴周波数より高い周波数帯域での特徴があるとともに、継続時間が概ね２００〜３００ｍｓであり、大部分が５００ｍｓ以下であることがわかっている。すなわち、例えば図３に示す（Ａ）包丁使用時の音では、所定レベル以上の音が断続的に複数回発生するものの、１回当たりの音の継続時間は１００ｍｓ程度である。また、図３に示す（Ｂ）冷蔵庫のドアを閉じる音の継続時間は２１０ｍｓ程度であり、（Ｃ）ガスコンロに添加する音の継続時間は４２０ｍｓ程度である。 FIG. 3 is a diagram showing temporal changes in the levels of (A) a sound when using a kitchen knife, (B) a sound that closes a refrigerator door, and (C) a sound that ignites a gas stove as specific examples of daily life sounds. These living sounds have characteristics in a frequency band higher than the human audible frequency and have a duration of approximately 200 to 300 ms, and most of them are known to be 500 ms or less. That is, for example, in the sound when the knife (A) shown in FIG. 3 is used, a sound of a predetermined level or higher is intermittently generated a plurality of times, but the sound duration per time is about 100 ms. Further, (B) the duration of the sound of closing the refrigerator door shown in FIG. 3 is about 210 ms, and (C) the duration of the sound added to the gas stove is about 420 ms.

このように、ユーザの発話以外の行動や動作に伴って発生する生活音は、継続時間が５００ｍｓ以下であることが多く、ユーザの発話に伴って発生する話声音よりも短く突発的である傾向が見られる。 As described above, life sounds generated with actions and actions other than the user's utterance often have a duration of 500 ms or less, and tend to be shorter and more sudden than the voice sound generated with the user's utterance. Is seen.

図２に戻って、プロセッサ１２０は、例えばＣＰＵ（Central Processing Unit）、ＭＰＵ（Micro Processing Unit）、ＡＳＩＣ（Application Specific Integrated Circuit）、ＦＰＧＡ（Field Programmable Gate Array）又はＤＳＰ（Digital Signal Processor）などを備え、生活音記録装置１００全体を統括制御する。具体的には、プロセッサ１２０は、分割部１２１、生活音判定部１２２、記録制御部１２３及び通信制御部１２４を有する。 Returning to FIG. 2, the processor 120 includes, for example, a central processing unit (CPU), a micro processing unit (MPU), an application specific integrated circuit (ASIC), a field programmable gate array (FPGA), or a digital signal processor (DSP). The life sound recording apparatus 100 is controlled overall. Specifically, the processor 120 includes a dividing unit 121, a living sound determination unit 122, a recording control unit 123, and a communication control unit 124.

分割部１２１は、音声入力部１１０から入力される音声データをフレームに分割する。具体的には、分割部１２１は、連続して入力される音声データを、生活音を判別することができる時間長のフレームに分割する。上述したように、生活音の継続時間は、大部分が５００ｍｓ以下であるため、分割部１２１は、音声データを例えば５００ｍｓのフレームに分割する。フレームの時間長が２００〜５００ｍｓ程度であれば、個々のフレームから生活音を抽出可能である一方、生活音よりも継続時間が長い話声音は１フレーム内に収まらないため、個々のフレームからユーザの発話の内容を判別することは困難である。 The dividing unit 121 divides the audio data input from the audio input unit 110 into frames. Specifically, the dividing unit 121 divides continuously input voice data into frames having a length of time that can discriminate living sounds. As described above, since the duration of the living sound is mostly 500 ms or less, the dividing unit 121 divides the audio data into frames of 500 ms, for example. If the time length of the frame is about 200 to 500 ms, the life sound can be extracted from each frame. On the other hand, since the speech voice sound having a longer duration than the life sound does not fit in one frame, the user can start from each frame. It is difficult to determine the content of the utterance.

生活音判定部１２２は、分割部１２１によって音声データが分割されて得られたフレームそれぞれに生活音が含まれるか否かを判定する。具体的には、生活音判定部１２２は、例えば各フレームの音声データに所定の閾値以上のレベルの音声が含まれるか否かを判定し、所定の閾値以上のレベルの音声が含まれるフレームに生活音が含まれると判定する。 The living sound determining unit 122 determines whether or not the living sound is included in each frame obtained by dividing the audio data by the dividing unit 121. Specifically, the living sound determination unit 122 determines, for example, whether or not the sound data of each frame includes sound having a level equal to or higher than a predetermined threshold, and the frame including sound having a level equal to or higher than the predetermined threshold is included. It is determined that life sounds are included.

記録制御部１２３は、生活音判定部１２２による判定結果に基づいて、各フレームの音声データの記録を制御する。具体的には、記録制御部１２３は、生活音判定部１２２によって生活音が含まれると判定されたフレームに音声の入力時刻などのタグを付与し、メモリ１４０に記録する。また、記録制御部１２３は、生活音判定部１２２によって生活音が含まれないと判定されたフレームを破棄して削除する。つまり、記録制御部１２３は、生活音が含まれないフレームを間引き、生活音が含まれるフレームのみをメモリ１４０に記録する。結果として、記録制御部１２３は、音声データのフレームの順番を変更してメモリ１４０に記録することになる。 The recording control unit 123 controls recording of audio data of each frame based on the determination result by the living sound determination unit 122. Specifically, the recording control unit 123 adds a tag such as a voice input time to the frame determined by the living sound determining unit 122 to include the living sound, and records it in the memory 140. In addition, the recording control unit 123 discards and deletes the frame determined by the living sound determining unit 122 that the living sound is not included. That is, the recording control unit 123 thins out the frames that do not include the living sound and records only the frames that include the living sound in the memory 140. As a result, the recording control unit 123 changes the order of the frames of the audio data and records them in the memory 140.

通信制御部１２４は、音声データの送信を制御する。具体的には、通信制御部１２４は、メモリ１４０に記録された音声データのフレームから送信データを生成し、送信データの宛先をサーバ装置２００に設定して無線送信部１３０へ出力する。通信制御部１２４は、メモリ１４０に音声データが記録された際にリアルタイムで送信データを生成しても良いし、メモリ１４０に蓄積された音声データから所定の周期で送信データを生成しても良い。換言すれば、通信制御部１２４は、音声データをリアルタイムで送信しても良いし、一定量メモリ１４０に蓄積されてから送信しても良い。 The communication control unit 124 controls transmission of audio data. Specifically, the communication control unit 124 generates transmission data from a frame of audio data recorded in the memory 140, sets the destination of the transmission data in the server device 200, and outputs the transmission data to the wireless transmission unit 130. The communication control unit 124 may generate transmission data in real time when the audio data is recorded in the memory 140, or may generate transmission data at a predetermined cycle from the audio data stored in the memory 140. . In other words, the communication control unit 124 may transmit the audio data in real time, or may transmit the audio data after being stored in the fixed amount memory 140.

無線送信部１３０は、通信制御部１２４によって生成された送信データに対して所定の無線送信処理を施し、アンテナを介して基地局装置１０へ送信する。送信データは、基地局装置１０によって受信された後、宛先として設定されたサーバ装置２００へ転送される。 The wireless transmission unit 130 performs predetermined wireless transmission processing on the transmission data generated by the communication control unit 124 and transmits the transmission data to the base station apparatus 10 via the antenna. The transmission data is received by the base station apparatus 10 and then transferred to the server apparatus 200 set as the destination.

メモリ１４０は、例えばＲＡＭ（Random Access Memory）又はＲＯＭ（Read Only Memory）などを備え、記録制御部１２３から出力される音声データのフレームを記録する。メモリ１４０に記録されるフレームの順番は、記録制御部１２３によってフレームが間引かれた結果、入力された音声データのフレームの順番から変更されている。このため、複数のフレームにわたって連続する話声音の内容は判別困難となり、ユーザのプライバシーが保護される。また、メモリ１４０は、プロセッサ１２０によって処理が実行される際に、種々の情報を記憶する。 The memory 140 includes, for example, a RAM (Random Access Memory) or a ROM (Read Only Memory), and records a frame of audio data output from the recording control unit 123. The order of the frames recorded in the memory 140 is changed from the order of the frames of the input audio data as a result of thinning out the frames by the recording control unit 123. For this reason, it is difficult to discriminate the content of the speech sound that continues over a plurality of frames, and the user's privacy is protected. In addition, the memory 140 stores various information when processing is executed by the processor 120.

［サーバ装置の構成］
図４は、実施例１に係るサーバ装置２００の構成を示すブロック図である。図４に示すサーバ装置２００は、通信インタフェース（以下「通信Ｉ／Ｆ」と略記する）２１０、プロセッサ２２０及びメモリ２３０を有する。 [Configuration of server device]
FIG. 4 is a block diagram illustrating the configuration of the server apparatus 200 according to the first embodiment. 4 includes a communication interface (hereinafter abbreviated as “communication I / F”) 210, a processor 220, and a memory 230.

通信Ｉ／Ｆ２１０は、例えばインターネットなどのネットワークＮに接続し、種々のデータを送受信する。具体的には、通信Ｉ／Ｆ２１０は、生活音記録装置１００から送信された音声データを受信する。 The communication I / F 210 is connected to a network N such as the Internet and transmits / receives various data. Specifically, the communication I / F 210 receives audio data transmitted from the life sound recording apparatus 100.

プロセッサ２２０は、例えばＣＰＵ、ＭＰＵ、ＡＳＩＣ、ＦＰＧＡ又はＤＳＰなどを備え、サーバ装置２００全体を統括制御する。具体的には、プロセッサ２２０は、生活音抽出部２２１、生活音判別部２２２及び結果出力部２２３を有する。 The processor 220 includes, for example, a CPU, MPU, ASIC, FPGA, DSP, and the like, and performs overall control of the server apparatus 200 as a whole. Specifically, the processor 220 includes a living sound extraction unit 221, a living sound determination unit 222, and a result output unit 223.

生活音抽出部２２１は、通信Ｉ／Ｆ２１０によって受信された音声データの各フレームから生活音を抽出する。具体的には、生活音抽出部２２１は、音声データの各フレームから、例えば所定の閾値以上のレベルの音声を生活音として抽出する。本実施例では、生活音記録装置１００が所定の閾値以上のレベルの音声を含むフレームのみを記録して送信するため、生活音抽出部２２１は、受信された音声データのすべてのフレームから、レベルが所定の閾値以上の生活音を抽出する。ただし、生活音抽出部２２１が生活音記録装置１００とは異なる閾値を用いたり、生活音記録装置１００とは異なる方法を用いたりして生活音を抽出する場合は、必ずしもすべてのフレームから生活音が抽出されなくても良い。 The living sound extraction unit 221 extracts a living sound from each frame of the audio data received by the communication I / F 210. Specifically, the living sound extraction unit 221 extracts, for example, a sound having a level equal to or higher than a predetermined threshold from each frame of the sound data as a living sound. In the present embodiment, since the living sound recording apparatus 100 records and transmits only frames including sound of a level equal to or higher than a predetermined threshold value, the living sound extraction unit 221 starts from all frames of the received sound data. Extracts a life sound that is equal to or greater than a predetermined threshold. However, when the living sound extraction unit 221 uses a threshold different from that of the living sound recording apparatus 100 or extracts a living sound by using a method different from that of the living sound recording apparatus 100, the living sound is not necessarily extracted from all frames. May not be extracted.

生活音判別部２２２は、生活音抽出部２２１によって抽出された生活音が何の音であるかを判別する。具体的には、生活音判別部２２２は、例えば各種の生活音の時間波形のパターンを類型化した生活音パターン情報をメモリ２３０から読み出し、抽出された生活音と生活音パターン情報とのパターンマッチングにより、各フレームから抽出された生活音の種類を判別する。なお、生活音判別部２２２による判別方法は、時間波形のパターンマッチングに限定されず、例えば音声データをフーリエ変換して得られる周波数スペクトラムのパターンマッチングによる方法などであっても良い。 The life sound determination unit 222 determines what sound the life sound extracted by the life sound extraction unit 221 is. Specifically, the living sound determination unit 222 reads, for example, living sound pattern information typified by patterns of time waveforms of various living sounds from the memory 230, and performs pattern matching between the extracted living sounds and the living sound pattern information. Thus, the type of life sound extracted from each frame is determined. The discrimination method by the living sound discrimination unit 222 is not limited to the time waveform pattern matching, and may be, for example, a method based on frequency spectrum pattern matching obtained by Fourier transforming audio data.

結果出力部２２３は、生活音判別部２２２による判別結果を出力する。すなわち、結果出力部２２３は、フレームごとの生活音の種類を各フレームに付与されたタグの情報とともに出力する。したがって、結果出力部２２３は、音声の入力時刻と音声の種類とを対応付けた情報を表示したり、さらに他の処理を実行するアプリケーションへ提供したりする。 The result output unit 223 outputs the discrimination result by the life sound discrimination unit 222. That is, the result output unit 223 outputs the life sound type for each frame together with the tag information assigned to each frame. Therefore, the result output unit 223 displays information in which the voice input time and the voice type are associated with each other, or provides the information to an application that executes another process.

メモリ２３０は、例えばＲＡＭ又はＲＯＭなどを備え、各種の生活音のパターンを類型化した生活音パターン情報をあらかじめ記憶する。また、メモリ２３０は、プロセッサ２２０によって処理が実行される際に、種々の情報を記憶する。 The memory 230 includes, for example, a RAM or a ROM, and stores in advance life sound pattern information obtained by typifying various life sound patterns. The memory 230 stores various information when processing is executed by the processor 220.

［生活音記録方法］
次いで、実施例１に係る生活音記録方法について、図５に示すフロー図を参照しながら説明する。以下に説明する生活音記録処理は、生活音記録装置１００によって実行される。 [Life sound recording method]
Next, the living sound recording method according to the first embodiment will be described with reference to the flowchart shown in FIG. The life sound recording process described below is executed by the life sound recording apparatus 100.

生活音記録装置１００の稼働中には、音声入力部１１０に備えられたマイクロフォンなどを介して生活音記録装置１００の周囲の音声が入力される（ステップＳ１０１）。入力された音声は、音声入力部１１０によって、例えば１９２ｋＨｚのサンプリング周波数で高速サンプリングされ（ステップＳ１０２）、得られた音声データがプロセッサ１２０の分割部１２１へ入力される。 While the living sound recording apparatus 100 is in operation, the sound around the living sound recording apparatus 100 is input via a microphone or the like provided in the sound input unit 110 (step S101). The input voice is sampled at a high speed, for example, at a sampling frequency of 192 kHz by the voice input unit 110 (step S102), and the obtained voice data is input to the dividing unit 121 of the processor 120.

そして、音声データは、分割部１２１によってフレームごとの音声データに分割される（ステップＳ１０３）。すなわち、音声データは、生活音を抽出可能である一方、話声音の内容が判別されない時間長（例えば５００ｍｓ）のフレームに分割される。なお、フレームの時間長は、例えば２００〜５００ｍｓの範囲で適宜設定されれば良く、生活音記録装置１００が設置された環境で頻繁に発生する生活音の種類に応じて、適切な時間長に設定されても良い。 Then, the audio data is divided into audio data for each frame by the dividing unit 121 (step S103). That is, the voice data is divided into frames having a length of time (for example, 500 ms) in which life sounds can be extracted, but the content of the spoken voice is not discriminated. In addition, the time length of a frame should just be suitably set, for example in the range of 200-500 ms, for example, according to the kind of the life sound frequently generate | occur | produced in the environment where the life sound recording apparatus 100 was installed, it is set to suitable time length It may be set.

分割部１２１によって生成された音声データの各フレームは、生活音判定部１２２へ入力され、それぞれのフレームに生活音が含まれるか否かが判定される（ステップＳ１０４）。具体的には、生活音判定部１２２によって、各フレームの音声のレベルが所定の閾値と比較され、所定の閾値以上のレベルの音声を含むフレームが生活音を含むフレームであると判定される。 Each frame of the audio data generated by the dividing unit 121 is input to the living sound determining unit 122, and it is determined whether or not each frame includes a living sound (step S104). Specifically, the life sound determination unit 122 compares the sound level of each frame with a predetermined threshold, and determines that the frame including the sound of the level equal to or higher than the predetermined threshold is the frame including the life sound.

フレームが生活音を含むと判定された場合には（ステップＳ１０４Ｙｅｓ）、このフレームは、記録制御部１２３によって、音声の入力時刻などの情報を示すタグが付与された上で（ステップＳ１０５）、メモリ１４０に記録される（ステップＳ１０６）。一方、フレームが生活音を含まないと判定された場合には（ステップＳ１０４Ｎｏ）、このフレームは、記録制御部１２３によって破棄されて削除される（ステップＳ１０７）。 If it is determined that the frame includes a living sound (Yes at Step S104), the recording control unit 123 attaches a tag indicating information such as a voice input time to the frame (Step S105), and then stores the memory. 140 (step S106). On the other hand, when it is determined that the frame does not include a living sound (No in step S104), the frame is discarded and deleted by the recording control unit 123 (step S107).

そして、記録制御部１２３によって、入力された音声に対応するすべてのフレームの記録又は削除が終了したか否かが判定され（ステップＳ１０８）、記録も削除もされていないフレームが残っている場合には（ステップＳ１０８Ｎｏ）、残りのフレームに対して上記の処理が繰り返される。また、すべてのフレームが記録又は削除された場合には（ステップＳ１０８Ｙｅｓ）、生活音の記録に係る処理が終了する。 Then, it is determined by the recording control unit 123 whether or not the recording or deletion of all the frames corresponding to the input sound has been completed (step S108), and when there are still frames that have not been recorded or deleted. (Step S108 No), the above processing is repeated for the remaining frames. If all the frames have been recorded or deleted (Yes at step S108), the process related to the recording of the living sound ends.

このように、生活音を含むフレームはメモリ１４０に記録される一方、生活音を含まないフレームは削除される結果、メモリ１４０に記録されるフレームの順番は連続しておらず、分割部１２１によって生成されるフレームの順番とは異なる。結果として、生活音よりも継続時間が長く複数フレームにわたる話声音の内容は、メモリ１４０に記録されたフレームから判別困難となる。これに対して、生活音は、個々のフレームから抽出可能であり、フレームに含まれる生活音の種類を判別することが可能である。このため、生活音を用いた解析を可能としつつ、ユーザの発話の内容を秘匿することができ、プライバシーを保護することができる。 As described above, the frames including the living sounds are recorded in the memory 140, while the frames not including the living sounds are deleted. As a result, the order of the frames recorded in the memory 140 is not continuous. The order of generated frames is different. As a result, it is difficult to discriminate the content of the speech sound over a plurality of frames that has a longer duration than the living sound from the frames recorded in the memory 140. On the other hand, life sounds can be extracted from individual frames, and the types of life sounds included in the frames can be determined. For this reason, it is possible to conceal the content of the user's utterance while enabling analysis using living sounds, and to protect privacy.

以上のように、本実施例によれば、生活音を抽出可能かつ話声音の内容が判別されない時間長のフレームに音声データを分割し、生活音を含まないフレームを間引いてメモリに記録する。このため、入力された音声に対応するフレームの順番が変更されてメモリに記録され、ユーザの発話の内容を秘匿してプライバシーを保護することができる。 As described above, according to the present embodiment, voice data is divided into time-long frames in which life sounds can be extracted and the contents of spoken voice sounds are not discriminated, and frames that do not contain life sounds are thinned out and recorded in a memory. For this reason, the order of frames corresponding to the input voice is changed and recorded in the memory, and the contents of the user's utterance can be concealed to protect the privacy.

［変形例］
上述した実施例では、生活音判定部１２２による判定の結果、生活音が含まれないと判定されたフレームは破棄されて削除されるものとした。しかしながら、生活音が含まれないと判定されたすべてのフレームが削除されなくても良い。すなわち、生活音が含まれないと判定されたフレームであっても、一部のフレームにはタグを付与してメモリ１４０に記録するようにしても良い。このように生活音が含まれないフレームを記録しておくことにより、生活音及び話声音とは異なり継続的に発生する環境音の判別をすることも可能となる。具体的には、例えば住宅の近くを走行する自動車や電車の走行音を環境音として判別することなどができる。 [Modification]
In the above-described embodiment, it is assumed that the frame determined as not including the living sound as a result of the determination by the living sound determining unit 122 is discarded and deleted. However, it is not necessary to delete all frames that are determined not to include life sounds. That is, even if it is determined that the life sound is not included, a tag may be attached to some frames and recorded in the memory 140. By recording a frame that does not include a living sound in this way, it is possible to discriminate an environmental sound that is continuously generated, unlike a living sound and a speech voice. Specifically, for example, the traveling sound of an automobile or a train traveling near a house can be determined as an environmental sound.

生活音が含まれないと判定されたフレームの記録に際しては、例えば生活音が含まれないと判定されたフレームを所定数間隔でメモリ１４０に記録すれば良い。この場合でも、すべてのフレームがメモリ１４０に記録されることはないため、入力された音声に対応するフレームの順番が変更されて記録されることになる。結果として、ユーザの発話の内容を秘匿してプライバシーを保護することができる。 When recording a frame determined not to include a living sound, for example, the frames determined to not include a living sound may be recorded in the memory 140 at predetermined intervals. Even in this case, not all the frames are recorded in the memory 140, so that the order of the frames corresponding to the input sound is changed and recorded. As a result, privacy can be protected by concealing the content of the user's utterance.

実施例２の特徴は、音声データを分割して得られたフレームを入れ替えることにより、フレームの順番を変更する点である。実施例２に係る情報処理システムの構成例は、実施例１（図１）と同様であるため、その説明を省略する。 A feature of the second embodiment is that the order of frames is changed by replacing frames obtained by dividing audio data. Since the configuration example of the information processing system according to the second embodiment is the same as that of the first embodiment (FIG. 1), the description thereof is omitted.

［生活音記録装置の構成］
図６は、実施例２に係る生活音記録装置１００の構成を示すブロック図である。図６において、図２と同じ部分には同じ符号を付し、その説明を省略する。図６に示す生活音記録装置１００は、図２に示す生活音記録装置１００の記録制御部１２３に代えて、記録制御部１５１を有する。 [Configuration of living sound recording device]
FIG. 6 is a block diagram illustrating the configuration of the life sound recording apparatus 100 according to the second embodiment. 6, the same parts as those in FIG. 2 are denoted by the same reference numerals, and the description thereof is omitted. The living sound recording apparatus 100 shown in FIG. 6 has a recording control unit 151 instead of the recording control unit 123 of the living sound recording apparatus 100 shown in FIG.

記録制御部１５１は、生活音判定部１２２による判定結果に基づいて、各フレームの音声データの記録を制御する。具体的には、記録制御部１５１は、生活音判定部１２２によって生活音が含まれると判定されたフレームに音声の入力時刻などのタグを付与する。そして、記録制御部１５１は、判定結果に関わらず、生活音判定部１２２から出力されるフレームを所定数蓄積し、蓄積された複数のフレームの順序を入れ替える。そして、記録制御部１５１は、順序が入れ替えられた複数のフレームをメモリ１４０に記録する。つまり、記録制御部１５１は、生活音判定部１２２から出力されるフレームを、所定数ずつ順序を入れ替えた上でメモリ１４０に記録する。結果として、記録制御部１５１は、音声データのフレームの順番を変更してメモリ１４０に記録することになる。 The recording control unit 151 controls recording of audio data of each frame based on the determination result by the living sound determination unit 122. Specifically, the recording control unit 151 assigns a tag such as a voice input time to a frame that is determined by the life sound determination unit 122 to include a life sound. Then, the recording control unit 151 stores a predetermined number of frames output from the living sound determination unit 122 regardless of the determination result, and changes the order of the plurality of stored frames. Then, the recording control unit 151 records a plurality of frames whose order has been changed in the memory 140. That is, the recording control unit 151 records the frames output from the living sound determination unit 122 in the memory 140 after changing the order by a predetermined number. As a result, the recording control unit 151 changes the order of the frames of the audio data and records them in the memory 140.

記録制御部１５１が蓄積するフレームの数は、例えば音声から意味のある日常イベントを検出可能な最短時間に応じて設定されれば良い。すなわち、例えば意味のある日常イベントが３秒間の音声から検出可能であり、フレームの時間長が５００ｍｓである場合、記録制御部１５１は、６フレーム（５００ｍｓ×６＝３ｓ）を蓄積してから順序を入れ替える。フレームの順序の入れ替えは、例えば乱数発生器によって乱数を発生させ、ランダムな入れ替えパターンを生成することにより実現可能である。このとき、記録制御部１５１は、生活音が含まれると判定されてタグが付与されたフレームは順序の入れ替えの対象とせず、生活音が含まれないと判定されたフレームのみを順序の入れ替えの対象としても良い。つまり、記録制御部１５１は、生活音が含まれるフレームについては、時間的な位置を維持しても良い。これにより、生活音が発生した時系列を維持したままフレームを記録することができる。 The number of frames stored in the recording control unit 151 may be set according to the shortest time in which a meaningful daily event can be detected from, for example, voice. That is, for example, when a meaningful daily event can be detected from a voice of 3 seconds and the time length of a frame is 500 ms, the recording control unit 151 accumulates 6 frames (500 ms × 6 = 3 s) and then performs an order. Replace. The frame order can be changed by, for example, generating a random number using a random number generator and generating a random replacement pattern. At this time, the recording control unit 151 does not change the order of frames that are determined to contain life sounds and are tagged, and only change the order of frames that are determined not to contain life sounds. It may be a target. That is, the recording control unit 151 may maintain a temporal position for a frame that includes a living sound. Thereby, it is possible to record the frame while maintaining the time series in which the life sounds are generated.

［サーバ装置の構成］
図７は、実施例２に係るサーバ装置２００の構成を示すブロック図である。図７において、図４と同じ部分には同じ符号を付し、その説明を省略する。図７に示すサーバ装置２００は、図４に示すサーバ装置２００に環境音抽出部２５１及び環境音判別部２５２を追加した構成を採る。 [Configuration of server device]
FIG. 7 is a block diagram illustrating the configuration of the server apparatus 200 according to the second embodiment. 7, the same parts as those in FIG. 4 are denoted by the same reference numerals, and the description thereof is omitted. 7 employs a configuration in which an environmental sound extraction unit 251 and an environmental sound determination unit 252 are added to the server device 200 illustrated in FIG.

環境音抽出部２５１は、音声データのフレームに含まれる環境音を抽出する。具体的には、環境音抽出部２５１は、音声データの複数のフレームを取得し、これらの複数のフレームに共通して含まれる音声を環境音として抽出する。環境音とは、生活音及び話声音とは異なり継続的に発生する音声であり、例えば住宅の近くを走行する自動車や電車の走行音などが環境音となる。本実施例では、生活音が含まれないと判定されたフレームも含めて生活音記録装置１００のメモリ１４０に記録されるため、サーバ装置２００の通信Ｉ／Ｆ２１０が受信する音声データには、生活音が含まれないフレームも含まれる。また、生活音記録装置１００によってフレームの順序が入れ替えられているものの、時間的な位置が近い所定数のフレームの範囲内で順序が入れ替えられている。このため、環境音抽出部２５１は、連続する複数のフレームに共通して含まれる音声を、継続的に発生する環境音として抽出することができる。 The environmental sound extraction unit 251 extracts environmental sound included in a frame of audio data. Specifically, the environmental sound extraction unit 251 acquires a plurality of frames of audio data, and extracts audio included in the plurality of frames in common as environmental sounds. The environmental sound is a sound that is continuously generated, unlike a daily sound and a voice sound. For example, a traveling sound of a car or a train traveling near a house is an environmental sound. In this embodiment, since the frames that are determined not to include the living sound are recorded in the memory 140 of the living sound recording device 100, the audio data received by the communication I / F 210 of the server device 200 includes Frames that do not contain sound are also included. Further, although the order of frames is changed by the life sound recording apparatus 100, the order is changed within a range of a predetermined number of frames whose temporal positions are close. For this reason, the environmental sound extraction unit 251 can extract the sound included in common in a plurality of consecutive frames as the environmental sound that is continuously generated.

環境音判別部２５２は、環境音抽出部２５１によって抽出された環境音が何の音であるかを判別する。具体的には、環境音判別部２５２は、例えば各種の環境音の周波数スペクトラムのパターンを類型化した環境音パターン情報を用いて、抽出された環境音のパターンマッチングを行い、複数のフレームから抽出された環境音の種類を判別する。なお、環境音判別部２５２による判別方法は、パターンマッチングに限定されず、音声の種類を判別可能な他の方法であっても良い。環境音判別部２５２による判別結果は、結果出力部２２３によって、生活音の判別結果とともに出力される。 The environmental sound determination unit 252 determines what sound the environmental sound extracted by the environmental sound extraction unit 251 is. Specifically, the environmental sound determination unit 252 performs pattern matching of the extracted environmental sounds using, for example, environmental sound pattern information obtained by typifying the frequency spectrum patterns of various environmental sounds, and extracts them from a plurality of frames. Discriminating the type of environmental sound. Note that the determination method by the environmental sound determination unit 252 is not limited to pattern matching, and may be another method that can determine the type of sound. The discrimination result by the environmental sound discrimination unit 252 is output by the result output unit 223 together with the discrimination result of the life sound.

［生活音記録方法］
次いで、実施例２に係る生活音記録方法について、図８に示すフロー図を参照しながら説明する。図８において、図５と同じ部分には同じ符号を付し、その詳しい説明を省略する。以下に説明する生活音記録処理は、生活音記録装置１００によって実行される。 [Life sound recording method]
Next, a living sound recording method according to the second embodiment will be described with reference to the flowchart shown in FIG. 8, the same parts as those in FIG. 5 are denoted by the same reference numerals, and detailed description thereof is omitted. The life sound recording process described below is executed by the life sound recording apparatus 100.

生活音記録装置１００へ入力された音声は、音声入力部１１０によって高速サンプリングされ、得られた音声データがプロセッサ１２０の分割部１２１へ入力される（ステップＳ１０１〜Ｓ１０２）。そして、音声データは、分割部１２１によってフレームごとの音声データに分割される（ステップＳ１０３）。 The voice input to the living sound recording apparatus 100 is sampled at high speed by the voice input unit 110, and the obtained voice data is input to the dividing unit 121 of the processor 120 (steps S101 to S102). Then, the audio data is divided into audio data for each frame by the dividing unit 121 (step S103).

分割部１２１によって生成された音声データの各フレームは、生活音判定部１２２へ入力され、それぞれのフレームに生活音が含まれるか否かが判定される（ステップＳ１０４）。フレームが生活音を含むと判定された場合には（ステップＳ１０４Ｙｅｓ）、記録制御部１５１によって、このフレームに音声の入力時刻などの情報を示すタグが付与されて蓄積される（ステップＳ１０５）。同様に、生活音を含まないと判定されたフレームについても（ステップＳ１０４Ｎｏ）、記録制御部１５１によって蓄積される。 Each frame of the audio data generated by the dividing unit 121 is input to the living sound determining unit 122, and it is determined whether or not each frame includes a living sound (step S104). If it is determined that the frame includes a living sound (Yes at Step S104), the recording control unit 151 adds a tag indicating information such as the voice input time to the frame and accumulates it (Step S105). Similarly, the recording control unit 151 also accumulates frames determined not to contain a living sound (No in step S104).

そして、記録制御部１５１によって、所定数のフレームが蓄積されたか否かが判定される（ステップＳ２０１）。具体的には、例えば音声から意味のある日常イベントを検出可能な最短時間に対応する数のフレームが記録制御部１５１に蓄積されたか否かが判定される。この判定の結果、また所定数のフレームが蓄積されていなければ（ステップＳ２０１Ｎｏ）、以降のフレームに対して上記の処理が繰り返される。そして、生活音を含むか否かに関わらず、所定数のフレームが記録制御部１５１に蓄積されると（ステップＳ２０１Ｙｅｓ）、蓄積されたフレームは、順序が入れ替えられた上でメモリ１４０に記録される（ステップＳ２０２）。 Then, the recording control unit 151 determines whether or not a predetermined number of frames have been accumulated (step S201). Specifically, for example, it is determined whether or not the number of frames corresponding to the shortest time in which a meaningful daily event can be detected from voice has been accumulated in the recording control unit 151. If the predetermined number of frames are not accumulated as a result of this determination (No in step S201), the above processing is repeated for the subsequent frames. When a predetermined number of frames are accumulated in the recording control unit 151 regardless of whether or not they contain living sounds (Yes in step S201), the accumulated frames are recorded in the memory 140 after the order is changed. (Step S202).

図９は、フレームの順序の入れ替えの具体例を示す図である。図９に示すように、記録制御部１５１には、フレーム＃１〜＃ＮのＮ個（Ｎは２以上の整数）のフレームが蓄積されており、蓄積されたＮ個のフレームの順序が入れ替えられる。図９において、レベルが閾値Ｔｈ以上の音声を含むフレームは、生活音判定部１２２によって生活音を含むと判定されたフレームである。したがって、図９の例では、フレーム＃３が生活音を含むと判定され、他のフレームは生活音を含まないと判定される。これらのフレーム＃１〜＃Ｎは、生活音を含むか否かに関わらず順番に記録制御部１５１に蓄積され（図９上図）、Ｎ個のフレームの範囲内でフレーム＃１〜＃Ｎの順序がランダムに入れ替えられる（図９下図）。これにより、フレームの順番が変更され、個々のフレームから抽出可能な生活音の情報は保持されたまま、複数にフレームにわたる話声音の内容が判別困難となる。 FIG. 9 is a diagram illustrating a specific example of changing the order of frames. As shown in FIG. 9, the recording control unit 151 stores N frames (N is an integer of 2 or more) of frames # 1 to #N, and the order of the stored N frames is changed. It is done. In FIG. 9, a frame including a sound whose level is equal to or higher than the threshold Th is a frame determined by the living sound determination unit 122 to include a living sound. Therefore, in the example of FIG. 9, it is determined that the frame # 3 includes the living sound, and the other frames are determined not to include the living sound. These frames # 1 to #N are sequentially stored in the recording control unit 151 regardless of whether or not they include a living sound (upper diagram in FIG. 9), and frames # 1 to #N are within a range of N frames. Are switched at random (the lower diagram in FIG. 9). As a result, the order of frames is changed, and it becomes difficult to discriminate the contents of speech voice sounds over a plurality of frames while retaining information on living sounds that can be extracted from individual frames.

また、記録制御部１５１は、蓄積されたフレームのうち、生活音を含むと判定されたフレームについては、順序の入れ替えの対象としなくても良い。すなわち、例えば図１０に示すように、記録制御部１５１は、フレーム＃１〜＃Ｎのうち生活音を含むと判定されたフレーム＃３については時間的な位置を維持し、他のフレームのみの順序を入れ替えても良い。これにより、生活音が含まれるフレームの時刻が変更されることがなく、生活音が発生した時系列を維持したままフレームを記録することができる。 Further, the recording control unit 151 does not need to change the order of frames that are determined to include the living sound among the accumulated frames. That is, for example, as illustrated in FIG. 10, the recording control unit 151 maintains the temporal position for the frame # 3 that is determined to include the living sound among the frames # 1 to #N, and only the other frames. The order may be changed. As a result, the time of the frame including the living sound is not changed, and the frame can be recorded while maintaining the time series in which the living sound is generated.

図８に戻って、順序が入れ替えられた所定数のフレームがメモリ１４０に記録されると、記録制御部１５１によって、入力された音声に対応するすべてのフレームの記録が終了したか否かが判定され（ステップＳ１０８）、記録されていないフレームが残っている場合には（ステップＳ１０８Ｎｏ）、残りのフレームに対して上記の処理が繰り返される。また、すべてのフレームが記録された場合には（ステップＳ１０８Ｙｅｓ）、生活音の記録に係る処理が終了する。 Returning to FIG. 8, when a predetermined number of frames whose order has been changed are recorded in the memory 140, the recording control unit 151 determines whether or not the recording of all the frames corresponding to the input sound has been completed. If no unrecorded frames remain (step S108 No), the above process is repeated for the remaining frames. If all the frames have been recorded (Yes at step S108), the process related to the recording of the living sound ends.

このように、記録制御部１５１に蓄積された所定数のフレームが順序を入れ替えてメモリ１４０に記録される結果、メモリ１４０に記録されるフレームの順番は連続しておらず、分割部１２１によって生成されるフレームの順番とは異なる。結果として、生活音よりも継続時間が長く複数フレームにわたる話声音の内容は、メモリ１４０に記録されたフレームから判別困難となる。これに対して、生活音は、個々のフレームから抽出可能であり、フレームに含まれる生活音の種類を判別することが可能である。このため、生活音を用いた解析を可能としつつ、ユーザの発話の内容を秘匿することができ、プライバシーを保護することができる。さらに、生活音が含まれないフレームを含めて比較的短い時間内のフレームがまとめて記録されるため、これらのフレームから、継続的に発生する環境音を抽出することが可能となる。 As described above, the predetermined number of frames accumulated in the recording control unit 151 are recorded in the memory 140 with the order changed. As a result, the order of the frames recorded in the memory 140 is not continuous and is generated by the dividing unit 121. This is different from the order of frames to be played. As a result, it is difficult to discriminate the content of the speech sound over a plurality of frames that has a longer duration than the living sound from the frames recorded in the memory 140. On the other hand, life sounds can be extracted from individual frames, and the types of life sounds included in the frames can be determined. For this reason, it is possible to conceal the content of the user's utterance while enabling analysis using living sounds, and to protect privacy. Furthermore, since frames within a relatively short period of time including frames that do not include life sounds are recorded together, it is possible to extract continuously generated environmental sounds from these frames.

以上のように、本実施例によれば、生活音を抽出可能かつ話声音の内容が判別されない時間長のフレームに音声データを分割し、所定数ずつのフレームを順序を入れ替えてメモリに記録する。このため、入力された音声に対応するフレームの順番が変更されてメモリに記録され、ユーザの発話の内容を秘匿してプライバシーを保護することができる。 As described above, according to the present embodiment, the voice data is divided into time-length frames in which life sounds can be extracted and the content of the spoken voice is not discriminated, and a predetermined number of frames are switched in order and recorded in the memory. . For this reason, the order of frames corresponding to the input voice is changed and recorded in the memory, and the contents of the user's utterance can be concealed to protect the privacy.

実施例３の特徴は、異なる位置に設置された複数の生活音記録装置によって記録された音声データを用いて、生活音の音源を推定する点である。 A feature of the third embodiment is that a sound source of a living sound is estimated using sound data recorded by a plurality of living sound recording devices installed at different positions.

［情報処理システム］
図１１は、実施例３に係る情報処理システムの構成例を示す図である。図１１において、図１と同じ部分には同じ符号を付す。図１１に示す情報処理システムにおいては、例えばユーザの住宅内などに複数の生活音記録装置１００−１、１００−２が設置され、生活音記録装置１００−１、１００−２と基地局装置１０とが互いに通信可能となっている。そして、基地局装置１０は、例えばインターネットなどのネットワークＮを介してサーバ装置２００と接続されている。生活音記録装置１００−１、１００−２は、例えばユーザが所持するパーソナルコンピュータやスマートフォンなどの汎用の情報処理端末であっても良いし、専用のＩｏＴ（Internet of Things）端末であっても良い。ただし、生活音記録装置１００−１、１００−２は、時刻同期しているものとし、その設置位置がサーバ装置２００によって既知であるものとする。また、生活音記録装置１００−１、１００−２は、必ずしも同一の基地局装置１０と通信しなくても良い。 [Information processing system]
FIG. 11 is a diagram illustrating a configuration example of an information processing system according to the third embodiment. In FIG. 11, the same parts as those in FIG. In the information processing system shown in FIG. 11, for example, a plurality of life sound recording devices 100-1 and 100-2 are installed in a user's house, and the life sound recording devices 100-1 and 100-2 and the base station device 10. Can communicate with each other. The base station device 10 is connected to the server device 200 via a network N such as the Internet. The living sound recording devices 100-1 and 100-2 may be general-purpose information processing terminals such as personal computers and smartphones possessed by the user, or may be dedicated IoT (Internet of Things) terminals. . However, it is assumed that the living sound recording apparatuses 100-1 and 100-2 are synchronized in time and the installation positions thereof are known by the server apparatus 200. Further, the living sound recording devices 100-1 and 100-2 do not necessarily have to communicate with the same base station device 10.

生活音記録装置１００−１、１００−２は、例えばマイクロフォンなどの音声入力デバイスを備え、音声入力デバイスから入力される周囲の音声を記録する。このとき、生活音記録装置１００−１、１００−２は、入力される音声を生活音が判別可能な時間長（例えば５００ｍｓ）のフレームに分割し、フレームの順番を変更した上で、各フレームの音声データを記録する。そして、生活音記録装置１００−１、１００−２は、記録した音声データを基地局装置１０へ送信する。生活音記録装置１００−１、１００−２が異なる位置に設置されているため、１つの音源で発生した生活音は、音源から生活音記録装置１００−１、１００−２までの距離に応じて異なる時刻に入力される。したがって、生活音記録装置１００−１、１００−２は、同一の生活音を、同じタイミングのフレーム内の異なる時刻又は異なるタイミングのフレームに記録することになる。 The living sound recording apparatuses 100-1 and 100-2 include a voice input device such as a microphone, for example, and record surrounding sounds input from the voice input device. At this time, the living sound recording devices 100-1 and 100-2 divide the input sound into frames of a length of time (for example, 500 ms) that the living sound can be discriminated, and change the order of the frames. Record audio data. The life sound recording devices 100-1 and 100-2 transmit the recorded sound data to the base station device 10. Since the living sound recording devices 100-1 and 100-2 are installed at different positions, the living sound generated by one sound source is in accordance with the distance from the sound source to the living sound recording devices 100-1 and 100-2. Entered at a different time. Therefore, the living sound recording devices 100-1 and 100-2 record the same living sound at different times in frames of the same timing or frames at different timings.

なお、生活音記録装置１００−１、１００−２の構成は、実施例１、２に係る生活音記録装置１００の構成（図２、６）と同様である。 The configuration of the life sound recording devices 100-1 and 100-2 is the same as the configuration of the life sound recording device 100 according to the first and second embodiments (FIGS. 2 and 6).

また、サーバ装置２００は、生活音記録装置１００−１、１００−２の設置場所のレイアウト情報、生活音記録装置１００−１、１００−２の設置位置、及び生活音記録装置１００−１、１００−２から送信された音声データを用いて、生活音の音源を推定する。具体的には、サーバ装置２００は、生活音記録装置１００−１、１００−２から送信された音声データにおける生活音の記録時刻の時間差から、生活音の音源の位置を推定する。そして、サーバ装置２００は、レイアウト情報を参照して、推定された音源の位置にある生活音の音源を特定する。 In addition, the server device 200 includes layout information of installation locations of the life sound recording devices 100-1 and 100-2, installation positions of the life sound recording devices 100-1 and 100-2, and life sound recording devices 100-1 and 100. -2 is used to estimate the sound source of the living sound. Specifically, the server device 200 estimates the position of the sound source of the living sound from the time difference of the recording time of the living sound in the sound data transmitted from the living sound recording devices 100-1 and 100-2. Then, the server device 200 refers to the layout information and identifies the sound source of the living sound at the estimated sound source position.

なお、サーバ装置２００の構成は、実施例１、２に係るサーバ装置２００の構成（図４、７）と同様である。上述した音源の推定は、生活音判別部２２２によって実行される。 The configuration of the server device 200 is the same as the configuration of the server device 200 according to the first and second embodiments (FIGS. 4 and 7). The sound source estimation described above is performed by the living sound discrimination unit 222.

［生活音判別方法］
次いで、実施例３に係る生活音判別方法について、図１２に示すフロー図を参照しながら説明する。以下に説明する生活音判別処理は、サーバ装置２００によって実行される。 [Life sound discrimination method]
Next, a living sound discrimination method according to the third embodiment will be described with reference to the flowchart shown in FIG. The life sound determination process described below is executed by the server device 200.

本実施例において、生活音記録装置１００−１、１００−２は、実施例１、２に係る生活音記録装置１００と同様に、生活音を抽出可能かつ話声音の内容が判別されない時間長のフレームに音声データを分割し、フレームの順番を変更して記録及び送信する。このため、サーバ装置２００の通信Ｉ／Ｆ２１０は、生活音記録装置１００−１、１００−２から送信された音声データのフレームをそれぞれ受信する（ステップＳ３０１）。そして、生活音抽出部２２１によって、生活音記録装置１００−１、１００−２それぞれから受信されたフレームから生活音が抽出される（ステップＳ３０２）。 In the present embodiment, the life sound recording devices 100-1 and 100-2 have a length of time during which the life sound can be extracted and the content of the spoken voice is not determined, similar to the life sound recording device 100 according to the first and second embodiments. Audio data is divided into frames, and the order of frames is changed and recorded and transmitted. For this reason, the communication I / F 210 of the server device 200 receives the frames of the audio data transmitted from the life sound recording devices 100-1 and 100-2, respectively (step S301). Then, the living sound extraction unit 221 extracts the living sound from the frames received from the living sound recording devices 100-1 and 100-2 (step S302).

生活音が抽出されたフレームには入力時刻を示すタグが付与されているため、生活音判別部２２２によって、生活音がそれぞれの生活音記録装置１００−１、１００−２へ入力された時刻が特定される。ここで、生活音記録装置１００−１、１００−２は、異なる位置に設置されているため、生活音の音源から生活音記録装置１００−１、１００−２までの距離は異なり、生活音の入力時刻も異なる。そこで、生活音判別部２２２によって、１つの生活音が生活音記録装置１００−１、１００−２へ入力された時刻の時間差が算出される（ステップＳ３０３）。 Since the tag indicating the input time is assigned to the frame from which the life sound is extracted, the time when the life sound is input to each of the life sound recording devices 100-1 and 100-2 by the life sound determination unit 222 is determined. Identified. Here, since the living sound recording devices 100-1 and 100-2 are installed at different positions, the distance from the sound source of the living sound to the living sound recording devices 100-1 and 100-2 is different, and Input time is also different. Therefore, the time difference between the times when one life sound is input to the life sound recording devices 100-1 and 100-2 is calculated by the life sound determination unit 222 (step S303).

算出された時間差は、生活音の音源から生活音記録装置１００−１、１００−２それぞれまでの距離の差に対応する。したがって、生活音記録装置１００−１、１００−２の設置位置が既知であれば、生活音の音源の位置を絞り込むことができる。具体的には、２点からの距離の差が一定の位置は、この２点を焦点とする双曲線上であるため、生活音の音源は、生活音記録装置１００−１、１００−２の設置位置を焦点とする双曲線上に位置する。そして、生活音判別部２２２によって、生活音記録装置１００−１、１００−２の設置場所のレイアウト情報が参照され、双曲線上に位置し、生活音を発生させ得る音源が推定される（ステップＳ３０４）。 The calculated time difference corresponds to a difference in distance from the sound source of the living sound to each of the living sound recording devices 100-1 and 100-2. Therefore, if the installation positions of the living sound recording devices 100-1 and 100-2 are known, the position of the sound source of the living sound can be narrowed down. Specifically, since the position where the difference in distance from the two points is constant is on a hyperbola with the two points as the focal point, the sound source of the living sound is the installation of the living sound recording devices 100-1 and 100-2. Located on a hyperbola with the position as the focal point. Then, the living sound discriminating unit 222 refers to the layout information of the installation locations of the living sound recording devices 100-1 and 100-2, and the sound source that is located on the hyperbola and can generate the living sound is estimated (step S304). ).

具体的に例を挙げると、例えば図１３に示すレイアウト情報が用いられることにより、生活音の音源が特定される。図１３に示すレイアウト情報では、生活音記録装置１００−１、１００−２がそれぞれ点ｘ、ｙに設置されており、この設置位置付近には、ドアＡ、ドアＢ、冷蔵庫及びカウンターが配置されることが示されている。生活音が生活音記録装置１００−１、１００−２へ入力される時刻の時間差から、生活音の音源は、点ｘ、ｙを焦点とする双曲線３０１上に位置すると推定される。そして、この双曲線３０１上に位置するのはドアＡであるため、生活音判別部２２２は、生活音の音源がドアＡであると推定する。 Specifically, for example, the layout information shown in FIG. 13 is used to specify the sound source of the living sound. In the layout information shown in FIG. 13, the living sound recording devices 100-1 and 100-2 are installed at points x and y, respectively, and door A, door B, refrigerator and counter are arranged near the installation position. It has been shown that. From the time difference of the time when the living sound is input to the living sound recording devices 100-1 and 100-2, it is estimated that the sound source of the living sound is located on the hyperbola 301 with the points x and y as the focal points. And since it is the door A located on this hyperbola 301, the living sound discrimination | determination part 222 estimates that the sound source of a living sound is the door A. FIG.

なお、生活音判別部２２２は、生活音の種類を特定した上で、生活音の音源を推定しても良い。すなわち、生活音判別部２２２は、生活音の種類がドアを閉じる音であると特定した後、上記のように音源を推定することにより、生活音がドアＢではなくドアＡを閉じる音であると特定することが可能である。また、ここでは生活音記録装置１００−１、１００−２の２つから音声データが取得されるものとしたため、生活音の入力時刻の時間差から、音源が位置し得る双曲線が求められるにとどまった。しかしながら、３つ以上の生活音記録装置から音声データが取得される場合は、生活音の入力時刻の時間差から、音源が位置する点を特定可能である。 Note that the living sound determination unit 222 may estimate the sound source of the living sound after specifying the type of the living sound. That is, the life sound discriminating unit 222 specifies that the life sound type is a sound that closes the door, and then estimates the sound source as described above, so that the life sound is a sound that closes the door A instead of the door B. It is possible to specify. In addition, since the sound data is acquired from the living sound recording devices 100-1 and 100-2 here, only a hyperbola where the sound source can be located is obtained from the time difference between the input times of the living sounds. . However, when sound data is acquired from three or more living sound recording devices, the point where the sound source is located can be specified from the time difference between the input times of the living sounds.

生活音判別部２２２によって音源が推定されると、生活音の種類及び音源を示す結果が結果出力部２２３によって出力される（ステップＳ３０５）。この結果は、さらに他の処理を実行するアプリケーションへ提供されても良い。 When the sound source is estimated by the life sound determination unit 222, the result output unit 223 outputs the result indicating the type of sound and the sound source (step S305). This result may be provided to an application that executes another process.

以上のように、本実施例によれば、複数の生活音記録装置からそれぞれフレームの順番が変更された音声データを取得し、複数の生活音記録装置へ入力された生活音の時間差に基づいて生活音の音源を特定する。このため、ユーザのプライバシーを保護しつつ、生活音を用いた詳細な解析をすることができる。 As described above, according to the present embodiment, the sound data in which the order of the frames is changed from each of the plurality of living sound recording devices is acquired, and based on the time difference between the living sounds input to the plurality of living sound recording devices. Identify the sound source of daily life. For this reason, it is possible to perform a detailed analysis using life sounds while protecting the user's privacy.

［生活音記録プログラム］
上記各実施例において説明した生活音記録装置１００、１００−１、１００−２及びサーバ装置２００の処理をそれぞれコンピュータが実行可能なプログラムとして記述することも可能である。この場合、これらのプログラムをコンピュータが読み取り可能な記録媒体に格納し、コンピュータに導入することも可能である。コンピュータが読み取り可能な記録媒体としては、例えばＣＤ−ＲＯＭ、ＤＶＤディスク、ＵＳＢメモリなどの可搬型記録媒体や、例えばフラッシュメモリなどの半導体メモリが挙げられる。 [Life sound recording program]
It is also possible to describe the processes of the life sound recording devices 100, 100-1, 100-2 and the server device 200 described in the above embodiments as programs that can be executed by a computer. In this case, these programs can be stored in a computer-readable recording medium and introduced into the computer. Examples of the computer-readable recording medium include a portable recording medium such as a CD-ROM, a DVD disk, and a USB memory, and a semiconductor memory such as a flash memory.

以上の各実施例に関し、さらに以下の付記を開示する。 The following additional notes are disclosed for each of the above embodiments.

（付記１）ユーザの生活空間において発生する音声が含まれる音声データを取得する取得部と、
前記取得部によって取得された音声データを、ユーザの行動又は動作に伴う生活音の継続時間に対応する時間長のフレームに分割する分割部と、
前記分割部によって分割されて得られた複数のフレームの順番を変更してメモリに記録する記録制御部と
を有することを特徴とする生活音記録装置。 (Additional remark 1) The acquisition part which acquires the audio | voice data containing the audio | voice which generate | occur | produces in a user's living space,
A division unit that divides the audio data acquired by the acquisition unit into frames of a time length corresponding to a duration of a life sound associated with a user's action or action;
A life sound recording apparatus comprising: a recording control unit that changes the order of a plurality of frames obtained by the division by the division unit and records the change in a memory.

（付記２）前記分割部によって分割されて得られたフレームそれぞれに、生活音が含まれるか否かを判定する判定部をさらに有し、
前記記録制御部は、
前記判定部によって生活音が含まれないと判定されたフレームを削除することにより、前記複数のフレームの順番を変更することを特徴とする付記１記載の生活音記録装置。 (Additional remark 2) It further has the determination part which determines whether each frame obtained by dividing | segmenting by the said division | segmentation part contains a living sound,
The recording control unit
The life sound recording apparatus according to appendix 1, wherein the order of the plurality of frames is changed by deleting a frame determined by the determination unit as not containing life sound.

（付記３）前記記録制御部は、
前記分割部によって分割されて得られたフレームを所定数ずつ蓄積し、蓄積された所定数のフレームの順序を入れ替えることにより、前記複数のフレームの順番を変更することを特徴とする付記１記載の生活音記録装置。 (Supplementary Note 3) The recording control unit
The addition of claim 1, wherein a predetermined number of frames obtained by dividing by the dividing unit are accumulated, and the order of the plurality of frames is changed by changing the order of the accumulated number of frames. Life sound recording device.

（付記４）前記分割部によって分割されて得られたフレームそれぞれに、生活音が含まれるか否かを判定する判定部をさらに有し、
前記記録制御部は、
前記判定部による判定後のフレームを所定数ずつ蓄積し、蓄積された所定数のフレームのうち前記判定部によって生活音が含まれないと判定されたフレームの順序を入れ替えることにより、前記複数のフレームの順番を変更することを特徴とする付記１記載の生活音記録装置。 (Additional remark 4) It further has the determination part which determines whether each frame obtained by dividing | segmenting by the said division part contains a living sound,
The recording control unit
By storing a predetermined number of frames after determination by the determination unit, and by switching the order of the frames determined by the determination unit as not including a living sound among the stored predetermined number of frames, the plurality of frames The life sound recording apparatus according to supplementary note 1, wherein the order of the sound is changed.

（付記５）前記分割部は、
音声データを２００ｍｓ（ミリ秒）以上５００ｍｓ以下の時間長のフレームに分割することを特徴とする付記１記載の生活音記録装置。 (Supplementary note 5)
The life sound recording apparatus according to appendix 1, wherein the sound data is divided into frames having a time length of 200 ms (milliseconds) to 500 ms.

（付記６）前記記録制御部は、
フレームに含まれる音声の入力時刻を示すタグを当該フレームに付与してメモリに記録することを特徴とする付記１記載の生活音記録装置。 (Appendix 6) The recording control unit
The life sound recording apparatus according to appendix 1, wherein a tag indicating an input time of sound included in a frame is attached to the frame and recorded in a memory.

（付記７）前記メモリに記録された順番のフレームからなる音声データを送信する送信部をさらに有することを特徴とする付記１記載の生活音記録装置。 (Supplementary note 7) The life sound recording apparatus according to supplementary note 1, further comprising a transmission unit that transmits audio data composed of frames in the order recorded in the memory.

（付記８）ユーザの行動又は動作に伴う生活音の継続時間に対応する時間長のフレームからなる音声データを取得する取得部と、
前記取得部によって取得された音声データの各フレームから生活音を抽出する抽出部と、
前記抽出部によって抽出された生活音の種類を判別する判別部と
を有することを特徴とする情報処理装置。 (Additional remark 8) The acquisition part which acquires the audio | voice data which consist of a frame of the time length corresponding to the duration of the life sound accompanying a user's action or operation | movement,
An extraction unit for extracting a living sound from each frame of the audio data acquired by the acquisition unit;
An information processing apparatus comprising: a discrimination unit that discriminates a type of life sound extracted by the extraction unit.

（付記９）前記取得部は、
互いに異なる位置に設置された第１の生活音記録装置及び第２の生活音記録装置からそれぞれ第１の音声データ及び第２の音声データを取得し、
前記判別部は、
前記第１の音声データのフレーム及び前記第２の音声データのフレームから抽出された生活音の時間差を算出し、算出された時間差に基づいて生活音の音源の位置を推定することを特徴とする付記８記載の情報処理装置。 (Supplementary Note 9) The acquisition unit
Obtaining the first sound data and the second sound data from the first living sound recording device and the second living sound recording device installed at different positions, respectively;
The discrimination unit
Calculating a time difference between living sounds extracted from the frame of the first sound data and the frame of the second sound data, and estimating a position of a sound source of the living sound based on the calculated time difference; The information processing apparatus according to appendix 8.

（付記１０）ユーザの生活空間において発生する音声が含まれる音声データを取得し、
取得された音声データを、ユーザの行動又は動作に伴う生活音の継続時間に対応する時間長のフレームに分割し、
分割されて得られた複数のフレームの順番を変更してメモリに記録する
処理をコンピュータが実行することを特徴とする生活音記録方法。 (Supplementary Note 10) Acquire audio data including audio generated in the user's living space,
Dividing the acquired voice data into frames of a time length corresponding to the duration of the life sound accompanying the user's action or action,
A life sound recording method, wherein a computer executes a process of changing the order of a plurality of frames obtained by division and recording the same in a memory.

（付記１１）ユーザの生活空間において発生する音声が含まれる音声データを取得し、
取得された音声データを、ユーザの行動又は動作に伴う生活音の継続時間に対応する時間長のフレームに分割し、
分割されて得られた複数のフレームの順番を変更してメモリに記録する
処理をコンピュータに実行させることを特徴とする生活音記録プログラム。 (Supplementary Note 11) Acquire audio data including audio generated in the user's living space,
Dividing the acquired voice data into frames of a time length corresponding to the duration of the life sound accompanying the user's action or action,
A life sound recording program characterized by causing a computer to execute a process of changing the order of a plurality of frames obtained by division and recording them in a memory.

１１０音声入力部
１２０、２２０プロセッサ
１２１分割部
１２２生活音判定部
１２３、１５１記録制御部
１２４通信制御部
１３０無線送信部
１４０、２３０メモリ
２１０通信Ｉ／Ｆ
２２１生活音抽出部
２２２生活音判別部
２２３結果出力部
２５１環境音抽出部
２５２環境音判別部 DESCRIPTION OF SYMBOLS 110 Voice input part 120,220 Processor 121 Dividing part 122 Living sound determination part 123,151 Recording control part 124 Communication control part 130 Wireless transmission part 140,230 Memory 210 Communication I / F
221 Living sound extraction unit 222 Living sound discrimination unit 223 Result output unit 251 Environmental sound extraction unit 252 Environmental sound discrimination unit

Claims

An acquisition unit for acquiring audio data including audio generated in the user's living space;
A division unit that divides the audio data acquired by the acquisition unit into frames of a time length corresponding to a duration of a life sound associated with a user's action or action;
A life sound recording apparatus comprising: a recording control unit that changes the order of a plurality of frames obtained by the division by the division unit and records the change in a memory.

Each of the frames obtained by the division by the division unit further includes a determination unit that determines whether or not life sounds are included,
The recording control unit
The life sound recording apparatus according to claim 1, wherein the order of the plurality of frames is changed by deleting a frame determined by the determination unit as not containing life sound.

The recording control unit
2. The order of the plurality of frames is changed by accumulating a predetermined number of frames obtained by dividing by the dividing unit and changing an order of the accumulated number of frames. Life sound recording device.

Each of the frames obtained by the division by the division unit further includes a determination unit that determines whether or not life sounds are included,
The recording control unit
By storing a predetermined number of frames after determination by the determination unit, and by switching the order of the frames determined by the determination unit as not including a living sound among the stored predetermined number of frames, the plurality of frames The life sound recording apparatus according to claim 1, wherein the order is changed.

The dividing unit is
The life sound recording apparatus according to claim 1, wherein the sound data is divided into frames having a time length of 200 ms (milliseconds) to 500 ms.

The recording control unit
The life sound recording apparatus according to claim 1, wherein a tag indicating an input time of sound included in a frame is attached to the frame and recorded in a memory.

The life sound recording apparatus according to claim 1, further comprising a transmission unit configured to transmit audio data composed of frames in the order recorded in the memory.

An acquisition unit that acquires audio data composed of a frame of a length corresponding to a duration of a life sound associated with a user's action or action;
An extraction unit for extracting a living sound from each frame of the audio data acquired by the acquisition unit;
An information processing apparatus comprising: a discrimination unit that discriminates a type of life sound extracted by the extraction unit.

The acquisition unit
Obtaining the first sound data and the second sound data from the first living sound recording device and the second living sound recording device installed at different positions, respectively;
The discrimination unit
Calculating a time difference between living sounds extracted from the frame of the first sound data and the frame of the second sound data, and estimating a position of a sound source of the living sound based on the calculated time difference; The information processing apparatus according to claim 8.

Acquire audio data that includes audio generated in the user's living space,
Dividing the acquired voice data into frames of a time length corresponding to the duration of the life sound accompanying the user's action or action,
A life sound recording method, wherein a computer executes a process of changing the order of a plurality of frames obtained by division and recording the same in a memory.