JP5749213B2

JP5749213B2 - Audio data analysis apparatus, audio data analysis method, and audio data analysis program

Info

Publication number: JP5749213B2
Application number: JP2012096504A
Authority: JP
Inventors: 菜々濱口; 陽子浅野; 大介朝井
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2012-04-20
Filing date: 2012-04-20
Publication date: 2015-07-15
Anticipated expiration: 2032-04-20
Also published as: JP2013225003A

Description

本発明は、発話音声が含まれた音声データから交流を目的とした会話を検出する音声データ分析装置、音声データ分析方法および音声データ分析プログラムに関する。 The present invention relates to a voice data analysis device, a voice data analysis method, and a voice data analysis program for detecting a conversation for the purpose of exchange from voice data including speech voice.

周囲との日常的なコミュニケーションの希薄化に起因する引きこもりは、重大な社会問題となっている。コミュニケーションが希薄化している人物を早期に特定できることは、引きこもり問題の予防につながる。 Withdrawal due to the diminished daily communication with the surroundings has become a serious social problem. Being able to identify people with poor communication at an early stage leads to prevention of withdrawal problems.

コミュニケーションが希薄な人物を把握するにあたり、個人の日常的な対面型コミュニケーションの状態を管理する技術が必要である。日常的なコミュニケーションには、例えば、パーソナルコンピュータや携帯電話等を介して行われる非対面型と、直接顔を合わせて行われる対面型がある。前者は、通信機器のログ等を取得することで把握できるが、後者は、通信機器を介さず行われるため、何らかの手段で把握する必要がある。 In order to grasp a person with poor communication, it is necessary to have a technique for managing an individual's daily face-to-face communication state. Daily communication includes, for example, a non-face-to-face type performed via a personal computer, a mobile phone, and the like, and a face-to-face type performed directly face to face. The former can be grasped by acquiring a log or the like of the communication device. However, since the latter is performed without the communication device, it is necessary to grasp by some means.

上記の必要性に対して、個人の所有する端末で記録された音声データを用いて日常的な対面型コミュニケーションを自動検出する技術が提案されている。非特許文献１では、対面型コミュニケーションを、任意の二者が近接し、かつ発話している状態と定義している。マイクを装備した端末を対象ユーザ全員に保持させ、任意の二つの端末から取得されるそれぞれの音声データの相関関係によって近接状態を識別している。更に、近接状態と認識されたユーザの保持する端末から取得された音声データの平均パワーとピッチを算出し、前者から「端末所有者の発話らしさ」を、後者から「音声らしさ」を推定する。そして、両者を併用することで会話状態を認識している。 In response to the above-described necessity, a technique for automatically detecting daily face-to-face communication using voice data recorded on a terminal owned by an individual has been proposed. Non-Patent Document 1 defines face-to-face communication as a state in which any two parties are in close proximity and speaking. A terminal equipped with a microphone is held by all the target users, and the proximity state is identified by the correlation of the respective audio data acquired from any two terminals. Further, the average power and pitch of the voice data acquired from the terminal held by the user who is recognized as the proximity state are calculated, and “the utterance likelihood of the terminal owner” is estimated from the former, and “the likelihood of voice” is estimated from the latter. And the conversation state is recognized by using both together.

岡本昌之, 池谷直紀, 西村圭亮, 菊池匡晃, 長健太, 服部正典, 坪井創吾, 芦川平, "端末音声の相互相関に基づくアドホック会話の検出", 日本データベース学会論文誌, Vol. 7, No. 1, pp.163-168, 2008Masayuki Okamoto, Naoki Ikeya, Junsuke Nishimura, Jun Kikuchi, Kenta Nagata, Masanori Hattori, Sogo Tsuboi, Taira Sasakawa, "Detection of ad hoc conversation based on cross-correlation of terminal speech", Transactions of the Database Society of Japan, Vol. 7, No .1, pp.163-168, 2008

非特許文献１の技術では、一方通行のコミュニケーションや独り言でも対面型コミュニケーションであると混同するという問題がある。引きこもり予防の観点から対面型コミュニケーションを検出しようとする場合、一方通行のコミュニケーションや独り言と、交流を目的とした対面型コミュニケーションとを、区別することが課題となる。一方通行のコミュニケーションの例として、スーパーのレジで会計をする場面での店員との関係がある。この場面において、店員は、一方的に合計金額を伝えお礼を言う。つまり、一方的な情報の伝達が目的であり、交流が目的ではない。引きこもり状態に陥る原因は、対面型コミュニケーションの中でも特に交流を目的としたものの頻度の低下にある。したがって、引きこもり状態を把握するためには、対面型コミュニケーションの中でも交流を目的としたものを区別して検出できる必要がある。以下、目的を区別しない広義の対面型コミュニケーションを「発話」、交流を目的とした対面型コミュニケーションを「会話」と呼称する。 In the technique of Non-Patent Document 1, there is a problem that even one-way communication or monologue is confused with face-to-face communication. When face-to-face communication is to be detected from the perspective of withdrawal, the challenge is to distinguish one-way communication or monologue from face-to-face communication for exchange. As an example of one-way communication, there is a relationship with a store clerk in the case of accounting at a supermarket cash register. In this scene, the store clerk unilaterally tells the total amount and thank you. In other words, the purpose is to transmit information unilaterally, not to exchange. The cause of the withdrawal state is a decrease in the frequency of face-to-face communication, especially for the purpose of exchange. Therefore, in order to grasp the withdrawal state, it is necessary to be able to distinguish and detect those in the face-to-face communication for the purpose of exchange. In the following, face-to-face communication in a broad sense that does not distinguish between purposes is referred to as “utterance”, and face-to-face communication for the purpose of exchange is referred to as “conversation”.

本発明は、上記事情に鑑みてなされたものであり、本発明の目的は、音声データが形態素に分解されたデータから、交流を目的とした会話を検出する音声データ分析装置、音声データ分析方法および音声データ分析プログラムを提供することにある。 The present invention has been made in view of the above circumstances, and an object of the present invention is to provide a voice data analysis apparatus and a voice data analysis method for detecting a conversation for the purpose of alternating current from data obtained by breaking voice data into morphemes. And providing an audio data analysis program.

上記目的を達成するため、本発明は、音声データ分析装置であって、対象ユーザおよび他者の音声データが形態素区間と非発声音区間とに分類されたデータについて、非発声音区間の長さが所定の時間を越える場合、当該非発声音区間を発話区間の区切りと判別し、発話区間を切り出す発話区間切り出し手段と、あいづちの形態素区間を検出するあいづち検出手段と、前記発話区間切り出し手段が切り出した各発話区間毎に、あいづちの形態素区間の話者が対象ユーザか他者かを判別するあいづち話者判別手段と、１つの発話区間の中で、あいづちの形態素区間の話者が対象ユーザと他者の両方が含まれる場合、当該発話区間を会話区間であると識別する会話識別手段と、を有し、前記あいづち話者判別手段は、対象ユーザの音声モデルと、あいづちの形態素区間に対応する音声データとを比較し、前記音声モデルと照合した音声データのあいづちの形態素区間については対象ユーザが話者であると判別し、前記音声モデルと照合しない音声データのあいづちの形態素区間については他者が話者であると判別する。 In order to achieve the above object, the present invention provides a speech data analysis apparatus, wherein the length of a non-vocal sound segment is determined for data in which speech data of a target user and others is classified into a morpheme segment and a non-vocal speech segment. The non-speech sound segment is determined to be a segment of the utterance interval, and the utterance segment extraction means for extracting the utterance segment, the nick detection unit for detecting the morpheme segment, and the utterance segment extraction For each utterance section cut out by the means, the Aichi utterance discriminating means for discriminating whether the speaker of the morphological section of Aizuchi is the target user or another person, When the speaker includes both the target user and another person, the speaker has a conversation identifying means for identifying the speech section as a conversation section, and the speech recognition means has a voice model of the target user , The speech data corresponding to any morpheme section is compared, and for the morpheme section of the speech data collated with the speech model, the target user is determined to be a speaker, and the speech data not collated with the speech model It is determined that the other person is the speaker for the morphological section of Aizuchi.

本発明は、音声データ分析装置が行う音声データ分析方法であって、対象ユーザおよび他者の音声データが形態素区間と非発声音区間とに分類されたデータについて、非発声音区間の長さが所定の時間を越える場合、当該非発声音区間を発話区間の区切りと判別し、発話区間を切り出す発話区間切り出しステップと、あいづちの形態素区間を検出するあいづち検出ステップと、前記発話区間切り出しステップで切り出した各発話区間毎に、あいづちの形態素区間の話者が対象ユーザか他者かを判別するあいづち話者判別ステップと、１つの発話区間の中で、あいづちの形態素区間の話者が対象ユーザと他者の両方が含まれる場合、当該発話区間を会話区間であると識別する会話識別ステップと、を行い、前記あいづち話者判別ステップは、対象ユーザの音声モデルと、あいづちの形態素区間に対応する音声データとを比較し、前記音声モデルと照合した音声データのあいづちの形態素区間については対象ユーザが話者であると判別し、前記音声モデルと照合しない音声データのあいづちの形態素区間については他者が話者であると判別する。 The present invention relates to a speech data analysis method performed by a speech data analysis apparatus, wherein the length of a non-vocal sound segment is the length of data in which speech data of a target user and others is classified into a morpheme segment and a non-vocal speech segment. When the predetermined time is exceeded, the non-speech sound segment is determined as a segment of the utterance interval, an utterance segment extraction step for extracting the utterance segment, an nick detection step for detecting the morpheme segment, and the utterance segment extraction step For each utterance section cut out in step 1, the audible morpheme section in the utterance section and the utterance of the morpheme section in one utterance section are determined. When the person includes both the target user and the other person, a conversation identifying step for identifying the speech section as a conversation section is performed. The user's speech model is compared with speech data corresponding to the speech morpheme section, and for the speech morpheme segment of speech data matched with the speech model, the target user is determined to be a speaker, and the speech It is determined that the other person is the speaker for the morpheme section of speech data that is not collated with the model.

本発明は、前記音声データ分析装置としてコンピュータを機能させるための音声データ分析プログラムである。 The present invention is an audio data analysis program for causing a computer to function as the audio data analysis apparatus.

本発明によれば、音声データが形態素に分解されたデータから、交流を目的とした会話を検出する音声データ分析装置、音声データ分析方法および音声データ分析プログラムを提供することができる。 ADVANTAGE OF THE INVENTION According to this invention, the audio | voice data analysis apparatus, the audio | voice data analysis method, and the audio | voice data analysis program which detect the conversation aiming at alternating current from the data by which audio | voice data were decomposed | disassembled into the morpheme can be provided.

本発明の第１の実施形態に係る音声データ分析装置の構成図である。It is a block diagram of the audio | voice data analyzer which concerns on the 1st Embodiment of this invention. 第１の実施形態における処理の手順を示すフローチャートである。It is a flowchart which shows the procedure of the process in 1st Embodiment. 第１の実施形態における形態素に分解されたデータ例である。It is an example of data decomposed | disassembled into the morpheme in 1st Embodiment. 第１の実施形態における区切りタグが設定されたデータ例である。It is an example of data in which the delimiter tag in the first embodiment is set. 第１の実施形態におけるあいづちタグが設定されたデータ例である。It is an example of data in which the identification tag in the first embodiment is set. 第１の実施形態における役割タグが設定されたデータ例である。It is an example of data in which the role tag in 1st Embodiment was set. 第１の実施形態における「会話」区間として識別されたデータ例である。It is an example of data identified as a “conversation” section in the first embodiment. 第１の実施形態における、データを説明するための説明図である。It is explanatory drawing for demonstrating data in 1st Embodiment. 本発明の第２の実施形態に係る音声データ分析装置の構成図である。It is a block diagram of the audio | voice data analyzer which concerns on the 2nd Embodiment of this invention. 第２の実施形態における処理の手順を示すフローチャートである。It is a flowchart which shows the procedure of the process in 2nd Embodiment. 第２の実施形態における形態素に分解されたデータ例である。It is an example of the data decomposed | disassembled into the morpheme in 2nd Embodiment. 第２の実施形態における区切りタグが設定されたデータ例である。It is an example of data in which the delimiter tag in the second embodiment is set. 第２の実施形態におけるあいづちタグが設定されたデータ例である。It is an example of data in which the identification tag in 2nd Embodiment was set. 第２の実施形態における話者タグが設定されたデータ例である。It is an example of data in which the speaker tag in 2nd Embodiment was set. 第２の実施形態における「会話」区間として識別されたデータ例である。It is an example of data identified as a “conversation” section in the second embodiment.

以下、本発明の実施の形態について、図面を参照して説明する。 Embodiments of the present invention will be described below with reference to the drawings.

本実施形態では、発話音声が含まれた音声データから交流を目的とした対面型コミュニケーションである「会話」区間を検出するために、「会話」に特有の特徴として、以下の２点に着目する。 In this embodiment, in order to detect a “conversation” section, which is a face-to-face communication intended for exchange, from speech data including speech speech, the following two points are focused on as features unique to “conversation” .

１つは、「会話」は、聞き手と話し手の二役が存在し、時間の経過と共に入れ替わりで両役共を担う点。もう１つは、聞き手役のときはあいづちを打つ一方、話し手役のときはあいづちを打たない点。 One is that “conversation” has two roles, the listener and the speaker. The other is that when you play the role of the listener, you will hit it, but you won't hit it when you are the speaker.

これらに基づき、本実施形態では、「発話」と「会話」の特徴を以下のように定め、これらの特徴の有無を判定することによって、交流を目的とした対面型コミュニケーションである「会話」区間を検出する。 Based on these, in the present embodiment, the features of “speech” and “conversation” are defined as follows, and by determining the presence or absence of these features, the “conversation” section that is face-to-face communication for the purpose of exchange Is detected.

「発話」は、時系列に連続した発声のまとまりである。「会話」は、あいづちを発声する区間と、あいづち以外を発声する区間の両区間が混在する「発話」である。 “Speech” is a group of utterances continuous in time series. The “conversation” is a “speech” in which both a section that utters Aichi and a section that utters other than Aichi are mixed.

＜第１の実施形態＞
図１は、第１の実施形態における音声データ分析装置１の構成を示す構成図である。図示する音声データ分析装置１は、音声データ受信部1aと、形態素解析部1bと、発話区間切り出し部1cと、あいづち検出部1dと、あいづち辞書部1eと、役割分類部1fと、会話識別部1gと、出力部1hと、記憶部1iとを備える。 <First Embodiment>
FIG. 1 is a configuration diagram illustrating the configuration of the audio data analysis device 1 according to the first embodiment. The voice data analyzing apparatus 1 shown in the figure includes a voice data receiving unit 1a, a morpheme analyzing unit 1b, a speech segment extracting unit 1c, an Aichi detector unit 1d, an Aichi dictionary unit 1e, a role classification unit 1f, a conversation An identification unit 1g, an output unit 1h, and a storage unit 1i are provided.

音声データ受信部1aは、音声データを入力する。第１の実施形態において入力される音声データは、環境音等の雑音が少なく、対象ユーザの発声音のみが鮮明に取得できているデータを想定する。すなわち、対象ユーザ１名の発声音が取得可能な音声データである。この音声データには、対象ユーザ（本人）以外の話し声が処理する上で無視できる程度に小さい音声も含まれていてもよく、また、人の音声以外の環境音も含まれていてもよい。 The audio data receiving unit 1a inputs audio data. The audio data input in the first embodiment is assumed to be data in which noise such as environmental sounds is small and only the voice of the target user can be clearly obtained. That is, it is voice data from which the voice of one target user can be acquired. This voice data may include a voice that is so small that it can be ignored when a voice other than the target user (person) is processed, and may also include an environmental sound other than a human voice.

例えば、マイク（例えば、指向性マイク等）を備えた録音機能（音声取得機能）を有するモバイル型端末を、対象ユーザに常時首から提げてもらう。モバイル型端末は、マイクにより取得された対象ユーザが発声した音声データを、時刻データ（タイムスタンプなど）とともに当該モバイル型端末のメモリなどの記憶部に記憶する。そして、モバイル型端末は、所定のタイミングで記憶部に記憶された所定の期間（例えば、１日分、数時間分など）の音声データを音声データ分析装置１に送信する。音声データ分析装置１の音声データ受信部1aは、モバイル型端末から送信された音声データを受信し、受信した音声データを形態素解析部1bに送出する。音声データ分析装置１は、複数の「発話」や「会話」が含まれる程度の期間分の音声データをまとめて処理する。例えば、モバイル型端末がユーザの就寝中に１日分のデータを音声データ分析装置１に転送し、音声データ分析装置１は転送された音声データを処理する形態などが考えられる。 For example, a mobile terminal having a recording function (voice acquisition function) including a microphone (for example, a directional microphone) is always provided from the neck to the target user. The mobile terminal stores voice data uttered by the target user acquired by the microphone in a storage unit such as a memory of the mobile terminal together with time data (time stamp or the like). Then, the mobile terminal transmits audio data for a predetermined period (for example, one day, several hours, etc.) stored in the storage unit at a predetermined timing to the audio data analysis device 1. The voice data receiver 1a of the voice data analyzer 1 receives voice data transmitted from the mobile terminal, and sends the received voice data to the morpheme analyzer 1b. The voice data analysis apparatus 1 collectively processes voice data for a period that includes a plurality of “utterances” and “conversations”. For example, a mode in which the mobile terminal transfers data for one day to the voice data analysis apparatus 1 while the user is sleeping, and the voice data analysis apparatus 1 processes the transferred voice data may be considered.

また、音声データ分析装置１が、マイク（例えば、指向性マイク等）を備えた録音機能を有するモバイル型端末であってもよい。対象ユーザは、モバイル端末である音声データ分析装置１を常時首から提げている。音声データ分析装置１の音声データ受信部1aは、マイクにより取得され、時刻データとともに図示しない記憶部に記憶された対象ユーザの音声データを、所定の期間分入力することとしてもよい。 Further, the voice data analyzing apparatus 1 may be a mobile terminal having a recording function including a microphone (for example, a directional microphone). The target user always has the voice data analysis device 1 as a mobile terminal from the head. The voice data receiving unit 1a of the voice data analyzing apparatus 1 may input the target user's voice data acquired by a microphone and stored in a storage unit (not shown) together with time data for a predetermined period.

形態素解析部1bは、音声データを形態素に解析し、形態素が含まれる形態素区間と、それ以外の非発声区間とに分類する。発話区間切り出し部1cは、形態素区間と非発声音区間とに分類されたデータについて、非発声音区間の長さが所定の時間を越える場合、当該非発声音区間を発話区間の区切りと判別し、発話区間を切り出す。 The morpheme analysis unit 1b analyzes the speech data into morphemes, and classifies the speech data into morpheme sections including morphemes and other non-voicing sections. When the length of the non-speech sound segment exceeds a predetermined time for the data classified into the morpheme segment and the non-speech sound segment, the utterance segment cutout unit 1c determines that the non-speech sound segment is a delimiter of the utterance segment. Cut out the utterance section.

あいづち検出部1dは、あいづちの形態素区間を検出する。具体的には、各形態素区間の形態素があいづち辞書部1eのいずれかのあいづちデータと一致する場合、当該形態素区間をあいづちの形態素区間であると判別する。あいづち辞書部1e（あいづち記憶手段）には、複数のあいづちデータが記憶される。あいづち辞書部1eは、あいづち検出部1dにおいてあいづちを検出する際に参照されるデータベースである。あいづちに多用される感嘆詞は、あらかじめテキストデータであいづち辞書部1eに登録しておく。 The Aichichi detection unit 1d detects the Amorphic morpheme section. Specifically, when the morpheme of each morpheme section matches the any one of the dictionary data in the dictionary unit 1e, it is determined that the morpheme section is the selected morpheme section. In the Aizuchi dictionary unit 1e (Aizuchi storage means), a plurality of Aizuchi data is stored. The nickname dictionary unit 1e is a database that is referred to when the nickname detection unit 1d detects nickname. The exclamations that are frequently used are registered as text data in the dictionary section 1e in advance.

役割分類部1fは、発話区間切り出し部1cが切り出した各発話区間毎に、あいづちの形態素区間については対象ユーザが聞き手役の区間であると判別し、あいづち以外の形態素区間については対象ユーザが話し手役の区間であると判別する。会話識別部1gは、１つの発話区間の中で、聞き手役の区間と話し手役の区間の両方が含まれる場合、当該発話区間を会話区間であると識別する。出力部1h（算出手段）は、会話識別部1gが識別した会話区間の合計時間または会話区間数に基づいて、対象ユーザのコミュニケーション度合いを算出する。記憶部1iには、音声データに含まれる形態素区間と非発声区間とが、開始時刻および終了時刻とともに時系列に記憶される。 For each utterance section cut out by the utterance section cutout section 1c, the role classification unit 1f determines that the target user is the section of the listener for the morpheme section, and the target user for the other morpheme sections. Is determined to be a speaker role section. The conversation identifying unit 1g identifies the utterance section as the conversation section when both the section of the listener role and the section of the speaker role are included in one utterance section. The output unit 1h (calculation means) calculates the degree of communication of the target user based on the total time of conversation sections or the number of conversation sections identified by the conversation identification section 1g. In the storage unit 1i, the morpheme section and the non-voicing section included in the voice data are stored in time series together with the start time and the end time.

上記説明した音声データ分析装置１は、例えば、ＣＰＵと、メモリと、ＨＤＤ等の外部記憶装置と、入力装置と、出力装置とを備えた汎用的なコンピュータシステムを用いることができる。このコンピュータシステムにおいて、ＣＰＵがメモリ上にロードされた音声データ分析装置１用のプログラムを実行することにより、音声データ分析装置１の各機能が実現される。また、音声データ分析装置１用のプログラムは、ハードディスク、フレキシブルディスク、ＣＤ−ＲＯＭ、ＭＯ、ＤＶＤ−ＲＯＭなどのコンピュータ読取り可能な記録媒体に記憶することも、ネットワークを介して配信することもできる。 The voice data analysis apparatus 1 described above can use, for example, a general-purpose computer system including a CPU, a memory, an external storage device such as an HDD, an input device, and an output device. In this computer system, the CPU executes the program for the voice data analysis apparatus 1 loaded on the memory, whereby each function of the voice data analysis apparatus 1 is realized. The program for the audio data analysis apparatus 1 can be stored in a computer-readable recording medium such as a hard disk, a flexible disk, a CD-ROM, an MO, or a DVD-ROM, or can be distributed via a network.

次に、本実施形態の処理について説明する。 Next, the processing of this embodiment will be described.

図２は、本実施形態の音声データ分析装置１の処理の流れを示すフローチャートである。 FIG. 2 is a flowchart showing a processing flow of the voice data analysis apparatus 1 of the present embodiment.

Ｓ１００において、音声データ受信部1aは、対象ユーザの音声データを取得し、形態素解析部1bに送出する。音声データには、前述の通り、時刻データが付加されている。 In S100, the voice data receiving unit 1a acquires the voice data of the target user and sends it to the morphological analysis unit 1b. As described above, time data is added to the audio data.

Ｓ１０１において、形態素解析部1bは、受信した所定の期間の一連の音声データに関して、まず、音声認識（文献１参照）を行うことで、音声データをテキストデータへと変換する。次に、形態素解析部1bは、形態素解析（文献２参照）を行うことで、テキストデータを形態素の列へ分割する。 In S101, the morphological analysis unit 1b first converts speech data into text data by performing speech recognition (see Document 1) for the received series of speech data for a predetermined period. Next, the morpheme analysis unit 1b divides the text data into morpheme columns by performing morpheme analysis (see Document 2).

文献１：「MeCab: Yet Another Part-of-Speech and Morphological Analyzer」、http://mecab.sourceforge.net/#format
文献２：「SpeechRec ソリューションパッケージ」、http://www.ntt-it.co.jp/goods/vcj/v-series/speechrec/solution_package.html
ここで、形態素解析部1bは、各形態素の始まりの時刻を参照し、１つの形態素の開始時刻からその次の形態素の開始時刻までを１つの形態素区間とする。発話をしていない場面や発話中に見られる「間」の部分等、形態素区間に含まれない部分を非発声音とし、２つの形態素区間に挟まれた非発声音の開始時刻から終了時刻までを一つの非発声音区間とする。そして、形態素解析部1bは、形態素区間および非発声音区間のそれぞれの開始時刻と終了時刻で構成されるテーブルを生成し、記憶部1iに格納する。 Reference 1: “MeCab: Yet Another Part-of-Speech and Morphological Analyzer”, http://mecab.sourceforge.net/#format
Reference 2: "SpeechRec Solution Package", http://www.ntt-it.co.jp/goods/vcj/v-series/speechrec/solution_package.html
Here, the morpheme analysis unit 1b refers to the start time of each morpheme and sets one morpheme section from the start time of one morpheme to the start time of the next morpheme. From the start time to the end time of a non-speech sound sandwiched between two morpheme sections, where the part not included in the morpheme section is a non-speech sound, such as a scene that is not uttered or a “between” part seen during utterance Is one non-voiced sound interval. Then, the morpheme analysis unit 1b generates a table including the start time and the end time of each of the morpheme section and the non-voiced sound section, and stores the table in the storage unit 1i.

図３は、Ｓ１０１で生成されるテーブルの一例を示す図である。図示するテーブルには、各形態素および非発声音区間（図３では「○」で示している）と、これらの開始時刻および終了時刻とが対応付けて、時系列に記憶されている。図３に示す開始時刻および終了時刻は、時間：分：秒：ミリ秒（ｈ：ｍ：ｓ：ｍｓ）で表している。なお、ミリ秒（ｍｓ）は通常３桁であるが、ここでは説明を簡単にするために２桁で表している。図４〜図７、図１１〜図１５の図面においても、図３と同様に開始時刻および終了時刻を表している。 FIG. 3 is a diagram illustrating an example of the table generated in S101. In the illustrated table, each morpheme and non-voiced sound section (indicated by “◯” in FIG. 3), and their start time and end time are associated with each other and stored in time series. The start time and end time shown in FIG. 3 are expressed in hours: minutes: seconds: milliseconds (h: m: s: ms). The millisecond (ms) is usually three digits, but is represented by two digits here for the sake of simplicity. 4 to 7 and FIGS. 11 to 15 also indicate the start time and the end time as in FIG.

Ｓ１０２において、発話区間切り出し部1cは、発話がおこなわれている場面毎に会話であるか否かを判断するため、まず、Ｓ１００で入力された音声データの中で発話が発生している時間を場面毎に分割する。ここで、発話が発生している１つの場面の開始時刻から終了時刻までを「発話」区間と呼称する。例えば10:33から10:35の間、人に道を聞かれ道案内をし、その後13:10から13:30まで友人と出会い会話をしていたとする。この例では、10:33から13:30までの時間が、道案内をした場面と、友人と会話をした場面の２つの場面に分割され、10:33から10:35までが第１の「発話」区間、13:10から13:30までが第２の「発話」区間とされる。 In S102, the utterance section cutout unit 1c first determines the time during which the utterance is occurring in the voice data input in S100 in order to determine whether or not the conversation is for each scene where the utterance is being performed. Divide by scene. Here, the period from the start time to the end time of one scene where an utterance is occurring is referred to as an “utterance” section. For example, between 10:33 and 10:35, a person asks for directions and guides them, and then meets and talks with a friend from 13:10 to 13:30. In this example, the time from 10:33 to 13:30 is divided into two scenes, the scene where the route was guided and the scene where the conversation was held with a friend. The “speech” section, from 13:10 to 13:30, is the second “speech” section.

発話区間切り出し部1cは、形態素解析部1bにより生成され、記憶部1iに記憶されたテーブル（図３参照）の非発声音区間を参照し、所定の時間（例えば5分間等）よりも長い非発声音区間を「発話」区間の区切りとし、区切りの間の形態素区間を１つの「発話」区間とみなす。例えば、発話区間切り出し部1cは、記憶部1iのテーブルに区切りタグの欄を新たに設けて、テーブルを更新する。そして、各非発声音区間について、所定の時間を超えるか否かを判別し、所定の時間を超える非発声音区間については、区切りタグ“1”を設定し、所定の時間を超えない非発声区間には区切りタグ“0”を設定する。非発声音区間以外の形態素区間については、全て区切りタグ“0”を設定する。 The speech segment cutout unit 1c is generated by the morpheme analysis unit 1b and refers to a non-speech sound segment in a table (see FIG. 3) stored in the storage unit 1i, and is longer than a predetermined time (for example, 5 minutes). The utterance sound section is defined as a “speech” section break, and the morpheme section between the breaks is regarded as one “utterance” section. For example, the utterance section cutout unit 1c updates the table by newly providing a delimiter tag column in the table of the storage unit 1i. Then, for each non-voiced sound section, it is determined whether or not a predetermined time is exceeded. For a non-voiced sound section that exceeds the predetermined time, a delimiter tag “1” is set, and the non-voiced sound does not exceed the predetermined time. Separation tag “0” is set in the section. The delimiter tag “0” is set for all morpheme sections other than the non-voiced sound section.

図４は、発話区間切り出し部1cにより、記憶部1iのテーブルに区切りタグの欄が設けられたテーブルの一例である。区切りタグ“1”で区切られた「発話」区間は、先の（前の）区切りと判別された非発声音区間の終了時刻を開始時刻とし、後の区切りと判別された非発声音区間の開始時刻を終了時刻とする。図４に示す例では、区切りタグ“1”の非発声音区間４１、４２により区切られた「発話」区間については、開始時刻が先の非発声音区間４１の終了時刻である「13：10：00：00」であって、終了時刻が後の非発声音区間４２の開始時刻である「13：30：00：00」である。 FIG. 4 is an example of a table in which a delimiter tag column is provided in the table of the storage unit 1i by the utterance section cutout unit 1c. The “speech” section delimited by the delimiter tag “1” starts from the end time of the non-speech sound segment determined as the previous (previous) delimiter, and the non-speech sound segment determined as the subsequent delimiter Let the start time be the end time. In the example shown in FIG. 4, for the “speech” section delimited by the non-speech sound sections 41 and 42 of the delimiter tag “1”, the start time is the end time of the previous non-speech sound section 41 “13:10”. 03:00 ”and the end time is“ 13: 30: 00: 00 ”which is the start time of the non-voiced sound section 42 later.

Ｓ１０３において、あいづち検出部1dは、Ｓ１０２で切り出した各「発話」区間について、記憶部1iに記憶されたテーブルを参照し、形態素区間の中からあいづちに該当する形態素を検出し、あいづちタグを付加する。あいづちの検出については、本実施形態ではあいづち辞書部1eを用いるものとする。すなわち、「うん」、「ふーん」、「へぇ」、「はい」、「そうなんだ」、「なるほど」等あいづちに多用される感動詞を、あらかじめテキストデータであいづち辞書部1eに登録しておく。あいづち検出部1dは、テキストマッチング技術を用いて、各形態素とあいづち辞書部1eに登録された感動詞とを比較し、あいづち辞書部1eに登録されたいずれかの感動詞と一致する形態素をあいづちとして検出し、あいづちタグを付加する。 In S103, the nickname detection unit 1d refers to the table stored in the storage unit 1i for each “speech” segment cut out in S102, detects a morpheme corresponding to the morpheme segment, and detects the nickname. Add a tag. As for detection of AIZUCHI, the AIZIC dictionary unit 1e is used in the present embodiment. In other words, the excitement verbs that are frequently used such as “Yes”, “Fun”, “Hey”, “Yes”, “Yes”, “I see”, and so on are registered in the dictionary part 1e as text data in advance. . The Aichichi detection unit 1d uses text matching technology to compare each morpheme with the emotional verb registered in the Aichichi dictionary unit 1e, and matches one of the emotional verbs registered in the Aichichi dictionary unit 1e. A morpheme is detected as an identification, and an identification tag is added.

図５は、あいづち検出部1dにより、記憶部1iのテーブルにあいづちタグの欄が設けられたテーブルの一例である。あいづち検出部1dは、形態素区間の中からあいづちに該当する形態素区間にあいづちタグ“1”を設定し、それ以外の区間にはあいづちタグ“0”を設定し、記憶部1iのテーブルを更新する。図示する例では、「はい」、「なるほど」の形態素が、あいづちとして検出されている。 FIG. 5 is an example of a table in which a column of the identification tag is provided in the table of the storage unit 1i by the identification detector 1d. The nickname detection unit 1d sets the nickname tag “1” in the morpheme section corresponding to the morpheme section from among the morpheme sections, sets the nickname tag “0” in the other sections, and stores in the storage unit 1i. Update the table. In the example shown in the figure, “Yes” and “I see” morphemes are detected as nicks.

Ｓ１０４において、役割分類部1fは、発話内容があいづちかそうでないかを判断することで、対象ユーザがそのとき聞き手役なのか話し手役なのかを判別する。本実施形態では、役割分類部1fは、記憶部1iに記憶されたテーブルを参照し、発話区間切り出し部1cで切り出された各「発話」区間内を、聞き手役である区間と話し手役である区間とに分類し、役割タグを付加してテーブルを更新する。 In S104, the role classification unit 1f determines whether the target user is a listener or a speaker at that time by determining whether the content of the utterance is likely to be obvious. In the present embodiment, the role classification unit 1f refers to the table stored in the storage unit 1i, and within each “utterance” section cut out by the utterance section cutout unit 1c, is a section serving as a listener and a speaker role. The table is updated by adding a role tag.

具体的には、役割分類部1fは、テーブルに設定された「発話」区間の区切りタグ、及び、あいづちタグを参照し、「発話」区間の開始時刻から終了時刻までのあいづちタグ“1”を順に探索していく。探索開始直後にあいづちタグ“1”が検出された時点、及び、話し手役の状態時で最初にあいづちタグ“1”が検出された時点を、聞き手役の開始とする。また、探索開始直後にあいづちタグ“0”が検出された時点、及び、聞き手役の状態時で最初にあいづちタグ“0”が検出された時点を、話し手役の開始とする。「発話」区間では、必ずどちらかの役割であるとし、片方の役割が開始するまで現在の役割は継続するものとする。なお、形態素区間のあいづちタグのみを参照し、非発声音区間のあいづちタグ“0”については無視する。役割分類部1fは、このように、あいづちの形態素区間については対象ユーザが聞き手役の区間であると判別し、あいづち以外の形態素区間については対象ユーザが話し手役の区間であると判別し、判別結果を記憶部1iのテーブルを設定し、更新する。 Specifically, the role classification unit 1f refers to the delimiter tag of the “utterance” section and the nickname tag set in the table, and the nick tag “1” from the start time to the end time of the “spoken” section. "Are searched in order. The time when the tag “1” is detected immediately after the search is started and the time when the tag “1” is detected for the first time in the state of the speaker role are set as the start of the listener role. Further, the time when the tag “0” is detected immediately after the search is started and the time when the tag “0” is detected for the first time in the state of the listener are set as the start of the speaker. In the “speech” section, it is assumed that the role is always one, and the current role is continued until one of the roles starts. It should be noted that only the identification tag of the morpheme section is referred to, and the identification tag “0” of the non-vocal sound section is ignored. In this way, the role classification unit 1f determines that the target user is the section of the listener role for the Aizuchi morpheme section, and determines that the target user is the section of the speaker role for the other morpheme sections. The determination result is set in the table of the storage unit 1i and updated.

図６は、役割分類部1fにより、記憶部1iのテーブルに役割タグの欄が設けられたテーブルの一例である。図示する例では、役割分類部1fは、「発話」区間の探索開始直後（13：10：00：00）にあいづちタグ“1”６１が検出されているため、対応する形態素区間を聞き手役の開始として検出し、話し手の開始が検出されるまで、役割タグに聞き手を設定する。そして、役割分類部1fは、聞き手役の状態時で最初にあいづちタグ“0”が検出された時点６２を、話し手役の開始として検出し、聞き手の開始が検出されるまで、役割タグに話し手を設定する。 FIG. 6 is an example of a table in which a role tag column is provided in the table of the storage unit 1i by the role classification unit 1f. In the example shown in the figure, since the tag “1” 61 is detected immediately after the search of the “speech” section is started (13: 10: 00: 00), the role classification unit 1f reads the corresponding morpheme section as a listener. The listener is set in the role tag until the start of the speaker is detected. Then, the role classification unit 1f detects the time point 62 when the tag “0” is first detected in the state of the listener role as the start of the speaker role, and the role tag is detected until the start of the listener is detected. Set up a speaker.

Ｓ１０５において、会話識別部1gは、「発話」区間の中で、話し手になったり聞き手なったりと時間の推移と共に役割が変動していることを「会話」の条件とし、この条件が成立している発話区間を「会話」区間と識別する。会話識別部1gは、記憶部1iに記憶されたテーブルの「発話」区間の区切りタグ、及び、役割タグを参照し、１つの「発話」区間内に聞き手役と話し手役の両方の役割が混在している場合に、当該「発話」区間を「会話」区間と判別する。 In S105, the conversation identifying unit 1g sets the role of the speaker to be a speaker or a listener in the “speech” section and the role changes with time, and the condition is satisfied. An utterance interval is identified as a “conversation” interval. The conversation identifying unit 1g refers to the delimiter tag and the role tag of the “utterance” section of the table stored in the storage unit 1i, and both the roles of the listener and the speaker role are mixed in one “utterance” section. If it is, the “speech” section is determined as the “conversation” section.

図７は、「会話」区間の一例を示すテーブルであって、１つの「発話」区間内に聞き手役と、話し手役の両方の役割が含まれている。 FIG. 7 is a table showing an example of the “conversation” section, and the roles of the listener and the speaker are included in one “utterance” section.

Ｓ１０６において、出力部1hは、記憶部1iに記憶されたテーブルを参照し、対象ユーザのコミュニケーション度合いを算出し、算出結果をディスプレイなどの出力装置に出力する。コミュニケーション度合いは、対象ユーザのコミュニケーションの程度を示す数値（指標）である。コミュニケーション度合いの算出については、例えば以下のような様々な算出方法が考えられ、コミュニケーション度合いが高いほど、周囲とのコミュニケーションが活発であると推定され、逆にコミュニケーション度合いが低いと、コミュニケーションが希薄であると推定することができる。 In S106, the output unit 1h refers to the table stored in the storage unit 1i, calculates the degree of communication of the target user, and outputs the calculation result to an output device such as a display. The degree of communication is a numerical value (index) indicating the degree of communication of the target user. Regarding the calculation of the degree of communication, for example, the following various calculation methods are conceivable. The higher the degree of communication, the more active the communication with the surroundings, and conversely, the lower the degree of communication, the less communication. It can be estimated that there is.

例えば、出力部1hは、入力された音声データの時間（例えば１日）の中で、会話をおこなった時間の割合をコミュニケーション度合いとして算出する。すなわち、出力部1hは、会話識別部1gにおいて「会話」区間と識別された区間の時間の総和を、音声データ全体の時間で割ることで算出する。
また、出力部1hは、「発話」区間の数に対する、「会話」区間の数の割合をコミュニケーション度合いとして算出する。すなわち、出力部1hは、会話識別部1gで識別した「会話」区間の数を、発話区間切り出し部1cで切り出した「発話」区間の数で割ることでコミュニケーション度合いを算出する。
図８は、本実施形態の処理におけるデータを説明するための説明図である。 For example, the output unit 1h calculates, as the communication degree, the proportion of time during which conversation is performed in the time (for example, one day) of the input voice data. That is, the output unit 1h calculates by dividing the total time of the section identified as the “conversation” section by the conversation identifying unit 1g by the time of the entire audio data.
Further, the output unit 1h calculates the ratio of the number of “conversation” sections to the number of “speech” sections as the degree of communication. That is, the output unit 1h calculates the degree of communication by dividing the number of “conversation” sections identified by the conversation identification unit 1g by the number of “utterance” sections cut out by the utterance section cutout unit 1c.
FIG. 8 is an explanatory diagram for explaining data in the processing of the present embodiment.

音声データ受信部1aは、音声データを受信（入力）する。形態素解析部1bは、音声認識により音声データをテキストデータに変換し、形態素解析により形態素区間および非発声区間に分解する。発話区間切り出し部1cは、所定の長さを超える非音声音区間を区切りとし、区切りタグを用いて「発話」区間を切り出す。あいづち検出部1dは、形態素区間の中からあいづちを検出する。役割分類部1fは、「発話」区間を聞き手役と話し手役に分類する。会話識別部1gは、「発話」区間に聞き手役と話し手役の両方が含まれる場合に、当該「発話」区間を「会話」区間と判別する。 The audio data receiving unit 1a receives (inputs) audio data. The morpheme analysis unit 1b converts speech data into text data by speech recognition, and decomposes it into morpheme sections and non-vocal sections by morpheme analysis. The utterance section cutout unit 1c separates a non-speech sound section exceeding a predetermined length, and cuts out an “utterance” section using a delimiter tag. The nick detection unit 1d detects nick from the morpheme section. The role classification unit 1f classifies the “utterance” section into a listener role and a speaker role. When the “utterance” section includes both the listener role and the speaker role, the conversation identifying unit 1g determines the “utterance” zone as the “conversation” zone.

＜第２の実施形態＞
図９は、第２の実施形態における音声データ分析装置２の構成を示す構成図である。図示する音声データ分析装置２は、音声データ受信部2aと、形態素解析部2bと、発話区間切り出し部2cと、あいづち検出部2dと、あいづち辞書部2e（あいづち記憶手段）と、あいづち話者判別部2fと、会話識別部2gと、出力部2h（算出手段）と、記憶部2iと、本人音声モデル記録部2jを備える。 <Second Embodiment>
FIG. 9 is a configuration diagram showing the configuration of the audio data analysis device 2 in the second embodiment. The voice data analyzing apparatus 2 shown in the figure includes a voice data receiving unit 2a, a morpheme analyzing unit 2b, a speech segment cutout unit 2c, an Aichi detector unit 2d, an Aichi dictionary unit 2e (Aiichi storage means), A speech recognition unit 2f, a conversation identification unit 2g, an output unit 2h (calculation means), a storage unit 2i, and a personal voice model recording unit 2j are provided.

第２の本実施形態の音声データ分析装置２は、あいづち話者判別部2fと、本人音声モデル記録部2jとを備える点において、第１の実施形態の音声データ分析装置１（図１参照）と異なる。その他の音声データ受信部2a、形態素解析部2b、発話区間切り出し部2c、あいづち検出部2d、あいづち辞書部2e、会話識別部2g、出力部2hおよび記憶部2iについては、第１の実施形態の音声データ分析装置１の音声データ受信部1a、形態素解析部1b、発話区間切り出し部1c、あいづち検出部1d、あいづち辞書部1e、会話識別部1g、出力部1hおよび記憶部1iと同様であるため、ここでは説明を省略する。 The voice data analysis apparatus 2 according to the second embodiment is provided with an audible speaker discriminating unit 2f and a person voice model recording unit 2j. The voice data analysis apparatus 1 according to the first embodiment (see FIG. 1). ) Is different. Other voice data receiving unit 2a, morphological analysis unit 2b, utterance section cutout unit 2c, nicking detection unit 2d, nicking dictionary unit 2e, conversation identification unit 2g, output unit 2h and storage unit 2i are the first implementation. Voice data receiving unit 1a, morphological analysis unit 1b, speech segment cutout unit 1c, voice detection unit 1d, voice search dictionary unit 1e, conversation identification unit 1g, output unit 1h and storage unit 1i Since it is the same, description is abbreviate | omitted here.

また、第２の実施形態において、音声データ受信部2aが取得する音声データは、対象ユーザ本人の発声音だけでなく、話し相手の発声音なども含んだ音声データを想定する。例えば、集音マイク等を備えた録音機能を有するモバイル型端末（または、集音マイク等を備えた録音機能を有するモバイル型端末である音声データ分析装置２）を、対象ユーザに常時携帯させることで実現することが考えられる。この音声データには、話者（対象ユーザ、会話相手）の音声以外の環境音も含まれていてもよい。 In the second embodiment, the audio data acquired by the audio data receiving unit 2a is assumed to be audio data including not only the voice of the target user but also the voice of the other party. For example, a mobile terminal having a recording function including a sound collecting microphone or the like (or a voice data analyzing apparatus 2 that is a mobile terminal having a recording function including a sound collecting microphone) is always carried by the target user. It can be realized with this. This sound data may also include environmental sounds other than the sound of the speaker (target user, conversation partner).

あいづち話者判別部2fは、発話区間切り出し部2cが切り出した各発話区間毎に、あいづちの形態素区間の話者が対象ユーザか他者かを判別する。具体的には、あいづち話者判別部2fは、本人音声モデル記録部2jに記録された対象ユーザの音声モデルと、あいづちの形態素区間に対応する音声データとを比較し、音声モデルと照合した音声データのあいづちの形態素区間については対象ユーザが話者であると判別し、前記音声モデルと照合しない音声データのあいづちの形態素区間については他者が話者であると判別する。 The Aichichi speaker discriminating unit 2f discriminates whether the speaker in the Amorphous morpheme segment is the target user or another person for each utterance segment extracted by the utterance segment extracting unit 2c. Specifically, the Aichi speaker determination unit 2f compares the target user's voice model recorded in the person's voice model recording unit 2j with the voice data corresponding to the morphological section of Aichi and matches the voice model. It is determined that the target user is the speaker for the speech morpheme section of the voice data, and the other person is the speaker for the speech morpheme section of the speech data that is not matched with the speech model.

本人音声モデル記録部2jは、あいづち話者判別部2fが話者照合を行う際に、参照されるデータベースである。参照される対象ユーザ本人の音声モデルは、あらかじめ学習により作成し、本人音声モデル記録部2jに登録しておく。 The person voice model recording unit 2j is a database that is referred to when the speaker identification unit 2f performs speaker verification. The speech model of the target user to be referred to is created in advance by learning and registered in the speech model recording unit 2j.

なお、本実施形態の会話識別部2gは、１つの発話区間の中で、あいづちの形態素区間の話者が対象ユーザと他者の両方が含まれる場合、当該発話区間を会話区間であると識別する。 Note that the conversation identifying unit 2g of the present embodiment, when the speaker of the morpheme section of the previous one includes both the target user and another person in one utterance section, the utterance section is the conversation section. Identify.

図１０は、本実施形態の処理を示すフローチャートである。 FIG. 10 is a flowchart showing the processing of this embodiment.

Ｓ２００からＳ２０３については、第１の実施形態のＳ１００からＳ１００と同様に処理を行う。すなわち、Ｓ２００において、音声データ受信部2aは、音声データを取得する。Ｓ２０１において、形態素解析部2bは、音声データをテキストデータに変換し、変換したテキストデータを形態素に分解し、処理結果を記憶部2iに記憶する。図１１は、形態素解析部2bにより生成され、記憶部2iに記憶されるテーブルの一例である。図示するテーブルには、形態素区間および非発声区間と、その開始時刻および終了時刻とが対応付けて記憶されている。 Processing from S200 to S203 is performed in the same manner as S100 to S100 in the first embodiment. That is, in S200, the audio data receiving unit 2a acquires audio data. In S201, the morphological analysis unit 2b converts the speech data into text data, decomposes the converted text data into morphemes, and stores the processing result in the storage unit 2i. FIG. 11 is an example of a table generated by the morphological analysis unit 2b and stored in the storage unit 2i. In the illustrated table, morpheme sections and non-vocal sections are stored in association with their start times and end times.

Ｓ２０２において、発話区間切り出し部2cは、記憶部2iのテーブルを参照し、非発声音区間の長さに基づいて区切りタグを設定し、「発話」区間を切り出す。図１２は、発話区間切り出し部2cにより更新されたテーブルの一例であって、区切りタグが付加されている。 In S202, the utterance section cutout unit 2c refers to the table in the storage unit 2i, sets a delimiter tag based on the length of the non-vocal sound section, and cuts out the “speech” section. FIG. 12 shows an example of a table updated by the speech segment cutout unit 2c, to which a delimiter tag is added.

Ｓ２０３において、あいづち検出部2dは、あいづち辞書部2eを用いて、記憶部2iのテーブルの形態素区間の中からあいづちを検出し、あいづちタグを設定する。図１３は、あいづち検出部2dにより更新されたテーブルの一例であって、あいづちタグが付加されている。 In S203, the Aichi detector 2d detects the Aichi from the morpheme section of the table of the storage unit 2i using the Aichi dictionary unit 2e, and sets the Aichi tag. FIG. 13 is an example of a table updated by the identification detector 2d, and an identification tag is added.

Ｓ２０４において、あいづち話者判別部2fは、「発話」区間に含まれる、あいづち（あいづちタグ“1”）の話者が、対象ユーザ本人かそれ以外の他者かを判別することで、そのとき対象ユーザ本人と他者のどちらが聞き手役となっているのかを判別する。あらかじめ対象ユーザ本人の音声データを学習して音声モデルを作成し、本人音声モデル記録部2jに登録してく。あいづち話者判別部2fは、あいづち検出部2dが検出したあいづちの形態素区間に対応する音声データと、本人音声モデル記録部2jに登録された対象ユーザ本人の音声モデルとの話者照合を行う（文献３参照）。なお、音声データは、Ｓ２００で入力されたものであって、図示しないメモリなどの記憶部に格納されている。 In S204, the Aichi speaker determination unit 2f determines whether the speaker of the Aichi (Aizuchi tag “1”) included in the “utterance” section is the target user himself / herself or the other person. At that time, it is determined which of the target user and the other person is the listener. Learn the voice data of the target user in advance to create a voice model, and register it in the person's voice model recording unit 2j. The speech recognition unit 2f performs speaker verification between the speech data corresponding to the morpheme section detected by the identity detection unit 2d and the speech model of the target user registered in the subject speech model recording unit 2j. (Refer to Document 3). The voice data is input in S200 and is stored in a storage unit such as a memory (not shown).

文献３：松井, 田邉, “dPLRMを用いた話者識別”, 統計数理研究所特集「計算推論-モデリング・数理・アルゴリズム-」, 第53巻, 第2号, pp.201-210, 2005.
あいづちの音声データが、本人音声モデル記録部2jの音声モデルと照合した場合（照合が取れた場合）、あいづち話者判別部2fは、当該あいづちの話者を対象ユーザ本人であると判別し、照合しなかった場合（照合が取れなかった場合）は、当該あいづちの話者を他者と判別する。そして、判別結果である「本人」または「他者」をタグ付けし、記憶部2iに記憶されたテーブルを更新する。 Reference 3: Matsui, Tanabe, “Speaker Identification Using dPLRM”, Statistical Mathematics Research Institute, Special Issue “Computational Reasoning: Modeling, Mathematical, Algorithm”, Vol. 53, No. 2, pp.201-210, 2005.
When the voice data of Aichi is collated with the voice model of the person's voice model recording unit 2j (when collation is obtained), the Aichi speaker determining unit 2f determines that the speaker of the Aichi is the target user. If it is discriminated and the collation is not performed (if collation is not obtained), the speaker at the moment is discriminated from the other person. Then, the identification result “person” or “other” is tagged, and the table stored in the storage unit 2i is updated.

図１４は、あいづち話者判別部2fにより、記憶部2iのテーブルに話者タグの欄が設けられたテーブルの一例である。あいづちタグ“1”が設定されたデータの話者タグには“本人”または“他者”が設定されている。 FIG. 14 shows an example of a table in which a speaker tag column is provided in the table of the storage unit 2i by the speaker identification unit 2f. “Person” or “Other” is set for the speaker tag of the data for which the tag “1” is set.

Ｓ２０５において、会話識別部2gは、「発話」区間の中で、あいづちを打っている人物が対称ユーザ本人になったり他者になったりと、時間の推移と共にあいづち話者が変動していることを「会話」の条件とし、この条件が成立している区間を「会話」区間と識別する。 In S205, the conversation identifying unit 2g changes the time-speaking speaker with the time transition, such as the person who is hitting a symmetric user or another person in the “speech” section. This is the “conversation” condition, and the section in which this condition is satisfied is identified as the “conversation” section.

１つの「発話」区間において対象ユーザ本人と他者で聞き手と話し手の役割が入れ替わり、両者が聞き手役を担った、すなわち両者があいづちを発声した場合に、その「発話」区間を「会話」区間と判別する。会話識別部2gは、Ｓ２０２で付加された区切りタグと、Ｓ２０４で付加された話者タグを参照し、１つの「発話」区間内でのあいづち話者に、対象ユーザ本人と他者の両方が混在している場合において、「会話」区間と判別する。 When the roles of the listener and the speaker are switched between the target user and others in one “utterance” section, both play the role of a listener, that is, when both of them utter a whisper, the “speech” section is changed to the “conversation” section. Is determined. The conversation identifying unit 2g refers to the delimiter tag added in S202 and the speaker tag added in S204, and gives both the target user himself / herself and others to the utterer in one “utterance” section. Is mixed, it is determined as a “conversation” section.

図１５は、「会話」区間の一例を示すテーブルであって、１つの「発話」区間内に本人と他者の両方の話者タグが含まれている。 FIG. 15 is a table showing an example of a “conversation” section, in which one and another speaker tags are included in one “utterance” section.

Ｓ２０６において、出力部2hは、第１の実施形態のＳ１０６と同様に、記憶部2iに記憶されたテーブルを参照し、識別された「会話」区間を用いて対象ユーザのコミュニケーション度合いを算出し、算出結果をディスプレイなどの出力装置に出力する。 In S206, as in S106 of the first embodiment, the output unit 2h refers to the table stored in the storage unit 2i, calculates the communication level of the target user using the identified “conversation” section, The calculation result is output to an output device such as a display.

以上説明した第１および第２の本実施形態では、「発話」と「会話」を区別して、「会話」区間を自動で抽出することが可能となる。すなわち、第１の実施形態では、対象ユーザが「話し手」と「聞き手」の双方の役割を果たしている「発話」区間を「会話」区間として識別し、第２の実施形態では、あいづちの話者に対象ユーザ本人と他者の両方が含まれている「発話」区間を「会話」区間として識別する。 In the first and second embodiments described above, it is possible to automatically extract the “conversation” section by distinguishing between “utterance” and “conversation”. That is, in the first embodiment, the “speech” section in which the target user plays both roles of “speaker” and “listener” is identified as the “conversation” section, and in the second embodiment, The “speech” section in which the user includes both the target user and the other person is identified as the “conversation” section.

これにより、本実施形態では、音声により様々なコミュニケーション（一方通行のコミュニケーション、独り言、交流を目的としたコミュニケーションなど）の中から、交流を目的としたコミュニケーションの区間のみを抽出することができる。言い換えると、一方通行のコミュニケーションや独り言などの交流を目的としないコミュニケーショを排除することで、より精度の高いコミュニケーションの状況を把握することができる。例えば、個々のユーザの「会話」状況を自治体等が日常的に管理することで、周囲との日常的なコミュニケーションが希薄な人物を把握でき、引きこもりの防止につながる。 Thereby, in this embodiment, it is possible to extract only a communication section for the purpose of exchange from various communication (one-way communication, monologue, communication for the purpose of exchange, etc.) by voice. In other words, it is possible to grasp the state of communication with higher accuracy by eliminating communication that is not intended for one-way communication or communication such as monologue. For example, when the local government or the like manages the “conversation” situation of individual users on a daily basis, it is possible to grasp a person with whom daily communication with the surroundings is rare, leading to prevention of withdrawal.

また、本実施形態では、対象ユーザ本人を含む３名以上で行われる対面型コミュニケーションに対しても、「発話」と「会話」の判別が可能である。例えば、３名で「会話」をしている場面において、対象ユーザ本人は、話に入っていけず、対象ユーザ本人以外のメンバで「会話」が構成されている場合が考えられる。従来技術では、一対一で処理を行い全員が会話状態にあると認識するが、本実施形態によれば、対象ユーザ本人はあいづちしか発声していないため、これを「会話」とはみなさない。 In the present embodiment, “speech” and “conversation” can also be discriminated for face-to-face communication performed by three or more persons including the target user. For example, in a scene in which “conversation” is being performed by three people, the target user himself / herself cannot enter the talk, and “conversation” may be configured by members other than the target user himself / herself. In the prior art, one-to-one processing is performed and all the users are recognized to be in a conversation state. However, according to the present embodiment, the target user himself / herself only speaks, so this is not regarded as “conversation”. .

また、本実施形態では、１つの音声取得機能を有する端末から取得できる音声データを入力として処理を行うため、対象ユーザ本人のみが端末を保持するだけで実現可能である。したがって、対象ユーザだけでなく、全員に端末を保持させる必要のある従来技術に比べ、導入が容易である。 In this embodiment, since processing is performed using voice data that can be acquired from a terminal having one voice acquisition function as an input, only the target user himself / herself can hold the terminal. Therefore, it is easier to introduce than the conventional technique in which not only the target user but also everyone needs to hold the terminal.

また、本実施形態は、抽出された「会話」区間を用いて対象ユーザのコミュニケーション度合いを算出する。これにより、対象ユーザのコミュニケーションの程度を推測することができる。 In the present embodiment, the degree of communication of the target user is calculated using the extracted “conversation” section. Thereby, the degree of communication of the target user can be estimated.

なお、本発明は上記実施形態に限定されるものではなく、その要旨の範囲内で数々の変形が可能である。 In addition, this invention is not limited to the said embodiment, Many deformation | transformation are possible within the range of the summary.

１：音声データ分析装置
1a ：音声データ受信部
1b ：形態素解析部
1c ：発話区間切り出し部
1d ：あいづち検出部
1e ：あいづち辞書部
1f ：役割分類部
1g ：会話識別部
1h ：出力部
1i ：記憶部
２：音声データ分析装置
2a ：音声データ受信部
2b ：形態素解析部
2c ：発話区間切り出し部
2d ：あいづち検出部
2e ：あいづち辞書部
2f ：あいづち話者判別部
2g ：会話識別部
2h ：出力部
2i ：記憶部
2j ：本人音声モデル記憶部 1: Voice data analyzer
1a: Audio data receiver
1b: Morphological analyzer
1c: Speech segment cutout part
1d: Aizuchi detector
1e: Aizuchi Dictionary
1f: Role classification part
1g: Conversation identifier
1h: Output section
1i: Storage unit 2: Audio data analyzer
2a: Audio data receiver
2b: Morphological analyzer
2c: Speech segment cutout part
2d: Aizuchi detector
2e: Aizuchi Dictionary
2f: Aizuchi speaker discriminator
2g: Conversation identifier
2h: Output section
2i: Memory unit
2j: Personal voice model storage

Claims

For the data in which the voice data of the target user and others is classified into a morpheme section and a non-vocal sound section, if the length of the non-voice sound section exceeds a predetermined time, the non-voice sound section An utterance section cutout means for determining and cutting out the utterance section;
Aizuchi detection means for detecting Aizuchi morpheme sections;
For each utterance section cut out by the utterance section cutout means, an Aichi speaker determination means for determining whether the speaker of the morpheme section is a target user or another person,
A conversation identifying means for identifying a speech section as a conversation section when the speaker of the morpheme section of Aizuchi includes both the target user and another person in one speech section;
The nickname speaker discriminating means compares the target user's voice model with the voice data corresponding to the morpheme section of the nickname, and the target user determines the morpheme section of the voice data matched with the voice model. A speech data analysis apparatus, characterized in that it is determined that the speaker is a speaker and that another person is a speaker for a morpheme section of speech data that is not collated with the speech model.

The voice data analysis apparatus according to claim 1,
An audio data analysis apparatus, further comprising a calculation unit that calculates a degree of communication of a target user based on a total time or the number of conversation segments identified by the conversation identification unit.

The voice data analysis device according to claim 1 or 2,
The speech data analysis apparatus further comprising: a morpheme analysis unit that analyzes the speech data into morphemes and classifies the speech data into the morpheme section including the morpheme and the other non-vocal sound section.

The voice data analysis device according to any one of claims 1 to 3,
A further feature is provided for storing a plurality of identification data.
The speech detection unit determines that the morpheme section is an morpheme section when the morpheme of each morpheme section matches any of the correlation data in the storage unit. Data analysis device.

A voice data analysis method performed by a voice data analysis device,
For the data in which the voice data of the target user and others is classified into a morpheme section and a non-vocal sound section, if the length of the non-voice sound section exceeds a predetermined time, the non-voice sound section An utterance section cut-out step for discriminating and cutting out the utterance section;
An Aichi detection step for detecting a morphological section of the Aizichi;
For each utterance section cut out in the utterance section cutout step, an Aichi speaker determination step for determining whether the speaker of the morpheme section is a target user or another person,
In a single utterance section, when the speaker of the morpheme section of Aizuchi includes both the target user and another person, a conversation identifying step for identifying the utterance section as a conversation section is performed.
In the speech recognition determination step, the target user's speech model is compared with speech data corresponding to the speech morpheme section. A speech data analysis method characterized by discriminating that a speaker is a speaker and determining that another person is a speaker for a morpheme section of speech data that is not collated with the speech model.

The voice data analysis method according to claim 5,
An audio data analysis method, further comprising a calculation step of calculating a degree of communication of the target user based on the total time of conversation sections or the number of conversation sections identified in the conversation identification step.

The voice data analysis method according to claim 5 or 6,
The voice data analysis device further includes an identification memory unit that stores a plurality of identification data,
In the speech detection step, if the morpheme of each morpheme section coincides with any of the correlation data in the storage unit, the morpheme section is determined to be the morpheme section of the speech. Data analysis method.

An audio data analysis program for causing a computer to function as the audio data analysis apparatus according to any one of claims 1 to 4.