JP5713782B2

JP5713782B2 - Information processing apparatus, information processing method, and program

Info

Publication number: JP5713782B2
Application number: JP2011095056A
Authority: JP
Inventors: 友範田中; 深田　俊明; 俊明深田
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2011-04-21
Filing date: 2011-04-21
Publication date: 2015-05-07
Anticipated expiration: 2031-04-21
Also published as: JP2012226651A

Description

本発明は、例えば会議の議事録中における特定の箇所に関連する音声を特定する技術に関するものである。 The present invention relates to a technique for specifying a voice related to a specific part in a meeting minutes, for example.

従来、音声認識で書き起こされた会議の議事録から、キーワードマッチングにより、入力されたキーワードに関連する箇所を特定する技術が知られている（例えば、特許文献１参照）。 2. Description of the Related Art Conventionally, a technique for identifying a location related to an input keyword by keyword matching from a meeting minutes transcribed by voice recognition is known (see, for example, Patent Document 1).

特開２００２−９９５３０号公報JP 2002-99530 A

議事録に書かれた内容の詳細を確認する場合には、会議中の音声を録音し、その音声の聴取を行っている。そしてユーザは、録音した全ての音声の聴取が多大な時間を必要とするため、議事録中の特定の箇所に関連する音声のみを聴取する必要がある。しかしながら、特許文献１に開示されるようなキーワードマッチングにおいては、入力されたキーワードが議事録中の広範囲に存在する場合は、関連する音声の特定が困難である。 When confirming the details of the contents written in the minutes, the voice during the meeting is recorded and the voice is listened to. And since the user needs a lot of time to listen to all the recorded sounds, it is necessary to listen only to the sounds related to a specific part in the minutes. However, in the keyword matching as disclosed in Patent Document 1, it is difficult to specify related speech when the input keyword exists in a wide range in the minutes.

そこで、本発明の目的は、例えば会議の議事録中における特定の箇所に関連する音声を、高い精度で特定することにある。 Therefore, an object of the present invention is to specify, for example, a voice related to a specific part in a meeting minutes with high accuracy.

本発明の情報処理装置は、複数の発言者による複数の発言を含む音声データの発言単位を音声区間として、複数の音声区間それぞれを複数の第１のテキストデータに変換する変換手段と、各音声区間に対し、発言開始時間及び発言者を特定する第１の特定手段と、前記発言単位の要約文を表す第２のテキストデータと、前記第２のテキストデータに対応する発言者を示す情報と、の入力を受け付ける第１の受付手段と、前記第２のテキストデータの入力時間を特定する第２の特定手段と、前記第１のテキストデータそれぞれと、前記第２のテキストデータと、のテキストマッチングを行い、対応箇所を特定するマッチング手段と、前記対応箇所、前記入力時間、前記発言開始時間、前記発言者を示す情報、前記第１の特定手段により特定された発言者に基づいて、前記第２のテキストデータに対応する前記音声区間を特定する第３の特定手段とを有することを特徴とする。 An information processing apparatus according to the present invention includes a conversion means for converting each of a plurality of speech sections into a plurality of first text data, with each speech unit of speech data including a plurality of speeches by a plurality of speakers as speech sections, and each speech to section showing a first specifying means for specifying a speech start time及beauty onset words person, a second text data representing a summary of the talk unit, a speaker corresponding to the second text data a first receiving means for receiving the information, the input of a second specifying means for specifying during the input mode of the second text data, and the previous SL first text data Taso respectively, said first perform a second text data, the text matching, matching means for identifying a corresponding portion, the corresponding portion, the input time, the talk start time, information indicating the speaker is identified by said first specifying means Based on the speaker, and having a third specifying means for specifying the voice interval corresponding to the second text data.

本発明によれば、例えば会議の議事録中における特定の箇所に関連する音声を、高い精度で特定することが可能となる。 ADVANTAGE OF THE INVENTION According to this invention, it becomes possible to specify the audio | voice relevant to the specific location in the minutes of a meeting, for example with high precision.

本発明の実施形態に係る情報処理システムの概観を示す図である。1 is a diagram showing an overview of an information processing system according to an embodiment of the present invention. 本発明の実施形態に係る情報処理システムの機能的な構成を示す図である。It is a figure which shows the functional structure of the information processing system which concerns on embodiment of this invention. 本発明の実施形態に係る情報処理システムのハードウエア構成を示す図である。It is a figure which shows the hardware constitutions of the information processing system which concerns on embodiment of this invention. 議事録の要約文に関連する音声区間（関連音声区間）を特定する処理を示すフローチャートである。It is a flowchart which shows the process which specifies the audio | voice area (related audio | voice area) relevant to the summary sentence of the minutes. 会議中に録音された音声の発言内容を示す図である。It is a figure which shows the utterance content of the audio | voice recorded during the meeting. ステップＳ４０１において、録音された音声から発言音声単位と発言開始時間とを特定した結果を示す図である。It is a figure which shows the result of having specified the speech audio | voice unit and the speech start time from the recorded audio | voice in step S401. ステップＳ４０２において発言音声単位の発言者が特定された結果を示す図である。It is a figure which shows the result by which the speaker of the speech voice unit was specified in step S402. 各発言音声単位におけるマッチング箇所の数を示す図である。It is a figure which shows the number of the matching parts in each speech audio | voice unit. ステップＳ４０３の詳細を示すフローチャートである。It is a flowchart which shows the detail of step S403. 議事録係によって作成される議事録、議事録から特定された要約文及び発言者、要約文入力時間の例を示す図である。It is a figure which shows the example of the minutes produced by the minutes clerk, the summary sentence specified from the minutes, the speaker, and the summary sentence input time. 図７（ｂ）の７０１に示す要約文「会議で発表をする価値はある。」に対して、ストローク時間保持部がストローク時間を保持した例を示す図である。It is a figure which shows the example which the stroke time holding | maintenance part hold | maintained the stroke time with respect to the summary sentence "It is worth giving at a meeting." Shown to 701 of FIG.7 (b). 図７の７０１に示す要約文「会議で発表をする価値はある。」を形態素解析した結果を示す図である。It is a figure which shows the result of having carried out the morphological analysis of the summary sentence "It is worth giving at a meeting." Shown to 701 of FIG. ステップＳ４０５の処理の詳細を示すフローチャートである。It is a flowchart which shows the detail of the process of step S405. 発言音声単位の再生例を説明するための図である。It is a figure for demonstrating the example of reproduction | regeneration of the speech audio | voice unit. 第２の実施形態における処理を示すフローチャートである。It is a flowchart which shows the process in 2nd Embodiment. 第３の実施形態における処理を示すフローチャートである。It is a flowchart which shows the process in 3rd Embodiment. 第４の実施形態における処理を示すフローチャートである。It is a flowchart which shows the process in 4th Embodiment.

以下、本発明を適用した好適な実施形態を、添付図面を参照しながら詳細に説明する。 DESCRIPTION OF EXEMPLARY EMBODIMENTS Hereinafter, preferred embodiments to which the invention is applied will be described in detail with reference to the accompanying drawings.

先ず、本発明の第１の実施形態について説明する。図１は、本発明の第１の実施形態に係る情報処理システムの概観を示す図である。本実施形態に係る情報処理システムは、音声変換機能と発言者特定機能とを備えている。また、本実施形態に係る情報処理システムは、各処理部の連携により効果を奏するものであるが、これに限らず、全ての処理部を一体的に備えた情報処理装置も本発明に適用可能である。 First, a first embodiment of the present invention will be described. FIG. 1 is a diagram showing an overview of an information processing system according to the first embodiment of the present invention. The information processing system according to the present embodiment includes a voice conversion function and a speaker specifying function. In addition, the information processing system according to the present embodiment is effective due to the cooperation of the processing units. However, the present invention is not limited to this, and an information processing device integrally including all processing units is also applicable to the present invention. It is.

図１に示す情報処理システムは、主にマイク１０１及びＰＣ１０２から構成される。発言者１０３〜１０７により会議が行われると、マイク１０１は発言者１０３〜１０７により発言された音声を録音する。議事録係１０８は、ＰＣ１０２を用いて、発言内容の要約文を入力して会議の議事録を作成する。ここでは、要約文の集合を会議の議事録とする。また、説明の便宜上、以下では発言者１０３〜１０７の名前を「佐藤」、「田中」、「鈴木」、「伊藤」、「大川」とする。 The information processing system shown in FIG. 1 mainly includes a microphone 101 and a PC 102. When the conference is performed by the speakers 103 to 107, the microphone 101 records the voices spoken by the speakers 103 to 107. The minutes clerk 108 uses the PC 102 to input a summary of the content of the statement and create a meeting minutes. Here, a set of summary sentences is used as the minutes of the meeting. Further, for convenience of explanation, the names of the speakers 103 to 107 are hereinafter referred to as “Sato”, “Tanaka”, “Suzuki”, “Ito”, and “Okawa”.

図２は、本実施形態に係る情報処理システムの機能的な構成を示す図である。図２において、音声変換部２０１は、マイク１０１によって録音された発言内容をテキストに変換する。発言者特定部２０２は、マイク１０１によって録音された発言内容の発言者を特定する。入力部２０３は、ＰＣ１０２に相当する構成であり、議事録係１０８によって会議の発言内容の要約文及び発言者が入力される。マッチング箇所特定部２０４は、発言内容と要約文とのテキストマッチングを行い、互いに一致する箇所（以下、マッチング箇所と称す）を特定する。音声区間特定部２０５は、発言時間、要約文入力時間、マッチング箇所及び発言者の情報を用いて、要約文に関連する音声区間を特定する。入力部２０３は、テキストとして要約文を入力するテキスト入力部２０６、要約文のストローク時間を保持するストローク時間保持部２０７、ストローク時間を用いて要約文の入力を開始した時間を特定する入力時間特定部２０８を備える。なお、以下の説明では、議事録の要約文に関連する音声区間を関連音声区間と称することがある。 FIG. 2 is a diagram illustrating a functional configuration of the information processing system according to the present embodiment. In FIG. 2, the voice conversion unit 201 converts the utterance content recorded by the microphone 101 into text. The speaker specifying unit 202 specifies a speaker of the content of the speech recorded by the microphone 101. The input unit 203 has a configuration corresponding to the PC 102, and a summary sentence and a speaker of the conference speech content are input by the minutes clerk 108. The matching part specifying unit 204 performs text matching between the utterance content and the summary sentence, and specifies a part that matches each other (hereinafter referred to as a matching part). The speech section specifying unit 205 specifies a speech section related to the summary sentence using the speech time, the summary sentence input time, the matching portion, and the information of the speaker. The input unit 203 includes a text input unit 206 that inputs a summary sentence as text, a stroke time holding unit 207 that holds a stroke time of the summary sentence, and an input time specification that specifies a time when the summary sentence is input using the stroke time. The unit 208 is provided. In the following description, a voice section related to the summary sentence of the minutes may be referred to as a related voice section.

図３は、本実施形態に係る情報処理システムのハードウエア構成を示す図である。ＣＰＵ３０１は、プログラムに従って、本実施形態の各動作手順を実現するよう動作する。ＲＡＭ３０２は、上記プログラムの動作に必要な記憶領域を提供する。ＲＯＭ３０３は、上記プログラムの動作手順を実現するプログラムやデータベース等を保持する。音声入力装置３０４は、マイク１０１に相当し、発言者１０３〜１０７により発言された音声を録音する。テキスト入力装置３０６は、ＰＣ１０２に相当し、議事録係１０８の操作に応じて発言内容の要約文を入力する。音声再生装置３０５は、音声区間特定部２０５によって特定された関連音声区間に相当する音声を出力する。なお、テキスト入力装置３０６によって入力された要約文には、当該要約文が入力された時間情報が付与されている。上記各処理部は、バス３０７を介してデータをやりとりする。 FIG. 3 is a diagram illustrating a hardware configuration of the information processing system according to the present embodiment. The CPU 301 operates according to a program so as to realize each operation procedure of the present embodiment. The RAM 302 provides a storage area necessary for the operation of the program. The ROM 303 holds a program, a database, and the like that realize the operation procedure of the program. The voice input device 304 corresponds to the microphone 101 and records voices spoken by the speakers 103 to 107. The text input device 306 corresponds to the PC 102 and inputs a summary sentence of the content of the utterance according to the operation of the minutes clerk 108. The audio playback device 305 outputs audio corresponding to the related audio segment specified by the audio segment specifying unit 205. The summary sentence input by the text input device 306 is given time information when the summary sentence is input. Each processing unit exchanges data via the bus 307.

図４は、議事録の要約文に関連する音声区間（関連音声区間）を特定する処理を示すフローチャートである。以下、図４を参照しながら、本実施形態に係る情報処理システムの処理について説明する。 FIG. 4 is a flowchart showing a process of specifying a voice section (related voice section) related to the summary sentence of the minutes. Hereinafter, the processing of the information processing system according to the present embodiment will be described with reference to FIG.

ステップＳ４０１において、音声区間特定部２０５は、録音された音声から、実際に発言のあった音声（以下、発言音声単位と称す）を検出し、発言音声単位が開始された時間（以下、発言開始時間と称す）を特定する。録音された音声は、例えば、２２．０５ＫＨｚでサンプリングされたＷＡＶＥデータであり、ＰＣＭ方式で外部記憶装置に保存される。本実施形態においては、発言音声単位の検出には音声区間検出の技術を用いる。ここでは、音声区間検出の技術により検出された発言音声区間に相当する音声を発言音声単位とする。なお、音声区間検出の技術は公知なので詳細な説明は省略する。 In step S401, the voice segment identification unit 205 detects the voice that actually made a speech (hereinafter referred to as a speech voice unit) from the recorded voice, and the time when the speech voice unit was started (hereinafter referred to as speech start). (Referred to as time). The recorded voice is, for example, WAVE data sampled at 22.05 KHz, and is stored in the external storage device by the PCM method. In the present embodiment, a technique for detecting a voice section is used for detecting a speech voice unit. Here, the speech corresponding to the speech segment detected by the speech segment detection technique is set as the speech unit. In addition, since the technique of speech area detection is well-known, detailed description is abbreviate | omitted.

また、ＷＡＶＥデータのヘッダ部分には録音された時間が書き込まれており、音声区間特定部２０５は、この情報から発言開始時間を特定する。音声変換部２０１は、発言音声単位をテキストに変換する。発言音声単位からテキストへの変換は音声認識の技術が用いられる。本実施形態では、予め様々な会議に関連した語彙の音声データをモデルとしてＲＡＭ３０２に記憶させておくことにより音声認識を行う。なお、音声認識の技術は公知なので詳細な説明は省略する。また、音声変換部２０１により生成されるテキストは、第１のテキストデータの適用例である。 In addition, the recorded time is written in the header portion of the WAVE data, and the speech section specifying unit 205 specifies the speech start time from this information. The voice conversion unit 201 converts a speech voice unit into text. A speech recognition technique is used to convert the speech unit into text. In the present embodiment, speech recognition is performed by previously storing speech data of vocabulary related to various meetings in the RAM 302 as a model. In addition, since the technique of voice recognition is well-known, detailed description is abbreviate | omitted. The text generated by the voice conversion unit 201 is an application example of the first text data.

図５Ａは、会議中に録音された音声の発言内容を示す図である。図５Ｂは、ステップＳ４０１において、録音された音声から発言音声単位と発言開始時間とを特定した結果を示す図である。即ち、図５Ｂにおける「発言音声単位」の列には、発言音声単位の特定結果が列挙されている。また、図５Ｂにおける「発言開始時間」の列には、発言音声単位毎の発言開始時間が列挙されている。また、図５Ｂにおける「発言内容変換結果」の列には、各発言音声単位をテキストに変換した結果が列挙されている。現在の音声認識の技術は、音声を完全にテキストに変換できる精度にはない。よって、図５Ｂにおける「発言内容変換結果」の列に示すように誤認識が起こる。なお、変換されるテキストの候補がない音声に関しては、空白となっている。また、以下の説明において、発言音声単位をテキストに変換した結果を、発言内容変換結果と称することがある。 FIG. 5A is a diagram showing the content of speech recorded during a conference. FIG. 5B is a diagram illustrating a result of specifying the speech voice unit and the speech start time from the recorded voice in step S401. That is, in the column of “speech speech unit” in FIG. 5B, specific results of speech speech units are listed. Further, in the column of “speech start time” in FIG. 5B, the speech start time for each speech sound unit is listed. In the column of “speech content conversion result” in FIG. 5B, the results of converting each speech unit into text are listed. Current speech recognition technology is not accurate enough to completely convert speech to text. Therefore, misrecognition occurs as shown in the column “conversion result of speech” in FIG. 5B. Note that a voice that has no text candidate to be converted is blank. Moreover, in the following description, the result of converting a speech unit into text may be referred to as a speech content conversion result.

ステップＳ４０２において、発言者特定部２０２は、ステップＳ４０１で検出された発言音声単位の発言者を特定する。本実施形態では、発言者の特定には話者認識の技術を用いる。話者認識の技術では、発言者１０３〜１０７の声の特徴をモデルとして予めＲＡＭ３０２に記憶させておき、発言音声単位から得られる音声特徴量とモデルとを照合することにより、話者を認識する。なお、話者認識の技術は公知なので詳細な説明は省略する。図５Ｃは、ステップＳ４０２において発言音声単位の発言者が特定された結果を示す図である。即ち、図５Ｃにおける「発言者」の列には、ステップＳ４０２において特定された発言音声単位の発言者が列挙されている。なお、ステップＳ４０１における発言開始時間の特定処理、ステップＳ４０２における発言者の特定処理は、第１の特定手段の処理例である。 In step S402, the speaker identifying unit 202 identifies the speaker in units of speech voice detected in step S401. In this embodiment, a speaker recognition technique is used to specify a speaker. In the speaker recognition technique, the voice characteristics of the speakers 103 to 107 are stored in the RAM 302 in advance as a model, and the speaker is recognized by collating the voice feature amount obtained from the speech voice unit with the model. . In addition, since the technique of speaker recognition is well-known, detailed description is abbreviate | omitted. FIG. 5C is a diagram illustrating a result of identifying a speaker in units of speech in step S402. That is, the “speaker” column in FIG. 5C lists the speakers in units of speech voices identified in step S402. Note that the speech start time specifying processing in step S401 and the speaker specifying processing in step S402 are processing examples of the first specifying means.

ステップＳ４０３において、議事録係１０８が入力部２０３より要約文及び発言者を入力する。ここで、図６を参照しながら、ステップＳ４０３について詳細に説明する。図６は、ステップＳ４０３の詳細を示すフローチャートである。 In step S403, the minutes clerk 108 inputs a summary sentence and a speaker from the input unit 203. Here, step S403 will be described in detail with reference to FIG. FIG. 6 is a flowchart showing details of step S403.

ステップＳ６０１において、入力部２０３のテキスト入力部２０６は、議事録係１０８の操作に応じて、議事録として発言内容の要約文及び発言者をテキストで入力する。図７（ａ）は、議事録係１０８がテキストを入力することによって作成された議事録の例を示している。本実施形態では、記号「・」の後に続く一文が要約文として特定されるとともに、要約文の文末の「（」と「）」の記号で囲まれた文字列がその要約文の発言者として特定される。図７（ｂ）は、作成された議事録から特定された要約文及び発言者を示している。なお、テキスト入力部２０６は、第１の入力手段の適用例となる構成であり、テキスト入力部２０６により入力されるテキストは、第２のテキストデータである。 In step S <b> 601, the text input unit 206 of the input unit 203 inputs a summary sentence of the utterance content and a speaker as text in accordance with the operation of the minutes clerk 108. FIG. 7A shows an example of the minutes created by the minutes clerk 108 inputting text. In the present embodiment, a sentence following the symbol “•” is specified as a summary sentence, and a character string surrounded by symbols “(” and “)” at the end of the summary sentence is used as a speaker of the summary sentence. Identified. FIG. 7B shows a summary sentence and a speaker identified from the created minutes. Note that the text input unit 206 has a configuration as an application example of the first input unit, and the text input by the text input unit 206 is second text data.

ステップＳ６０２において、入力部２０３のストローク時間保持部２０７は、ステップＳ６０１で入力された要約文のストローク時間を保持する。本実施形態におけるストローク時間保持部２０７は、要約文を構成する１文字をストローク単位とし、各ストローク単位の１文字の入力が開始された時間（ストローク時間）を記録する。図８は、図７（ｂ）の７０１に示す要約文「会議で発表をする価値はある。」に対して、ストローク時間保持部２０７がストローク時間を保持した例を示す図である。 In step S602, the stroke time holding unit 207 of the input unit 203 holds the stroke time of the summary sentence input in step S601. The stroke time holding unit 207 in the present embodiment records one character constituting the summary sentence as a stroke unit, and records the time (stroke time) when the input of one character in each stroke unit is started. FIG. 8 is a diagram showing an example in which the stroke time holding unit 207 holds the stroke time for the summary sentence “It is worth presenting at the meeting” shown at 701 in FIG. 7B.

ステップＳ６０３において、入力部２０３の入力時間特定部２０８は、ステップＳ６０２の結果から要約文の入力を開始した時間（以下、要約文入力時間と称す）を特定する。各要約文の一番先頭のストローク単位のストローク時間が要約文入力時間となる。要約文「会議で発表をする価値はある。」については、ストローク単位の「会」のストローク時間である「8時04分50秒」が要約文入力時間となる。図７（ｃ）における「要約文入力時間」の列には、ステップＳ６０１で入力された各要約文に対して特定された要約文入力時間が列挙されている。従って、ステップＳ４０３においては、例えば図７（ａ）に示す議事録が入力されると、図７（ｃ）に示すように要約文入力時間、要約文及び発言者が特定される。なお、入力時間特定部２０８は、第２の特定手段の適用例となる構成である。 In step S603, the input time specifying unit 208 of the input unit 203 specifies the time when the input of the summary sentence is started from the result of step S602 (hereinafter referred to as the summary sentence input time). The stroke time in the first stroke unit of each summary sentence is the summary sentence input time. For the summary sentence “It is worth presenting at the meeting.”, “8:04:50”, which is the stroke time of the “meeting” in units of strokes, is the summary sentence input time. In the column of “summary sentence input time” in FIG. 7C, the summary sentence input times specified for each summary sentence input in step S601 are listed. Therefore, in step S403, for example, when the minutes shown in FIG. 7A are input, the summary sentence input time, the summary sentence, and the speaker are specified as shown in FIG. 7C. The input time specifying unit 208 is a configuration serving as an application example of the second specifying unit.

ステップＳ４０４において、マッチング箇所特定部２０４は、ステップＳ４０１で変換されたテキストとステップＳ４０３で入力された要約文との間でテキストマッチングを行い、マッチング箇所を特定する。ここで、テキストマッチングについて具体的に説明する。先ず、マッチング箇所特定部２０４は図７（ｂ）の要約文を形態素解析する。図９は、図７（ｂ）の７０１に示す要約文「会議で発表をする価値はある。」を形態素解析した結果を示す図である。要約文７０１は単語１〜単語８に単語分割され、各単語の品詞が特定される。そしてマッチング箇所特定部２０４は、図５Ｃの発言内容変換結果の中から、品詞が名詞と特定された「会議」、「発表」及び「価値」の単語を検索する。マッチング箇所特定部２０４は、このように検索した単語の箇所をマッチング箇所とする。なお、テキストマッチングの代替手段として概念辞書等を用いて、意味の近い単語の箇所をマッチング箇所としてもよい。図５Ｄは、各発言音声単位におけるマッチング箇所の数を示す図である。 In step S404, the matching part specifying unit 204 performs text matching between the text converted in step S401 and the summary sentence input in step S403, and specifies a matching part. Here, the text matching will be specifically described. First, the matching part specifying unit 204 performs morphological analysis on the summary sentence in FIG. FIG. 9 is a diagram showing a result of a morphological analysis of the summary sentence “It is worth presenting at a meeting” shown at 701 in FIG. 7B. The summary sentence 701 is divided into words 1 to 8 and the part of speech of each word is specified. Then, the matching part specifying unit 204 searches the words “conference”, “announcement”, and “value” in which the part of speech is specified as a noun from the result of conversion of the content of the speech in FIG. 5C. The matching location specifying unit 204 sets the location of the searched word as a matching location. Note that a concept dictionary or the like may be used as an alternative means of text matching, and a word part having a close meaning may be used as a matching part. FIG. 5D is a diagram showing the number of matching points in each speech unit.

ステップＳ４０５において、音声区間特定部２０５は、発言開始時間、要約文入力時間、マッチング箇所及び発言者の情報を用いて、ステップＳ４０３で入力された要約文に関連する音声区間（関連音声区間）を特定する。ここで、図１０を参照しながら、ステップＳ４０５について詳細に説明する。図１０は、ステップＳ４０５の処理の詳細を示すフローチャートである。以下、図１０を参照しながら、図７（ｂ）の要約文７０１に関連する音声区間（関連音声区間）を特定する例について説明する。なお、音声区間特定部２０５は、第３の特定手段の適用例となる構成である。 In step S405, the speech section specifying unit 205 uses the speech start time, the summary sentence input time, the matching part, and the information of the speaker to determine a speech section (related speech section) related to the summary sentence input in step S403. Identify. Here, step S405 will be described in detail with reference to FIG. FIG. 10 is a flowchart showing details of the process in step S405. Hereinafter, an example in which a speech section (related speech section) related to the summary sentence 701 in FIG. 7B is specified will be described with reference to FIG. Note that the speech segment identification unit 205 is a configuration that is an application example of the third identification unit.

ステップＳ１００１において、音声区間特定部２０５は、ステップＳ４０１とステップＳ４０３との結果から時間情報対象区間を特定する。ここでは、図７（ｂ）の要約文７０１の要約文入力時間（8時04分50秒）から、所定の時間内（ここでは２分とする）にある発言音声単位を時間情報対象区間とする。即ち、図５Ｃにおいて、発言音声単位５０１〜５１７のうち、8時04分50秒から8時02分50秒の間にある発言音声単位５１１〜５１５が時間情報対象区間となる。ステップＳ１００２において、音声区間特定部２０５は、ステップＳ４０２とステップＳ４０３との結果から発言者情報対象区間を特定する。ここでは、要約文７０１で入力された発言者（佐藤）が発言した発言音声単位を発言者情報対象区間とする。即ち、図５Ｃにおいて、発言音声単位５０１〜５１７のうち、発言者（佐藤）が発言した発言音声単位５０１、５０５、５１３、５１５が発言者情報対象区間となる。 In step S1001, the speech section specifying unit 205 specifies a time information target section from the results of steps S401 and S403. Here, from the summary sentence input time (8:04:50) of the summary sentence 701 in FIG. 7B, speech speech units within a predetermined time (here, 2 minutes) are defined as the time information target section. To do. That is, in FIG. 5C, speech speech units 511 to 515 between 8:04:50 and 8:02:50 among speech speech units 501 to 517 are time information target sections. In step S1002, the speech section specifying unit 205 specifies the speaker information target section from the results of steps S402 and S403. Here, the speech unit that is spoken by the speaker (Sato) input in the summary sentence 701 is set as the speaker information target section. That is, in FIG. 5C, among the speech units 501 to 517, speech units 501, 505, 513, and 515 uttered by the speaker (Sato) are the speaker information target sections.

ステップＳ１００３において、音声区間特定部２０５は、ステップＳ４０４の結果からマッチング箇所情報対象区間を特定する。ここでは、各発言音声単位におけるマッチング箇所の数が第１の閾値（ここでは２とする）以上であった発言音声単位をマッチング箇所情報対象区間とする。即ち、図５Ｄにおいて、発言音声単位５０１〜５１７のうち、マッチング箇所が第１の閾値以上である発言音声単位５０３、５０５、５０７、５１３、５１５、５１６がマッチング箇所情報対象区間として特定される。 In step S1003, the speech section specifying unit 205 specifies the matching part information target section from the result of step S404. Here, a speech unit whose number of matching points in each speech unit is equal to or more than a first threshold (here, 2) is set as a matching part information target section. That is, in FIG. 5D, speech speech units 503, 505, 507, 513, 515, and 516 whose matching locations are equal to or greater than the first threshold among speech speech units 501 to 517 are identified as matching location information target sections.

ステップＳ１００４において、音声区間特定部２０５は、ステップＳ１００１〜Ｓ１００３の結果から関連音声区間を特定する。ここでは、時間情報対象区間と発言者情報対象区間とマッチング箇所情報対象区間とが重なり合う（アンドとなる）発言音声単位を関連音声区間とする。即ち、ステップＳ４０５では、発言音声単位５１３、５１５が関連音声区間として特定される。 In step S1004, the speech section specifying unit 205 specifies a related speech section from the results of steps S1001 to S1003. Here, the speech information unit in which the time information target section, the speaker information target section, and the matching location information target section overlap (become AND) is set as the related speech section. That is, in step S405, speech voice units 513 and 515 are specified as related voice sections.

よって、ステップＳ４０１〜ステップＳ４０５の処理により、キーワードマッチングに加えて時間情報と発言者情報とを用いることにより、議事録中の要約文に関連する音声区間を高い精度で特定することができる。具体的な上記情報処理システムの用途として、会議終了後に、議事録に書かれた要約文の詳細を、音声を再生させて確認したい場合が挙げられる。例えば、図１１において、ＰＣ１１０１の画面１１０２に議事録を表示させ、図７の要約文７０１に相当する箇所１１０４をマウス１１０３でクリックすると、発言音声単位５１３、５１５が再生される。また、上記情報処理システムでは、会議を例に説明したが、発言者の発言内容が録音可能であれば、講演や授業等のいかなる形態にも適用できる。 Therefore, by using the time information and the speaker information in addition to the keyword matching by the processing of step S401 to step S405, it is possible to specify the speech section related to the summary sentence in the minutes with high accuracy. As a specific application of the information processing system, there is a case where the details of the summary sentence written in the minutes are desired to be played back after the meeting is finished. For example, in FIG. 11, when the minutes are displayed on the screen 1102 of the PC 1101 and the portion 1104 corresponding to the summary sentence 701 in FIG. 7 is clicked with the mouse 1103, the speech units 513 and 515 are reproduced. In the information processing system described above, a conference has been described as an example. However, the present invention can be applied to any form such as a lecture or a class as long as the content of a speaker can be recorded.

次に、本発明の第２の実施形態について説明する。第２の実施形態においては、関連音声区間に隣接する発言音声単位を関連音声区間に含めるか否かを判定するため、図１０のステップＳ１００４の後に図１２に示す処理を実行する。以下、図１２に示す処理について説明する。なお、第２の実施形態に係る情報処理システムの構成は、第１の実施形態に係る情報処理システムの構成と同様であるため、第１の実施形態と同一符号を用いて説明する。 Next, a second embodiment of the present invention will be described. In the second embodiment, the processing shown in FIG. 12 is executed after step S1004 of FIG. 10 in order to determine whether or not the speech unit adjacent to the related speech section is included in the related speech section. Hereinafter, the process illustrated in FIG. 12 will be described. Note that the configuration of the information processing system according to the second embodiment is the same as the configuration of the information processing system according to the first embodiment, and therefore will be described using the same reference numerals as those in the first embodiment.

ステップＳ１２０１において、音声区間特定部２０５は、特定した関連音声区間に隣接する発言音声単位（以下、隣接発言音声単位と称す）において、ステップＳ４０４で得られたマッチング箇所の数が設定された第２の閾値（ここでは２とする）以上であるか否かを判定する。マッチング箇所の数が第２の閾値以上である場合、処理はステップＳ１２０２に移行する。一方、マッチング箇所の数が第２の閾値未満である場合、処理は終了する。ステップＳ１２０２において、音声区間特定部２０５は、特定した関連音声区間に隣接発言音声単位を含める。本実施形態では、隣接発言音声単位５１６におけるマッチング箇所の数が２以上であれば、隣接発言音声単位５１６を関連音声区間に含めるようにしている。従って、ステップＳ１００４において関連音声区間として特定されなくとも、関連性が高い可能性のあるマッチング箇所の数が第２の閾値以上あると、隣接発言音声単位が関連音声区間に含まれることになる。 In step S1201, the speech segment specifying unit 205 sets the number of matching points obtained in step S404 in the speech unit adjacent to the identified related speech segment (hereinafter referred to as the adjacent speech unit). It is determined whether or not it is equal to or greater than a threshold value (here, 2). If the number of matching points is equal to or greater than the second threshold, the process proceeds to step S1202. On the other hand, if the number of matching points is less than the second threshold, the process ends. In step S1202, the speech segment specifying unit 205 includes the adjacent speech unit in the specified related speech segment. In the present embodiment, if the number of matching points in the adjacent speech unit 516 is two or more, the adjacent speech unit 516 is included in the related speech section. Therefore, even if it is not specified as the related speech section in step S1004, if the number of matching points that are likely to be highly relevant is equal to or greater than the second threshold, the adjacent speech unit is included in the related speech section.

次に、本発明の第３の実施形態について説明する。第３の実施形態においては、ステップＳ４０５で特定された図７（ｂ）の要約文７０１の関連音声区間が複数ある場合、ステップＳ１００４の後に図１３に示す処理を実行する。以下、図１３に示す処理について説明する。なお、第３の実施形態に係る情報処理システムの構成は、第１の実施形態に係る情報処理システムの構成と同様であるため、第１の実施形態と同一符号を用いて説明する。 Next, a third embodiment of the present invention will be described. In the third embodiment, when there are a plurality of related speech sections of the summary sentence 701 in FIG. 7B specified in step S405, the process shown in FIG. 13 is executed after step S1004. Hereinafter, the process illustrated in FIG. 13 will be described. Note that the configuration of the information processing system according to the third embodiment is the same as the configuration of the information processing system according to the first embodiment, and therefore will be described using the same reference numerals as those in the first embodiment.

ステップＳ１３０１において、音声区間特定部２０５は、特定した関連音声区間が複数あるか否かを判定する。関連音声区間が複数ある場合、処理はステップＳ１３０２に移行する。一方、関連音声区間が複数ない場合、処理は終了する。要約文７０１については、発言音声単位５１３、５１５が関連音声区間として特定されているので、処理はステップＳ１３０２に移行する。 In step S1301, the speech segment identification unit 205 determines whether there are a plurality of identified related speech segments. If there are a plurality of related speech sections, the process proceeds to step S1302. On the other hand, when there are not a plurality of related speech sections, the process ends. For the summary sentence 701, the speech voice units 513 and 515 are specified as the related voice section, and the process proceeds to step S1302.

ステップＳ１３０２において、音声区間特定部２０５は、特定した複数の関連音声区間の間に位置する隣接発言音声単位について設定された第２の閾値を下げる。ここでは、隣接発言音声単位５１４の第２の閾値を２から１に下げるものとする。ステップＳ１３０３において、音声区間特定部２０５は、隣接発言音声単位において、ステップＳ４０４で得られたマッチング箇所の数がステップＳ１３０２で設定された第２の閾値以上であるか否かを判定する。マッチング箇所の数が第２の閾値以上である場合、処理はステップＳ１３０４に移行する。一方、マッチング箇所の数が第２の閾値未満である場合、処理は終了する。隣接発言音声単位５１４は、特定された複数の関連音声区間の間にあるので第２の閾値は１に設定される。従って、隣接発言音声単位５１４については、マッチング箇所の数は１であるので、処理はステップＳ１３０４に移行する。また、隣接発言音声単位５１２のマッチング箇所の数は、第２の閾値未満であるため、処理は終了する。一方、隣接発言音声単位５１６のマッチング箇所の数は、第２の閾値以上であるため、処理はステップＳ１３０４に移行する。 In step S <b> 1302, the speech segment specifying unit 205 lowers the second threshold set for the adjacent speech speech units located between the plurality of specified related speech segments. Here, it is assumed that the second threshold value of the adjacent speech unit 514 is lowered from 2 to 1. In step S1303, the speech section specifying unit 205 determines whether or not the number of matching points obtained in step S404 is equal to or greater than the second threshold set in step S1302 in adjacent speech units. If the number of matching points is equal to or greater than the second threshold, the process proceeds to step S1304. On the other hand, if the number of matching points is less than the second threshold, the process ends. Since the adjacent speech unit 514 is between a plurality of specified related speech sections, the second threshold is set to 1. Accordingly, for the adjacent speech unit 514, the number of matching points is 1, so the process moves to step S1304. Moreover, since the number of matching parts of the adjacent speech unit 512 is less than the second threshold value, the process ends. On the other hand, since the number of matching points in the adjacent speech unit 516 is equal to or greater than the second threshold, the process proceeds to step S1304.

ステップＳ１３０４において、音声区間特定部２０５は、隣接発言音声単位をステップＳ１００４で特定された関連音声区間に含める。従って、ステップＳ１００４で関連音声区間と特定されなくとも、関連音声区間の間にある隣接発言音声単位については、より高い確率で隣接発言音声単位が関連音声区間に含まれることになる。 In step S1304, the speech segment specifying unit 205 includes the adjacent speech unit in the related speech segment specified in step S1004. Therefore, even if it is not specified as the related speech section in step S1004, the adjacent speech sound units between the related speech sections are included in the related speech section with a higher probability.

次に、本発明の第４の実施形態について説明する。第４の実施形態においては、ステップＳ４０３で入力された図７（ｂ）の要約文７０１の発言者が複数である場合、図１０のステップＳ１００４の後に図１４に示す処理を実行する。以下、図１４に示す処理について説明する。なお、第４の実施形態に係る情報処理システムの構成は、第１の実施形態に係る情報処理システムの構成と同様であるため、第１の実施形態と同一符号を用いて説明する。 Next, a fourth embodiment of the present invention will be described. In the fourth embodiment, when there are a plurality of speakers in the summary sentence 701 in FIG. 7B input in step S403, the processing shown in FIG. 14 is executed after step S1004 in FIG. Hereinafter, the process illustrated in FIG. 14 will be described. Note that the configuration of the information processing system according to the fourth embodiment is the same as the configuration of the information processing system according to the first embodiment, and therefore will be described using the same reference numerals as those in the first embodiment.

ステップＳ１４０１において、音声区間特定部２０５は、ステップＳ４０３で入力された要約文の発言者は複数であるか否かを判定する。要約文の発言者が複数である場合、処理はステップＳ１４０２に移行する。一方、要約文の発言者が複数ではない場合、処理は終了する。ステップＳ１４０２において、音声区間特定部２０５は、特定した関連音声区間の隣接発言音声単位の発言者が、ステップＳ４０３で入力された要約文の発言者であるか否かを判定する。入力された要約文の発言者である場合、処理はステップＳ１４０３に移行する。一方、入力された要約文の発言者でない場合、処理は終了する。ステップＳ１４０３において、音声区間特定部２０５は、隣接発言音声単位をステップＳ１００４で特定された関連音声区間に含める。 In step S1401, the speech section identification unit 205 determines whether there are a plurality of speakers in the summary sentence input in step S403. If there are a plurality of speakers in the summary sentence, the process proceeds to step S1402. On the other hand, if there are not a plurality of speakers in the summary sentence, the process ends. In step S1402, the speech segment specifying unit 205 determines whether or not the speaker in the adjacent speech unit of the specified related speech segment is the speaker of the summary sentence input in step S403. If the speaker is an input summary sentence, the process proceeds to step S1403. On the other hand, if the speaker is not a speaker of the input summary sentence, the process ends. In step S1403, the speech segment specifying unit 205 includes the adjacent speech unit in the related speech segment specified in step S1004.

例えば、図７（ａ）に示す議事録において、「・会議で発表をする価値はある。（佐藤）」が「・会議で発表をする価値はある。（佐藤、鈴木）」と記入されていたとする。要約文７０１の発言者は、佐藤と鈴木になる。なお、「（」と「）」の記号で囲まれ、「、」の記号で区切られた文字列がその要約文の発言者として特定される。ステップＳ１００２で特定された発言者情報対象区間は、発言音声単位５０１、５０３、５０５、５０６、５０８、５１０、５１３、５１５、５１６となるが、ステップＳ１００４の結果（発言音声単位５１３、５１５）は変わらない。しかしながら、隣接発言音声単位５１６の発言者が鈴木であるので、隣接発言音声単位５１６が関連音声区間に含まれることになる。よって、ステップＳ１００４で関連音声区間と特定されなくとも、議事録係が記入した発言者に関連する音声区間を高い精度で特定することができる。よって、ステップＳ１００４で関連音声区間と特定されなくとも、関連性が高い可能性のある議事録係が記入した発言者の隣接発言音声単位が関連音声区間に含まれることになる。 For example, in the minutes shown in Fig. 7 (a), "・ It is worth making a presentation at a meeting (Sato)" is written as "・ It is worth making a presentation at a meeting (Sato, Suzuki)". Suppose. The speakers of the summary sentence 701 are Sato and Suzuki. A character string surrounded by “(” and “)” symbols and separated by “,” symbols is specified as a speaker of the summary sentence. The speaker information target section specified in step S1002 is speech voice units 501, 503, 505, 506, 508, 510, 513, 515, and 516. The result of step S1004 (speech voice units 513 and 515) is does not change. However, since the speaker of the adjacent speech unit 516 is Suzuki, the adjacent speech unit 516 is included in the related speech section. Therefore, even if it is not specified as the related speech section in step S1004, the speech section related to the speaker entered by the minuteskeeper can be specified with high accuracy. Therefore, even if it is not specified as the related voice section in step S1004, the adjacent speech voice unit of the speaker entered by the minutes clerk who may be highly relevant is included in the related voice section.

また、本発明は、以下の処理を実行することによっても実現される。即ち、上述した実施形態の機能を実現するソフトウェア（プログラム）を、ネットワーク又は各種記憶媒体を介してシステム或いは装置に供給し、そのシステム或いは装置のコンピュータ（またはＣＰＵやＭＰＵ等）がプログラムを読み出して実行する処理である。 The present invention can also be realized by executing the following processing. That is, software (program) that realizes the functions of the above-described embodiments is supplied to a system or apparatus via a network or various storage media, and a computer (or CPU, MPU, or the like) of the system or apparatus reads the program. It is a process to be executed.

２０１：音声変換部、２０２：発言者特定部、２０３：入力部、２０４：マッチング箇所特定部、２０５：音声区間特定部、２０６：テキスト入力部、２０７：ストローク時間保持部、２０８：入力時間特定部 201: Voice conversion unit, 202: Speaker specifying unit, 203: Input unit, 204: Matching part specifying unit, 205: Voice segment specifying unit, 206: Text input unit, 207: Stroke time holding unit, 208: Input time specifying Part

Claims

Conversion means for converting each of the plurality of speech sections into a plurality of first text data, with a speech unit of speech data including a plurality of speeches by a plurality of speakers as a speech section ;
For each speech segment, and the first specifying means for specifying a speech start time及beauty onset words person,
First accepting means for accepting input of second text data representing a summary sentence of the comment unit, and information indicating a speaker corresponding to the second text data ;
Second specifying means for specifying during the input mode of the second text data,
Before SL first text data Taso Rezoreto performed with the second text data, the text matching, matching means for identifying a corresponding portion,
The speech section corresponding to the second text data is identified based on the corresponding location, the input time, the speech start time, information indicating the speaker, and the speaker identified by the first identifying means. And an information processing apparatus.

Second accepting means for accepting designation of the summary sentence of the second text data;
Output means for outputting the voice data of the voice section specified by the third specifying means for the second text data when the designation is received;
The information processing apparatus according to claim 1, further comprising:

The third specific means, as the object of the speech segment in which the speech start time within the from the input time of the predetermined time is determined, the speech section corresponding to the second text data from the target The information processing apparatus according to claim 1, wherein the information processing apparatus is specified.

The third specifying means specifies the speech section corresponding to the second text data from the target, with the speech section in which the speaker indicated by the information indicating the speaker is specified as a target. The information processing apparatus according to claim 1, wherein the information processing apparatus is an information processing apparatus.

The third specifying means targets the speech section corresponding to the first text data in which the number of the corresponding locations specified by the matching means is equal to or more than a first threshold, from among the objects . 5. The information processing apparatus according to claim 1, wherein the speech section corresponding to two text data is specified .

The third specifying means specifies the speaker start time within a predetermined time from the input time, specifies the speaker who matches the speaker information, and the number of corresponding locations is a first threshold value. The information processing apparatus according to claim 1, wherein the speech section corresponding to the first text data is specified as the speech section corresponding to the second text data .

The third specifying means specifies the voice section, and the corresponding portion between the first text data and the second text data corresponding to an adjacent voice section adjacent to the specified voice section is the first. 7. The information processing apparatus according to claim 1, wherein when the threshold value is equal to or greater than 2, the adjacent speech section is included in the speech section corresponding to the second text data .

The third specifying unit is adjacent to at least one voice section of the plurality of voice sections and has the plurality of voice sections when a plurality of voice sections are specified for the second text data. The speech section located between the first text data and the second text data corresponding to the speech section to be processed is smaller than the second threshold value. 8. The information processing apparatus according to claim 7, wherein the processing target speech section is included in the speech section corresponding to the second text data when the third threshold value or more.

The first accepting means accepts input of information indicated by the speaker, in which a plurality of speakers are indicated,
The third specifying means, when the voice section is specified and the speaker specified for the adjacent voice section adjacent to the specified voice section is included in the information indicating the speaker, The information processing apparatus according to claim 1, wherein an adjacent speech section is included in the speech section corresponding to the second text data.

The said 2nd specific means specifies the said input time based on the information of the stroke time concerning the input of said 1st text data, The any one of Claim 1 thru | or 9 characterized by the above-mentioned. Information processing device.

An information processing method executed by an information processing apparatus,
A conversion step of converting each of the plurality of speech sections into a plurality of first text data, with a speech unit of the speech data including a plurality of comments by a plurality of speakers as a speech section;
A first identification step for identifying a speech start time and a speaker for each voice section;
A first accepting step for accepting input of second text data representing a summary sentence of the comment unit, and information indicating a speaker corresponding to the second text data;
A second specifying step of specifying an input time of the second text data;
A matching step of performing text matching between each of the first text data and the second text data to identify a corresponding portion;
The speech section corresponding to the second text data is identified based on the corresponding location, the input time, the speech start time, information indicating the speaker, and the speaker identified by the first identifying means. A third specific step to
An information processing method comprising:

Computer
Conversion means for converting each of the plurality of speech sections into a plurality of first text data, with a speech unit of speech data including a plurality of speeches by a plurality of speakers as a speech section;
A first specifying means for specifying a speech start time and a speaker for each voice section;
First accepting means for accepting input of second text data representing a summary sentence of the comment unit, and information indicating a speaker corresponding to the second text data;
Second specifying means for specifying an input time of the second text data;
Matching means for performing text matching between each of the first text data and the second text data, and identifying a corresponding portion;
The speech section corresponding to the second text data is identified based on the corresponding location, the input time, the speech start time, information indicating the speaker, and the speaker identified by the first identifying means. Third identifying means to
Program to function as.

A computer-readable recording medium storing the program according to claim 12.