JP2020074020A

JP2020074020A - Summary generation device, summary generating method and summary generation program

Info

Publication number: JP2020074020A
Application number: JP2020001449A
Authority: JP
Inventors: 布目　光生; Mitsuo Nunome; 光生布目; 平芦川; Taira Ashikawa; 将之芦川; Masayuki Ashikawa; 貴史益子; Takashi Masuko
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2020-01-08
Filing date: 2020-01-08
Publication date: 2020-05-14
Anticipated expiration: 2036-03-17
Also published as: JP6818916B2

Abstract

To reduce a preparation work.SOLUTION: A summary generation device includes an identity extraction unit, a segment candidate generation unit, and a structuring estimation unit. The identity extraction unit extracts identity information of a word included in text information. The segment candidate generation unit generates a segment candidate which is a constitutional unit for a display based on the extracted identity information. The structuring estimation unit estimates structural information including global structural information to local structural information based on the generated segment candidate and an estimation model for structuring.SELECTED DRAWING: Figure 3

Description

本発明の実施形態は、サマリ生成装置、サマリ生成方法及びサマリ生成プログラムに関する。 Embodiments of the present invention relate to a summary generation device, a summary generation method, and a summary generation program.

従来、音声認識技術の精度向上に伴い、会議等の発言録作成に音声認識技術を活用したシステムが提案されている。このような状況で、人手による作成のために手間がかかっていた会議等の議事録作成を支援する技術がある。例えば、議事録の雛形を準備し、分析したうえで、議事録の雛形に従った議事録を作成するための技術がある。 2. Description of the Related Art Conventionally, with the improvement in accuracy of voice recognition technology, a system utilizing the voice recognition technology for creating a memorandum at a meeting or the like has been proposed. In such a situation, there is a technology that supports the preparation of minutes of a meeting or the like, which is time-consuming for manual preparation. For example, there is a technique for preparing a minutes template and analyzing it, and then creating a minutes according to the minutes template.

特開２０１５−１３８１０３号公報JP, 2015-138103, A

しかしながら、従来技術は、議事録作成のための事前の準備が煩雑であるという問題がある。具体的には、従来技術は、事前に準備した議事録の雛形を使用するため、議事録作成のための事前の準備が煩雑である。 However, the conventional technique has a problem that preparation for preparing minutes is complicated. Specifically, in the related art, since the template of the minutes prepared in advance is used, the preparation in advance for creating the minutes is complicated.

本発明が解決しようとする課題は、事前準備の手間を削減することができるサマリ生成装置、サマリ生成方法及びサマリ生成プログラムを提供することである。 The problem to be solved by the present invention is to provide a summary generation device, a summary generation method, and a summary generation program that can reduce the preparation work.

実施形態のサマリ生成装置は、素性抽出部と、セグメント候補生成部と、構造化推定部とを有する。素性抽出部は、テキスト情報に含まれる単語の素性情報を抽出する。セグメント候補生成部は、抽出された素性情報に基づいて、表示のための構成単位となるセグメントの候補を生成する。構造化推定部は、生成されたセグメントの候補と、構造化のための推定モデルとに基づいて、大局的な構造の情報から局所的な構造の情報までを含む構造情報を推定する。 The summary generation device according to the embodiment includes a feature extraction unit, a segment candidate generation unit, and a structured estimation unit. The feature extraction unit extracts feature information of a word included in the text information. The segment candidate generation unit generates a segment candidate that is a constituent unit for display based on the extracted feature information. The structuring estimation unit estimates structural information including global structural information to local structural information based on the generated segment candidates and the structuring estimation model.

実施形態に係るサマリ生成システムのシステム構成を示す図。FIG. 1 is a diagram showing a system configuration of a summary generation system according to an embodiment. 実施形態に係るサマリ生成装置のハードウェア構成を示すブロック図。FIG. 3 is a block diagram showing a hardware configuration of the summary generation device according to the embodiment. 実施形態に係るサマリ生成装置の機能構成を示すブロック図。FIG. 3 is a block diagram showing a functional configuration of a summary generation device according to the embodiment. 実施形態に係る音声認識処理結果を示す図。The figure which shows the voice recognition process result which concerns on embodiment. 実施形態に係る形態素解析処理結果を示す図。The figure which shows the morphological analysis processing result which concerns on embodiment. 実施形態に係るコマンド表現を示す図。The figure which shows the command expression which concerns on embodiment. 実施形態に係る品詞のプロパティ情報、意味クラスを示す図。The figure which shows the property information of a part of speech and semantic class which concern on embodiment. 実施形態に係るシステム由来の情報を示す図。The figure which shows the information derived from the system which concerns on embodiment. 実施形態に係るセグメント化の結果を示す図。The figure which shows the result of the segmentation which concerns on embodiment. 実施形態に係る構造化推定の候補抽出を説明する図。The figure explaining the extraction of the candidate of the structured estimation which concerns on embodiment. 実施形態に係る表示形式の候補を説明する図。FIG. 6 is a diagram illustrating display format candidates according to the embodiment. 実施形態に係る素性抽出部による処理の流れを示すフローチャート。6 is a flowchart showing a flow of processing by the feature extraction unit according to the embodiment. 実施形態に係るセグメント候補生成部による処理の流れを示すフローチャート。6 is a flowchart showing the flow of processing by a segment candidate generation unit according to the embodiment. 実施形態に係る構造化推定部による処理の流れを示すフローチャート。6 is a flowchart showing a flow of processing by a structured estimation unit according to the embodiment. 実施形態に係る表示形式変換部による処理の流れを示すフローチャート。6 is a flowchart showing a flow of processing by the display format conversion unit according to the embodiment.

（実施形態）
図１は、実施形態に係るサマリ生成システム１のシステム構成例を示す図である。図１に示すように、サマリ生成システム１には、サマリ生成装置１００と、端末装置２００とが含まれる。サマリ生成システム１において、各装置は、無線又は有線により通信可能である。また、サマリ生成システム１には、複数台の端末装置２００が含まれていても良い。サマリ生成システム１は、会議等における音声データから、書式構造付きのテキストとして可視化したサマリ文書を作成するものである。 (Embodiment)
FIG. 1 is a diagram illustrating a system configuration example of a summary generation system 1 according to the embodiment. As shown in FIG. 1, the summary generation system 1 includes a summary generation device 100 and a terminal device 200. In the summary generation system 1, each device can communicate wirelessly or by wire. Further, the summary generation system 1 may include a plurality of terminal devices 200. The summary generation system 1 creates a summary document visualized as text with a format structure from voice data in a meeting or the like.

上述した構成において、端末装置２００は、会議等における音声データを取得し、ネットワークを介して、取得した音声データをサマリ生成装置１００に対して送信する。音声データは、端末装置２００に接続されるマイクロフォン等から取得される。会議では、１のマイクロフォンが使用されても良いし、複数のマイクロフォンが使用されても良い。異なる拠点間を通じて会議が開催される場合もあるため、サマリ生成システム１には複数の端末装置２００が含まれることも有り得る。かかる端末装置２００は、ＰＣ（Personal Computer）やタブレット端末等の情報機器である。 In the configuration described above, the terminal device 200 acquires voice data in a conference or the like, and transmits the acquired voice data to the summary generation device 100 via the network. The voice data is acquired from a microphone or the like connected to the terminal device 200. In the conference, one microphone may be used or a plurality of microphones may be used. Since a conference may be held through different bases, the summary generation system 1 may include a plurality of terminal devices 200. The terminal device 200 is an information device such as a PC (Personal Computer) or a tablet terminal.

サマリ生成装置１００は、端末装置２００から音声データを取得し、発話者の明示的な要約指示や発話に含まれる構造化要求の表現を検知し、適切な表示単位（セグメント）を推定する。そして、サマリ生成装置１００は、発話者からの終端指示等に応じて、各セグメントをその内容に応じて組み替え、各種の表示形式に変換して出力する。かかるサマリ生成装置１００は、サーバ装置等の情報処理装置である。 The summary generation device 100 acquires voice data from the terminal device 200, detects an explicit summary instruction of the speaker or an expression of the structured request included in the utterance, and estimates an appropriate display unit (segment). Then, the summary generation device 100 rearranges each segment according to the content thereof in accordance with a termination instruction or the like from the speaker, converts the segments into various display formats, and outputs the display formats. The summary generation device 100 is an information processing device such as a server device.

図２は、実施形態に係るサマリ生成装置１００のハードウェア構成例を示すブロック図である。図２に示すように、サマリ生成装置１００は、ＣＰＵ（Central Processing Unit）１２と、ＲＯＭ（Read Only Memory）１３と、ＲＡＭ（Random Access Memory）１４と、通信部１５とを有する。上記各部は、バス１１により互いに接続される。 FIG. 2 is a block diagram showing a hardware configuration example of the summary generation device 100 according to the embodiment. As shown in FIG. 2, the summary generation device 100 includes a CPU (Central Processing Unit) 12, a ROM (Read Only Memory) 13, a RAM (Random Access Memory) 14, and a communication unit 15. The above units are connected to each other by a bus 11.

ＣＰＵ１２は、サマリ生成装置１００全体の動作を制御する。ＣＰＵ１２は、ＲＡＭ１４等を作業領域として、ＲＯＭ１３等に記憶されたプログラムを実行することで、サマリ生成装置１００全体の動作を制御する。ＲＡＭ１４は、各種処理に関する情報等を一時的に記憶するとともに、ＲＯＭ１３等に格納されたプログラムの実行時の作業領域として使用される。ＲＯＭ１３は、サマリ生成装置１００による処理を実現するためのプログラムを記憶する。通信部１５は、無線又は有線によるネットワークを介して、端末装置２００等の外部装置と通信する。なお、図２に示したハードウェア構成は一例であり、上記の他にも、処理結果を出力する表示部や各種情報を入力するための操作部等が含まれていても良い。 The CPU 12 controls the overall operation of the summary generation device 100. The CPU 12 controls the overall operation of the summary generation device 100 by executing a program stored in the ROM 13 or the like using the RAM 14 or the like as a work area. The RAM 14 temporarily stores information about various processes and the like, and is used as a work area when a program stored in the ROM 13 or the like is executed. The ROM 13 stores a program for realizing the processing by the summary generation device 100. The communication unit 15 communicates with an external device such as the terminal device 200 via a wireless or wired network. The hardware configuration shown in FIG. 2 is an example, and in addition to the above, a display unit for outputting the processing result, an operation unit for inputting various kinds of information, and the like may be included.

図３は、実施形態に係るサマリ生成装置１００の機能構成例を示すブロック図である。図３に示すように、サマリ生成装置１００は、音声認識部１１０と、素性抽出部１２０と、セグメント候補生成部１３０と、構造化推定部１４０と、指示部１６０と、表示形式変換部１７０とを有する。上記各部は、これらの一部又は全てがソフトウェア（プログラム）で実現されても良いし、ハードウェア回路で実現されても良い。また、サマリ生成装置１００は、ＲＯＭ１３等に、構造推定モデル１５０を記憶している。 FIG. 3 is a block diagram illustrating a functional configuration example of the summary generation device 100 according to the embodiment. As shown in FIG. 3, the summary generation device 100 includes a voice recognition unit 110, a feature extraction unit 120, a segment candidate generation unit 130, a structured estimation unit 140, an instruction unit 160, and a display format conversion unit 170. Have. Some or all of the above-mentioned units may be realized by software (program) or may be realized by a hardware circuit. The summary generation device 100 also stores the structure estimation model 150 in the ROM 13 or the like.

音声認識部１１０は、音声データに対して音声認識処理を実行する。より具体的には、音声認識部１１０は、端末装置２００から送信された音声データの入力を受け付ける。そして、音声認識部１１０は、音声認識処理を実行し、各発話の文字データと、発話されたときの時刻の情報とを含むテキスト情報を生成する。図４は、実施形態に係る音声認識処理結果の例を示す図である。図４に示すように、音声認識結果であるテキスト情報には、各発話の文字データと、発話されたときの時刻の情報とが含まれる。 The voice recognition unit 110 performs voice recognition processing on voice data. More specifically, the voice recognition unit 110 receives an input of voice data transmitted from the terminal device 200. Then, the voice recognition unit 110 executes a voice recognition process, and generates text information including character data of each utterance and information of time when the utterance was made. FIG. 4 is a diagram showing an example of a voice recognition processing result according to the embodiment. As shown in FIG. 4, the text information that is the voice recognition result includes character data of each utterance and information about the time when the utterance was made.

また、音声認識部１１０は、音声データの音響特徴として、発話区間や無音区間等を特定し、これらの区間の時間長を検出する。なお、音声認識部１１０は、サマリ生成装置１００に含まれていなくても良く、音声認識処理・音響特徴の抽出処理の実行結果をもとに、後段の素性抽出部１２０が処理するような構成としても良い。 Further, the voice recognition unit 110 identifies a speech section, a silent section, or the like as the acoustic feature of the voice data, and detects the time length of these sections. The voice recognition unit 110 does not have to be included in the summary generation device 100, and the feature extraction unit 120 in the subsequent stage performs processing based on the execution result of the voice recognition process / acoustic feature extraction process. Also good.

素性抽出部１２０は、テキスト情報に含まれる素性情報を抽出する。より具体的には、素性抽出部１２０は、音声認識部１１０によって生成されたテキスト情報に対して、形態素解析処理を適用する。図５は、実施形態に係る形態素解析処理結果の例を示す図である。図５に示すように、形態素解析処理により、意味を有する言語の最小単位に分割される。そして、素性抽出部１２０は、テキスト情報に含まれる品詞情報を特定し、意味クラス解析処理を行なう。名詞であれば、そのプロパティ情報に基づくより詳細な情報（例えば、人名や日付等の情報）を素性情報として抽出する。例えば、意味クラス解析処理では、数量表現や単位表現の有無、人名や組織名、イベント名、専門用語等のキーワードが素性情報として抽出される。続いて、素性抽出部１２０は、テキスト情報に対して論理要素の判定を行なう。例えば、論理要素の判定では、順序付きの箇条書き表現や、構造化のためのコマンド表現等がテキスト情報に含まれているか否かが判定される。素性抽出部１２０は、テキスト情報に論理要素が含まれていれば、それらのメタデータを付与しておく。 The feature extraction unit 120 extracts feature information included in the text information. More specifically, the feature extraction unit 120 applies the morphological analysis process to the text information generated by the voice recognition unit 110. FIG. 5 is a diagram illustrating an example of a morphological analysis processing result according to the embodiment. As shown in FIG. 5, the morphological analysis processing divides the language into the smallest units of meaningful language. Then, the feature extraction unit 120 identifies the part-of-speech information included in the text information and performs the semantic class analysis process. If it is a noun, more detailed information based on the property information (for example, information such as a person's name and date) is extracted as the feature information. For example, in the semantic class analysis process, keywords such as presence / absence of a quantity expression or unit expression, a person name, an organization name, an event name, and a technical term are extracted as feature information. Subsequently, the feature extraction unit 120 determines a logical element for the text information. For example, in the determination of the logical element, it is determined whether or not the ordered bulleted expression, the command expression for structuring, and the like are included in the text information. If the text information includes logical elements, the feature extraction unit 120 adds metadata to those logical elements.

図６は、実施形態に係るコマンド表現の例を示す図である。図６に示すように、コマンド表現は、「まとめます（conclusion）」、「ＴＯＤＯ（todo）」、「以上（terminate）」、「終わります（terminate）」等である。図７は、実施形態に係る品詞のプロパティ情報、意味クラスの例を示す図である。図７に示すように、品詞のプロパティ情報や意味クラスは、「１０日」であれば「ＤＡＴＥ」、「西口」であれば「ＰＥＲＳＯＮ」等となる。 FIG. 6 is a diagram showing an example of a command expression according to the embodiment. As shown in FIG. 6, the command expressions are “conclusion”, “TODO (todo)”, “terminate”, “terminate”, and the like. FIG. 7 is a diagram illustrating an example of part-of-speech property information and a semantic class according to the embodiment. As shown in FIG. 7, the part-of-speech property information and the semantic class are “DATE” for “10 days” and “PERSON” for “West Exit”.

その後、素性抽出部１２０は、テキスト情報に対してセグメントラベルの判定を行なう。セグメントラベルとは、セグメント（表示単位）の役割を表す名称であり、前段で抽出された意味クラスや名詞のプロパティ情報、又は、このような要素を有さない発話のテキスト、構造化指示のコマンド（指示命令）を含むか否か、等に応じて付与されるメタデータのことである。例えば、構造化指示のコマンドは、構造化の開始を示唆する指示命令を指し、「箇条書きはじまり」、「ここから表」、「ここから表形式」等がこれに該当する。また、素性抽出部１２０は、音声認識部１１０によって検出された発話区間や無音区間を周辺情報として付与しておく。 After that, the feature extraction unit 120 determines the segment label for the text information. A segment label is a name that represents the role of a segment (display unit), the property information of the semantic class or noun extracted in the previous stage, or the text of an utterance that does not have such an element, the command of structuring instruction. It is metadata given according to whether or not (instruction command) is included. For example, the structuring instruction command indicates an instruction command that suggests the start of structuring, and corresponds to "bullet start", "here to table", "here to table format", and the like. The feature extraction unit 120 also adds the utterance section and the silent section detected by the voice recognition unit 110 as peripheral information.

また、素性情報として、サマリ生成システム１由来の情報も利用可能である。例えば、素性抽出部１２０は、システム由来の情報として、マイクロフォンや接続されている端末装置２００のログインユーザをもとにした話者ＩＤの検出、会議室等の利用時刻やスケジューラ等と連動して参照可能な会議タイトル、会議の開催時刻、参加者、会議室等の会議情報、会議中に音声を入力している個別の話者情報等の会議詳細情報があれば、これらを素性情報として取得しておく。図８は、実施形態に係るシステム由来の情報例を示す図である。図８に示すように、システム由来の情報は、「話者ＩＤ」であれば「Ａ」であり、「会議日時」であれば「２０１５／１０／２３」等である。 Information derived from the summary generation system 1 can also be used as the feature information. For example, the feature extraction unit 120 detects the speaker ID based on the logged-in user of the microphone or the connected terminal device 200 as the system-derived information, and operates in conjunction with the usage time of the conference room or the scheduler. If there is detailed conference information such as the conference title that can be referred to, conference time, conference information such as participants and conference rooms, and information about individual speakers inputting audio during the conference, these will be acquired as feature information. I'll do it. FIG. 8 is a diagram showing an example of information derived from the system according to the embodiment. As shown in FIG. 8, the system-derived information is “A” for “speaker ID” and “2015/10/23” for “meeting date / time”.

セグメント候補生成部１３０は、構造化のための最小構成単位の候補のバリエーションを生成する。構造化における最小構成単位の候補には、粒度の大きな順に、話者、パラグラフ、フレーズ、漢字やカタカナ等の同一文字種の連続、意味クラス、単語、品詞等の単位で区切られる文字列が該当する。より具体的には、セグメント候補生成部１３０は、音声認識部１１０によって生成されたテキスト情報と、素性抽出部１２０によって抽出された素性情報とを読み込む。そして、セグメント候補生成部１３０は、各素性情報に存在するセグメントラベルを検出する。例えば、セグメントラベルの検出では、開始指示や終端指示、これらのほか、構造化の手がかりになるようなラベル等が検出される。 The segment candidate generation unit 130 generates a variation of the minimum structural unit candidate for structuring. The candidate of the minimum structural unit in structuring corresponds to a sequence of the same character type such as speaker, paragraph, phrase, kanji and katakana, and a character string delimited in units of semantic class, word, part of speech, etc. in descending order of granularity. .. More specifically, the segment candidate generation unit 130 reads the text information generated by the voice recognition unit 110 and the feature information extracted by the feature extraction unit 120. Then, the segment candidate generation unit 130 detects the segment label existing in each feature information. For example, in the detection of a segment label, a start instruction, an end instruction, a label that can be a clue to structuring, and the like are detected.

続いて、セグメント候補生成部１３０は、これまでに読み込んで蓄積されている素性情報をグルーピングする。例えば、グルーピングでは、類似する要素の規則的な出現の繰り返しや、種別が異なる素性情報の出現パターン等が検出され、このような繰り返しの単位を同一のグループとする。例を挙げると、類似する要素としては、「日付、場所、任意のテキスト文」等の要素（３要素）の繰り返しが規則的に出現しているものを指す。 Subsequently, the segment candidate generation unit 130 groups the feature information that has been read and accumulated so far. For example, in grouping, regular repeated appearances of similar elements, appearance patterns of feature information of different types, and the like are detected, and such repeated units are set as the same group. For example, as a similar element, a repetition of elements (three elements) such as "date, place, arbitrary text sentence" appears regularly.

また、セグメント候補生成部１３０は、セグメントラベルに構造化の終端指示が含まれていれば、それまでにグルーピングされた素性情報間に対して順序付けを行なう。順序付けの例としては、素性情報の種別に対し事前に順序付け定義をしておき、そのうえで順序付けを固定的に定義する方法や、抽出された素性情報の具体例において、各素性に含まれる文字長（平均の文字長）を基準に順序付けする方法、特定の要素（意味クラス）の含有数を基準に順序付けする方法等がある。 Further, if the segment label includes the structuring end instruction, the segment candidate generation unit 130 orders the feature information grouped by then. As an example of ordering, a method of defining ordering in advance for the type of feature information, and then defining the ordering fixedly, or in the specific example of the extracted feature information, the character length (character length included in each feature ( There are a method of ordering based on the average character length) and a method of ordering based on the number of contained specific elements (semantic classes).

図９は、実施形態に係るセグメント化の結果例を示す図である。図９に示すように、セグメント化により、時刻ごとのテキストに含まれるコマンド表現、品詞のプロパティ情報や意味クラス、システム由来の情報等が付与されたテキスト情報が生成される。 FIG. 9 is a diagram showing an example of a result of segmentation according to the embodiment. As shown in FIG. 9, the segmentation generates text information to which a command expression included in the text for each time, property information of a part of speech, a semantic class, system-derived information, and the like are added.

構造化推定部１４０は、セグメント情報をもとに構造情報を推定する。より具体的には、構造化推定部１４０は、セグメント候補生成部１３０によって生成されたセグメント情報を読み込む。そして、構造化推定部１４０は、構造推定モデル１５０から構造推定モデルを読み込む。構造推定モデルは、表示に適した形式例や過去に編集・確定された結果を学習データとして学習されたものである。構造化推定部１４０は、このような構造推定モデルをもとに、素性情報の出現の組み合わせやパターンを与えることで、好適な構造化候補を順序付きで提示する。初期の構造情報の提示では、順序付きセグメントパターンから最も尤度の高い構造化結果が提示される。 The structured estimation unit 140 estimates structural information based on the segment information. More specifically, the structured estimation unit 140 reads the segment information generated by the segment candidate generation unit 130. Then, the structured estimation unit 140 reads the structure estimation model from the structure estimation model 150. The structure estimation model is learned using, as learning data, a format example suitable for display and a result edited / determined in the past. Based on such a structure estimation model, the structured estimation unit 140 presents suitable structured candidates in order by giving combinations and patterns of appearance of feature information. In the initial presentation of the structural information, the structured result with the highest likelihood is presented from the ordered segment pattern.

続いて、構造化推定部１４０は、ユーザからの確定指示を受け付ける。ユーザからの指示は、指示部１６０を介して受け付けられる。例えば、現在の提示候補に対して、ユーザが問題なければ、提示候補を確定した構造化結果として提示する。一方、ユーザからの確定指示が得られない場合（次候補の提示の要求を受け付けた場合）には、次の構造化結果が提示される。次の構造化結果を提示する場合には、セグメントの組み合わせの変更だけではなく、セグメントそのものの取り出し方に遡ってバリエーションを変更して提示しても良い。なお、構造化結果の提示は、端末装置２００から出力しても良いし、サマリ生成装置１００から出力しても良い。 Then, the structured estimation unit 140 receives a confirmation instruction from the user. The instruction from the user is accepted via the instruction unit 160. For example, if the user has no problem with the current presentation candidate, the presentation candidate is presented as a confirmed structured result. On the other hand, when the confirmation instruction from the user is not obtained (when the request to present the next candidate is accepted), the next structured result is presented. When presenting the next structured result, not only the combination of segments may be changed but also variations may be changed and presented retroactively to how to extract the segment itself. The presentation of the structured result may be output from the terminal device 200 or the summary generation device 100.

図１０は、実施形態に係る構造化推定の候補抽出例を説明する図である。図１０に示すように、セグメント情報をもとに、大局的な構造の情報から局所的な構造の情報までを構造化結果とする。 FIG. 10 is a diagram illustrating an example of candidate extraction for structured estimation according to the embodiment. As shown in FIG. 10, based on the segment information, the information on the global structure to the information on the local structure is used as the structuring result.

表示形式変換部１７０は、確定された構造化結果をユーザが閲覧するための表示形式に変換する。より具体的には、表示形式変換部１７０は、構造化推定部１４０において確定された構造化結果を読み込む。そして、表示形式変換部１７０は、表示形式変換モデルを読み込む。表示形式変換モデルは、構造化結果に対応して、どのような表示形式で提示するかの定義パターンが記載されているものであり、ＣＳＳ（Cascading Style Sheets）やＸＳＬＴ（XSL Transformations）等で記述することが可能である。 The display format conversion unit 170 converts the confirmed structured result into a display format for the user to browse. More specifically, the display format conversion unit 170 reads the structured result determined by the structured estimating unit 140. Then, the display format conversion unit 170 reads the display format conversion model. The display format conversion model describes the definition pattern of the display format to be presented according to the structured result, and is described by CSS (Cascading Style Sheets), XSLT (XSL Transformations), etc. It is possible to

続いて、表示形式変換部１７０は、構造化結果と、表示形式変換モデルとから、初期の変換結果を提示する。この提示に対し、指示部１６０を介して、ユーザからの確定指示を受け付けた場合には、変換結果をサマリ文書として出力する。一方、ユーザからの確定指示が得られない場合（次候補の提示の要求を受け付けた場合）には、次に尤度の高い変換結果が提示される。なお、変換結果の提示は、端末装置２００から出力しても良いし、サマリ生成装置１００から出力しても良い。 Then, the display format conversion unit 170 presents an initial conversion result from the structured result and the display format conversion model. In response to this presentation, when a confirmation instruction from the user is received via the instruction unit 160, the conversion result is output as a summary document. On the other hand, when the confirmation instruction from the user is not obtained (when the request to present the next candidate is accepted), the conversion result with the next highest likelihood is presented. The presentation of the conversion result may be output from the terminal device 200 or the summary generation device 100.

図１１は、実施形態に係る表示形式の候補例を説明する図である。図１１に示すように、構造化結果のうち、何れかの構造の情報で提示する場合に、日付順に表示されたり、人名ごとに表示されたり、項目と人のみが表示されたりする。 FIG. 11 is a diagram illustrating an example of display format candidates according to the embodiment. As shown in FIG. 11, when presenting information of any structure in the structured result, the information is displayed in chronological order, displayed for each person's name, or only items and people are displayed.

図１２は、実施形態に係る素性抽出部１２０による処理の流れの例を示すフローチャートである。図１２に示すように、素性抽出部１２０は、音声認識部１１０によって生成されたテキスト情報を取得する（ステップＳ１０１）。そして、素性抽出部１２０は、取得したテキスト情報に対して、形態素解析処理を適用する（ステップＳ１０２）。続いて、素性抽出部１２０は、テキスト情報に含まれる品詞情報を特定し、意味クラス解析処理を実行する（ステップＳ１０３）。その後、素性抽出部１２０は、テキスト情報に対して論理要素の判定を行なう（ステップＳ１０４）。 FIG. 12 is a flowchart showing an example of the flow of processing by the feature extraction unit 120 according to the embodiment. As shown in FIG. 12, the feature extraction unit 120 acquires the text information generated by the voice recognition unit 110 (step S101). Then, the feature extraction unit 120 applies morphological analysis processing to the acquired text information (step S102). Subsequently, the feature extraction unit 120 identifies the part of speech information included in the text information and executes the semantic class analysis process (step S103). After that, the feature extraction unit 120 determines a logical element for the text information (step S104).

そして、素性抽出部１２０は、テキスト情報に対してセグメントラベルの判定を行なう（ステップＳ１０５）。続いて、素性抽出部１２０は、音声認識部１１０によって検出された発話区間や無音区間を周辺情報として付与する（ステップＳ１０６）。その後、素性抽出部１２０は、システム由来の情報として、マイクロフォンや端末装置２００のログインユーザをもとにした話者ＩＤを検出する（ステップＳ１０７）。そして、素性抽出部１２０は、外部装置が管理する会議詳細情報等を検出する（ステップＳ１０８）。 Then, the feature extraction unit 120 determines the segment label for the text information (step S105). Subsequently, the feature extraction unit 120 adds the utterance section and the silent section detected by the voice recognition unit 110 as peripheral information (step S106). After that, the feature extraction unit 120 detects the speaker ID based on the microphone or the login user of the terminal device 200 as the information derived from the system (step S107). Then, the feature extraction unit 120 detects conference detailed information and the like managed by the external device (step S108).

図１３は、実施形態に係るセグメント候補生成部１３０による処理の流れの例を示すフローチャートである。図１３に示すように、セグメント候補生成部１３０は、音声認識部１１０によって生成されたテキスト情報と、素性抽出部１２０によって抽出された素性情報とを読み込む（ステップＳ２０１）。そして、セグメント候補生成部１３０は、各素性情報に含まれるセグメントラベルを読み込む（ステップＳ２０２）。続いて、セグメント候補生成部１３０は、これまでの読み込みによって蓄積された素性情報をグルーピングする（ステップＳ２０３）。その後、セグメント候補生成部１３０は、この時点でグルーピングされている素性情報間の順序付けを行なう（ステップＳ２０４）。 FIG. 13 is a flowchart showing an example of the flow of processing by the segment candidate generation unit 130 according to the embodiment. As shown in FIG. 13, the segment candidate generation unit 130 reads the text information generated by the voice recognition unit 110 and the feature information extracted by the feature extraction unit 120 (step S201). Then, the segment candidate generation unit 130 reads the segment label included in each feature information (step S202). Then, the segment candidate generation unit 130 groups the feature information accumulated by the reading so far (step S203). After that, the segment candidate generation unit 130 orders the feature information grouped at this point (step S204).

そして、セグメント候補生成部１３０は、セグメントラベルに構造化の終端指示が含まれているか否かを判定する（ステップＳ２０５）。このとき、セグメント候補生成部１３０は、セグメントラベルに構造化の終端指示が含まれている場合に（ステップＳ２０５：Ｙｅｓ）、グルーピングされた素性情報間の順序付けを行なう（ステップＳ２０６）。一方、セグメント候補生成部１３０は、セグメントラベルに構造化の終端指示が含まれていない場合に（ステップＳ２０５：Ｎｏ）、ステップＳ２０１の処理に戻る。 Then, the segment candidate generation unit 130 determines whether or not the segment label includes a structuring end instruction (step S205). At this time, when the segment label includes the structuring end instruction (step S205: Yes), the segment candidate generation unit 130 orders the grouped feature information (step S206). On the other hand, when the segment label does not include the structured end instruction (step S205: No), the segment candidate generation unit 130 returns to the process of step S201.

図１４は、実施形態に係る構造化推定部１４０による処理の流れの例を示すフローチャートである。図１４に示すように、構造化推定部１４０は、セグメント候補生成部１３０によって生成された順序付きのセグメント情報を読み込む（ステップＳ３０１）。そして、構造化推定部１４０は、構造推定モデル１５０から構造推定モデルを読み込む（ステップＳ３０２）。続いて、構造化推定部１４０は、読み込んだセグメント情報や構造推定モデルをもとに、素性情報の出現の組み合わせやパターンを与えることで、初期の構造情報（構造化の候補）を提示する（ステップＳ３０３）。 FIG. 14 is a flowchart showing an example of the flow of processing by the structured estimation unit 140 according to the embodiment. As shown in FIG. 14, the structured estimation unit 140 reads the ordered segment information generated by the segment candidate generation unit 130 (step S301). Then, the structured estimation unit 140 reads the structure estimation model from the structure estimation model 150 (step S302). Subsequently, the structured estimation unit 140 presents initial structural information (structural candidates) by giving combinations and patterns of appearance of feature information based on the read segment information and structure estimation model ( Step S303).

その後、構造化推定部１４０は、構造情報の提示に応じてユーザからの確定指示を受け付けた場合に（ステップＳ３０４：Ｙｅｓ）、提示した候補を確定した構造情報として付与する（ステップＳ３０５）。また、構造化推定部１４０は、構造情報の提示に応じてユーザからの確定指示が受け付けられない（次候補の提示要求を受け付けた）場合に（ステップＳ３０４：Ｎｏ）、次点でスコアの高い構造情報の候補を提示する（ステップＳ３０６）。候補の提示後は、ステップＳ３０４の処理に戻り、ユーザからの確定指示待ちの状態となる。 Then, when the structured estimation unit 140 receives a confirmation instruction from the user in response to the presentation of the structure information (step S304: Yes), the structured estimation unit 140 adds the presented candidate as the confirmed structure information (step S305). Further, when the confirmation instruction from the user is not accepted in response to the presentation of the structural information (the request for presenting the next candidate is accepted) (step S304: No), the structured estimation unit 140 has a high score in the second place. The structural information candidates are presented (step S306). After the candidates are presented, the process returns to step S304 and waits for a confirmation instruction from the user.

図１５は、実施形態に係る表示形式変換部１７０による処理の流れの例を示すフローチャートである。図１５に示すように、表示形式変換部１７０は、構造化推定部１４０において確定された構造化結果を読み込む（ステップＳ４０１）。そして、表示形式変換部１７０は、表示形式変換モデルを読み込む（ステップＳ４０２）。続いて、表示形式変換部１７０は、構造化結果と、表示形式変換モデルとから、初期の変換結果を提示する（ステップＳ４０３）。 FIG. 15 is a flowchart showing an example of the flow of processing by the display format conversion unit 170 according to the embodiment. As shown in FIG. 15, the display format conversion unit 170 reads the structured result determined by the structured estimating unit 140 (step S401). Then, the display format conversion unit 170 reads the display format conversion model (step S402). Then, the display format conversion unit 170 presents an initial conversion result from the structured result and the display format conversion model (step S403).

その後、表示形式変換部１７０は、変換結果の提示に応じてユーザからの確定指示を受け付けた場合に（ステップＳ４０４：Ｙｅｓ）、この変換結果をサマリ文書として出力する（ステップＳ４０５）。また、変換結果の提示に応じてユーザからの確定指示が受け付けられない（次候補の提示の要求を受け付けた）場合に（ステップＳ４０４：Ｎｏ）、次点でスコアの高い変換結果の候補を提示する（ステップＳ４０６）。候補の提示後は、ステップＳ４０４の処理に戻り、ユーザからの確定指示待ちの状態となる。 After that, when the confirmation instruction from the user is received in response to the presentation of the conversion result (step S404: Yes), the display format conversion unit 170 outputs this conversion result as a summary document (step S405). Further, when the confirmation instruction from the user is not accepted in response to the presentation of the conversion result (the request for the presentation of the next candidate is accepted) (step S404: No), the candidate of the conversion result having a high score in the second run is presented. Yes (step S406). After presenting the candidates, the process returns to step S404 and waits for a confirmation instruction from the user.

実施形態によれば、音声データに対する音声認識処理結果から、話者による明示的な指示や構造化要求のための表現をもとにセグメントを推定し、セグメントをその内容に応じて組み替え、各種の表示形式に変換して提示するので、事前準備の手間を削減することができる。 According to the embodiment, segments are estimated from the result of voice recognition processing on voice data based on expressions for explicit instructions and structured requests by a speaker, and the segments are recombined according to their contents, and Since the information is converted into the display format and presented, it is possible to reduce the preparation work.

また、上記実施形態に係るサマリ生成装置１００は、例えば、汎用のコンピュータ装置を基本ハードウェアとして用いることで実現することが可能である。実行されるプログラムは、上述してきた各機能を含むモジュール構成となっている。また、実行されるプログラムは、インストール可能な形式又は実行可能な形式のファイルでＣＤ−ＲＯＭ、ＣＤ−Ｒ、ＤＶＤ等のコンピュータで読み取り可能な記録媒体に記録されて提供しても、ＲＯＭ等に予め組み込んで提供しても良い。 Moreover, the summary generation device 100 according to the above-described embodiment can be realized by using a general-purpose computer device as basic hardware. The program to be executed has a module configuration including the functions described above. The program to be executed is stored in a ROM or the like even if the program is recorded in a computer-readable recording medium such as a CD-ROM, a CD-R, or a DVD as an installable or executable file. It may be incorporated in advance and provided.

また、上述した実施形態は、例として提示したものであり、発明の範囲を限定することは意図していない。これら新規な実施形態は、その他の様々な形態で実施されることが可能であり、発明の要旨を逸脱しない範囲で、種々の省略、置き換え、変更を行うことができる。また、各実施形態は、内容を矛盾させない範囲で適宜組み合わせることが可能である。また、各実施形態やその変形は、発明の範囲や要旨に含まれるとともに、特許請求の範囲に記載された発明とその均等の範囲に含まれる。 Further, the above-described embodiments are presented as examples, and are not intended to limit the scope of the invention. These novel embodiments can be implemented in various other forms, and various omissions, replacements, and changes can be made without departing from the spirit of the invention. In addition, the respective embodiments can be appropriately combined within a range in which the contents are not inconsistent. Further, each embodiment and its modification are included in the scope and gist of the invention, and are also included in the invention described in the claims and an equivalent range thereof.

１００サマリ生成装置
１１０音声認識部
１２０素性抽出部
１３０セグメント候補生成部
１４０構造化推定部
１５０構造推定モデル
１６０指示部
１７０表示形式変換部 100 summary generation device 110 speech recognition unit 120 feature extraction unit 130 segment candidate generation unit 140 structured estimation unit 150 structure estimation model 160 instruction unit 170 display format conversion unit

Claims

A feature extraction unit for extracting feature information of words included in text information,
Based on the extracted feature information, a segment candidate generation unit that generates a segment candidate that is a constituent unit for display,
Based on the generated segment candidates and the estimation model for structuring, a structured estimation unit that estimates a plurality of structural information that is a result of structuring the text information into mutually different structures,
Of the plurality of the structure information, the output control unit that outputs the instructed structure information as a summary of the text information,
Generating device having a.

A voice recognition unit for executing voice recognition processing of voice data to generate the text information;
The summary generation device according to claim 1, wherein the feature extraction unit extracts feature information of a word included in the generated text information.

The voice recognition unit generates the text information in which the character information of the utterance and the time when the utterance is associated, and extracts the acoustic feature of the voice data,
The summary generation device according to claim 2, wherein the feature extraction unit extracts a time for the character information and the acoustic feature as peripheral information for the word.

The output control unit converts the instructed structure information into a display format for browsing and outputs the display format.
The summary generation device according to claim 1.

The structured estimation unit, based on the estimation model that is learning data based on past processing results, presents structural information with a higher likelihood than the other structural information for the estimation model,
The output control unit outputs the presented structure information as a summary of the text information when a confirmation instruction is accepted for the presented structure information.
The summary generation device according to claim 1.

The summary generation device according to claim 4, wherein the output control unit changes the display format according to a user instruction.

The structured estimation unit presents a plurality of the structural information in order of likelihood with respect to the estimation model,
The output control unit outputs the presented structure information as a summary of the text information when a confirmation instruction is accepted for the presented structure information.
The summary generation device according to claim 1.

A summary generation method executed by a summary generation device, comprising:
A step of extracting feature information of words included in the text information,
Based on the extracted feature information, a step of generating a segment candidate that is a constituent unit for display,
Based on the generated segment candidates and the estimation model for structuring, a structured estimation step of estimating a plurality of structural information that is the result of structuring the text information into mutually different structures,
An output control step of outputting the instructed structure information among a plurality of the structure information as a summary of the text information,
A summary generation method including.

In the summary generator,
A step of extracting feature information of words included in the text information,
Based on the extracted feature information, a step of generating a segment candidate that is a constituent unit for display,
Based on the generated segment candidates and the estimation model for structuring, a structured estimation step of estimating a plurality of structural information that is the result of structuring the text information into mutually different structures,
An output control step of outputting the instructed structure information among a plurality of the structure information as a summary of the text information,
A summary generator to run.