JP4600828B2

JP4600828B2 - Document association apparatus and document association method

Info

Publication number: JP4600828B2
Application number: JP2005517060A
Authority: JP
Inventors: 恭二平田
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2004-01-14
Filing date: 2005-01-14
Publication date: 2010-12-22
Anticipated expiration: 2025-01-14
Also published as: WO2005069171A1; JPWO2005069171A1

Description

本発明は、文書対応付け装置、および文書対応付け方法に関し、特に、映像または音声のようなコンテンツと、コンテンツに関連した文書情報との対応関係を導出する文書対応付け装置、および文書対応付け方法に関する。 The present invention relates to a document association apparatus and a document association method, and in particular, a document association apparatus and a document association method for deriving a correspondence relationship between content such as video or audio and document information related to the content. About.

音声記録または音声付随のビデオ記録の対応部分に文書データを自動的にマッピングする方法が知られている。例えば、特開平７−１９９３７９号公報に、音声記録または音声付随のビデオ記録中の音声を音声認識処理によりテキスト化し、そのテキストと文書記憶装置に順序付けられて記憶された文書情報と比較して、両者が同一の一連の文字を含む場合に同一とみなす方法が提案されている。この際に、自動音声認識装置が音声をデコードし、デコードテキストが、類似語または語のクラスタの識別を介して文書情報と照合される。 Methods are known for automatically mapping document data to corresponding parts of an audio recording or an accompanying video recording. For example, in Japanese Patent Application Laid-Open No. 7-199379, the voice in the voice recording or the video recording accompanying the voice is converted into text by voice recognition processing, and the text is compared with the document information stored in order in the document storage device. A method has been proposed in which both are considered identical if they contain the same series of characters. At this time, the automatic speech recognition device decodes the speech, and the decoded text is collated with the document information through identification of similar words or clusters of words.

また、特開２０００−２７０２６３号公報に、放送番組において、アナウンス原稿と字幕内容とが極めて類似している場合に、アナウンス原稿に対して音声認識処理を施し、音声認識結果と提示時間順に配列された字幕文テキストとの対応を導出することで、始点と終点のタイミング情報を同期点として検出して記録するシステムが提案されている。 In Japanese Laid-Open Patent Publication No. 2000-270263, when an announcement manuscript and subtitle content are very similar in a broadcast program, the announcement manuscript is subjected to voice recognition processing and arranged in order of voice recognition result and presentation time. There has been proposed a system that detects and records the timing information of the start point and the end point as a synchronization point by deriving the correspondence with the closed caption text.

さらに、特開平８−２１２１９０号公報に、音声付随の動画像にシナリオテキストを対応付ける場合に、シナリオテキストを音声化した場合の無音区間を予測し、予測結果と音声付随の動画像における音声信号の無音区間とを比較することによって、音声とテキストとを対応付けるシステムが提案されている。 Furthermore, in Japanese Patent Application Laid-Open No. 8-212190, when a scenario text is associated with a voice-accompanying moving image, a silent section when the scenario text is voiced is predicted, and the prediction result and the voice signal in the voice-accompanying moving image A system for associating speech and text by comparing with silent sections has been proposed.

これらの従来の映像または音声といったコンテンツと文書とを対応付ける文書対応付け方法の第一の問題点は、音声情報と文書データとの対応の精度が音声認識の精度に大きく依存しているので、音声認識の精度が十分に得られないときに、音声情報と文書データの対応関係導出が精度よく行われないということである。 The first problem with the conventional document association method for associating content such as video or audio with a document is that the accuracy of correspondence between audio information and document data depends greatly on the accuracy of speech recognition. This means that when the accuracy of recognition is not sufficiently obtained, the correspondence relationship between the voice information and the document data cannot be accurately derived.

上記の特開平７−１９９３７９号公報や特開２０００−２７０２６３号公報に記載されている従来の方法では、音声認識処理により音声をテキストに変換した後に、変換されたテキストと文書データ間で同期をとっている。この結果、音声認識によって出力されたテキストが誤りを多く含んでいる場合には、文書データとの対応がとれなかったり、全く異なった文書部分と対応をとってしまうなど対応関係に多くの誤りが含まれてしまう。一般に、音声認識では、ＢＧＭが音声に重畳された場合や、屋外等の高雑音下で記録された会話のように発話音声以外の背景音が大きい場合などでは、認識精度は著しく低下することが知られている。通常の会話においても、マイクロフォンと話し手の位置関係や、話者の話し方、会話スタイルおよび特性などによって、高い認識精度が期待できない場合が数多く存在する。会話内容が特定のトピックに限定される場合には、推定される話題により認識のための辞書を最適に選択するなどの対策によって、音声認識の精度をあげることが可能である。しかし、通常、トピック等は事前に推定できない場合が多く、その場合、誤った辞書を使用すると音声認識の精度は更に下がってしまうなどの問題がある。こうした、誤りを多く含んだ音声認識結果にもとづいて、音声記録または音声付随のビデオ記録と文書情報の対応付けを行った場合に、対応付け誤りが多くなり、テキスト同時表示やキーワード検索による頭出しに利用することが困難になる。 In the conventional methods described in JP-A-7-199379 and JP-A-2000-270263, after converting speech into text by speech recognition processing, synchronization between the converted text and document data is performed. I'm taking it. As a result, if the text output by speech recognition contains many errors, there are many errors in the correspondence, such as failure to correspond to document data or correspondence to completely different document parts. It will be included. In general, in speech recognition, recognition accuracy may be significantly reduced when BGM is superimposed on speech, or when background sounds other than spoken speech are loud, such as conversations recorded under high noise such as outdoors. Are known. Even in normal conversation, there are many cases where high recognition accuracy cannot be expected due to the positional relationship between the microphone and the speaker, the way the speaker speaks, the conversation style and characteristics, and the like. When the conversation content is limited to a specific topic, it is possible to improve the accuracy of speech recognition by taking measures such as optimally selecting a dictionary for recognition based on the estimated topic. However, in general, topics and the like cannot often be estimated in advance, and in that case, there is a problem that the accuracy of speech recognition is further lowered if an incorrect dictionary is used. Based on these speech recognition results containing many errors, when correspondence between voice recording or video recording accompanied by voice and document information is increased, the number of correspondence errors increases, and text search or keyword search is performed. It becomes difficult to use it.

従来の方法の第二の問題として、文書情報が、音声を忠実に再現したものではなく、内容を簡単にまとめたような文書であった場合には、文書情報と音声情報とを正しく整合できないということがある。たとえば講演における音声情報と、講演者の作成した説明用の資料や要約文書とを対応付ける場合、音声情報から作成されたテキストに直接対応する部分が文書中に存在しないため、文書情報と音声情報とを正しく整合できない。 The second problem with the conventional method is that if the document information is not a faithful reproduction of the sound but is a document that simply summarizes the contents, the document information and the sound information cannot be correctly aligned. There is. For example, when associating speech information in a lecture with explanatory materials or summary documents created by a speaker, there is no portion in the document that directly corresponds to the text created from the speech information. Cannot be properly aligned.

従来の方法の第三の問題として、音声認識を基礎とした整合では、整合の単位が単語単位となるため、文書内容と音声情報とが完全に一致しないような場合には、同一の単語の出現によって対応が大きくずれてしまうということである。 As a third problem of the conventional method, in the matching based on speech recognition, the unit of matching is a word unit. Therefore, when the document content and the speech information do not completely match, This means that the correspondence will be greatly shifted by the appearance.

関連する技術として、特開２０００−３４８０６４号公報（優先権主張番号：０９／２８８７２４、優先権主張国：米国）に、内容情報と話者情報を使用して音声情報を検索するための方法および装置が開示されている。この内容情報と話者情報を使用して音声情報を検索するための方法は、１つまたは複数の音声ソースから音声情報を検索する方法である。少なくとも１つの内容と１つの話者制約条件を指定するユーザ照会を受け取る段階と、前記ユーザ照会を、前記音声ソースの内容索引および話者索引と比較して、前記ユーザ照会に適合する音声情報を識別する段階とを含む。 As a related technique, Japanese Patent Application Laid-Open No. 2000-348644 (priority claim number: 09/288724, priority claim country: US) discloses a method for searching speech information using content information and speaker information, and An apparatus is disclosed. The method for retrieving speech information using the content information and the speaker information is a method for retrieving speech information from one or more speech sources. Receiving a user query specifying at least one content and one speaker constraint; and comparing the user query with a content index and a speaker index of the speech source to obtain speech information that matches the user query. Identifying.

関連する技術として、特開２００２−１８９７２８号公報に、マルチメディア情報編集装置、その方法および記録媒体並びにマルチメディア情報配信システムが開示されている。このマルチメディア情報編集装置は、マルチメディア情報を編集する。マルチメディア情報編集装置は、記憶手段と、音声判別手段と、文書変換手段と、マルチメディア構造化手段とを備えたことを特徴とする。記憶手段は、音声、動画像などのマルチメディア情報を記憶する。音声判別手段は、前記記憶手段に記憶されたマルチメディア情報に対して音声が付加されているか判別を行う。文書変換手段は、前記音声判別手段によって音声が付加されていた場合その音声情報を文書情報に変換する。マルチメディア構造化手段は、前記文書変換手段で変換された文書を言語解析して、文書とマルチメディア情報とを構造化して、対応付ける。 As a related technique, Japanese Unexamined Patent Application Publication No. 2002-189728 discloses a multimedia information editing apparatus, a method thereof, a recording medium, and a multimedia information distribution system. This multimedia information editing apparatus edits multimedia information. The multimedia information editing apparatus includes a storage unit, a voice discrimination unit, a document conversion unit, and a multimedia structuring unit. The storage means stores multimedia information such as voice and moving images. The voice discrimination means determines whether voice is added to the multimedia information stored in the storage means. The document conversion means converts the voice information into document information when the voice is added by the voice discrimination means. The multimedia structuring means linguistically analyzes the document converted by the document conversion means to structure and associate the document with the multimedia information.

関連する技術として、特開２００２−２３６４９４号公報に、音声区間判別装置、音声認識装置、プログラム及び記録媒体の技術が開示されている。この音声区間判別装置は、音響分析手段と、標準パターン記憶手段と、マッチング手段と、判定手段と、音声区間判別手段とを備えていることを特徴とする。音響分析手段は、外部から入力された音声を所定周期で音響的に分析し、当該分析結果を基に音響特徴量を求める。標準パターン記憶手段は、前記入力音声に複数の話者の音声が混在し得る前提の下、単一話者の音声及び、複数話者の混合音声に対応する標準パターンを記憶する。マッチング手段は、前記標準パターン記憶手段に記憶された標準パターンと、前記音響分析手段にて求められた音響特徴量とのマッチングを行う。
判定手段は、前記マッチング手段による処理結果に基づき、前記入力音声がいずれの標準パターンに類似しているかを前記所定周期毎に判定する。音声区間判別手段は、前記判定手段による判定結果に基づき、前記各話者の音声区間を判別するとを備えている。As a related technique, Japanese Patent Application Laid-Open No. 2002-236494 discloses a technique of a speech segment determination device, a speech recognition device, a program, and a recording medium. This speech segment determination device is characterized by comprising acoustic analysis means, standard pattern storage means, matching means, determination means, and speech segment determination means. The acoustic analysis means acoustically analyzes a voice input from the outside with a predetermined period, and obtains an acoustic feature amount based on the analysis result. The standard pattern storage means stores a standard pattern corresponding to the voice of a single speaker and the mixed voice of a plurality of speakers on the premise that voices of a plurality of speakers can be mixed in the input voice. The matching unit performs matching between the standard pattern stored in the standard pattern storage unit and the acoustic feature amount obtained by the acoustic analysis unit.
The determination unit determines, based on the processing result by the matching unit, which standard pattern the input speech is similar to every predetermined period. The voice section determining means comprises determining the voice section of each speaker based on the determination result by the determining means.

関連する技術として、特開２００２−３６６５５２号公報（優先権主張番号：０９／９６２６５９、優先権主張国：米国）に、記録音声を探索し、関連セグメントを検索する方法及びシステムが開示されている。これは、データベース内の記録音声を探索する方法である。ａ）音声認識システムを用いて、前記記録音声をテキストに変換するステップと、ｂ）情報エクステンダを用いて、前記記録音声のフル・テキスト索引を作成するステップであって、前記フル・テキスト索引が、前記記録音声内での単語の出現を指し示す複数のタイムスタンプを含み、ｃ）フル・テキスト・サーバにより、前記フル・テキスト索引を用いて、テキストを探索するステップと、ｄ）前記探索テキスト、前記フル・テキスト索引、及び前記記録音声を前記データベースに記憶するステップとを含みる。前記記録音声の特定の内容が、全部の記録を聴取することなく、前記フル・テキスト索引を用いて再生される。 As a related technique, Japanese Patent Application Laid-Open No. 2002-366552 (priority claim number: 09/966659, priority claim country: US) discloses a method and system for searching recorded speech and searching for related segments. . This is a method for searching recorded voices in a database. a) converting the recorded speech into text using a speech recognition system; and b) creating a full text index of the recorded speech using an information extender, wherein the full text index is Including a plurality of time stamps indicating the occurrence of a word in the recorded speech, c) searching the text using the full text index by a full text server; and d) the search text; Storing the full text index and the recorded audio in the database. The specific content of the recorded audio is reproduced using the full text index without listening to the entire recording.

関連する技術として、特開平１１−２４２６６９号公報に、文書処理装置の技術が開示されている。この文書処理装置は、音声入力手段と、抽出手段と、属性生成手段と、文書記憶手段と指示手段と、出力手段と、添付手段とを備えたことを特徴とする。音声入力手段は、音声を入力する。抽出手段は、音声入力手段によって入力された音声から話者を特定するための情報を抽出する。属性生成手段は、前記抽出された情報と所定の基準情報とを比較して話者属性情報を生成する。文書記憶手段は、文書を記憶する。指示手段は、入力された音声を添付すべき文書中の位置を指示する。出力手段は、文書を出力する。添付手段は、前記指示手段によって指示された文書中の位置の情報と、前記入力された音声と、前記属性生成手段によって生成された話者属性情報とからなる組情報を前記文書記憶手段に記憶する。 As a related technique, Japanese Patent Application Laid-Open No. 11-242669 discloses a technique of a document processing apparatus. The document processing apparatus includes a voice input unit, an extraction unit, an attribute generation unit, a document storage unit, an instruction unit, an output unit, and an attachment unit. The voice input means inputs voice. The extraction means extracts information for specifying a speaker from the voice input by the voice input means. The attribute generation means generates speaker attribute information by comparing the extracted information with predetermined reference information. The document storage means stores the document. The instruction means indicates a position in the document to which the input voice is to be attached. The output means outputs a document. The attachment means stores, in the document storage means, set information including position information in the document instructed by the instruction means, the input voice, and speaker attribute information generated by the attribute generation means. To do.

本発明の目的は、音声や映像などのコンテンツにおいて定義される有意な区間と文書中の区間とを精度良く対応付ける文書対応付け装置、および文書対応付け方法を提供することである。 An object of the present invention is to provide a document associating apparatus and a document associating method for accurately associating a significant section defined in content such as audio and video with a section in a document.

本発明の他の目的は、コンテンツの状態に影響されずに、コンテンツにおける有意な区間と文書中の区間とを精度良く対応付ける文書対応付け装置、および文書対応付け方法を提供することである。 Another object of the present invention is to provide a document associating apparatus and a document associating method that accurately associates a significant section in a content with a section in a document without being affected by the state of the content.

本発明の他の目的は、文書の種類に影響されずに、コンテンツにおける有意な区間と文書中の区間とを精度良く対応付ける文書対応付け装置、および文書対応付け方法を提供することである。 Another object of the present invention is to provide a document associating apparatus and a document associating method that accurately associates a significant section in content and a section in a document without being affected by the type of document.

この発明のこれらの目的とそれ以外の目的と利益とは以下の説明と添付図面とによって容易に確認することができる。 These objects and other objects and advantages of the present invention can be easily confirmed by the following description and attached drawings.

上記課題を解決するために、本発明の文書対応付け方法は、（ａ）複数の話者が発話者として登場する音声情報及び映像情報のうちの少なくとも一方を含むコンテンツと、前記コンテンツの内容を記述した文書とを準備するステップと、（ｂ）そのコンテンツとその文書との対応関係を、話者単位で導出するステップとを具備する。 In order to solve the above-described problem, the document association method of the present invention includes (a) content including at least one of audio information and video information in which a plurality of speakers appear as speakers, and the content of the content. Preparing a described document; and (b) deriving a correspondence between the content and the document for each speaker.

上記の文書対応付け方法において、その（ｂ）ステップは、（ｂ１）そのコンテンツを話者単位で分割して複数のコンテンツ区間とするステップと、（ｂ２）その文書を話者単位で分割して複数の文書区間とするステップと、（ｂ３）その複数のコンテンツ区間とその複数の文書区間との対応付けを行うステップとを備える。 In the document association method, the step (b) includes: (b1) dividing the content into units of speakers to form a plurality of content sections; and (b2) dividing the document into units of speakers. A plurality of document sections; and (b3) a step of associating the plurality of content sections with the plurality of document sections.

上記の文書対応付け方法において、その（ｂ２）ステップは、（ｂ２１）その複数の話者の一人からその複数の話者の他の一人へ発話者が変化した時点をコンテンツから抽出するステップと、（ｂ２２）その発話者が変化した時点に基づいて、そのコンテンツを話者単位で分割するステップとを含む。 In the document association method, the step (b2) includes (b21) extracting from the content a point in time when the speaker changes from one of the plurality of speakers to another one of the plurality of speakers; (B22) dividing the content into units of speakers based on the time when the speaker changes.

上記の文書対応付け方法において、その（ｂ２１）ステップは、（ｂ２１１）そのコンテンツはその音声情報であり、その発話者の音声の変化点をその音声情報から抽出するステップを含む。 In the document association method, the step (b21) includes a step (b211) in which the content is the voice information and the change point of the voice of the speaker is extracted from the voice information.

上記の文書対応付け方法において、その（ｂ２１）ステップは、（ｂ２１２）そのコンテンツはその映像情報であり、その発話者の映像の変化点そのを映像情報から抽出するステップを含む。 In the document association method, the step (b21) includes a step (b212) in which the content is the video information and the change point of the video of the speaker is extracted from the video information.

上記の文書対応付け方法において、そのコンテンツは、その音声情報とその映像情報とが同期した音声映像情報である。 In the document association method, the content is audio / video information in which the audio information and the video information are synchronized.

上記の文書対応付け方法において、その（ｂ２１）ステップは、（ｂ２１３）その音声情報の音特徴の変化点解析を行い、その発話者が変化した時点を導出するステップを含む。 In the document association method, the step (b21) includes the step (b213) of analyzing the change point of the sound feature of the voice information and deriving the time point when the speaker changes.

上記の文書対応付け方法において、その（ｂ２１）ステップは、（ｂ２１４）その映像情報の視覚的特徴の変化点解析を行い、その発話者が変化した時点を導出するステップを含む。 In the document association method, the step (b21) includes a step (b214) of performing a change point analysis of the visual feature of the video information and deriving a time point when the speaker changes.

上記の文書対応付け方法において、その（ｂ２１）ステップは、（ｂ２１５）その映像情報の視覚的特徴の変化点解析及びその音声情報の音特徴の変化点解析を行い、双方の結果を統合して、その発話者が変化した時点を導出するステップを含む。 In the document association method, the step (b21) includes (b215) performing a change point analysis of the visual feature of the video information and a change point analysis of the sound feature of the audio information, and integrating both results. Deriving a point in time when the speaker changes.

上記の文書対応付け方法において、その（ｂ）ステップは、（ｂ４）その文書の構造解析を行い、その文書を話者単位で分割するステップを備える。 In the document association method, the step (b) includes the step (b4) of analyzing the structure of the document and dividing the document into units of speakers.

上記課題を解決するために、本発明のコンピュータプログラム製品は、コンピュータ上で使用したときに、上記各項のいずれか一項に記載された全てのステップを実行するプログラムコード手段を有する。 In order to solve the above problems, the computer program product of the present invention comprises program code means for executing all the steps described in any one of the above items when used on a computer.

上記のプログラムコード手段を有するコンピュータプログラム製品は、コンピュータにより読み取り可能な記憶手段に記憶されている。 The computer program product having the above program code means is stored in a storage means readable by a computer.

上記課題を解決するために、本発明の文書対応付け装置は、コンテンツ区間抽出部と、文書区間抽出部と、区間対応関係導出部とを具備する。コンテンツ区間抽出部は、複数の話者が発話者として登場する音声情報及び映像情報のうちの少なくとも一方を含むコンテンツについて、そのコンテンツを話者単位で分割して複数のコンテンツ区間を抽出する。文書区間抽出部は、そのコンテンツの内容を記述した文書を話者単位で分割して複数の文書区間を抽出する。区間対応関係導出部は、その複数のコンテンツ区間とその複数の文書区間との対応関係を導出する。 In order to solve the above problems, a document association apparatus of the present invention includes a content section extraction unit, a document section extraction unit, and a section correspondence relationship deriving unit. The content section extraction unit divides the content for each speaker and extracts a plurality of content sections for content including at least one of audio information and video information in which a plurality of speakers appear as speakers. The document section extraction unit extracts a plurality of document sections by dividing a document describing the contents of the content into units of speakers. The section correspondence relationship deriving unit derives a correspondence relationship between the plurality of content sections and the plurality of document sections.

上記の文書対応付け装置において、そのコンテンツはその音声情報である。そのコンテンツ区間抽出部は、その音声情報の音特徴を解析してその複数のコンテンツ区間を抽出する。 In the document association apparatus, the content is the audio information. The content section extraction unit analyzes the sound feature of the audio information and extracts the plurality of content sections.

上記の文書対応付け装置において、そのコンテンツはその映像情報である。そのコンテンツ区間抽出部は、その映像情報の視覚的特徴を解析してその複数のコンテンツ区間を抽出する。 In the document association apparatus, the content is the video information. The content section extraction unit analyzes the visual feature of the video information and extracts the plurality of content sections.

上記の文書対応付け装置において、そのコンテンツは、その音声情報とその映像情報とが同期した音声映像情報である。そのコンテンツ区間抽出部は、その音声情報の音特徴の解析の結果とその映像情報の視覚的特徴の解析の結果とを統合してその複数のコンテンツ区間を抽出する。 In the document association apparatus, the content is audio / video information in which the audio information and the video information are synchronized. The content section extraction unit extracts the plurality of content sections by integrating the result of the sound feature analysis of the audio information and the result of the visual feature analysis of the video information.

上記の文書対応付け装置において、コンテンツ抽出部は、音声区間抽出部と、映像区間抽出部と、音声映像区間統合部とを含む。音声区間抽出部は、その音声情報の音特徴を解析して、その音声情報を話者単位に分割して複数の音声区間を抽出する。映像区間抽出部は、その映像情報の視覚的特徴を解析して、その映像情報を話者単位に分割して複数の映像区間を抽出する。音声映像区間統合部は、その複数の音声区間に関する複数の音声区間情報と、その複数の映像区間に関する複数の映像区間情報とに基づいて、その複数のコンテンツ区間を抽出する。 In the document association apparatus, the content extraction unit includes an audio segment extraction unit, a video segment extraction unit, and an audio / video segment integration unit. The speech segment extraction unit analyzes the sound feature of the speech information, divides the speech information into speaker units, and extracts a plurality of speech segments. The video section extraction unit analyzes the visual characteristics of the video information, divides the video information into units of speakers, and extracts a plurality of video sections. The audio / video segment integration unit extracts the plurality of content segments based on the plurality of audio segment information regarding the plurality of audio segments and the plurality of video segment information regarding the plurality of video segments.

上記の文書対応付け装置において、そのコンテンツ区間抽出部は、そのコンテンツにおけるその複数の話者の一人からその複数の話者の他の一人へ発話者が変化した時点としての発話者変化点を抽出して、その複数のコンテンツ区間を抽出する。 In the document association apparatus, the content section extraction unit extracts a speaker change point as a point when the speaker changes from one of the plurality of speakers to another one of the plurality of speakers in the content. Then, the plurality of content sections are extracted.

上記の文書対応付け装置において、そのコンテンツはその音声情報を含む。そのコンテンツ区間抽出部は、その音声情報における発話の高さ、発話速度、発話の大きさのうちの少なくとも一つの韻律情報の特徴の変化に基づいて、その発話者変化点を抽出する。 In the document association apparatus, the content includes the audio information. The content section extraction unit extracts the speaker change point based on a change in the feature of at least one prosodic information among the speech height, speech speed, and speech size in the speech information.

上記の文書対応付け装置において、そのコンテンツはその音声情報を含む。そのコンテンツ区間抽出部は、その音声情報における会話形態の変化に基づいて、その発話者変化点を抽出する。 In the document association apparatus, the content includes the audio information. The content section extraction unit extracts the speaker change point based on the change of the conversation form in the voice information.

上記の文書対応付け装置において、そのコンテンツはその映像情報を含む。そのコンテンツ区間抽出部は、その映像情報における人物の視覚的特徴の変化に基づいて、その発話者変化点を抽出する。 In the document association apparatus, the content includes the video information. The content section extraction unit extracts the speaker change point based on the change in the visual feature of the person in the video information.

上記の文書対応付け装置において、そのコンテンツはその映像情報を含む。そのコンテンツ区間抽出部は、その映像情報における人物の顔特徴の変化に基づいて、その発話者変化点を抽出する。 In the document association apparatus, the content includes the video information. The content section extraction unit extracts the speaker change point based on the change in the facial feature of the person in the video information.

上記の文書対応付け装置において、そのコンテンツはその映像情報を含む。そのコンテンツ区間抽出部は、その映像情報における人物の衣服の視覚的特徴の変化に基づいて、その発話者変化点を抽出する。 In the document association apparatus, the content includes the video information. The content section extraction unit extracts the speaker change point based on the change in the visual feature of the clothes of the person in the video information.

上記の文書対応付け装置において、その文書区間抽出部は、その文書の書式情報に基づいて、その複数の文書区間を抽出する。 In the document association apparatus, the document section extraction unit extracts the plurality of document sections based on the format information of the document.

上記の文書対応付け装置において、その文書区間抽出部は、その文書に記入された発話者に関する記述に基づいて、その複数の文書区間を抽出する。 In the document association apparatus, the document section extraction unit extracts the plurality of document sections based on the description about the speaker entered in the document.

上記の文書対応付け装置において、その文書区間抽出部は、その文書における構造化文書のタグ情報に基づいて、その複数の文書区間を抽出する。 In the document association apparatus, the document section extraction unit extracts the plurality of document sections based on the tag information of the structured document in the document.

上記の文書対応付け装置において、その文書区間抽出部は、その文書における会話特徴の変化に基づいて、その複数の文書区間を抽出する。 In the document association apparatus, the document section extraction unit extracts the plurality of document sections based on a change in conversation characteristics in the document.

上記の文書対応付け装置において、その区間対応関係導出部は、その複数のコンテンツ区間の区間長とその複数の文書区間の文書量とを比較に基づいて、その複数のコンテンツ区間とその複数の文書区間との対応付けを行う。 In the document association apparatus, the section correspondence relationship deriving unit is configured to compare the plurality of content sections and the plurality of documents based on the comparison between the section lengths of the plurality of content sections and the document amounts of the plurality of document sections. Correlate with the section.

上記の文書対応付け装置において、その区間対応関係導出部は、その複数のコンテンツ区間及びその複数の文書区間に対するダイナミックプログラミングマッチングの実行結果に基づいて、その対応付けを行う。 In the document association apparatus, the section correspondence relationship deriving unit performs the association based on the execution result of the dynamic programming matching for the plurality of content sections and the plurality of document sections.

上記の文書対応付け装置において、その区間対応関係導出部は、コンテンツ話者識別部と、文書話者情報抽出部と、区間整合部とを含む。コンテンツ話者識別部は、その複数のコンテンツ区間のうちの少なくとも一つにおける発話者を特定する。文書話者情報抽出部は、その複数の文書区間のうちの少なくとも一つにおける発話者を特定して、その発話者の情報としての話者情報を得る。区間整合部は、その話者情報に基づいて、その複数のコンテンツ区間とその複数の文書区間との整合を行う。 In the document association apparatus, the section correspondence relationship deriving unit includes a content speaker identifying unit, a document speaker information extracting unit, and a section matching unit. The content speaker identification unit identifies a speaker in at least one of the plurality of content sections. The document speaker information extraction unit specifies a speaker in at least one of the plurality of document sections, and obtains speaker information as information of the speaker. The section matching unit performs matching between the plurality of content sections and the plurality of document sections based on the speaker information.

上記の文書対応付け装置において、そのコンテンツ話者識別部は、コンテンツ特徴量抽出部と、話者情報記憶部と、特徴量整合識別部とを含む。コンテンツ特徴量抽出部は、その複数のコンテンツ区間のうちの少なくとも一つにおける特徴量を抽出する。話者情報記憶部は、その特徴量とその発話者とを対応させて記憶する。特徴量整合識別部は、記憶されたその特徴量と抽出された特徴量との比較に基づいて、その発話者の識別を行う。 In the document association apparatus, the content speaker identification unit includes a content feature amount extraction unit, a speaker information storage unit, and a feature amount matching identification unit. The content feature amount extraction unit extracts a feature amount in at least one of the plurality of content sections. The speaker information storage unit stores the feature quantity and the speaker in association with each other. The feature quantity matching identification unit identifies the speaker based on the comparison between the stored feature quantity and the extracted feature quantity.

上記の文書対応付け装置において、
そのコンテンツ話者識別部は、その音声情報における声の高さ、声の長さ、声の強さのうちの少なくとも一つの韻律情報の特徴に基づいて、その発話者を特定する
文書対応付け装置。In the above document matching apparatus,
The content speaker identification unit identifies the speaker based on the characteristics of at least one of the prosody information of the voice pitch, voice length, and voice strength in the voice information. .

上記の文書対応付け装置において、そのコンテンツ話者識別部は、その音声情報における会話形態の表す特徴量に基づいて、その発話者を特定する。 In the document association apparatus, the content speaker identification unit identifies the speaker based on the feature amount represented by the conversation form in the audio information.

上記の文書対応付け装置において、そのコンテンツ話者識別部は、その映像情報における人物の視覚的特徴量に基づいて、その発話者を特定する。 In the document association apparatus, the content speaker identification unit identifies the speaker based on the visual feature amount of the person in the video information.

上記の文書対応付け装置において、そのコンテンツ話者識別部は、その人物の視覚的特徴として人物の顔特徴を用いる。 In the document association apparatus, the content speaker identification unit uses a facial feature of a person as a visual feature of the person.

上記の文書対応付け装置において、その文書話者情報抽出部は、その文書に記入された発話者に関する記述に基づいて、その発話者を特定する。 In the document association apparatus, the document speaker information extraction unit specifies the speaker based on the description about the speaker entered in the document.

上記の文書対応付け装置において、その文書話者情報抽出部は、その文書における構造化文書のメタデータに基づいて、話者を特定する。 In the document association apparatus, the document speaker information extraction unit specifies a speaker based on the metadata of the structured document in the document.

上記の文書対応付け装置において、その区間整合部は、その複数のコンテンツ区間の各々における発話者とその複数の文書区間の各々における発話者とが一致するように、その複数のコンテンツ区間とその複数の文書区間とを対応付ける。 In the document association apparatus, the section matching unit includes the plurality of content sections and the plurality of content sections so that a speaker in each of the plurality of content sections matches a speaker in each of the plurality of document sections. Is associated with the document section.

上記の文書対応付け装置において、その区間整合部は、その複数のコンテンツ区間とその複数の文書区間とに対するダイナミックプログラミングマッチングの実行結果に基づいて、その複数のコンテンツ区間とその複数の文書区間とを対応付ける。 In the document association apparatus, the section matching unit determines the plurality of content sections and the plurality of document sections based on the execution result of the dynamic programming matching for the plurality of content sections and the plurality of document sections. Associate.

上記の文書対応付け装置において、そのコンテンツは音声情報を含む。文書対応付け装置は、その複数のコンテンツ区間における発話内容を抽出して発話テキスト情報を出力する音声認識部を更に具備する。その区間対応関係導出部は、その発話テキスト情報とその文書の文書情報との類似度に基づいて、その複数のコンテンツ区間とその複数の文書区間とを対応付ける。 In the document association apparatus, the content includes audio information. The document association apparatus further includes a voice recognition unit that extracts utterance contents in the plurality of content sections and outputs utterance text information. The section correspondence relation deriving unit associates the plurality of content sections with the plurality of document sections based on the similarity between the utterance text information and the document information of the document.

上記の文書対応付け装置において、その区間対応関係導出部は、その発話テキスト情報で出現する単語とその文書情報で出現する単語との間のダイナミックプログラムマッチングの実行結果に基づいて、その発話テキスト情報とその文書情報とを整合させる。 In the document association apparatus, the section correspondence relationship deriving unit is configured to determine the utterance text information based on the execution result of the dynamic program matching between the words that appear in the utterance text information and the words that appear in the document information. And its document information.

上記の文書対応付け装置において、その区間対応関係導出部は、基本単語抽出部と、基本単語群類似度導出部とを含む。基本単語抽出部は、その発話テキスト情報におけるその複数のコンテンツ区間の各々で使用されている一つまたは複数の第１基本単語と、その複数の文書区間の各々で使用されている一つまたは複数の第２基本単語とをそれぞれ抽出する。基本単語群類似度導出部は、その複数の第１基本単語と、その複数の第２基本単語との間の類似度を測定する。その区間対応関係導出部は、その類似度に基づいて、その対応関係を導出する。 In the document association apparatus, the section correspondence relationship deriving unit includes a basic word extracting unit and a basic word group similarity deriving unit. The basic word extraction unit includes one or more first basic words used in each of the plurality of content sections in the utterance text information, and one or more used in each of the plurality of document sections. Are extracted. The basic word group similarity deriving unit measures the similarity between the plurality of first basic words and the plurality of second basic words. The section correspondence relationship deriving unit derives the correspondence relationship based on the similarity.

上記の文書対応付け装置において、その区間対応関係導出部は、その類似度を、ダイナミックプログラミングマッチングにより対応付けることによって対応関係を導出する。 In the document association apparatus, the section correspondence relationship deriving unit derives the correspondence relationship by associating the similarity by dynamic programming matching.

上記の文書対応付け装置において、そのコンテンツを入力するコンテンツ入力部と、そのコンテンツを記憶するコンテンツ記憶部と、その文書情報を入力する文書入力部と、その文書を記憶する文書記憶部と、その対応関係に関する情報を出力する出力部とを更に具備する。 In the document association apparatus, a content input unit for inputting the content, a content storage unit for storing the content, a document input unit for inputting the document information, a document storage unit for storing the document, And an output unit that outputs information on the correspondence relationship.

本発明によれば、ＢＧＭの影響、ノイズなどの影響、発話者の発話スタイル、集音環境等の影響によって、音声認識の精度が十分に得られないときでも、精度よくコンテンツの有意な区間と文書中の区間の対応付けを行うことができる。その理由は、音声または映像といったコンテンツと文書区間との整合を、音声認識に比べて容易である話者単位（話者の変化した部分）にもとづいて行っているためである。話者が代わった点の認識は、話者の話している内容を認識するのに比べて、違いを認識するだけでよいので、ノイズや集音の状態に対して頑強である。また、音声の内容ではなく、話者にフォーカスして対応付けを行うため、視覚的情報も活用することができ、話者変化点抽出を視覚的情報にもとづいて行う場合には、集音状態には依存しない対応付けを行うことができる。また、本発明によれば、対応付ける文書が音声または映像中の会話を忠実に表していない場合でも、対応付けを行うことができる。その理由は、単語レベルでの整合を取っていないので、話者や話題ごとの比較的長い区間での対応付けが実現でき、個々の会話の内容を詳細に対応付ける必要がないためである。 According to the present invention, even when the accuracy of speech recognition cannot be sufficiently obtained due to the influence of BGM, the influence of noise, the utterance style of the speaker, the sound collection environment, etc. It is possible to associate sections in the document. This is because the content such as voice or video and the document section are matched based on the unit of speaker (the part where the speaker has changed), which is easier than voice recognition. Recognizing the point where the speaker has changed is more robust against noise and sound collection because it only needs to recognize the difference than recognizing what the speaker is speaking. In addition, visual information can also be used because the correspondence is focused on the speakers rather than the contents of the voice, and if the speaker change points are extracted based on the visual information, the sound collection state Correspondence that does not depend on can be performed. Further, according to the present invention, it is possible to perform the association even when the document to be associated does not faithfully represent the conversation in the audio or video. The reason for this is that since matching at the word level is not achieved, it is possible to realize association in a relatively long section for each speaker or topic, and it is not necessary to associate the contents of individual conversations in detail.

図１は、本発明の文書対応付け装置の実施の形態の構成を示す図である。FIG. 1 is a diagram showing a configuration of an embodiment of a document association apparatus according to the present invention. 図２は、本発明の文書対応付け装置の実施の形態におけるコンテンツ区間抽出手段５の構成の一例を示すブロック図である。FIG. 2 is a block diagram showing an example of the configuration of the content section extraction means 5 in the embodiment of the document association apparatus of the present invention. 図３は、本発明の文書対応付け方法の実施の形態におけるコンテンツ区間抽出手段５の動作の一例を示すフローチャートである。FIG. 3 is a flowchart showing an example of the operation of the content section extraction means 5 in the embodiment of the document association method of the present invention. 図４は、本発明の文書対応付け装置の実施の形態におけるコンテンツ区間抽出手段５の構成の他の一例を示すブロック図である。FIG. 4 is a block diagram showing another example of the configuration of the content section extraction means 5 in the embodiment of the document association apparatus of the present invention. 図５は、本発明の文書対応付け方法の実施の形態におけるコンテンツ区間抽出手段５の動作の他の一例を示すフローチャートである。FIG. 5 is a flowchart showing another example of the operation of the content section extraction means 5 in the embodiment of the document association method of the present invention. 図６は、本発明の文書対応付け装置の実施の形態におけるコンテンツ区間抽出手段５の構成の更に他の一例を示すブロック図である。FIG. 6 is a block diagram showing still another example of the configuration of the content section extraction means 5 in the embodiment of the document association apparatus of the present invention. 図７は、本発明の文書対応付け装置の実施の形態におけるコンテンツ区間抽出手段５の動作の更に他の一例を示すフローチャートである。FIG. 7 is a flowchart showing still another example of the operation of the content section extracting means 5 in the embodiment of the document association apparatus of the present invention. 図８は、本発明の文書対応付け装置の実施の形態におけるコンテンツ区間抽出手段５の構成の別の一例を示すブロック図である。FIG. 8 is a block diagram showing another example of the configuration of the content section extraction means 5 in the embodiment of the document association apparatus of the present invention. 図９は、本発明の文書対応付け装置の実施の形態におけるコンテンツ区間抽出手段５の動作の別の一例を示すフローチャートである。FIG. 9 is a flowchart showing another example of the operation of the content section extraction means 5 in the embodiment of the document association apparatus of the present invention. 図１０は、本発明の文書対応付け装置の実施の形態における文書区間抽出手段６の動作の一例を示すフローチャートである。FIG. 10 is a flowchart showing an example of the operation of the document section extraction means 6 in the embodiment of the document association apparatus of the present invention. 図１１Ａ〜図１１Ｄは、本発明の文書対応付け方法の実施の形態における文書の書式情報を利用する方法の一例を示す図である。11A to 11D are diagrams illustrating an example of a method of using document format information in the embodiment of the document association method of the present invention. 図１１Ｂは、本発明の文書対応付け方法の実施の形態における文書の書式情報を利用する方法の一例を示す図である。FIG. 11B is a diagram showing an example of a method of using document format information in the embodiment of the document association method of the present invention. 図１１Ｃは、本発明の文書対応付け方法の実施の形態における文書の書式情報を利用する方法の一例を示す図である。FIG. 11C is a diagram showing an example of a method of using document format information in the embodiment of the document association method of the present invention. 図１１Ｄは、本発明の文書対応付け方法の実施の形態における文書の書式情報を利用する方法の一例を示す図である。FIG. 11D is a diagram showing an example of a method of using document format information in the embodiment of the document association method of the present invention. 図１２Ａは、本発明の文書対応付け方法の実施の形態における文書の書式情報を利用する方法の他の一例を示す図である。FIG. 12A is a diagram showing another example of the method using the document format information in the embodiment of the document association method of the present invention. 図１２Ｂは、本発明の文書対応付け方法の実施の形態における文書の書式情報を利用する方法の他の一例を示す図である。FIG. 12B is a diagram showing another example of the method using the document format information in the embodiment of the document association method of the present invention. 図１２Ｃは、本発明の文書対応付け方法の実施の形態における文書の書式情報を利用する方法の他の一例を示す図である。FIG. 12C is a diagram showing another example of the method using the document format information in the embodiment of the document association method of the present invention. 図１３は、本発明の文書対応付け方法の実施の形態における文書の書式情報を利用する方法の更に他の一例を示す図である。FIG. 13 is a diagram showing still another example of a method of using document format information in the embodiment of the document association method of the present invention. 図１４は、本発明の文書対応付け装置の実施の形態における区間対応関係導出手段７の構成の一例を示すブロック図である。FIG. 14 is a block diagram showing an example of the configuration of the section correspondence relationship deriving means 7 in the embodiment of the document association apparatus of the present invention. 図１５は、本発明の文書対応付け方法の実施の形態における区間対応関係導出手段７が実行する対応関係導出方法の一例を示すフローチャートである。FIG. 15 is a flowchart showing an example of the correspondence derivation method executed by the section correspondence derivation means 7 in the embodiment of the document association method of the present invention. 図１６Ａは、対応関係導出方法におけるコンテンツ情報と文書情報との対応関係を示す図である。FIG. 16A is a diagram showing the correspondence between content information and document information in the correspondence derivation method. 図１６Ｂは、対応関係導出方法におけるコンテンツ情報と文書情報との対応関係を示す図である。FIG. 16B is a diagram showing the correspondence between content information and document information in the correspondence derivation method. 図１７は、対応関係導出方法における正規化を説明する図である。FIG. 17 is a diagram for explaining normalization in the correspondence derivation method. 図１８Ａは、対応関係導出方法におけるコンテンツ情報と文書情報との対応関係を示す図である。FIG. 18A is a diagram showing the correspondence between content information and document information in the correspondence derivation method. 図１８Ｂは、対応関係導出方法におけるコンテンツ情報と文書情報との対応関係を示す図である。FIG. 18B is a diagram showing the correspondence between content information and document information in the correspondence derivation method. 図１９は、本発明の文書対応付け装置の実施の形態における区間対応関係導出手段７の構成の他の一例を示すブロック図である。FIG. 19 is a block diagram showing another example of the configuration of the section correspondence relationship deriving means 7 in the embodiment of the document correspondence apparatus of the present invention. 図２０は、本発明の文書対応付け方法の実施の形態における区間対応関係導出手段７が実行する対応関係導出方法の他の一例を示すフローチャートである。FIG. 20 is a flowchart showing another example of the correspondence derivation method executed by the section correspondence derivation means 7 in the embodiment of the document association method of the present invention. 図２１は、対応関係導出方法におけるコンテンツ情報と文書情報との対応関係を示す図である。FIG. 21 is a diagram showing the correspondence between content information and document information in the correspondence derivation method. 図２２は、対応関係導出方法におけるコンテンツ情報と文書情報との対応関係を示す図である。FIG. 22 is a diagram showing the correspondence between content information and document information in the correspondence derivation method. 図２３は、本発明の文書対応付け装置の実施の形態における区間対応関係導出手段７の構成の別の一例を示すブロック図である。FIG. 23 is a block diagram showing another example of the configuration of the section correspondence derivation means 7 in the embodiment of the document association apparatus of the present invention. 図２４は、候補テキスト文書対応部６２の構成の一例を示すブロック図である。FIG. 24 is a block diagram illustrating an example of the configuration of the candidate text document corresponding unit 62. 図２５は、本発明の文書対応付け方法の実施の形態における区間対応関係導出手段７が実行する対応関係導出方法の別の一例を示すフローチャートである。FIG. 25 is a flowchart showing another example of the correspondence derivation method executed by the section correspondence derivation means 7 in the embodiment of the document association method of the present invention. 図２６は、対応関係導出方法におけるコンテンツ情報と文書情報との対応関係を示す図である。FIG. 26 is a diagram showing the correspondence between content information and document information in the correspondence derivation method. 図２７は、対応関係導出方法におけるコンテンツ情報と文書情報との対応関係を示す図である。FIG. 27 is a diagram illustrating the correspondence between content information and document information in the correspondence derivation method.

以下、本発明の文書対応付け装置、および文書対応付け方法の実施の形態について添付図面を参照して詳細に説明する。 Embodiments of a document association apparatus and a document association method of the present invention will be described below in detail with reference to the accompanying drawings.

本発明の文書対応付け装置の実施の形態の構成について説明する。
図１は、本発明の文書対応付け装置の実施の形態の構成を示す図である。文書対応付け装置１０は、コンテンツ入力手段（コンテンツ入力部）１と、文書入力手段（文書入力部）２と、コンテンツ記憶手段（コンテンツ記憶部）３と、文書記憶手段（文書記憶部）４と、コンテンツ区間抽出手段（コンテンツ区間抽出部）５と、文書区間抽出手段（文書区間抽出部）６と、区間対応関係導出手段（区間対応関係導出部）７と、出力手段（出力部）８とを具備する。コンテンツ入力手段１は、音声や映像などの情報（データ）を含むコンテンツを入力する。文書入力手段２は、コンテンツに関連する文書を入力する。コンテンツ記憶手段３は、コンテンツ入力手段１から得られたコンテンツを記憶する。文書記憶手段４は、文書入力手段２から得られた文書を記憶する。コンテンツ区間抽出手段５は、コンテンツより単一話者区間を抽出する。文書区間抽出手段６は、文書から単一話者区間の抽出を行う。区間対応関係導出手段７は、コンテンツ区間抽出手段５が抽出したコンテンツ区間と文書区間抽出手段６が抽出した文書区間との対応関係を導出する。出力手段８は、区間対応関係導出手段７が導出した対応関係を出力する。The configuration of the embodiment of the document association apparatus of the present invention will be described.
FIG. 1 is a diagram showing a configuration of an embodiment of a document association apparatus according to the present invention. The document association apparatus 10 includes a content input unit (content input unit) 1, a document input unit (document input unit) 2, a content storage unit (content storage unit) 3, and a document storage unit (document storage unit) 4. , Content section extraction means (content section extraction section) 5, document section extraction means (document section extraction section) 6, section correspondence relation derivation means (section correspondence relation derivation section) 7, output means (output section) 8, It comprises. The content input unit 1 inputs content including information (data) such as audio and video. The document input unit 2 inputs a document related to the content. The content storage unit 3 stores the content obtained from the content input unit 1. The document storage unit 4 stores the document obtained from the document input unit 2. The content section extraction means 5 extracts a single speaker section from the content. The document segment extraction means 6 extracts a single speaker segment from the document. The section correspondence relationship deriving unit 7 derives a correspondence relationship between the content section extracted by the content section extracting unit 5 and the document section extracted by the document section extracting unit 6. The output unit 8 outputs the correspondence relationship derived by the section correspondence relationship deriving unit 7.

コンテンツ入力手段１は、対象となるコンテンツを入力するためのものである。コンテンツ入力手段１は、例えば、ビデオカメラやマイクロフォンである。ここで、コンテンツは、映像情報、音声情報または音声情報が付随した映像情報に例示される。コンテンツ入力手段１は、ビデオテープのような記録媒体に記録された映像情報または音声情報を読み込んで出力する映像再生機や録音再生機のようなものであってもよい。 The content input means 1 is for inputting target content. The content input unit 1 is, for example, a video camera or a microphone. Here, the content is exemplified by video information, audio information, or video information accompanied by audio information. The content input unit 1 may be a video player or a recording / reproducing device that reads and outputs video information or audio information recorded on a recording medium such as a video tape.

文書入力手段２は、コンテンツに関連する文書を入力するためのものである。文書入力部２は、例えば、キーボードやペン入力デバイス、スキャナのようなテキスト入力機器である。文書入力部２は、文書作成ソフトウェアを用いて作成した文書データを読み込む入力機器であってもよい。 The document input means 2 is for inputting a document related to the content. The document input unit 2 is a text input device such as a keyboard, a pen input device, or a scanner. The document input unit 2 may be an input device that reads document data created using document creation software.

コンテンツ記憶手段３は、例えば、コンテンツ入力手段１からのコンテンツを記録する内部記憶装置または外部記憶装置である。コンテンツ記憶手段３で用いられる記憶媒体は、ＲＡＭ、ＣＤ−ＲＯＭ、ＤＶＤ、フラッシュメモリ、ハードディスクに例示される。 The content storage unit 3 is, for example, an internal storage device or an external storage device that records content from the content input unit 1. The storage medium used in the content storage unit 3 is exemplified by a RAM, a CD-ROM, a DVD, a flash memory, and a hard disk.

文書記憶手段４は、文書入力手段２からの文書を記録する内部記憶装置または外部記憶装置である。文書記憶手段４で用いられる記録媒体は、ＲＡＭ、ＣＤ−ＲＯＭ、ＤＶＤ、フラッシュメモリ、ハードディスクに例示される。 The document storage unit 4 is an internal storage device or an external storage device that records a document from the document input unit 2. The recording medium used in the document storage unit 4 is exemplified by a RAM, a CD-ROM, a DVD, a flash memory, and a hard disk.

コンテンツ区間抽出手段５は、コンテンツ記憶手段３に記憶されたコンテンツ（情報）を話者毎に区間分割し、単一話者によるコンテンツ区間の抽出を行う。単一話者によるコンテンツ区間（以下、「単一話者区間」ともいう）は、話者が交替した時点から次に話者が交替するまでの区間である。単一話者区間は、区間内では発話者が単一でありかつ隣接する区間での発話者が異なるように抽出される。コンテンツ区間抽出手段５が抽出する単一話者区間は、誤りを含まないことが望ましいが、コンテンツ区間抽出の自動化を行ったために誤りを含んでしまっても構わない。 The content section extraction means 5 divides the content (information) stored in the content storage means 3 into sections for each speaker, and extracts a content section by a single speaker. A content section by a single speaker (hereinafter, also referred to as “single speaker section”) is a section from when the speaker changes until the next speaker changes. The single speaker section is extracted so that there is a single speaker in the section and the speakers in the adjacent sections are different. The single speaker section extracted by the content section extraction unit 5 preferably does not include an error, but may include an error because the content section extraction is automated.

文書区間抽出手段６は、文書記憶手段３に記憶された文書から、各発話者に対応した区間（文書区間）の抽出を行う。抽出された文書区間は、単一話者の発言に対応する文書情報を記述する。文書区間抽出手段６は、例えば、文書の書式情報を使う方法、文書中に記入された発話者に関する記述を利用する方法、構造化文書におけるメタデータを利用する方法を用いて文書区間の抽出を行う。 The document section extraction unit 6 extracts a section (document section) corresponding to each speaker from the document stored in the document storage unit 3. The extracted document section describes document information corresponding to a single speaker's utterance. The document section extraction means 6 extracts the document section by using, for example, a method of using the document format information, a method of using the description about the speaker entered in the document, or a method of using metadata in the structured document. Do.

区間対応関係導出手段７は、コンテンツ区間抽出手段５が抽出したコンテンツ区間と文書区間抽出手段６が抽出した文書区間との対応関係を導出して、出力手段８に出力する。出力手段８は、その対応関係を表示装置、プリンタ、内部記憶装置、外部記憶装置などに表示、出力、格納する。 The section correspondence relationship deriving unit 7 derives a correspondence relationship between the content section extracted by the content section extracting unit 5 and the document section extracted by the document section extracting unit 6 and outputs it to the output unit 8. The output unit 8 displays, outputs, and stores the correspondence relationship on a display device, a printer, an internal storage device, an external storage device, or the like.

文書対応付け装置１０は、コンピュータで実現される場合、コンテンツ区間抽出手段５、文書区間抽出手段６および区間対応関係導出手段７は、コンピュータの演算処理装置（例示：ＣＰＵ）と、各手段５，６，７の機能を実現するためのプログラムとで実現可能である。 When the document association device 10 is realized by a computer, the content section extraction means 5, the document section extraction means 6 and the section correspondence relationship derivation means 7 are a computer processing unit (example: CPU) and each means 5, This can be realized with a program for realizing the functions 6 and 7.

図２は、本発明の文書対応付け装置の実施の形態におけるコンテンツ区間抽出手段５の構成の一例を示すブロック図である。コンテンツ区間抽出手段５は、音声分割部２１と、音声特徴量導出部２２と、一次記憶部２３と、音声特徴量整合部２４と、出力部２５とを含む。音声分割部２１は、コンテンツ記憶手段３から読み出されたコンテンツから無音区間を抽出して音声の第一の分割を行う。音声特徴量導出部２２は、第一の分割によって得られた第一の音声区間に関して音声特徴量を導出する。一次記憶部２３は、第一の音声区間の開始時間と音声特徴量を記憶する。音声特徴量整合部２４は、音声特徴量導出部２２が導出した音声特徴量と、一次記憶部２３に記憶されていた音声特徴量との比較を行う。出力部２５は、音声特徴量整合部２４の処理結果を区間対応関係導出手段７に出力する。 FIG. 2 is a block diagram showing an example of the configuration of the content section extraction means 5 in the embodiment of the document association apparatus of the present invention. The content section extracting unit 5 includes an audio dividing unit 21, an audio feature amount deriving unit 22, a primary storage unit 23, an audio feature amount matching unit 24, and an output unit 25. The audio dividing unit 21 extracts a silent section from the content read from the content storage unit 3 and performs first audio division. The speech feature amount deriving unit 22 derives a speech feature amount for the first speech section obtained by the first division. The primary storage unit 23 stores the start time of the first speech section and the speech feature amount. The audio feature amount matching unit 24 compares the audio feature amount derived by the audio feature amount deriving unit 22 with the audio feature amount stored in the primary storage unit 23. The output unit 25 outputs the processing result of the voice feature amount matching unit 24 to the section correspondence relationship deriving unit 7.

本発明の文書対応付け方法の実施の形態におけるコンテンツ区間抽出手段５の動作の一例について説明する。図３は、本発明の文書対応付け方法の実施の形態におけるコンテンツ区間抽出手段５の動作の一例を示すフローチャートである。図３は、図２に示されたしている。ここでは、コンテンツが音声を含む映像であり、コンテンツ区間抽出に音声解析を用いた場合を例に説明する。 An example of the operation of the content section extraction means 5 in the embodiment of the document association method of the present invention will be described. FIG. 3 is a flowchart showing an example of the operation of the content section extraction means 5 in the embodiment of the document association method of the present invention. FIG. 3 is shown in FIG. Here, a case where the content is a video including audio and audio analysis is used for content section extraction will be described as an example.

音声分割部２１は、音声の第一の分割を行う（ステップＳ１０１）。すなわち、音声分割部２１は、音声の第一の分割として、入力映像の無音区間を抽出し、２つの無音区間の間の音声区間を検出する。無音区間は、入力映像の音声トラックもしくは入力音声の音声パワーの測定により抽出される。音声特徴量導出部２２は、音声の第一の分割によって得られた第一の音声区間に関して、音声特徴量を導出する（ステップＳ１０２）。音声特徴量としては、区間内の音声の平均基本周波数、平均発話時間長、平均音声パワーが例示される。一次記憶部２３は、音声特徴量導出部２２が音声特徴量を導出したときに、その第一の音声区間の開始時間と音声特徴量とが記憶されているか否かを判定する（ステップＳ１０３）。その第一の音声区間の開始時間と音声特徴量とが記憶されていない場合、一次記憶部２３は、その第一の音声区間の開始時間と音声特徴量を記憶する（ステップＳ１０４）。 The audio dividing unit 21 performs first audio division (step S101). That is, the audio dividing unit 21 extracts a silent interval of the input video as the first audio division, and detects an audio interval between the two silent intervals. The silent section is extracted by measuring the audio track of the input video or the audio power of the input audio. The voice feature quantity deriving unit 22 derives a voice feature quantity for the first voice section obtained by the first division of the voice (step S102). Examples of the voice feature amount include an average fundamental frequency, an average utterance time length, and an average voice power of the voice in the section. When the speech feature value deriving unit 22 derives the speech feature value, the primary storage unit 23 determines whether or not the start time of the first speech segment and the speech feature value are stored (step S103). . When the start time and the voice feature amount of the first voice section are not stored, the primary storage unit 23 stores the start time and the voice feature amount of the first voice section (step S104).

既にその第一の音声区間の開始時間と音声特徴量とが記憶されている場合、音声特徴量整合部２４は、音声特徴量導出部２２が導出した新規音声特徴量と、一次記憶部２３に記憶されている音声特徴量との比較を行う（ステップＳ１０５）。両区間の音声特徴量が、あらかじめ設定した閾値より小さい（類似している）場合、音声特徴量整合部２４は、同一人物による発話が継続していると判定する（ステップＳ１０６：ＹＥＳ）。音声分割部２１は、音声データが終了していない場合（ステップＳ１０９：ＮＯ）、次の無音区間までの音声情報を抽出する（ステップＳ１０１）。
両区間の音声特徴量が異なっている場合（ステップＳ１０６：ＮＯ）、音声特徴量整合部２４は、音声の発話者が変化したと判定する。出力部２５は、一次記憶部２３中に記憶されている開始時間と、現在の音声区間の開始時間の間の区間を単一話者の発話区間として出力する（ステップＳ１０７）。すなわち、音特徴の変化点解析により、単一話者の発話区間が検出される。同時に、一次記憶部２３は、音声特徴量と開始時間を新規に得られたものに更新する（ステップＳ１０８）。音声分割部２１は、音声データが終了していない場合（ステップＳ１０９：ＮＯ）、引き続き次の音声の無音区間を抽出する（ステップＳ１０１）。When the start time and the voice feature amount of the first voice section are already stored, the voice feature amount matching unit 24 stores the new voice feature amount derived by the voice feature amount deriving unit 22 and the primary storage unit 23. Comparison with the stored voice feature amount is performed (step S105). If the voice feature values in both sections are smaller than (similar to) a preset threshold value, the voice feature value matching unit 24 determines that the utterance by the same person continues (step S106: YES). If the voice data has not ended (step S109: NO), the voice division unit 21 extracts voice information up to the next silent section (step S101).
If the voice feature values in both sections are different (step S106: NO), the voice feature value matching unit 24 determines that the voice speaker has changed. The output unit 25 outputs a section between the start time stored in the primary storage unit 23 and the start time of the current speech section as a single speaker's speech section (step S107). That is, an utterance section of a single speaker is detected by sound feature change point analysis. At the same time, the primary storage unit 23 updates the voice feature amount and the start time to the newly obtained one (step S108). If the voice data has not ended (step S109: NO), the voice dividing unit 21 continues to extract a silent section of the next voice (step S101).

以上の処理が、音声データが終了するまで継続される。なお、音声特徴量として、ここでは、声の高さ、声の長さ、声の大きさのような韻律情報の特徴の変化（音声特徴量の変化の一例）を得るために、平均基本周波数、平均発話時間長、平均音声パワーを用いている。しかし、韻律情報を表す別の尺度を用いてもよい。また、言い回しや口癖といった会話形態の特徴量を利用してもよい。その場合、少なくとも一つの韻律情報の特徴の変化を用いればよい。 The above processing is continued until the voice data is finished. Note that here, as the voice feature value, in order to obtain a change in prosodic information features such as voice pitch, voice length, and voice volume (an example of a change in voice feature value), the average fundamental frequency The average speech duration and average voice power are used. However, another scale representing prosodic information may be used. Moreover, you may utilize the feature-value of conversation form, such as a wording and a mustache. In that case, a change in the characteristics of at least one prosodic information may be used.

また、ここでは、コンテンツ区間抽出手段５は、音声区間における音声特徴量の類似度をもとに発話者の変化点を検出して話者区間を特定する。話者の識別を行っているのではなく、話者の変化した点を検出していることで、話者識別や音声認識と比較して高精度に話者区間を検出できる。もちろん、コンテンツ区間抽出手段５は、各時間における音声特徴量から、発話者の特定を行い、話者識別結果から話者区間を抽出してもよい。 Further, here, the content section extraction means 5 identifies the speaker section by detecting the change point of the speaker based on the similarity of the speech feature amount in the speech section. By detecting the point where the speaker has changed rather than identifying the speaker, it is possible to detect the speaker section with higher accuracy than speaker identification or speech recognition. Of course, the content section extraction means 5 may identify the speaker from the voice feature amount at each time and extract the speaker section from the speaker identification result.

図４は、本発明の文書対応付け装置の実施の形態におけるコンテンツ区間抽出手段５の構成の他の一例を示すブロック図である。コンテンツ区間抽出手段５は、シーン分割部３１と、人物抽出および人物特徴量導出部３２と、一次記憶部３３と、人物特徴量整合部３４と、出力部３５とを含む。シーン分割部３１は、コンテンツ記憶手段３から読み出されたコンテンツからシーンチェンジを検出することによって連続したフレームで構成される第一の映像区間を抽出する。人物抽出および人物特徴量導出部３２は、第一の映像区間に関して人物特徴量を導出する。一次記憶部３３は、第一の映像区間の開始時間と人物特徴量を記憶する。人物特徴量整合部３４は、人物特徴量導出部３２が導出した人物特徴量と人物特徴量および開始時間記憶部３３に記憶されている人物特徴量との比較を行う。出力部３５は、人物特徴量整合部３４の処理結果を区間対応関係導出手段７に出力する。 FIG. 4 is a block diagram showing another example of the configuration of the content section extraction means 5 in the embodiment of the document association apparatus of the present invention. The content section extracting unit 5 includes a scene dividing unit 31, a person extracting / person feature amount deriving unit 32, a primary storage unit 33, a person feature amount matching unit 34, and an output unit 35. The scene division unit 31 extracts a first video section composed of continuous frames by detecting a scene change from the content read from the content storage unit 3. The person extraction and person feature quantity deriving unit 32 derives a person feature quantity for the first video section. The primary storage unit 33 stores the start time of the first video section and the person feature amount. The person feature amount matching unit 34 compares the person feature amount derived by the person feature amount deriving unit 32 with the person feature amount stored in the person feature amount and start time storage unit 33. The output unit 35 outputs the processing result of the person feature amount matching unit 34 to the section correspondence relationship deriving unit 7.

本発明の文書対応付け方法の実施の形態におけるコンテンツ区間抽出手段５の動作の他の一例について説明する。図５は、本発明の文書対応付け方法の実施の形態におけるコンテンツ区間抽出手段５の動作の他の一例を示すフローチャートである。ここでは、入力として映像情報を想定し、会話中の発話者が映像中に映っているということを仮定して話者区間の導出を行う場合を例にする。 Another example of the operation of the content section extraction means 5 in the embodiment of the document association method of the present invention will be described. FIG. 5 is a flowchart showing another example of the operation of the content section extraction means 5 in the embodiment of the document association method of the present invention. Here, a case is assumed in which video information is assumed as an input, and a speaker section is derived on the assumption that a talking speaker is reflected in the video.

シーン分割部３１は、入力映像のフレーム間の差分を測定して映像情報が大きく変化した部分を検出し、検出結果にもとづいて視覚的に連続したフレームで構成される第一の映像区間を抽出する（ステップＳ２０１）。人物抽出および人物特徴量導出部３２は、映像中に映っている人物領域を抽出し、人物領域に対して映像処理を施して人物特徴量を導出する（ステップＳ２０２）。人物領域抽出の方法としては、映像中の動物体が人物のみである場合に、背景差分法として監視の分野では広く使われている方法である前フレームとの差分値が特定値以上の領域を人物領域として採用する方法が例示される。人物の特徴量としては、顔の形状等の詳細に記述されている顔特徴量、人物全体の色の分布や模様及び境界の形状を記述した低次の視覚特徴量に例示される。色の分布等や模様を利用することにより、利用者の着ている服の特徴（人物の衣服の視覚的特徴）を考慮に入れることができるので、単純な会議等での人物変化の抽出には十分適用可能である。 The scene segmentation unit 31 measures a difference between frames of the input video to detect a portion where the video information has changed greatly, and extracts a first video section composed of visually continuous frames based on the detection result. (Step S201). The person extraction and person feature quantity derivation unit 32 extracts a person area shown in the video, performs video processing on the person area, and derives a person feature quantity (step S202). As a method for extracting a person area, when the moving object in the video is only a person, an area where the difference value from the previous frame, which is a method widely used in the field of monitoring as a background difference method, is a specified value or more is used. A method employed as a person area is exemplified. Examples of the feature amount of the person include a face feature amount described in detail such as the shape of the face, and a low-order visual feature amount describing the color distribution, pattern, and boundary shape of the entire person. By using color distribution and patterns, it is possible to take into account the characteristics of the clothes worn by the user (visual characteristics of the clothes of the person). Is fully applicable.

人物特徴量および開始時間記憶部３３は、人物抽出および人物特徴両導出部３２が人物特徴量を導出したときに、その第一の映像区間の開始時間と人物特徴量が記憶されているか否かを判定する（ステップＳ２０３）。その第一の映像区間の開始時間と人物特徴量が記憶されていない場合（ステップＳ２０３：ＮＯ）、その第一の映像区間の開始時間と人物特徴量を記憶する（ステップＳ２０４）。すなわち、映像中の視覚的特徴の変化点解析により、第一の映像区間が検出される。既にその第一の映像区間の開始時間と人物特徴量が記憶されている場合（ステップＳ２０３：ＹＥＳ）、人物特徴量整合部３４は、人物抽出および人物特徴量導出部３２が導出した新規人物特徴量と、人物特徴量および開始時間記憶部３３に記憶されている人物特徴量との比較を行う（ステップＳ２０５）。そして、人物特徴量整合部３４は、両区間の人物特徴量があらかじめ設定した閾値より類似している場合、同一人物による発話が継続していると判定する（ステップＳ２０６：ＹＥＳ）。シーン分割部３１は、映像データが終了していない場合（ステップＳ２０９：ＮＯ）、次の映像情報が大きく変化した部分を抽出する（ステップＳ２０１）。 Whether or not the person feature and start time storage unit 33 stores the start time and person feature of the first video section when the person extraction and person feature deriving unit 32 derives the person feature. Is determined (step S203). When the start time and person feature amount of the first video section are not stored (step S203: NO), the start time and person feature amount of the first video section are stored (step S204). That is, the first video section is detected by analyzing the change point of the visual feature in the video. When the start time of the first video section and the person feature amount are already stored (step S203: YES), the person feature amount matching unit 34 extracts the new person feature derived by the person extraction and person feature amount deriving unit 32. The amount is compared with the person feature quantity stored in the person feature quantity and start time storage unit 33 (step S205). Then, the person feature amount matching unit 34 determines that the utterance by the same person is continued when the person feature amounts of both sections are more similar than a preset threshold value (step S206: YES). When the video data has not ended (step S209: NO), the scene division unit 31 extracts a portion where the next video information has changed significantly (step S201).

人物特徴量整合部３４は、両区間の人物特徴量が異なっている場合、映像中の発話者が変化したと判定する（ステップＳ２０６：ＮＯ）。出力部３５は、一次記憶部３３に記憶されている開始時間と、現在の映像区間の開始時間との間の区間を単一話者の発話区間として出力する（ステップＳ２０７）。同時に、一次記憶部３３は、人物特徴量と開始時間を新規に得られたものに更新する（ステップＳ２０８）。シーン分割部３１は、映像データが終了していない場合（ステップＳ２０９：ＮＯ）、次の映像情報が大きく変化した部分を抽出する（ステップＳ２０１）。 The person feature amount matching unit 34 determines that the speaker in the video has changed when the person feature amounts in the two sections are different (step S206: NO). The output unit 35 outputs a section between the start time stored in the primary storage unit 33 and the start time of the current video section as an utterance section of a single speaker (step S207). At the same time, the primary storage unit 33 updates the person feature amount and the start time to the newly obtained one (step S208). When the video data has not ended (step S209: NO), the scene division unit 31 extracts a portion where the next video information has changed significantly (step S201).

以上の処理が、映像データが終了するまで継続される。なお、映像特徴量としては、色分布、形状、エッジヒストグラムなどの低次の特徴量や、目のカテゴリ、目，鼻，口の配置等の高次の特徴量が例示される。また、特徴量として、適切な一つを採用してもよいし、複数を組み合わせてもよい。また、人物が大きく動かないという仮定を導入すれば、人物領域を抽出せず、背景の情報も含めて視覚特徴量とすることも可能である。 The above processing is continued until the video data is finished. Examples of the video feature amount include low-order feature amounts such as color distribution, shape, and edge histogram, and higher-order feature amounts such as eye category, eye, nose, and mouth arrangement. Further, as the feature amount, an appropriate one may be adopted, or a plurality may be combined. If the assumption that the person does not move greatly is introduced, it is possible to use the visual feature amount including the background information without extracting the person region.

図６は、本発明の文書対応付け装置の実施の形態におけるコンテンツ区間抽出手段５の構成の更に他の一例を示すブロック図である。図６は、音声に関する区間抽出と映像に関する区間抽出の双方を行うコンテンツ区間抽出手段５を示している。音声区間抽出部８１は、例えば、図２に示される音声分割部２１、音声特徴量導出部２２、一次記憶部２３、音声特徴量整合部２４および出力部２５を備える。映像区間抽出部８２は、例えば、図４に示されるシーン分割部３１、人物抽出および人物特徴量導出部３２、一次記憶部３３、人物特徴量整合部３４および出力３５を備える。音声映像区間抽出部（音声映像区間統合手段）８３は、音声区間抽出部８１の出力と映像区間抽出部８２の出力から、コンテンツ区間を決定する。音声映像区間抽出部８３は、例えば、音声区間抽出部８１の出力と映像区間抽出部８２の出力がともに、発話者が変わったことを示す時点のみを採用してコンテンツ区間を決定する。 FIG. 6 is a block diagram showing still another example of the configuration of the content section extraction means 5 in the embodiment of the document association apparatus of the present invention. FIG. 6 shows content section extraction means 5 that performs both section extraction related to audio and section extraction related to video. The speech section extraction unit 81 includes, for example, the speech division unit 21, the speech feature amount derivation unit 22, the primary storage unit 23, the speech feature amount matching unit 24, and the output unit 25 illustrated in FIG. The video section extraction unit 82 includes, for example, the scene division unit 31, the person extraction / person feature amount derivation unit 32, the primary storage unit 33, the person feature amount matching unit 34 and the output 35 shown in FIG. 4. The audio / video segment extraction unit (audio / video segment integration means) 83 determines a content segment from the output of the audio segment extraction unit 81 and the output of the video segment extraction unit 82. For example, the audio / video segment extraction unit 83 determines the content segment by adopting only the time point when both the output of the audio segment extraction unit 81 and the output of the video segment extraction unit 82 indicate that the speaker has changed.

図７は、本発明の文書対応付け装置の実施の形態におけるコンテンツ区間抽出手段５の動作の更に他の一例を示すフローチャートである。音声区間抽出部８１は、入力映像の音声に基づいて、入力映像を複数の音声区間に分割する（ステップＳ１２１）。例えば、図３に示す動作を実行する。一方、映像区間抽出部８２は、入力映像の映像に基づいて、入力映像を複数の映像区間に分割する（ステップＳ１２２）。例えば、図５に示す動作を実行する。ただし、ステップＳ１２１とステップＳ１２２とは、同時に行っても良いし、ステップＳ１２２を先に行っても良い。次に、音声映像区間抽出部（音声映像区間統合手段）８３は、音声区間抽出部８１の出力と映像区間抽出部８２の出力とに基づいて、コンテンツ区間を決定する（ステップＳ１２３）。例えば、音声映像区間抽出部８３は、音声区間抽出部８１の出力と映像区間抽出部８２の出力がともに、発話者が変わったことを示す時点のみを採用してコンテンツ区間を決定する。 FIG. 7 is a flowchart showing still another example of the operation of the content section extracting means 5 in the embodiment of the document association apparatus of the present invention. The audio segment extraction unit 81 divides the input video into a plurality of audio segments based on the audio of the input video (step S121). For example, the operation shown in FIG. 3 is executed. On the other hand, the video segment extraction unit 82 divides the input video into a plurality of video segments based on the video of the input video (step S122). For example, the operation shown in FIG. 5 is executed. However, step S121 and step S122 may be performed simultaneously, or step S122 may be performed first. Next, the audio / video segment extraction unit (audio / video segment integration means) 83 determines a content segment based on the output of the audio segment extraction unit 81 and the output of the video segment extraction unit 82 (step S123). For example, the audio / video segment extraction unit 83 determines the content segment by adopting only the time point when both the output of the audio segment extraction unit 81 and the output of the video segment extraction unit 82 indicate that the speaker has changed.

図８は、本発明の文書対応付け装置の実施の形態におけるコンテンツ区間抽出手段５の構成の別の一例を示すブロック図である。図８は、音声解析と映像解析の双方を使ってコンテンツの単一話者区間の抽出を行うコンテンツ区間抽出手段５を示している。 FIG. 8 is a block diagram showing another example of the configuration of the content section extraction means 5 in the embodiment of the document association apparatus of the present invention. FIG. 8 shows content section extraction means 5 that extracts a single speaker section of content using both audio analysis and video analysis.

シーン分割部９１は、コンテンツの特徴量を解析してシーンに分割する。シーン分割部９１は、図２に示された音声分割部２１のように音声特徴量を用いてもよいし、図４に示された人物抽出および人物特徴量導出部３２のように視覚的特徴量を用いてもよい。また、音声特徴量と人物特徴量との和をとってもよい。すなわち、発話者が変化した時点を導出するために、映像中の視覚的特徴の変化点解析と音声中の音特徴の変化点解析を行って双方の結果を統合するようにしてもよい。音声特徴量導出部９２は、抽出されたシーンの音声特徴量を導出する。視覚的特徴量導出部９３は、抽出されたシーンの視覚的特徴量を導出する。一次記憶部９４は、抽出された音声特徴量及び視覚的特徴量が蓄積されていない場合、それら音声特徴量および視覚特徴量とその開始時間とを記憶する。既に、音声特徴量および視覚特徴量が記憶されている場合、音声特徴量整合部９５は、音声特徴量導出部９２から入力した音声特徴量と、一次記憶部９４に記憶されている音声特徴量との比較を行う。同様に、視覚的特徴量整合部９６は、視覚的特徴量導出部９３から入力した視覚特徴量と、一次記憶部９４に記憶されている視覚特徴量との比較を行う。 The scene dividing unit 91 analyzes the feature amount of the content and divides it into scenes. The scene division unit 91 may use audio feature amounts like the audio division unit 21 shown in FIG. 2, or visual features like the person extraction and person feature quantity derivation unit 32 shown in FIG. An amount may be used. Further, the sum of the voice feature quantity and the person feature quantity may be taken. That is, in order to derive the point of time when the speaker changes, it is also possible to perform a change point analysis of visual features in a video and a change point analysis of sound features in speech to integrate both results. The audio feature amount deriving unit 92 derives the audio feature amount of the extracted scene. The visual feature amount deriving unit 93 derives the visual feature amount of the extracted scene. When the extracted voice feature quantity and visual feature quantity are not accumulated, the primary storage unit 94 stores the voice feature quantity and visual feature quantity and the start time thereof. If the audio feature quantity and the visual feature quantity are already stored, the audio feature quantity matching unit 95 and the audio feature quantity input from the audio feature quantity deriving unit 92 and the audio feature quantity stored in the primary storage unit 94 are stored. Compare with. Similarly, the visual feature amount matching unit 96 compares the visual feature amount input from the visual feature amount deriving unit 93 with the visual feature amount stored in the primary storage unit 94.

音声特徴量導出部９２から入力した音声特徴量と一次記憶部９４に記憶されている音声特徴量との差があらかじめ定めた閾値よりも大きい場合、または、視覚的特徴量導出部９３から入力した視覚的特徴量と一次記憶部９４に記憶されている視覚的特徴量との差があらかじめ定めた閾値よりも大きい場合、一次記憶部９４に記憶される音声特徴量および視覚的特徴量をクリアして、現在の時間と開始時間とを出力部９７に送る。出力部９７はそれらを区間対応関係導出手段７に出力する。なお、音声特徴量導出部９２から入力した音声特徴量と一次記憶部９４に記憶されている音声特徴量との差があらかじめ定めた閾値よりも大きく、かつ、視覚的特徴量導出部９３から入力した視覚的特徴量と一次記憶部９４に記憶されている視覚的特徴量との差があらかじめ定めた閾値よりも大きい場合に、現在の時間と開始時間とを出力部９７に送るようにしてもよい。 When the difference between the speech feature amount input from the speech feature amount deriving unit 92 and the speech feature amount stored in the primary storage unit 94 is larger than a predetermined threshold, or input from the visual feature amount deriving unit 93 When the difference between the visual feature value and the visual feature value stored in the primary storage unit 94 is larger than a predetermined threshold, the voice feature value and the visual feature value stored in the primary storage unit 94 are cleared. The current time and start time are sent to the output unit 97. The output unit 97 outputs them to the section correspondence relationship deriving means 7. It should be noted that the difference between the speech feature amount input from the speech feature amount deriving unit 92 and the speech feature amount stored in the primary storage unit 94 is greater than a predetermined threshold value and is input from the visual feature amount deriving unit 93. When the difference between the visual feature value and the visual feature value stored in the primary storage unit 94 is larger than a predetermined threshold value, the current time and the start time may be sent to the output unit 97. Good.

図９は、本発明の文書対応付け装置の実施の形態におけるコンテンツ区間抽出手段５の動作の別の一例を示すフローチャートである。 FIG. 9 is a flowchart showing another example of the operation of the content section extraction means 5 in the embodiment of the document association apparatus of the present invention.

シーン分割部９１は、コンテンツの特徴量を解析してシーンに分割する（ステップＳ１４１）。シーン分割部９１は、図２に示された音声分割部２１のように音声特徴量を用いてもよいし、図４に示された人物抽出および人物特徴量導出部３２のように視覚的特徴量を用いてもよい。また、音声特徴量と人物特徴量との和をとってもよい。すなわち、発話者が変化した時点を導出するために、映像中の視覚的特徴の変化点解析と音声中の音特徴の変化点解析を行って双方の結果を統合するようにしてもよい。音声特徴量導出部９２は、抽出されたシーンの音声特徴量を導出する（ステップＳ１４２）。視覚的特徴量導出部９３は、抽出されたシーンの視覚的特徴量を導出する（ステップＳ１４３）。ただし、ステップＳ１４２とステップＳ１４３とは、同時に行われても良いし、ステップＳ１４３が咲きに行われても良い。一次記憶部９４は、抽出された音声特徴量及び視覚的特徴量が蓄積されているか否かを判定する（ステップＳ１４４）。抽出された音声特徴量及び視覚的特徴量が蓄積されていない場合（ステップＳ１４４：ＮＯ）、一次記憶部９４は、それら音声特徴量および視覚特徴量とその開始時間とを記憶する（ステップＳ１４５）。 The scene division unit 91 analyzes the feature amount of the content and divides it into scenes (step S141). The scene division unit 91 may use audio feature amounts like the audio division unit 21 shown in FIG. 2, or visual features like the person extraction and person feature quantity derivation unit 32 shown in FIG. An amount may be used. Further, the sum of the voice feature quantity and the person feature quantity may be taken. That is, in order to derive the point of time when the speaker changes, it is also possible to perform a change point analysis of visual features in a video and a change point analysis of sound features in speech to integrate both results. The audio feature amount deriving unit 92 derives an audio feature amount of the extracted scene (step S142). The visual feature amount deriving unit 93 derives the visual feature amount of the extracted scene (step S143). However, step S142 and step S143 may be performed simultaneously, or step S143 may be performed in bloom. The primary storage unit 94 determines whether or not the extracted audio feature quantity and visual feature quantity are accumulated (step S144). When the extracted audio feature quantity and visual feature quantity are not accumulated (step S144: NO), the primary storage unit 94 stores the audio feature quantity and visual feature quantity and the start time thereof (step S145). .

既に、音声特徴量および視覚特徴量が記憶されている場合（ステップＳ１４４：ＹＥＳ）、音声特徴量整合部９５は、音声特徴量導出部９２から入力した音声特徴量と、一次記憶部９４に記憶されている音声特徴量との比較を行う。同様に、視覚的特徴量整合部９６は、視覚的特徴量導出部９３から入力した視覚特徴量と、一次記憶部９４に記憶されている視覚特徴量との比較を行う（ステップＳ１４６）。 When the audio feature quantity and the visual feature quantity are already stored (step S144: YES), the audio feature quantity matching unit 95 stores the audio feature quantity input from the audio feature quantity deriving unit 92 and the primary storage unit 94. Comparison with the voice feature value. Similarly, the visual feature amount matching unit 96 compares the visual feature amount input from the visual feature amount deriving unit 93 with the visual feature amount stored in the primary storage unit 94 (step S146).

音声特徴量導出部９２から入力した音声特徴量と一次記憶部９４に記憶されている音声特徴量との差があらかじめ定めた閾値よりも小さい（類似している）場合、かつ、視覚的特徴量導出部９３から入力した視覚的特徴量と一次記憶部９４に記憶されている視覚的特徴量との差があらかじめ定めた閾値よりも小さい（類似している）場合、音声特徴量導出部９２及び視覚的特徴量導出部９３は、同一人物による発話が継続していると判定する（ステップＳ１４７：ＹＥＳ）。シーン分割部９１は、データが終了していない場合（ステップＳ１５０：ＮＯ）、シーン分割を継続する（ステップＳ１４１）。 When the difference between the speech feature amount input from the speech feature amount deriving unit 92 and the speech feature amount stored in the primary storage unit 94 is smaller (similar) than a predetermined threshold, and the visual feature amount When the difference between the visual feature value input from the derivation unit 93 and the visual feature value stored in the primary storage unit 94 is smaller (similar) than a predetermined threshold value, the audio feature value derivation unit 92 and The visual feature quantity deriving unit 93 determines that the utterance by the same person is continuing (step S147: YES). If the data has not ended (step S150: NO), the scene division unit 91 continues the scene division (step S141).

音声特徴量導出部９２から入力した音声特徴量と一次記憶部９４に記憶されている音声特徴量との差があらかじめ定めた閾値よりも大きい場合、または、視覚的特徴量導出部９３から入力した視覚的特徴量と一次記憶部９４に記憶されている視覚的特徴量との差があらかじめ定めた閾値よりも大きい場合、音声特徴量導出部９２、または、視覚的特徴量導出部９３は、同一人物による発話が終了したと判定する（ステップＳ１４７：ＮＯ）。一次記憶部９４は、記憶される音声特徴量および視覚的特徴量をクリアして、現在の時間と開始時間とを出力部９７に送る（ステップＳ１４８）。出力部９７はそれらを区間対応関係導出手段７に出力する（ステップＳ１４９）。 When the difference between the speech feature amount input from the speech feature amount deriving unit 92 and the speech feature amount stored in the primary storage unit 94 is larger than a predetermined threshold, or input from the visual feature amount deriving unit 93 When the difference between the visual feature value and the visual feature value stored in the primary storage unit 94 is larger than a predetermined threshold value, the audio feature value deriving unit 92 or the visual feature value deriving unit 93 is the same. It is determined that the utterance by the person has ended (step S147: NO). The primary storage unit 94 clears the stored audio feature quantity and visual feature quantity, and sends the current time and start time to the output unit 97 (step S148). The output unit 97 outputs them to the section correspondence relationship deriving means 7 (step S149).

なお、音声特徴量導出部９２から入力した音声特徴量と一次記憶部９４に記憶されている音声特徴量との差があらかじめ定めた閾値よりも大きい場合、かつ、視覚的特徴量導出部９３から入力した視覚的特徴量と一次記憶部９４に記憶されている視覚的特徴量との差があらかじめ定めた閾値よりも大きい場合に、同一人物による発話が継続していると判定し、現在の時間と開始時間とを出力部９７に送るようにしてもよい。
その場合、音声特徴量導出部９２から入力した音声特徴量と一次記憶部９４に記憶されている音声特徴量との差があらかじめ定めた閾値よりも小さい場合、又は、視覚的特徴量導出部９３から入力した視覚的特徴量と一次記憶部９４に記憶されている視覚的特徴量との差があらかじめ定めた閾値よりも小さい場合に、同一人物による発話が終了したと判定する。It should be noted that if the difference between the speech feature amount input from the speech feature amount deriving unit 92 and the speech feature amount stored in the primary storage unit 94 is larger than a predetermined threshold, and the visual feature amount deriving unit 93 When the difference between the input visual feature value and the visual feature value stored in the primary storage unit 94 is larger than a predetermined threshold value, it is determined that the utterance by the same person continues and the current time And the start time may be sent to the output unit 97.
In that case, when the difference between the speech feature amount input from the speech feature amount deriving unit 92 and the speech feature amount stored in the primary storage unit 94 is smaller than a predetermined threshold, or the visual feature amount deriving unit 93. When the difference between the visual feature amount input from the visual feature amount and the visual feature amount stored in the primary storage unit 94 is smaller than a predetermined threshold value, it is determined that the utterance by the same person has ended.

このようにすることにより、音声では区別できなかった話者区間を映像から識別したり、また、顔または服などの視覚的特徴量が類似していたため映像からは検出困難だった話者区間を音声特徴により抽出することができる。すなわち、コンテンツ区間を精度よく検出することが可能になる。 In this way, speaker segments that could not be distinguished by speech were identified from the video, and speaker segments that were difficult to detect from the video because of similar visual features such as face or clothes were identified. It can be extracted by voice feature. That is, it becomes possible to detect the content section with high accuracy.

図１に示された文書区間抽出手段６は、文書記憶手段４に記憶された文書情報から、文書中の各発話者に対応した区間（文書区間）の抽出を行う。抽出された文書区間においては、単一話者の発言に対応する文書情報が記述される。文書情報からの発話者に対応した文書区間の抽出には、例えば、文書の書式情報を使う方法、文書中に記入された発話者に関する記述を利用する方法、構造化文書におけるメタデータを利用する方法がある。 The document section extraction means 6 shown in FIG. 1 extracts sections (document sections) corresponding to the individual speakers in the document from the document information stored in the document storage means 4. In the extracted document section, document information corresponding to a single speaker's utterance is described. To extract a document section corresponding to a speaker from document information, for example, a method using document format information, a method using a description about a speaker entered in a document, or metadata in a structured document is used. There is a way.

図１０は、本発明の文書対応付け装置の実施の形態における文書区間抽出手段６の動作の一例を示すフローチャートである。文書区間抽出手段６は、文書記憶手段４に記憶された文書情報から、文書区切りを示す情報（以下「文書区切情報」）を抽出する（ステップＳ１６１）。文書区切情報としては、文書中の改行（空行）、文字フォントの相違、文字の色の相違、文字のレイアウト、発話者の名前の記載などに例示される。次に、文書区間抽出手段６は、文書区切情報に基づいて、最適な文書区間の抽出方法を選択する（ステップＳ１６２）。文書区切情報と文書区間の抽出方法との対応関係（テーブル）は、図示されない記憶部に格納されている。文書情報からの発話者に対応した文書区間の抽出方法には、例えば、文書の書式情報を使う方法、文書中に記入された発話者に関する記述を利用する方法、構造化文書におけるメタデータを利用する方法がある。そして、文書区間抽出手段６は、文書中の各発話者に対応した区間（文書区間）の抽出を行う。抽出された文書区間においては、単一話者の発言に対応する文書情報が記述される。ただし、文書情報があらかじめ決まっている場合、ステップＳ１６１及びＳ１６２を省略し、文書情報に対応する文書区間の抽出方法を直ぐに実行しても良い。 FIG. 10 is a flowchart showing an example of the operation of the document section extraction means 6 in the embodiment of the document association apparatus of the present invention. The document section extraction unit 6 extracts information indicating a document break (hereinafter, “document break information”) from the document information stored in the document storage unit 4 (step S161). The document delimiter information is exemplified by a line feed (blank line) in a document, a character font difference, a character color difference, a character layout, a description of a speaker's name, and the like. Next, the document section extraction means 6 selects an optimal document section extraction method based on the document delimiter information (step S162). The correspondence (table) between the document segmentation information and the document segment extraction method is stored in a storage unit (not shown). The document segment extraction method corresponding to the speaker from the document information includes, for example, a method using the document format information, a method using the description about the speaker entered in the document, and using metadata in the structured document. There is a way to do it. Then, the document section extracting means 6 extracts a section (document section) corresponding to each speaker in the document. In the extracted document section, document information corresponding to a single speaker's utterance is described. However, if the document information is determined in advance, steps S161 and S162 may be omitted, and the document segment extraction method corresponding to the document information may be immediately executed.

以下、文書区間抽出手段６が実行する文書区間の抽出方法の具体例を説明する。
図１１Ａ〜図１１Ｄは、本発明の文書対応付け方法の実施の形態における文書の書式情報を利用する方法の一例を示す図である。図１１Ａに示す例では、発話者間の発言に対して空行が挿入されている。よって、文書区間抽出手段６は、空行をもとに文書区間を抽出することができる。図１１Ｂに示す例では、対談における文書が例示されている。そして、ホストの発言が斜字で表示されている。よって、文書区間抽出手段６は、ゲストの発言内容とホストの発言内容とを識別して、文書区間の抽出を行うことができる。図１１Ｃに示す例では、発話者ごとに色が異なっている。複数の発話者を区別する際によく利用される。よって、文書区間抽出手段６は、色情報を用いて文書区間を抽出することができる。図１１Ｄに示す例では、発話者ごとに記載場所が整理されている。このように発話者ごとに記載場所が整理されている場合には、発話者の名前が直接記入されていなくても、文書区間抽出手段６は、単一発話者と推定される区間を抽出することができる。なお、ここで抽出した区間はあくまで候補であり、単一発話者の区間で区切られていることが望ましいが、厳密に単一発話者の発言でまとまっていなくてもよい。なお、図１１Ａ〜図１１Ｄを参照して説明される方法では、文書の構造解析の一例が実施されていることになる。Hereinafter, a specific example of the document segment extraction method executed by the document segment extraction means 6 will be described.
11A to 11D are diagrams illustrating an example of a method of using document format information in the embodiment of the document association method of the present invention. In the example illustrated in FIG. 11A, a blank line is inserted for the speech between the speakers. Therefore, the document section extraction means 6 can extract a document section based on a blank line. In the example shown in FIG. 11B, a document in a conversation is illustrated. The host's remarks are displayed in italics. Therefore, the document section extraction means 6 can extract the document section by discriminating between the guest's speech content and the host's speech content. In the example shown in FIG. 11C, the color is different for each speaker. Often used to distinguish between multiple speakers. Therefore, the document section extraction means 6 can extract the document section using the color information. In the example illustrated in FIG. 11D, the description locations are arranged for each speaker. In this way, when the description locations are arranged for each speaker, the document segment extraction means 6 extracts a segment estimated as a single speaker even if the name of the speaker is not directly entered. be able to. Note that the sections extracted here are only candidates and are preferably divided by sections of a single speaker, but may not be strictly grouped with a single speaker. In the method described with reference to FIGS. 11A to 11D, an example of document structure analysis is performed.

図１２Ａ〜図１２Ｃは、本発明の文書対応付け方法の実施の形態における文書の書式情報を利用する方法の他の一例を示す図である。図１２Ａ〜図１２Ｃは、文書中に記入された発話者に関する記述を利用して文書区間を抽出する方法を示している。図１２Ａに示す例では、発言の前に「名前：」の形式で発話者が記入されている。文書区間抽出手段６は、「名前：」に基づいて文書区間を抽出することができる。図１２Ｂに示す例では、名前の代わりに、「Ｑｕｅｓｔｉｏｎ」、「Ａｎｓｗｅｒ」のような表現が使われている。文書区間抽出手段６は、「Ｑｕｅｓｔｉｏｎ」、「Ａｎｓｗｅｒ」に基づいて文書区間を抽出することができる。図１２Ｃに示す例では、発言者の名前が別カラムで表示されており、ドラマの台本や議事録などで広く用いられる。こうした情報を用いれば、文書区間抽出手段６は、容易に発話者および発話者区間に関する情報を文書から抽出することができる。なお、図１２Ａ〜図１２Ｃを参照して説明される方法でも、文書の構造解析の一例が実施されていることになる。 12A to 12C are diagrams showing another example of the method using the document format information in the embodiment of the document association method of the present invention. FIG. 12A to FIG. 12C show a method of extracting a document section using a description about a speaker entered in a document. In the example shown in FIG. 12A, the speaker is entered in the form of “name:” before the utterance. The document section extraction means 6 can extract a document section based on “name:”. In the example shown in FIG. 12B, expressions such as “Quest” and “Answer” are used instead of names. The document section extraction means 6 can extract a document section based on “Question” and “Answer”. In the example shown in FIG. 12C, the name of the speaker is displayed in a separate column, which is widely used in drama scripts and minutes. If such information is used, the document segment extraction means 6 can easily extract information on the speaker and the speaker segment from the document. Note that an example of document structure analysis is also performed in the method described with reference to FIGS. 12A to 12C.

図１３は、本発明の文書対応付け方法の実施の形態における文書の書式情報を利用する方法の更に他の一例を示す図である。図１３は、構造化文書におけるタグを利用して文書区間の抽出する方法を示している。文書区間抽出手段６は、例えば「Ｓｐｅａｋｅｒ」タグによって文書区間の抽出することができる。なお、図１１Ａ〜図１３に例示された文書から文書区間を抽出する方法以外にも、文書の書式情報や発話者に関する記述を利用した文書区間抽出は可能である。また、文書区間抽出手段６は、これらの方法を組み合わせてより高精度に発話者区間を抽出するということも可能である。さらに、文書区間抽出手段６は、音声と同様に、文書中の記述の会話相当部の口癖や言い回しのような会話特徴の変化にもとづいて文書区間を導出してもよい。なお、図１３を参照して説明される方法でも、文書の構造解析の一例が実施されていることになる。 FIG. 13 is a diagram showing still another example of a method of using document format information in the embodiment of the document association method of the present invention. FIG. 13 shows a method for extracting a document section using a tag in a structured document. The document section extraction means 6 can extract a document section by using, for example, a “Speaker” tag. In addition to the method of extracting a document section from the document illustrated in FIGS. 11A to 13, document section extraction using document format information and a description about a speaker is possible. Further, the document section extraction means 6 can extract the speaker section with higher accuracy by combining these methods. Further, the document section extracting means 6 may derive a document section based on a change in conversation features such as a speech or phrase of a conversation-corresponding portion of a description in the document, as in the case of speech. Note that an example of document structure analysis is also performed in the method described with reference to FIG.

次に、本発明の文書対応付け装置の実施の形態におけ区間対応関係導出手段７について説明する。図１４は、本発明の文書対応付け装置の実施の形態における区間対応関係導出手段７の構成の一例を示すブロック図である。図１４に示す例では、区間対応関係導出手段７は、コンテンツ長正規化部４１と、文書長正規化部４２と、区間整合度導出部（区間整合手段）４３と、区間対応関係記憶部４４と、区間統合部４５と、出力部４６とを含む。コンテンツ長正規化部４１は、抽出された各区間におけるコンテンツ長の正規化を行う。文書長正規化部４２は、各文書区間の長さを正規化する。区間整合度導出部（区間整合手段）４３は、コンテンツ区間と文書区間の対応関係を導出する。区間対応関係記憶部４４は、区間毎の対応関係を記憶する。区間統合部４５は、隣接する区間を統合してコンテンツと文書とを一対一に対応付ける。出力部４６は、対応関係を出力する。 Next, the section correspondence relation deriving means 7 in the embodiment of the document correspondence apparatus of the present invention will be described. FIG. 14 is a block diagram showing an example of the configuration of the section correspondence relationship deriving means 7 in the embodiment of the document association apparatus of the present invention. In the example illustrated in FIG. 14, the section correspondence relationship deriving unit 7 includes a content length normalization unit 41, a document length normalization unit 42, a section matching degree deriving unit (section matching unit) 43, and a section correspondence relationship storage unit 44. And a section integration unit 45 and an output unit 46. The content length normalization unit 41 normalizes the content length in each extracted section. The document length normalization unit 42 normalizes the length of each document section. The section matching degree deriving unit (section matching means) 43 derives the correspondence between the content section and the document section. The section correspondence storage unit 44 stores the correspondence for each section. The section integration unit 45 integrates adjacent sections and associates content and documents on a one-to-one basis. The output unit 46 outputs the correspondence relationship.

次に、本発明の文書対応付け方法の実施の形態における区間対応関係導出手段７が実行する対応関係導出方法について説明する。図１５は、本発明の文書対応付け方法の実施の形態における区間対応関係導出手段７が実行する対応関係導出方法の一例を示すフローチャートである。図１６Ａ及び図１６Ｂは、対応関係導出方法におけるコンテンツ情報と文書情報との対応関係を示す図である。図１７は、対応関係導出方法における正規化を説明する図である。なお、図１６に示す例では、説明簡略化のため、コンテンツ区間抽出手段５によって抽出された話者区間が６区間（［ａ］−［ｆ］）、文書区間抽出手段６によって抽出された文書区間が７区間（［１］−［７］）であるとする。 Next, the correspondence derivation method executed by the section correspondence derivation means 7 in the embodiment of the document association method of the present invention will be described. FIG. 15 is a flowchart showing an example of the correspondence derivation method executed by the section correspondence derivation means 7 in the embodiment of the document association method of the present invention. 16A and 16B are diagrams illustrating the correspondence between content information and document information in the correspondence derivation method. FIG. 17 is a diagram for explaining normalization in the correspondence derivation method. In the example shown in FIG. 16, for simplification of explanation, the speaker sections extracted by the content section extraction unit 5 are six sections ([a] − [f]), and the document extracted by the document section extraction unit 6. It is assumed that the section is 7 sections ([1]-[7]).

コンテンツ長正規化部４１は、抽出された各区間におけるコンテンツ長の正規化を行う（ステップＳ３０１）。正規化に際して、図１７（ａ）に示すようにコンテンツが音声を含む場合、まず、各区間中の無音部を抽出する。次に、抽出された無音部を各区間から除く。そして、各区間の長さが音声部分の長さに比例し、総和が１．０となるように正規化する。この状態が図１７（ｂ）である。なお、図１６Ａ（ａ）および図１７（ａ）に示すコンテンツ情報は無音部を含むとする。また、図１７（ｃ）に示すように、無音部を除かずに、単なる区間長に比例するように正規化を行ってもよい。コンテンツが音声を含まない場合、映像情報から人物検出を行い、人物を含まない場合を各区間から除いて各区間の長さが音声部分の長さに比例し、総和が１．０となるように正規化してもよい。人物を含まない区間を除かずに、単なる区間長に比例するように正規化を行ってもよい。 The content length normalization unit 41 normalizes the content length in each extracted section (step S301). At the time of normalization, when the content includes audio as shown in FIG. 17A, first, a silent part in each section is extracted. Next, the extracted silent part is removed from each section. Then, normalization is performed so that the length of each section is proportional to the length of the voice part, and the total sum is 1.0. This state is shown in FIG. It is assumed that the content information shown in FIGS. 16A (a) and 17 (a) includes a silent part. Also, as shown in FIG. 17C, normalization may be performed so as to be proportional to the mere section length without removing the silent part. When the content does not include audio, person detection is performed from the video information, and when the content is not included, the length of each section is proportional to the length of the audio portion, and the sum is 1.0. You may normalize to. You may normalize so that it may be proportional to a mere section length, without removing the section which does not contain a person.

文書長正規化部４２は、各文書区間の長さを正規化する（ステップＳ３０２）。例えば、各区間の長さを各区間に含まれる文書量（又は文字量）に比例した長さとする。双方を正規化してならべた結果の一例が図１３Ａに示されている。図１６Ａ（ａ）はコンテンツ情報、図１６Ａ（ｂ）は文書情報をそれぞれ示す。 The document length normalization unit 42 normalizes the length of each document section (step S302). For example, the length of each section is set to a length proportional to the document amount (or character amount) included in each section. An example of the result of normalizing both is shown in FIG. 13A. 16A (a) shows content information, and FIG. 16A (b) shows document information.

区間整合度導出部４３は、コンテンツ区間と文書区間の個別の対応関係を導出する（ステップＳ３０３）。例えば、正規化軸上での重なりを調べて、最も重なった領域と対応関係があるとする。図１６Ａに示す例では、その対応関係は、文書情報で考えると、［１］→［ａ］，［２］→［ａ］，［３］→［ｂ］，［４］→［ｃ］，［５］→［ｄ］，［６］→［ｆ］，［７］→［ｆ］となる。コンテンツ情報で考えると、［ａ］→［２］，［ｂ］→［３］，［ｃ］→［４］，［ｄ］→［５］，［ｅ］→［５］，［ｆ］→［７］となる。区間対応関係記憶部４４は、区間整合度導出部４３が導出した区間毎の対応関係を記憶する。 The section matching degree deriving unit 43 derives an individual correspondence between the content section and the document section (step S303). For example, the overlap on the normalization axis is examined, and it is assumed that there is a correspondence with the most overlapped area. In the example shown in FIG. 16A, the correspondence relationship is [1] → [a], [2] → [a], [3] → [b], [4] → [c], considering document information. [5] → [d], [6] → [f], [7] → [f]. Considering content information, [a] → [2], [b] → [3], [c] → [4], [d] → [5], [e] → [5], [f] → [7] The section correspondence storage unit 44 stores the correspondence for each section derived by the section matching degree deriving unit 43.

区間統合部４５は、コンテンツと文書とが完全に一対一に対応しているか否かを判定する（ステップＳ３０４）。コンテンツと文書とが完全に一対一に対応していない場合（ステップＳ３０４：ＮＯ）、区間統合部４５は、区間対応関係記憶部４４が記憶する区間毎の対応関係に基づいて、コンテンツと文書とが完全に一対一に対応するまで、隣接する区間を統合して、コンテンツと文書が一対一に対応付くようにする（ステップＳ３０４、Ｓ３０５）。例えば、同一区間に対応する隣接区間を統合する処理（例示：［１］→［ａ］、［２］→［ａ］であったとき、［１］と［２］を統合する）を繰り返すことにより、コンテンツと文書との一対一の対応を得ることができる。コンテンツと文書とが完全に一対一に対応した場合（ステップＳ３０４：ＹＥＳ）、出力部４６は、区間統合部４５による統合後の区間を一つの区間とみなして、対応関係を出力する（ステップＳ３０６）。 The section integration unit 45 determines whether or not the content and the document completely correspond to each other (Step S304). When the content and the document do not completely correspond one-to-one (step S304: NO), the section integration unit 45 determines the content and the document based on the correspondence for each section stored in the section correspondence storage unit 44. The adjacent sections are integrated so that the content and the document are in a one-to-one correspondence (steps S304 and S305). For example, the process of integrating adjacent sections corresponding to the same section (example: [1] → [a], [2] → [a] when [1] and [2] are integrated) is repeated) Thus, a one-to-one correspondence between the content and the document can be obtained. When the content and the document completely correspond one-to-one (step S304: YES), the output unit 46 regards the section after integration by the section integration unit 45 as one section and outputs the correspondence (step S306). ).

図１６Ａに示す例では、上記の処理により、図１３Ｂに示すように、［［１］［２］⇔［ａ］］，［［３］⇔［ｂ］］，［［４］⇔［ｃ］］，［［５］⇔［ｄ］［ｅ］］，［［６］［７］⇔［ｆ］］の対応関係および区間を抽出することができる。以上のように、区間対応関係導出手段７は、抽出されたコンテンツ区間の区間長と抽出された文書区間の文書量とを比較することにより対応付けを行う。 In the example shown in FIG. 16A, the above processing results in [[1] [2] ⇔ [a]], [[3] ⇔ [b]], [[4] ⇔ [c] as shown in FIG. 13B. ], [[5] ⇔ [d] [e]], [[6] [7] ⇔ [f]] and the sections can be extracted. As described above, the section correspondence relationship deriving means 7 performs the association by comparing the section length of the extracted content section with the document amount of the extracted document section.

区間対応関係導出手段７は、コンテンツの変化の確信度を導入して対応関係を導出することもできる。すなわち、コンテンツ区間抽出手段５から、導出された区間情報に加えて、区間抽出のために用いた変化点抽出の確信度をスコアとして入力し、変化点抽出の確信度を用いて対応関係を導出する。例えば、変化の確信度が高い領域において、区間統合部４５が、変化の確信度の高いスコアをもつ両区間に対して、統合処理を行う代わりに、一方を別の区間と統合する。図１８Ａ及び図１８Ｂは、対応関係導出方法におけるコンテンツ情報と文書情報との対応関係を示す図である。すなわち、図１５Ａに示す例において、［ｄ］→［ｅ］の変化の確信度が０．９０（高い）、［ｅ］→［ｆ］の確信度が０．４０（低い）である場合、長さの短い［ｅ］を［ｆ］と統合して対応関係を導出する。この結果、図１５Ｂに示すように確信度を反映した対応関係を導出することができる。 The section correspondence relationship deriving unit 7 can also derive the correspondence relationship by introducing the certainty of content change. In other words, in addition to the derived section information, the certainty of change point extraction used for section extraction is input as a score from the content section extraction means 5, and the correspondence is derived using the certainty of change point extraction. To do. For example, in an area where the certainty of change is high, the section integration unit 45 integrates one section with another section instead of performing the integration process on both sections having a high score of change confidence. 18A and 18B are diagrams showing the correspondence between content information and document information in the correspondence derivation method. That is, in the example shown in FIG. 15A, when the certainty of change of [d] → [e] is 0.90 (high) and the certainty of [e] → [f] is 0.40 (low), [E] having a short length is integrated with [f] to derive a correspondence. As a result, as shown in FIG. 15B, a correspondence that reflects the certainty factor can be derived.

また、コンテンツ区間の確信度のかわりに、文書区間抽出の際の確信度を利用したり、または、コンテンツ区間と文書区間の双方で確信度を利用したり場合にも、同様の処理が可能である。 In addition, the same processing can be performed when the certainty factor at the time of document segment extraction is used instead of the certainty factor of the content segment, or when the certainty factor is used in both the content segment and the document segment. is there.

図１９は、本発明の文書対応付け装置の実施の形態における区間対応関係導出手段７の構成の他の一例を示すブロック図である。区間対応関係導出手段７は、話者情報記憶部５１と、話者識別部５２と、文書話者情報抽出部５３と、区間整合度導出部５４とを含む。話者情報記憶部５１は、人物を特定するための特徴量と人物との対応関係を記憶する。話者識別部５２は、話者を特定する。文書話者情報抽出部５３は、文書中から話者に関する情報を抽出する。区間整合度導出部５４は、話者情報をもとに区間の整合を行う。 FIG. 19 is a block diagram showing another example of the configuration of the section correspondence relationship deriving means 7 in the embodiment of the document correspondence apparatus of the present invention. The section correspondence relationship deriving unit 7 includes a speaker information storage unit 51, a speaker identifying unit 52, a document speaker information extracting unit 53, and a section matching degree deriving unit 54. The speaker information storage unit 51 stores the correspondence between the feature quantity for specifying a person and the person. The speaker identification unit 52 identifies a speaker. The document speaker information extraction unit 53 extracts information about the speaker from the document. The section matching degree deriving unit 54 performs section matching based on the speaker information.

話者情報記憶部５１は、あらかじめ、人物を特定するための特徴量（音声特徴量または視覚的特徴量を含む。）と人物との対応関係を記録する。特徴量は、人物識別のためにあらかじめ設定される。例えば、音声特徴量を用いる場合には、特定の音素あるいは単語に関する音高、ピッチといった発話者毎に異なる話者固有の特徴量を使用する。また、言い回し、口癖といった情報を利用してもよい。視覚的特徴量を用いる場合には、話者の顔についての特徴として、目，鼻，口の形状や位置関係などを使用する。特徴量として、顔認識技術や話者識別技術として利用される既知の特徴量を利用することもできる。 The speaker information storage unit 51 records in advance the correspondence between a feature quantity (including a voice feature quantity or a visual feature quantity) for specifying a person and the person. The feature amount is set in advance for person identification. For example, when using speech feature values, speaker-specific feature values that differ for each speaker, such as pitches and pitches for specific phonemes or words, are used. Information such as wording and moustache may be used. When using visual features, the shape, positional relationship, and the like of eyes, nose, and mouth are used as features of the speaker's face. As the feature amount, a known feature amount used as a face recognition technique or a speaker identification technique can be used.

話者識別部５２は、コンテンツ区間抽出手段５からコンテンツ区間の情報およびその区間に含まれる特徴量を入力し、それらを、話者情報記憶部５１に記憶されている特徴量と比較することによって１つまたは複数の区間における話者を特定する。このように、特徴量整合識別手段としての話者識別部５２は、話者情報記憶部５１が記憶する特徴量とコンテンツ特徴量抽出手段（具体的にはコンテンツ区間抽出手段５）が抽出した特徴量との比較を行って話者の識別を行う。話者識別部５２は、例えば、入力特徴量がもっとも近い話者情報記憶部５１中の人物を抽出する。会議やテレビ番組等であらかじめ登場人物が限られている場合には、それらの制約情報を考慮して識別を行ってもよいし、候補となる話者をすべて列挙してもよい。文書話者情報抽出部５３は、１つまたは複数の文書区間における話者を特定することによって、文書中から話者に関する情報（話者情報）を抽出する。区間整合度導出部５４は、話者情報をもとに区間の整合を行う。すなわち、話者区間と文書区間とを対応付ける。 The speaker identification unit 52 inputs the content section information and the feature amount included in the section from the content section extraction means 5 and compares them with the feature amount stored in the speaker information storage unit 51. Identify speakers in one or more segments. As described above, the speaker identifying unit 52 as the feature amount matching identifying unit includes the feature amount stored in the speaker information storage unit 51 and the feature extracted by the content feature amount extracting unit (specifically, the content section extracting unit 5). The speaker is identified by comparing with the quantity. For example, the speaker identification unit 52 extracts a person in the speaker information storage unit 51 having the closest input feature amount. When characters are limited in advance in a conference or a TV program, identification may be performed in consideration of the restriction information, or all candidate speakers may be listed. The document speaker information extraction unit 53 extracts speaker information (speaker information) from the document by specifying speakers in one or more document sections. The section matching degree deriving unit 54 performs section matching based on the speaker information. That is, the speaker section is associated with the document section.

次に、本発明の文書対応付け方法の実施の形態における区間対応関係導出手段７が実行する他の対応関係導出方法について説明する。図２０は、本発明の文書対応付け方法の実施の形態における区間対応関係導出手段７が実行する対応関係導出方法の他の一例を示すフローチャートである。図２１及び図２２は、対応関係導出方法におけるコンテンツ情報と文書情報との対応関係を示す図である。なお、この例は、図１２Ａ〜図１３に示されるように話者情報が文書中に記述されて抽出可能であるときに有効である。 Next, another correspondence derivation method executed by the section correspondence derivation means 7 in the embodiment of the document association method of the present invention will be described. FIG. 20 is a flowchart showing another example of the correspondence derivation method executed by the section correspondence derivation means 7 in the embodiment of the document association method of the present invention. 21 and 22 are diagrams showing the correspondence between content information and document information in the correspondence derivation method. This example is effective when the speaker information is described in the document and can be extracted as shown in FIGS.

話者識別部５２は、コンテンツ区間抽出手段５から入力されるコンテンツ区間の情報およびその区間に含まれる特徴量に基づいて、それらを、話者情報記憶部５１に記憶されている特徴量と比較することによって１つまたは複数の区間における話者（話者区間）を特定する（ステップＳ３２１）。一方、文書話者情報抽出部５３は、１つまたは複数の文書区間における話者を特定することによって、文書中から話者に関する情報（話者情報）を抽出する（ステップＳ３２２）。ただし、ステップＳ３２１とステップＳ３２２とは、同時に行っても良いし、ステップＳ３２２を先に行っても良い。次に、区間整合度導出部５４は、話者情報をもとに区間の整合を行う。すなわち、話者区間と文書区間とを対応付ける（ステップＳ３２３）。このようにして、動作する。 The speaker identification unit 52 compares the content section information inputted from the content section extraction means 5 and the feature amount included in the section with the feature amount stored in the speaker information storage unit 51. Thus, the speaker (speaker section) in one or a plurality of sections is specified (step S321). On the other hand, the document speaker information extracting unit 53 extracts speaker information (speaker information) from the document by specifying speakers in one or more document sections (step S322). However, step S321 and step S322 may be performed simultaneously, or step S322 may be performed first. Next, the section matching degree deriving unit 54 performs section matching based on the speaker information. That is, the speaker section and the document section are associated with each other (step S323). It works in this way.

図２１（（ａ）コンテンツ情報、（ｂ）文書情報）に示す区間整合部５４による区間の正豪雨処理の一例では、話者識別部５２がコンテンツ情報（：コンテンツ区間）にもとづいて話者情報記憶部５１に記憶されている特徴量を利用して話者を特定した結果である人物識別情報に従って区間の対応がとられている。区間の対応に関しては、ダイナミックプログラミングマッチング（ＤＰマッチング）の手法を導入してもよい。コンテンツ情報にもとづく話者識別の精度が低く図２１に例示するように「田中」が抽出されない場合には、「田中」をスキップして対応をとることができる。 In an example of the regular heavy rain processing of the section by the section matching unit 54 shown in FIG. 21 ((a) content information, (b) document information), the speaker identification unit 52 performs speaker information based on the content information (: content section). Correspondence between sections is taken according to person identification information that is a result of specifying a speaker by using a feature amount stored in the storage unit 51. Regarding the correspondence between the sections, a method of dynamic programming matching (DP matching) may be introduced. When “Tanaka” is not extracted as illustrated in FIG. 21 because the speaker identification accuracy based on the content information is low, it is possible to take action by skipping “Tanaka”.

図２２（（ａ）コンテンツ情報、（ｂ）文書情報）は、話者識別部５２が複数の人物を候補として抽出した場合の区間整合度導出部５４による区間の整合処理例を説明するための説明図である。この場合、文書情報にもとづく人物情報によって、［ｆ］の領域は文書情報の［７］の区間と対応付けることができる。なお、「高木」や「山下」は文書中に登場しないとする。また、［ａ］の区間は、「山本」または「田中」の区間であるが、両方の名前が文書情報にでているため、［１］および［２］と対応付けられる。 FIG. 22 ((a) content information, (b) document information) is for explaining an example of section matching processing by the section matching degree deriving unit 54 when the speaker identifying unit 52 extracts a plurality of persons as candidates. It is explanatory drawing. In this case, the area [f] can be associated with the section [7] of the document information by the person information based on the document information. It is assumed that “Takagi” and “Yamashita” do not appear in the document. The section [a] is a section of “Yamamoto” or “Tanaka”, but since both names appear in the document information, they are associated with [1] and [2].

図２３は、本発明の文書対応付け装置の実施の形態における区間対応関係導出手段７の構成の別の一例を示すブロック図である。区間対応関係導出手段７は、音声認識を行って入力音声に対する候補テキストを生成する音声認識部６１と、候補テキストと文書記憶手段４中の文書の対応付けを行う候補テキスト文書対応部６２とを含む。 FIG. 23 is a block diagram showing another example of the configuration of the section correspondence derivation means 7 in the embodiment of the document association apparatus of the present invention. The section correspondence relationship deriving unit 7 includes a speech recognition unit 61 that performs speech recognition and generates candidate text for the input speech, and a candidate text document correspondence unit 62 that associates the candidate text with the document in the document storage unit 4. Including.

図２４は、候補テキスト文書対応部６２の構成の一例を示すブロック図である。候補テキスト文書対応部６２は、候補テキスト内単語抽出部７１と、文書区間内単語抽出部７２と、候補テキスト／文書区間対応部７４と、候補テキスト／文書区間単語類似度計算部７３とを含む。候補テキスト内単語抽出部７１は、一つまたは複数の単語を区間の候補テキストの中から抽出する。文書区間内単語抽出部７２は、各区間における一つまたは複数の単語を抽出する。候補テキスト／文書区間対応部７４は、各区間の対応付けを行う。候補テキスト／文書区間単語類似度計算部７３は、区間内距離を算出する。 FIG. 24 is a block diagram illustrating an example of the configuration of the candidate text document corresponding unit 62. The candidate text document correspondence unit 62 includes a candidate text word extraction unit 71, a document segment word extraction unit 72, a candidate text / document segment correspondence unit 74, and a candidate text / document segment word similarity calculation unit 73. . The candidate text word extraction unit 71 extracts one or more words from the candidate text in the section. The word extraction unit 72 in the document section extracts one or a plurality of words in each section. The candidate text / document section correspondence unit 74 associates each section. The candidate text / document section word similarity calculation unit 73 calculates a section distance.

次に、本発明の文書対応付け方法の実施の形態における区間対応関係導出手段７が実行する別の対応関係導出方法について説明する。図２５は、本発明の文書対応付け方法の実施の形態における区間対応関係導出手段７が実行する対応関係導出方法の別の一例を示すフローチャートである。図２６及び図２７は、対応関係導出方法におけるコンテンツ情報と文書情報との対応関係を示す図である。コンテンツには音声情報が含まれているとする。
音声認識部６１は、コンテンツ区間抽出手段５から、コンテンツ区間についての情報を入力される。また、コンテンツ記憶手段３からコンテンツ情報を入力される。そして、コンテンツ情報から音声情報を取り出し、音声認識を行って、入力音声に対する候補テキストを生成する（ステップＳ３４１）。音声認識方式については種々の手法があるが、音素をもちいた認識方法、直接単語テンプレートを利用して音声認識する方法、話者に合わせてテンプレートを変換する等、この実施の形態ではいずれの方法を用いてもよい。Next, another correspondence derivation method executed by the section correspondence derivation means 7 in the embodiment of the document association method of the present invention will be described. FIG. 25 is a flowchart showing another example of the correspondence derivation method executed by the section correspondence derivation means 7 in the embodiment of the document association method of the present invention. 26 and 27 are diagrams showing the correspondence between content information and document information in the correspondence derivation method. Assume that the content includes audio information.
The voice recognition unit 61 receives information about the content section from the content section extraction means 5. Further, content information is input from the content storage means 3. Then, voice information is extracted from the content information, voice recognition is performed, and candidate text for the input voice is generated (step S341). There are various methods for speech recognition, but any method in this embodiment such as a recognition method using phonemes, a speech recognition method using a direct word template, a template conversion according to the speaker, etc. May be used.

候補テキスト文書対応部６２は、音声認識部６１からのコンテンツの各区間の候補テキストをうけとり、候補テキストと文書記憶手段４中の文書との対応付けを行う。 The candidate text document correspondence unit 62 receives the candidate text of each section of the content from the speech recognition unit 61 and associates the candidate text with the document in the document storage unit 4.

候補テキスト文書対応部６２は、候補テキストにおける単語と、文書区間内の単語とを比較する。そして、一致した単語または類似した単語を含むコンテンツ区間と文書区間とを対応付ける。具体的には、候補テキスト内単語抽出部７１が、各コンテンツ区間に使用されているひとつまたは複数の単語を区間の候補テキストの中から抽出する（ステップＳ３４２）。文書区間内単語抽出部７２は、各文書区間における一つまたは複数の単語を抽出する（ステップＳ３４３）。なお、ステップＳ３４２とステップＳ３４３とは同時に行っても良いし、ステップＳ３４３を先に行っても良い。次に、候補テキスト／文書区間単語類似度計算部７３は、コンテンツ区間における単語と文書区間における単語の類似度を判定するための区間内距離の計算を行う（ステップＳ３４４）。候補テキスト／文書区間対応部７４は、区間内距離に基づいて、抽出された単語組を比較することによりコンテンツ区間と文書区間との対応付けを行い、結果を出力する（ステップＳ３４５）。 The candidate text document correspondence unit 62 compares the words in the candidate text with the words in the document section. Then, the content section including the matched word or the similar word is associated with the document section. Specifically, the word extraction unit 71 in candidate text extracts one or more words used in each content section from the candidate text in the section (step S342). The document segment word extraction unit 72 extracts one or a plurality of words in each document segment (step S343). Note that step S342 and step S343 may be performed simultaneously, or step S343 may be performed first. Next, the candidate text / document section word similarity calculation unit 73 calculates a section distance for determining the similarity between the word in the content section and the word in the document section (step S344). The candidate text / document section correspondence unit 74 compares the extracted word sets with each other based on the intra-section distance, associates the content section with the document section, and outputs the result (step S345).

図２６は、候補テキスト文書区間対応部７４による候補テキストと文書記憶手段４中の文書との対応付けの一例を示している。（ａ）はコンテンツ区間、（ｂ）はコンテンツ区間の開始時間、（ｃ）は候補テキスト単語、（ｄ）は文書区間内単語、（ｅ）は文書区間、（ｆ）は文書をそれぞれ示す。図２６に示す例では、各文書区間では、その文書区間において重要である単語（文書区間の内容を特徴付ける基本単語）として、（情報通信、音声認識、意味情報、・・・）、（セキュリティ、ビデオカメラ、動物体、・・・）、（実験、・・・）、（研究、・・・）が抽出されている。各音声映像区間すなわちコンテンツ区間（１３：４１、１５：４１）、（１５：４１、１６：５０）、（１６：５０、２０：１５）、（２０：１５、２１：１３）、・・・から、おのおの、（音声認識、意味情報、・・・）、（情報通信、意味情報，・・・）、（セキュリティ、・・・）、（研究、・・・）といった単語が抽出されている。このような単語は、文書中から単に名詞だけを抽出することによって得られるものでもよいし、あらかじめ辞書に重要単語を登録しておき、辞書中の単語と整合をとることで抽出されもよい。また、単語の使用頻度解析によって、重要度を決定してもよい。 FIG. 26 shows an example of the correspondence between the candidate text and the document in the document storage unit 4 by the candidate text document section correspondence unit 74. (A) is a content section, (b) is a start time of the content section, (c) is a candidate text word, (d) is a word in a document section, (e) is a document section, and (f) is a document. In the example shown in FIG. 26, in each document section, as words (basic words characterizing the contents of the document section) that are important in the document section, (information communication, speech recognition, semantic information,...), (Security, Video camera, moving object, ...), (experiment, ...), (research, ...) are extracted. Each audio video section, that is, a content section (13:41, 15:41), (15:41, 16:50), (16:50, 20:15), (20:15, 21:13),... From these, words such as (voice recognition, semantic information,...), (Information communication, semantic information,...), (Security,...), (Research,...) Are extracted. . Such words may be obtained by simply extracting nouns from the document, or may be extracted by registering important words in the dictionary in advance and matching the words in the dictionary. The importance may be determined by word usage frequency analysis.

図２７は、候補テキスト文書区間対応部７４による候補テキストと文書記憶手段４中の文書との対応付けの一例を示している。（ａ）はコンテンツ区間、（ｂ）はコンテンツ区間の時間、（ｃ）は文書区間、（ｄ）は文書、（ｅ）対応関係表をそれぞれ示す。候補テキスト文書区間対応部７４は、単語列の類似度（重複度）を測定することにより、図２７（ｅ）に対応関係表として例示するように、各区間の対応関係を導出することができる。なお、図２６に例示するように、対応がとれない場合には「対応がとれない」としてしまってもよい。また、コンテンツ区間と文書区間との対応関係導出にはダイナミックプログラミングマッチング（ＤＰマッチング）の手法を利用してもよい。 FIG. 27 shows an example of the correspondence between the candidate text and the document in the document storage unit 4 by the candidate text document section correspondence unit 74. (A) shows the content section, (b) shows the time of the content section, (c) shows the document section, (d) shows the document, and (e) shows the correspondence table. The candidate text document section correspondence unit 74 can derive the correspondence relation of each section as illustrated in the correspondence relation table in FIG. 27E by measuring the similarity (duplication degree) of the word string. . In addition, as illustrated in FIG. 26, when it is not possible to take a correspondence, it may be determined that “a correspondence cannot be taken”. Further, a dynamic programming matching (DP matching) technique may be used for deriving the correspondence between the content section and the document section.

以上のようにして、コンテンツ区間と文書区間の対応付けが実現される。対応付けは、上記の区間対応関係導出手段７の各構成（図１４、図１９、図２３）の組み合わせによって実現してもよい。 As described above, the association between the content section and the document section is realized. The association may be realized by a combination of the components of the section correspondence relationship deriving unit 7 (FIGS. 14, 19, and 23).

図１に示された出力手段８は、区間対応関係導出手段７が導出した音声または映像と文書区間との対応関係を出力する。出力の形態の一例として、図２７（ｅ）に示されているように、文書の区間の先頭にコンテンツ中の時間を付与した対応関係表がある。この他、コンテンツの時間情報と、文書区間との対応関係を表す表現であれば、どのような出力形態であってもよい。 The output means 8 shown in FIG. 1 outputs the correspondence between the audio or video derived by the section correspondence deriving means 7 and the document section. As an example of the output form, as shown in FIG. 27 (e), there is a correspondence table in which the time in the content is added to the head of the document section. In addition, any output form may be used as long as it represents the correspondence between the content time information and the document section.

本発明は、コンテンツと文書情報とを自動的に対応付けることによってコンテンツと文書情報を同期表示する情報提示装置や、テキスト情報でコンテンツの該当部分を検索したり頭だしをするマルチメディア表示装置や、マルチメディア検索装置といった用途に適用可能である。また、議事録等を参照しながら実際のコンテンツを確認する議会映像閲覧装置や、講演の資料と講演内容を参照する講演支援システム、教育支援システムといった用途に適用可能である。 The present invention provides an information presentation device that synchronously displays content and document information by automatically associating the content with document information, a multimedia display device that searches for or cues the relevant part of the content with text information, It can be applied to uses such as a multimedia search device. Further, the present invention can be applied to applications such as an assembly video browsing device that confirms actual content while referring to minutes, a lecture support system that refers to lecture materials and lecture contents, and an education support system.

Claims

A document association method using a document association apparatus comprising a content section extraction unit, a document section extraction unit, and a section correspondence relationship derivation unit,
(B) The content section extraction unit divides content including at least one of audio information and video information into a plurality of content sections;
(C) the document section extracting unit divides a document describing the content contents into a plurality of document sections;
(D) The section correspondence relationship deriving unit has normalized the first change pattern indicating how the section lengths of the plurality of content sections in the content are changed by dividing the total section length of the plurality of content sections. A normalized first change pattern and a second change pattern indicating how to change the document amount of the plurality of document sections in the document are divided by the total document amount of the plurality of document sections. Based on the comparison with the two change patterns, the plurality of content sections from the positions of the plurality of content sections in the normalized first change pattern and the positions of the plurality of document sections in the normalized second change pattern the correspondence relationship between the plurality of documents sections were immediately Bei and deriving a,
The step (d) includes:
(D1) Two adjacent contents of the plurality of content sections so that the section correspondence relation deriving unit corresponds one-to-one with the number of the plurality of content sections and the number of the plurality of document sections. And a step of executing at least one of integrating the sections into one content section and integrating two adjacent document sections of the plurality of document sections into one document section. ,
The step (d1) includes
(D11) When the section correspondence relation deriving unit integrates the two content sections into the one content section, the section correspondence relationship deriving unit performs the extraction of a change point as a partition for dividing the plurality of content sections. In the case where the two content sections having relatively low certainty of change point extraction in the change point analysis are selected and the two document sections are integrated into one document section, the plurality of document sections are A document association method including a step of selecting the two document sections having relatively low certainty of change point extraction in change point analysis performed to extract change points as divisions to be divided .

The document association method according to claim 1,
The step (d) includes:
(D2) the section correspondence relation deriving unit identifying each speaker of the plurality of content sections in the normalized first pattern based on a feature amount that identifies a person included in the content;
(D3) The speaker of each of the plurality of document sections in the normalized second pattern based on speaker information as information regarding a speaker that identifies the person included in the document, by the section correspondence relationship deriving unit Identifying steps,
(D4) Based on the speaker specified by the normalized first pattern and the normalized second pattern, the section correspondence relationship deriving unit determines the normalized first change pattern and the normalized second change pattern. And a step of associating with a document associating method.

The document association method according to claim 1,
The step (d) includes:
(D5) the section correspondence relationship deriving unit deleting a silent section in each of the plurality of content sections;
(D6) The section correspondence relation deriving unit obtains a section length of each of the plurality of content sections from which silent sections are deleted;
(D7) The section correspondence relationship deriving unit includes the step of obtaining the normalized first change pattern based on section lengths of a plurality of content sections from which the silent section is deleted.

The document association method according to claim 1,
The step (b)
(B1) the content section extraction unit extracting a feature amount specifying a person included in the content;
(B2) The content section extraction unit estimates a section in which the feature amount is more similar to a preset threshold value as one content section in which one speaker's utterance continues, and adjacent content sections Extracting the plurality of content sections having different speakers. A document association method.

The document association method according to claim 1 ,
The step (c) includes:
(C1) the document section extracting unit extracting document delimiter information indicating a document delimiter included in the document;
(C2) The document section extracting unit estimates the section between the document delimiter information as one document section in which a single speaker continues speaking, and the plurality of speakers having different content sections are different. A document matching method comprising a step of extracting a document section of a document.

The document association method according to claim 1 ,
The step (d) includes:
(D8) The section correspondence relationship deriving unit sets one of the first change pattern of the plurality of content sections and the second change pattern of the plurality of document sections as a standard pattern and the other as an input pattern. A document association method comprising the step of performing the association by a dynamic programming matching method in pattern recognition.

The document association method according to claim 1,
The step (b)
(B3) From the one of the plurality of speakers at the time when the feature amount changes based on the feature amount specifying the person included in the stored content, the plurality of speakers Extracting from the content a point in time when the speaker has changed to the other person;
(B4) The content segment extraction unit includes a step of dividing the content in units of speakers based on a point in time when the speaker changes.

In the document matching method according to claim 7 ,
The step (b3)
(B31) The content is the audio information, and the content section extraction unit determines a change point of the speaker's voice as the time when the audio feature amount changes based on the audio feature amount as the feature amount. A document association method including a step of extracting from the audio information.

In the document matching method according to claim 7 ,
The step (b3)
(B32) The content is the video information, and the content section extraction unit determines a change point of the video of the speaker as the time when the person feature changes based on the person feature as the feature. A document matching method including a step of extracting from video information.

In the document matching method according to any one of claims 1 to 7 ,
The content is audio-video information in which the audio information and the video information are synchronized.

In the document matching method according to claim 8 ,
The step (b3)
(B33) The content section extraction unit performs a change point analysis of the voice feature quantity as a sound feature of the voice information, and derives a time point when the speaker changes as a time point when the voice feature quantity changes. Document matching method including

In document correspondence method according to claim 9,
The step (b3)
(B34) The content section extraction unit performs a change point analysis of the person feature amount as a visual feature of the video information, and derives a time point when the speaker changes as a time point when the person feature amount changes. Document matching method including steps.

In the document matching method according to claim 7 ,
The step (b3)
(B35) The content is audio / video information including the audio information and the video information, and the content section extraction unit analyzes a change point of a person feature amount as the feature amount which is a visual feature of the video information. And a change point analysis of the voice feature quantity as the feature quantity, which is a sound feature of the voice information, and the results of both are integrated to derive a time point when the speaker changes as a time point when the feature quantity changes. Document matching method including the step of performing.

In the document matching method according to claim 5 ,
The step (c2) includes:
(C21) The document section extraction unit performs structural analysis of the document based on at least one of a blank line, a font difference, a character color difference, a character layout, and a speaker name as the document delimiter information. A document association method comprising a step of dividing the document into units of speakers.

A program that causes a computer to execute the method according to any one of claims 1 to 14 .

A storage medium storing the program according to claim 15 readable by a computer.

A content segment extraction unit that divides content including at least one of audio information and video information and extracts a plurality of content segments;
A document section extracting unit that extracts a plurality of document sections by dividing a document describing the contents of the content;
A normalized first change pattern that is normalized by dividing a first change pattern indicating how to change a section length of the plurality of content sections in the content by an entire section length of the plurality of content sections; Based on the comparison with the normalized second variation pattern normalized by dividing the second variation pattern of the document amount of the plurality of document segments by the total document amount of the plurality of document segments, the normalized first variation A section correspondence relationship for deriving a correspondence relationship between the plurality of content sections and the plurality of document sections from the positions of the plurality of content sections in the pattern and the positions of the plurality of document sections in the normalized second change pattern and the lead-out portion and ingredients Bei,
The section correspondence relationship deriving unit integrates two adjacent content sections of the plurality of content sections so that the number of the plurality of content sections and the number of the plurality of document sections correspond one-to-one. And executing at least one of making one content section and integrating two adjacent document sections of the plurality of document sections into one document section,
When the two content sections are integrated into the one content section, the section correspondence relation deriving unit performs a change point performed to extract a change point as a break for dividing the plurality of content sections. When the two content sections having a relatively low certainty of change point extraction in the analysis are selected and the two document sections are integrated into one document section, a division for dividing the plurality of document sections A document association apparatus that selects the two document sections having relatively low certainty of change point extraction in change point analysis performed to extract change points as

The document association apparatus according to claim 17 , wherein
The section correspondence relation deriving unit identifies each speaker of the plurality of content sections in the normalized first pattern based on a feature amount that identifies a person included in the content, and includes a person included in the document On the basis of speaker information as information relating to a speaker for specifying the speaker, each speaker of the plurality of document sections in the normalized second pattern is specified, and the normalized first pattern and the normalized second pattern A document association apparatus that associates the normalized first change pattern with the normalized second change pattern based on the speaker specified in step (b).

The document association apparatus according to claim 17 , wherein
The section correspondence relationship deriving unit deletes a silent section in each of the plurality of content sections, obtains a section length of each of the plurality of content sections from which the silent section has been deleted, and a plurality of the deleted silent sections. A document association apparatus that obtains the normalized first change pattern based on a section length of a content section.

In the document matching apparatus according to any one of claims 17 to 19 ,
The content section extraction unit extracts a feature amount that identifies a person included in the content, and one speaker's utterance continues in a section in which the feature amount is similar to a preset threshold value. A document association apparatus that extracts a plurality of content sections that are different from each other by estimating speakers in one content section.

In document associating device according to any one of claims 17 to 20,
The document section extraction unit extracts document section information indicating a document section included in the document, and estimates a section between the document section information as one document section in which one speaker's utterance continues. A plurality of document sections having different speakers in adjacent content sections.

In the document matching apparatus according to any one of claims 17 to 21 ,
The section correspondence derivation unit recognizes a pattern by using one of the first change pattern of the plurality of content sections and the second change pattern of the plurality of document sections as a standard pattern and the other as an input pattern. A document association apparatus that performs the association by the dynamic programming matching method in FIG.

In the document matching apparatus according to any one of claims 17 to 19 ,
The content is the audio information;
The content section extraction unit analyzes a sound feature of the audio information and extracts the plurality of content sections based on a point where the sound feature changes.

In the document matching apparatus according to any one of claims 17 to 19 ,
The content is the video information;
The content section extraction unit analyzes a visual feature of the video information and extracts the plurality of content sections based on a point where the visual feature changes.

In the document matching apparatus according to any one of claims 17 to 19 ,
The content is audio-video information in which the audio information and the video information are synchronized,
The content section extraction unit integrates the change point of the sound feature in the analysis result of the sound feature of the audio information and the change point of the visual feature in the result of the analysis of the visual feature of the video information. A document association apparatus that extracts the plurality of content sections.

The document association apparatus according to claim 25 ,
The content section extractor
Analyzing a sound feature of the sound information, and based on a point where the sound feature changes, a speech section extracting unit that divides the speech information into speaker units and extracts a plurality of speech sections;
Analyzing the visual characteristics of the video information, and based on the points where the visual characteristics change, a video segment extraction unit that divides the video information into speakers and extracts a plurality of video segments;
An audio / video section integration unit that extracts the plurality of content sections based on a plurality of pieces of voice section information about the plurality of voice sections and a plurality of pieces of video section information about the plurality of video sections.

In the document matching apparatus according to any one of claims 17 to 19 ,
The content section extraction unit is configured to change another one of the plurality of speakers from one of the plurality of speakers when the feature amount changes based on a feature amount that identifies a person included in the stored content. A document association device that extracts a speaker change point as a time when a speaker changes to one person and extracts the plurality of content sections obtained by dividing the content into units of speakers based on the time when the speaker changes .

In document associating device according to claim 27,
The content includes the audio information;
The content section extraction unit determines the speaker change point based on a change in the feature of at least one of the prosodic information among the speech height, the speech speed, and the speech size in the speech information as the feature amount. Document matching device to extract.

28. The document association apparatus according to claim 27 .
The content includes the audio information;
The said content area extraction part extracts the said speaker change point based on the change of the conversation form in the said audio | voice information as the said feature-value Document matching apparatus.

28. The document association apparatus according to claim 27 .
The content includes the video information,
The said content area extraction part extracts the said speaker change point based on the change of the person's visual feature in the said video information as the said feature-value Document matching apparatus.

28. The document association apparatus according to claim 27 .
The content includes the video information,
The said content area extraction part extracts the said speaker change point based on the change of the face feature of the person in the said video information as the said feature-value Document matching apparatus.

28. The document association apparatus according to claim 27 .
The content includes the video information,
The said content area extraction part extracts the said speaker change point based on the change of the visual feature of the clothes of the person in the said video information as the said feature-value Document matching apparatus.

The document association apparatus according to any one of claims 17 to 32 ,
The document section extraction unit estimates a section having the same format information based on the format information of the document as one document section in which a single speaker is speaking, and talks between adjacent content sections. A document association device that extracts the plurality of document sections having different persons.

The document association apparatus according to any one of claims 17 to 32 ,
The document section extraction unit estimates a section in which the speaker is the same as one document section in which an utterance of one speaker continues based on a description about the speaker entered in the document, and A document associating device that extracts the plurality of document sections having different speakers in matching content sections.

The document association apparatus according to any one of claims 17 to 32 ,
The document section extraction unit estimates a section having the same tag information based on tag information of a structured document in the document as one document section in which one speaker's utterance continues and is adjacent to each other. A document association apparatus that extracts the plurality of document sections having different speakers in the content section.

The document association apparatus according to any one of claims 17 to 32 ,
The document section extraction unit estimates a section having the same conversation feature based on a change in conversation feature in the document as one document section in which a single speaker is speaking, and is adjacent to a content section. A document association device that extracts the plurality of document sections having different speakers.

The document association apparatus according to claim 18 , wherein
The section correspondence relation deriving unit
A content speaker identifying unit that identifies each speaker of the plurality of content sections based on the feature amount;
A document speaker information extraction unit that identifies each speaker of the plurality of document sections based on the speaker information;
A section matching unit that performs matching between the plurality of content sections and the plurality of document sections based on each speaker of the plurality of specified content sections and each speaker of the plurality of document sections. Document association device.

38. The document association apparatus according to claim 37 .
The content speaker identification unit
A content feature amount extraction unit that extracts a feature amount of each of the plurality of content sections;
A speaker information storage unit for storing the feature quantity and the speaker in association with each other;
A document associating device, comprising: a feature quantity matching and identifying unit that identifies the speaker based on a comparison between the stored feature quantity and the extracted feature quantity.

In the document matching apparatus according to claim 37 or 38 ,
The content speaker identification unit identifies the speaker based on a feature of at least one prosodic information among the voice pitch, voice length, and voice strength in the voice information as the feature amount. Yes Document matching device.

In the document matching apparatus according to claim 37 or 38 ,
The said content speaker identification part specifies the said speaker based on the feature-value which the conversation form in the said audio | voice information as said feature-value represents. Document matching apparatus.

In the document matching apparatus according to claim 37 or 38 ,
The content speaker identification unit identifies the speaker based on a visual feature amount of a person in the video information as the feature amount.

The document matching apparatus according to claim 41 , wherein
The content speaker identification unit uses a facial feature of a person as a visual feature of the person.

In the document matching apparatus according to any one of claims 37 to 42 ,
The said document speaker information extraction part specifies the said speaker based on the description regarding the speaker entered in the said document based on the said speaker information. Document matching apparatus.

In the document matching apparatus according to any one of claims 37 to 42 ,
The document speaker information extraction unit specifies a speaker based on metadata of a structured document in the document based on the speaker information.

In the document matching apparatus according to any one of claims 37 to 44 ,
The section matching unit associates the plurality of content sections with the plurality of document sections so that a speaker in each of the plurality of content sections matches a speaker in each of the plurality of document sections. Attachment device.

The document association apparatus according to claim 45 , wherein
The section matching unit may use one of the first change pattern of the plurality of content sections and the second change pattern of the plurality of document sections as a standard pattern and the other as an input pattern for dynamic in pattern recognition. A document association apparatus that associates the plurality of content sections and the plurality of document sections based on a result of executing programming matching.

In the document matching device according to any one of claims 17 to 46 ,
The content includes audio information;
Further comprising a voice recognition unit that extracts utterance contents in the plurality of content sections and outputs utterance text information;
The section correspondence relation deriving unit associates the plurality of content sections with the plurality of document sections based on the similarity between the utterance text information and the document information of the document.

48. The document association apparatus according to claim 47 , wherein
The section correspondence relation deriving unit is a dynamic program for pattern recognition in which one of a word pattern appearing in the utterance text information and a word pattern appearing in the document information is a standard pattern and the other is an input pattern. A document association apparatus that matches the utterance text information and the document information based on a result of matching.

In document associating device according to claim 47 or 48, wherein,
The section correspondence relation deriving unit
One or more first basic words used in each of the plurality of content sections in the utterance text information; one or more second basic words used in each of the plurality of document sections; A basic word extraction unit for extracting
A basic word group similarity deriving unit that measures the similarity between the plurality of first basic words and the plurality of second basic words;
A document association apparatus that derives the correspondence relationship based on the similarity.

In the document matching apparatus according to claim 47 or 48 ,
The section correspondence relationship deriving unit derives a correspondence relationship by associating the similarity by dynamic programming matching.

The document association apparatus according to any one of claims 17 to 50 ,
A content input unit for inputting the content;
A content storage unit for storing the content;
A document input unit for inputting the document information;
A document storage unit for storing the document;
An output unit that outputs information related to the correspondence relationship.