JP5296598B2

JP5296598B2 - Voice information extraction device

Info

Publication number: JP5296598B2
Application number: JP2009111587A
Authority: JP
Inventors: 彰夫小林; 亨今井; 貴裕奥; 庄衛佐藤; 真一本間
Original assignee: Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 2009-04-30
Filing date: 2009-04-30
Publication date: 2013-09-25
Anticipated expiration: 2029-04-30
Also published as: JP2010262413A

Abstract

PROBLEM TO BE SOLVED: To provide a voice information extraction device for displaying the voice recognition result of a program or the like and a topic related with speech content as a retrieval result. SOLUTION: A voice information extraction device includes: a video voice recording part for obtaining a video and a voice; a voice recognition part for performing the voice recognition processing of a voice by using an acoustic model and a language model; a text data obtaining part for obtaining text data related with the obtained video and voice from the outside; a topic extraction part for extracting a topic by comparing the obtained text data with the result of the voice recognition processing; a voice information integration part for writing voice information configured by integrating the voice recognition result with the topic in the voice information storage part; an index creation part for retrieval for creating an index for retrieval on the basis of the voice recognition result; and a retrieval server part for retrieving index voice information for retrieval on the basis of a retrieval request on the basis of a retrieval word, and for reading and presenting the voice information related with the pertinent video and the voice. COPYRIGHT: (C)2011,JPO&INPIT

Description

本発明は、入力される映像・音声から、音声情報を抽出し、抽出された音声情報を検索・閲覧可能とする音声情報抽出装置に関する。 The present invention relates to an audio information extraction apparatus that extracts audio information from input video / audio, and makes it possible to search and browse the extracted audio information.

特許文献１の、特に請求項７には、テレビジョンの放送番組データに含まれる映像信号を表示装置に表示するとともに、当該放送番組データに含まれる音声データに対して音声認識処理を行なうことによって文章情報を取得し、所定の時刻に、取得した文章情報を形態素解析することによってキーワードを抽出して、抽出されたキーワードと当該時刻の情報とを共に記憶装置に蓄積し、これらを履歴として時系列に提示し、いずれかの時刻情報が選択された場合に、選択された時刻情報と共に前記記憶装置に記憶されたキーワードの一覧を表示する情報処理装置が記載されている。また、このキーワードを用いることにより、インターネット等の通信回線を介して、外部から関連する詳細情報を取得できるようになる。 In Patent Document 1, in particular, the video signal included in the broadcast program data of the television is displayed on the display device, and voice recognition processing is performed on the audio data included in the broadcast program data. Acquires text information, extracts keywords by morphological analysis of the acquired text information at a predetermined time, stores the extracted keywords and the information of the time together in a storage device, and stores them as a history. There is described an information processing apparatus that displays a list of keywords stored in the storage device together with selected time information when any time information is selected. Further, by using this keyword, it becomes possible to acquire related detailed information from the outside via a communication line such as the Internet.

非特許文献１および非特許文献２には、音声認識結果のラティスのデータを展開したり圧縮したりする技術が記載されている。
非特許文献１に記載されている方法は、ラティスを集約する際に、発話時刻の重なりと、単語表記の発音の類似性を調べる。例えば「リンカーン」と「印鑑（いんかん）」は発音が類似しているといったことを調べる。これにより、単語仮説（正解候補）の対立関係を求めることができる。
また、非特許文献２に記載されている方法は、ラティスを圧縮する際に、予め圧縮ラティスの元になるグラフ（最尤系列）をラティスから選んでおく。そして、その後、ラティスを巡回する順番を変えながら、圧縮ラティスにノード、エッジを追加していく。 Non-Patent Document 1 and Non-Patent Document 2 describe techniques for expanding and compressing lattice data of speech recognition results.
The method described in Non-Patent Document 1 examines the similarity between the utterance time overlaps and the pronunciation of the word notation when the lattices are aggregated. For example, “Lincoln” and “Inkan” are checked for similar pronunciation. As a result, it is possible to obtain an opposing relationship between word hypotheses (correct answer candidates).
In the method described in Non-Patent Document 2, when a lattice is compressed, a graph (maximum likelihood sequence) based on the compression lattice is selected in advance from the lattice. Then, nodes and edges are added to the compressed lattice while changing the order of circulating the lattice.

特開２００９−０７７１６６号公報JP 2009-077166 A

L. Mangu, E. Brill, A. Stolcke, “Finding consensus in speech recognition: word error minimization and other applications of confusion networks”, Computer Speech and Language, vol. 14, no. 4, pp.373-400, ２０００年．L. Mangu, E. Brill, A. Stolcke, “Finding consensus in speech recognition: word error minimization and other applications of confusion networks”, Computer Speech and Language, vol. 14, no. 4, pp.373-400, 2000 Year. D. Hakkani-Tur, F. Bechet, G. Riccardi, G. Tur, “Beyond ASR 1-best: Using word confusion networks in spoken language understanding”, Computer Speech and Language, Vol. 20, No. 4, pp.495-514, 2006.D. Hakkani-Tur, F. Bechet, G. Riccardi, G. Tur, “Beyond ASR 1-best: Using word confusion networks in spoken language understanding”, Computer Speech and Language, Vol. 20, No. 4, pp. 495-514, 2006.

しかしながら、上記の従来技術（特に、特許文献１に記載の技術）では、番組等の音声を認識してキーワードを抽出し、そのキーワードに関連する情報を外部から取得することはできるものの、番組等を検索対象とすることはできない。
また、従来技術では、外部から取得した情報と番組等の音声そのものとの関係が利用者にわかりにくい。
また、テレビやラジオなどの放送などにおける言語表現の変化により、音声認識の精度が落ちることも考えられる。 However, in the above-described conventional technology (particularly, the technology described in Patent Document 1), it is possible to extract a keyword by recognizing the sound of a program or the like and obtain information related to the keyword from the outside. Cannot be searched.
Further, in the prior art, it is difficult for the user to understand the relationship between information acquired from the outside and the sound of the program itself.
In addition, the accuracy of speech recognition may be reduced due to changes in language expressions in broadcasting such as television and radio.

また、非特許文献１に記載されているラティスデータ処理方法では、エッジのクラスタリングを音素列に変換した単語仮説同士の編集距離に基づいてクラスタリングするため、ラティスの圧縮に時間がかかるという問題がある。つまり、単語表記ごとに発音の類似度を比べるため、圧縮の手続きに時間がかかるという問題がある。
また、非特許文献２に記載されているラティスデータ処理方法では、非特許文献１に記載されている手法よりは高速にラティスを圧縮することが可能だが、単語仮説のクラスタリングを行わないため、圧縮率が低いという問題がある。つまり、発音の類似性の比較を行わないので、圧縮ラティスの精度が悪いという問題がある。 Further, in the lattice data processing method described in Non-Patent Document 1, clustering is performed based on the edit distance between word hypotheses obtained by converting edge clustering into phoneme strings, so that there is a problem that it takes time to compress the lattice. . That is, there is a problem that the compression procedure takes time because the similarity of pronunciation is compared for each word notation.
In addition, the lattice data processing method described in Non-Patent Document 2 can compress the lattice at a higher speed than the method described in Non-Patent Document 1, but does not perform word hypothesis clustering. There is a problem that the rate is low. That is, there is a problem that the accuracy of the compression lattice is not good because the similarity of pronunciation is not compared.

本発明は、上記の課題認識に基づいて為されたものであり、映像および音声を蓄積するとともに、その発話内容を対象として映像および音声を検索することができ、検索結果として音声の該当箇所における話題や話者に関する情報などといった音声情報も利用者にわかりやすく提示することのできる、音声情報抽出装置を提供することを目的とする。 The present invention has been made on the basis of the above problem recognition, and can store video and audio, and can search video and audio for the content of the utterance. It is an object of the present invention to provide a voice information extraction device that can present voice information such as information about a topic or a speaker to a user in an easy-to-understand manner.

また、本発明は、言語表現が変化しても音声認識の精度が落ちない構成を備えた音声情報抽出装置を提供することを目的とする。 It is another object of the present invention to provide a speech information extraction apparatus having a configuration in which the accuracy of speech recognition does not decrease even if the language expression changes.

さらに、本発明では、音声認識処理の結果得られるラティスデータを、高速に且つ高圧縮率で圧縮し、利用することのできる音声情報抽出装置を提供することも目的とする。 Furthermore, another object of the present invention is to provide a speech information extraction device that can compress and use lattice data obtained as a result of speech recognition processing at high speed and a high compression rate.

［１］上記の課題を解決するため、本発明の一態様による音声情報抽出装置は、映像および音声を記憶する映像音声記憶部と、単語と、音声における発話時刻と、の対応関係を含んでなる検索用インデックスを記憶する検索用インデックス記憶部と、発話時刻と、単語の列である発話内容と、話題と、話者名または話者属性の少なくともいずれかと、を関連付けてなる音声情報を記憶する音声情報記憶部と、音声の音響的特徴を統計的に表わした音響モデルを記憶する音響モデル記憶部と、単語の出現頻度を統計的に表わした言語モデルを記憶する言語モデル記憶部と、映像および音声を外部から取得して前記映像音声記憶部に書き込む映像音声収録部と、前記音響モデル記憶部から読み出した前記音響モデルと前記言語モデル記憶部から読み出した前記言語モデルとを用いて、前記映像音声収録部が取得した前記音声の音声認識処理を行ない、音声認識結果を出力する音声認識部と、話者毎または話者属性毎の音響的特徴を統計的に表わした話者データを予め記憶する話者データ記憶部と、前記話者データ記憶部から読み出した前記話者データを用いて、前記映像音声収録部が取得した前記音声に対応する話者名または話者属性を算出して出力する話者識別部と、前記映像音声収録部が取得した前記映像および前記音声に関連するテキストデータを外部から取得するテキストデータ取得部と、前記テキストデータ取得部が取得した前記テキストデータと前記音声認識部により出力された前記音声認識結果とを比較することにより話題を抽出する話題抽出部と、前記音声認識結果と、前記話題と、前記話者名または前記話者属性の少なくともいずれか、とを統合してなる音声情報を前記音声情報記憶部に書き込む音声情報統合部と、前記音声認識結果に基づき前記検索用インデックスのデータを作成して前記検索用インデックス記憶部に書き込む検索用インデックス作成部と、検索語による検索要求に基づき前記検索用インデックス記憶部および前記音声情報記憶部を検索し、前記検索語に該当する前記映像および前記音声に関連付けられた前記音声情報を前記音声情報記憶部から読み出して検索元に対して検索結果として提示するとともに、前記映像音声記憶部に記憶されている当該映像および当該音声を再生可能とする検索サーバ部と、を具備することを特徴とする。
ここで、映像および音声とは、それぞれ映像および音声を表わす電気的な信号あるいはデータである。これらはコンピュータ等によって処理可能である。
また、発話時刻とは、番組ＩＤ（放送チャンネルと番組名から定める一意の数値などの識別情報）と発話開始時刻との組み合わせによって表わされる情報である。この発話開始時刻は、番組開始時からの相対時刻や、現実の日時（例えば日本標準時）で表わされる。
また、音声情報とは、音声に関する情報であり、その詳細は後述する。 [1] In order to solve the above-described problem, an audio information extraction device according to an aspect of the present invention includes a correspondence relationship between a video / audio storage unit that stores video and audio, a word, and an utterance time in audio. A search index storage unit for storing a search index, speech information, speech time, speech content that is a string of words, a topic, and voice information that associates at least one of a speaker name and speaker attribute with each other A speech information storage unit, an acoustic model storage unit that stores an acoustic model that statistically represents the acoustic features of speech, a language model storage unit that stores a language model that statistically represents the appearance frequency of words, A video / audio recording unit that acquires video and audio from outside and writes the video and audio to the video / audio storage unit, and the acoustic model read from the acoustic model storage unit and the language model storage unit. The speech recognition unit that performs speech recognition processing of the speech acquired by the video and audio recording unit using the language model that has been issued, and outputs a speech recognition result, and acoustic features for each speaker or each speaker attribute Corresponding to the voice acquired by the video / audio recording unit using the speaker data storage unit that preliminarily stores the speaker data statistically representing the voice data and the speaker data read from the speaker data storage unit A speaker identification unit that calculates and outputs a speaker name or speaker attribute, a text data acquisition unit that externally acquires text data related to the video and the audio acquired by the video / audio recording unit, and the text A topic extraction unit that extracts a topic by comparing the text data acquired by the data acquisition unit and the speech recognition result output by the speech recognition unit; and the speech recognition result; A speech information integration unit that writes speech information formed by integrating the topic and at least one of the speaker name or the speaker attributes into the speech information storage unit; and the search index based on the speech recognition result A search index creation unit that creates the data and writes it to the search index storage unit, and searches the search index storage unit and the voice information storage unit based on a search request based on a search term, and corresponds to the search term The audio information associated with the video and the audio is read from the audio information storage unit and presented to the search source as a search result, and the video and the audio stored in the video / audio storage unit are reproduced. A search server unit that enables the search server unit.
Here, video and audio are electrical signals or data representing video and audio, respectively. These can be processed by a computer or the like.
The utterance time is information represented by a combination of a program ID (identification information such as a unique numerical value determined from a broadcast channel and a program name) and an utterance start time. This utterance start time is represented by a relative time from the start of the program or an actual date and time (for example, Japan Standard Time).
The voice information is information about voice, and details thereof will be described later.

上記の構成によれば、音声認識結果と抽出された話題とを音声情報として統合して記憶部に記憶させるとともに、検索用インデックスが記憶部に記憶されていることにより、検索語による検索要求に対して、合致する発話内容（音声認識結果）を有する番組と、その発話内容に関連する話題とを検索結果として利用者に提示できる。また、その検索結果から選択された番組の映像および音声を再生表示することができる。 According to the above configuration, the speech recognition result and the extracted topic are integrated as speech information and stored in the storage unit, and the search index is stored in the storage unit. On the other hand, a program having matching utterance content (speech recognition result) and a topic related to the utterance content can be presented to the user as a search result. In addition, the video and audio of the program selected from the search result can be reproduced and displayed.

［２］また、本発明の一態様は、上記の音声情報抽出装置において、前記話題抽出部は、前記テキストデータに含まれる所定数の単語組が前記音声認識結果の所定数の単語中に含まれる数により類似度を算出し、この類似度に基づいて前記テキストデータと前記音声認識結果との間の対応付けを行なうことによって、前記テキストデータから前記話題を抽出することを特徴とする。
この構成により、話題を抽出するとともに、音声認識結果において話題境界を特定することができる。 [2] Further, according to one aspect of the present invention, in the speech information extraction device, the topic extraction unit includes a predetermined number of word sets included in the text data in a predetermined number of words of the speech recognition result. The topic is extracted from the text data by calculating the similarity based on the number of the images and by associating the text data with the speech recognition result based on the similarity.
With this configuration, the topic can be extracted and the topic boundary can be specified in the speech recognition result.

［３］また、本発明の一態様は、上記の音声情報抽出装置において、前記テキストデータ取得部が取得した前記テキストデータの中における単語の出現頻度を算出することによって前記言語モデル記憶部に記憶されている前記言語モデルを更新する言語モデル学習部を更に具備することを特徴とする。
これにより、最新の放送の内容に基づいて言語モデルを更新することができ、音声認識の認識率の向上につながる。 [3] Further, according to one aspect of the present invention, in the speech information extraction device described above, the appearance frequency of words in the text data acquired by the text data acquisition unit is calculated and stored in the language model storage unit. And a language model learning unit for updating the language model.
As a result, the language model can be updated based on the latest broadcast content, leading to an improvement in the recognition rate of speech recognition.

［４］また、本発明の一態様は、上記の音声情報抽出装置において、前記音声認識部が前記音声認識結果として出力する単語仮説の有向非巡回グラフを表わすラティスデータを圧縮する処理を行なうラティス圧縮部を更に具備し、前記検索用インデックス作成部は、前記ラティス圧縮部によって圧縮された前記ラティスデータに基づいて前記検索用インデックスを作成することを特徴とする。
これにより、音声認識結果のラティスを圧縮し、音声認識結果ラティスのために必要な記憶容量を削減することができる。 [4] Further, according to one aspect of the present invention, in the speech information extraction apparatus, the lattice data representing a directed acyclic graph of a word hypothesis output as the speech recognition result by the speech recognition unit is compressed. A lattice compression unit is further included, and the search index creation unit creates the search index based on the lattice data compressed by the lattice compression unit.
Thereby, the lattice of the speech recognition result can be compressed, and the storage capacity required for the speech recognition result lattice can be reduced.

［５］また、本発明の一態様は、上記の音声情報抽出装置において、利用者からの入力に基づく検索語を用いて前記検索サーバ部に対して前記検索要求を送信し、前記検索サーバ部からの前記検索結果を画面に表示し、更に利用者からの操作に基づいて、該当する前記映像および前記音声を再生する検索クライアント部を更に具備することを特徴とする。 [5] According to another aspect of the present invention, in the above-described speech information extraction device, the search request is transmitted to the search server unit using a search term based on an input from a user, and the search server unit The search result is further displayed on the screen, and further, a search client unit that reproduces the video and the audio corresponding to the user's operation is further provided.

また、本発明の一態様は、上記の音声情報抽出装置において、ラティス圧縮部が次の（１）〜（３）の処理を行うものである。
（１）ラティス上のエッジについて、発話開始時刻・発話終了時刻がオーバーラップするエッジのうち、同一の表記を持つエッジをクラスタリングする（つまり、エッジの始端と終端を事後確率の大きなもので代表させ、事後確率の和を大きな方（代表させたほう）に与える）。
（２）ラティス上のエッジについて，オーバーラップするエッジをクラスタリングする（つまり、同一の始端ノードおよび終端ノードを持つようにする）。
（３）ラティス上のエッジについて、トポロジカルな順番でノードを訪問し、リンクをマージしていく。 In addition, according to one aspect of the present invention, the lattice compression unit performs the following processes (1) to (3) in the audio information extraction apparatus.
(1) For the edges on the lattice, of the edges where the utterance start time and utterance end time overlap, the edges having the same notation are clustered (that is, the start and end of the edge are represented by those having a large posterior probability). , Give the sum of the posterior probabilities to the larger one (representative)).
(2) For overlapping edges, cluster overlapping edges (that is, have the same start and end nodes).
(3) Visit the nodes in the topological order for the edges on the lattice and merge the links.

本発明によれば、映像および音声を蓄積するとともに、その発話内容を対象として、検索語による映像および音声の検索をすることができ、検索結果として音声の該当箇所における話題や話者に関する情報などといった音声情報も利用者にわかりやすく提示することができる。 According to the present invention, it is possible to store video and audio, and to search video and audio by a search word for the content of the utterance, and information on topics and speakers at a corresponding portion of the audio as a search result Such voice information can be presented to the user in an easy-to-understand manner.

本発明の実施形態による音声情報抽出装置の機能構成を示したブロック図である。It is the block diagram which showed the function structure of the audio | voice information extraction device by embodiment of this invention. 同実施形態における音声情報記憶部１６が記憶する音声情報の構造を示す概略図である。It is the schematic which shows the structure of the audio | voice information which the audio | voice information storage part 16 in the embodiment memorize | stores. 同実施形態における検索クライアント部２０に設けられている表示装置に表示される画面の構成を示す概略図である。It is the schematic which shows the structure of the screen displayed on the display apparatus provided in the search client part 20 in the embodiment. 同実施形態による検索クライアント部２０に設けられている表示装置における検索結果の表示の画面構成を示す概略図である。It is the schematic which shows the screen structure of the display of the search result in the display apparatus provided in the search client part 20 by the embodiment. 同実施形態におけるテキスト収集部３および言語モデル学習部９による、言語モデル学習処理の手順を示すフローチャートである。It is a flowchart which shows the procedure of the language model learning process by the text collection part 3 and the language model learning part 9 in the embodiment. 同実施形態における話題抽出部１４による、話題抽出処理の手順を示すフローチャートである。It is a flowchart which shows the procedure of the topic extraction process by the topic extraction part 14 in the embodiment. 同実施形態におけるラティス展開・圧縮部１２による、音声認識結果のラティスを圧縮する処理の手順を示すフローチャートの第１である。It is the 1st of the flowchart which shows the procedure of the process which compresses the lattice of the speech recognition result by the lattice expansion / compression part 12 in the embodiment. 同実施形態におけるラティス展開・圧縮部１２による、音声認識結果のラティスを圧縮する処理の手順を示すフローチャートの第２である。It is the 2nd of the flowchart which shows the procedure of the process which compresses the lattice of the speech recognition result by the lattice expansion / compression part 12 in the same embodiment. 同実施形態におけるラティス展開・圧縮部１２による、音声認識結果のラティスを圧縮する処理の手順を示すフローチャートの第３である。It is the 3rd of the flowchart which shows the procedure of the process which compresses the lattice of the speech recognition result by the lattice expansion / compression part 12 in the same embodiment. 同実施形態における検索用転置インデックス作成部１５による、転置インデックス作成処理の手順を示すフローチャートである。It is a flowchart which shows the procedure of the inverted index creation process by the search inverted index creation part 15 in the embodiment. 同実施形態における検索用転置インデックス記憶部１７に記憶される、転置インデックスのデータ構成を示す概略図である。It is the schematic which shows the data structure of the transposition index memorize | stored in the transposition index memory | storage part 17 for a search in the embodiment.

以下、図面を参照しながら、本発明の実施形態について説明する。
図１は、同実施形態による音声情報抽出装置の機能構成を示すブロック図である。図示するように、音声情報抽出装置５０は、映像・音声収録部１と、番組情報収集部２と、テキスト収集部３と、話者データ記憶部４と、話者識別部５と、音声認識部６と、音響モデル記憶部７と、言語モデル記憶部８と、言語モデル学習部９と、テキストデータ記憶部１０と、単語辞書記憶部１１と、ラティス展開・圧縮部１２（ラティス圧縮部）と、音声情報統合部１３と、話題抽出部１４と、検索用転置インデックス作成部１５（検索用インデックス作成部）と、音声情報記憶部１６と、検索用転置インデックス記憶部１７と、映像音声記憶部１８と、検索サーバ部１９と、検索クライアント部２０とを含んで構成される。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.
FIG. 1 is a block diagram showing a functional configuration of the audio information extracting apparatus according to the embodiment. As shown in the figure, the audio information extraction device 50 includes a video / audio recording unit 1, a program information collection unit 2, a text collection unit 3, a speaker data storage unit 4, a speaker identification unit 5, and a voice recognition unit. Unit 6, acoustic model storage unit 7, language model storage unit 8, language model learning unit 9, text data storage unit 10, word dictionary storage unit 11, lattice expansion / compression unit 12 (lattice compression unit) , Audio information integration unit 13, topic extraction unit 14, search inverted index creation unit 15 (search index creation unit), audio information storage unit 16, search transposition index storage unit 17, and video / audio storage The unit 18 includes a search server unit 19 and a search client unit 20.

なお、話者データ記憶部４と、音響モデル記憶部７と、言語モデル記憶部８と、テキストデータ記憶部１０と、単語辞書記憶部１１と、音声情報記憶部１６と、検索用転置インデックス記憶部１７と、映像音声記憶部１８とは、それぞれ、磁気ディスク装置（ＨＤＤ）または半導体メモリ（半導体ＲＡＭまたは半導体ＲＯＭなど）を用いて実現される。 The speaker data storage unit 4, the acoustic model storage unit 7, the language model storage unit 8, the text data storage unit 10, the word dictionary storage unit 11, the speech information storage unit 16, and the transposed index storage for search. The unit 17 and the video / audio storage unit 18 are realized using a magnetic disk device (HDD) or a semiconductor memory (such as a semiconductor RAM or a semiconductor ROM), respectively.

映像・音声収録部１は、放送（総合テレビ、教育テレビ、衛星放送、ラジオ第一、ラジオ第二など）の映像および音声の入力を受け、それらを計算機にて読み取り可能なデジタル動画データファイルに変換する。なお、映像・音声収録部１は、ここで得られたデジタル動画データファイルを映像音声記憶部１８に書き込む。また、ここで得られたデジタル動画データファイルは、後述する話者識別部５や音声認識部６においても利用される。 The video / audio recording unit 1 receives video and audio input of broadcasting (general television, educational television, satellite broadcasting, radio first, radio second, etc.) and converts them into a digital video data file that can be read by a computer. Convert. The video / audio recording unit 1 writes the digital moving image data file obtained here into the video / audio storage unit 18. The obtained digital moving image data file is also used in a speaker identification unit 5 and a voice recognition unit 6 described later.

番組情報収集部２は、インターネット等の通信回線を介して、外部のサーバコンピュータ（ウェブサーバなど）から、番組情報やＥＰＧ（電子番組ガイド，ＥｌｅｃｔｒｏｎｉｃＰｒｏｇｒａｍＧｕｉｄｅ）情報を取得する。これらの情報には、番組のタイトルや、番組の出演者等のテキスト情報が含まれている。番組情報収集部２は、取得したＥＰＧ情報等を加工し、映像・音声収録部１によって変換されたデジタル動画データファイルのメタデータとして保存する。 The program information collection unit 2 acquires program information and EPG (Electronic Program Guide) information from an external server computer (such as a web server) via a communication line such as the Internet. These pieces of information include text information such as program titles and program performers. The program information collection unit 2 processes the acquired EPG information and the like, and stores it as metadata of the digital moving image data file converted by the video / audio recording unit 1.

テキスト収集部３は、インターネット等の通信回線を介して、外部のサーバコンピュータ（ウェブサーバなど）から、ウェブテキスト情報を取得し、そのウェブテキスト情報に対して形態素解析等の自然言語処理を行なった上で、テキストデータ記憶部１０に書き込む。ここで、テキスト収集部３が取得するウェブテキスト情報は、例えば、テレビやラジオの放送局が運営するウェブサイトに掲載されているニュース等のウェブテキスト情報である。なお、テキストデータ記憶部１０に書き込まれたテキストデータは、後で詳述するように、言語モデル学習部９や話題抽出部１４によって読み出され利用される。 The text collection unit 3 acquires web text information from an external server computer (web server, etc.) via a communication line such as the Internet, and performs natural language processing such as morphological analysis on the web text information. Above, it writes in the text data storage unit 10. Here, the web text information acquired by the text collection unit 3 is, for example, web text information such as news posted on a website operated by a television or radio broadcasting station. The text data written in the text data storage unit 10 is read and used by the language model learning unit 9 and the topic extraction unit 14 as will be described in detail later.

話者データ記憶部４は、話者毎または話者属性毎の音響的特徴を統計的に表わした話者データと、発話末の単語列もしくは文節と、当該単語列の直後に発話者もしくは話者属性の交代が起こる確率とをテーブル化し、言語特徴量として予め記憶するものである。ここで、話者属性とは、例えば話者の性別（男性または女性）など、異なる音響的特徴に関連付けられる属性である。 The speaker data storage unit 4 includes speaker data that statistically represents acoustic characteristics for each speaker or each speaker attribute, a word string or phrase at the end of the utterance, and a speaker or story immediately after the word string. The probabilities of changing person attributes are tabulated and stored in advance as language feature quantities. Here, the speaker attribute is an attribute associated with different acoustic features such as the gender (male or female) of the speaker.

話者識別部５は、話者データ記憶部４から読み出した話者データを用いて、映像音声収録部１が取得した音声を分析し、対応するクラス（話者名や話者属性）を算出して出力する。具体的には、話者識別部５は、映像音声収録部１が取得した音声を基にその音響的特徴量を求め、その音響的特徴が話者データ記憶部４に記憶されているいずれかの話者クラス（個別話者や話者属性に対応）に属するものであるか、或いは未知のものであるかについて、その確率を求める。また、音声認識部６から、音声認識結果を取得し、その末尾の単語列もしくは文節から、話者データ記憶部４に記憶されたテーブルを参照し、話者もしくは話者属性の交代が行われた確率を求める。話者識別部５は、音響特徴量と、言語特徴量から求めた確率を統合し、音声が話者データ記憶部４に記憶されているいずれかの話者クラスに属するか、あるいは未知のものであるかを判別する。話者クラスが既知であれば、話者識別部５は、対応する話者名または話者属性を識別結果として出力し、未知であれば、新たな話者クラスを生成し、話者クラスの番号を識別結果として出力する。 The speaker identification unit 5 uses the speaker data read from the speaker data storage unit 4 to analyze the voice acquired by the video / audio recording unit 1 and calculates a corresponding class (speaker name and speaker attribute). And output. Specifically, the speaker identification unit 5 obtains the acoustic feature amount based on the sound acquired by the video / audio recording unit 1, and the acoustic feature is stored in the speaker data storage unit 4. , The probability is determined as to whether it belongs to a speaker class (corresponding to individual speakers and speaker attributes) or unknown. Further, the speech recognition result is obtained from the speech recognition unit 6, and the table stored in the speaker data storage unit 4 is referred to from the last word string or phrase, and the speaker or speaker attribute is changed. Find the probability. The speaker identifying unit 5 integrates the acoustic feature value and the probability obtained from the language feature value, and the speech belongs to any speaker class stored in the speaker data storage unit 4 or is unknown Is determined. If the speaker class is known, the speaker identification unit 5 outputs the corresponding speaker name or speaker attribute as the identification result. If the speaker class is unknown, the speaker identification unit 5 generates a new speaker class. The number is output as an identification result.

音声認識部６は、音響モデル記憶部７から読み出した音響モデルと言語モデル記憶部８から読み出した言語モデルとを用いて、映像音声収録部が取得した音声の音声認識処理を行ない、音声認識結果を出力する。この際、音声認識部６では、映像音声収録部が取得した音声について、音声認識の前処理として、まず、当該音声が人間の話し声の箇所か、音楽の箇所（人間の話し声に該当しない箇所）かを識別する。そして音声認識部６は、音楽と判定された区間については、開始時刻とともに音楽箇所であることを示すメタデータを出力する。話し声と判定された区間については、当該区間を音声認識し、発話内容を音声認識結果として出力する。この音声認識部６による音声認識処理自体には、既存の技術を利用する。なお、後でラティス展開・圧縮部１２の処理の説明の箇所で詳述するように、音声認識部６は、単語仮説をエッジとするとともに単語と単語の間の時刻に対応するノードを有する有向非巡回グラフであるラティス構造のデータを、音声認識結果として出力する。このラティス構造のデータは、音声認識結果の仮説とそれら仮説の確率を表わすデータである。なお、このラティス構造のデータをフォワード・バックワード（ｆｏｒｗａｒｄ−ｂａｃｋｗａｒｄ）アルゴリズム等によって走査することにより、事後確率を計算し、最尤単語列を求めることは可能である。 The voice recognition unit 6 performs voice recognition processing of the voice acquired by the video / audio recording unit using the acoustic model read from the acoustic model storage unit 7 and the language model read from the language model storage unit 8, and the voice recognition result Is output. At this time, in the voice recognition unit 6, as a preprocessing for voice recognition, first, the voice is a part of a human voice or a music part (a part not corresponding to a human voice). To identify. Then, the speech recognition unit 6 outputs metadata indicating that the section determined to be music is a music location together with the start time. For a section determined as speech, the section is recognized as speech, and the utterance content is output as a speech recognition result. The voice recognition processing itself by the voice recognition unit 6 uses existing technology. As will be described later in detail in the description of the processing of the lattice expansion / compression unit 12, the speech recognition unit 6 uses a word hypothesis as an edge and has a node corresponding to the time between words. Lattice structure data that is a directed acyclic graph is output as a speech recognition result. The lattice structure data is data representing hypotheses of speech recognition results and the probabilities of these hypotheses. It is possible to calculate the posterior probability and obtain the maximum likelihood word string by scanning the data of this lattice structure with a forward-backward algorithm or the like.

音響モデル記憶部７は、例えば音素などの言語的単位と、その音素が音声として発話された場合の音響的特徴量との関係を統計的に表わしたデータとしてなる音響モデルを記憶するものである。具体的には、音響モデルは、音素単位の表記と、音響的特徴量と、確率値とを関連付けたデータの集合として表わされる。なお、音響モデルには、例えば、隠れマルコフモデル（ＨＭＭ）を利用する。 The acoustic model storage unit 7 stores an acoustic model as data that statistically represents a relationship between a linguistic unit such as a phoneme and an acoustic feature amount when the phoneme is uttered as speech. . Specifically, the acoustic model is represented as a set of data in which phoneme unit notation, acoustic feature amount, and probability value are associated with each other. As the acoustic model, for example, a hidden Markov model (HMM) is used.

言語モデル記憶部８は、所定の言語において、例えば音素や単語などの言語的単位が出現する頻度（特徴）を統計的に表わしたデータとしてなる言語モデルを記憶するものである。具体的には、言語モデルとしては、例えば単語ｎグラム（ｎ−ｇｒａｍ）を用いる。この単語ｎグラムは、テキスト内で出現するｎ個（ｎは、自然数）の連続する単語の並びとそのｎ個の単語列の出現頻度を表わす出現確率（０以上で１以下の実数）との組を蓄積した統計的データである。 The language model storage unit 8 stores a language model as data that statistically represents the frequency (feature) of appearance of linguistic units such as phonemes and words in a predetermined language. Specifically, as a language model, for example, a word n-gram is used. This word n-gram is a sequence of n consecutive words (n is a natural number) appearing in the text and an appearance probability (a real number of 0 or more and 1 or less) representing the appearance frequency of the n word strings. This is statistical data that accumulates tuples.

言語モデル学習部９は、テキストデータ取得部３が取得したテキストデータをテキストデータ記憶部１０から読み出し、そのテキストデータの中における単語の出現頻度を統計的に算出することによって言語モデル記憶部９に記憶されている言語モデルを更新する処理を行なうものである。この言語モデル学習の処理の詳細については、後で、フローチャートを参照しながら説明する。 The language model learning unit 9 reads the text data acquired by the text data acquisition unit 3 from the text data storage unit 10, and statistically calculates the appearance frequency of words in the text data, thereby storing the language model storage unit 9 in the language model storage unit 9. The process of updating the stored language model is performed. Details of the language model learning process will be described later with reference to a flowchart.

テキストデータ記憶部１０は、テキスト取得部３がインターネット等を介して外部のウェブサーバ等から取得したテキストデータを記憶するものである。なお、このテキストデータは、形態素解析処理済のニュース原稿等である。 The text data storage unit 10 stores the text data acquired by the text acquisition unit 3 from an external web server or the like via the Internet or the like. The text data is a news manuscript or the like that has been subjected to morphological analysis processing.

単語辞書記憶部１１は、テキスト取得部３や言語モデル学習部９による処理の際に用いられる単語辞書データを記憶するものである。 The word dictionary storage unit 11 stores word dictionary data used in processing by the text acquisition unit 3 and the language model learning unit 9.

ラティス展開・圧縮部１２は、音声認識部６によって出力される音声認識結果としてラティスのデータ（単語仮説による有向非巡回グラフ）を圧縮する処理を行なう。なお、ラティス展開・圧縮部１２は、バイグラム（ｂｉｇｒａｍ）によるラティスを一旦トライグラム（ｔｒｉｇｒａｍ）によるラティスに展開してから、圧縮する処理を行なう。このラティス展開・圧縮部１２による処理の詳細については、後でフローチャートを参照しながら詳しく説明する。 The lattice expansion / compression unit 12 performs processing for compressing lattice data (a directed acyclic graph based on word hypotheses) as a speech recognition result output by the speech recognition unit 6. Note that the lattice expansion / compression unit 12 performs a process of expanding a lattice based on a bigram into a lattice based on a trigram and then compressing the lattice. Details of the processing by the lattice expansion / compression unit 12 will be described later in detail with reference to a flowchart.

音声情報統合部１３は、少なくとも、音声認識部６から得られる音声認識結果（単語列、発話内容のテキスト）と、話題抽出部１４から得られる話題とを統合し、音声情報として音声情報記憶部１６に書き込む。また、音声情報統合部１３は、話者識別部５から出力される話者または話者属性（例えば、話者の性別など）の識別結果も、発話内容に関連付けて、音声情報の一部として音声情報記憶部１６に書き込む。さらに、音声情報統合部１３は、放送番組のテーマやジングルや、効果音などの音楽や、複数の単語から構成される人名、地名、組織名、構造物など、特定の事物を指し示す固有表現をも音声情報の一部として統合し音声情報記憶部１６に書き込む。なお、音声情報のデータ構造については後述する。 The voice information integration unit 13 integrates at least the voice recognition result (word string, text of utterance content) obtained from the voice recognition unit 6 and the topic obtained from the topic extraction unit 14, and the voice information storage unit as voice information. 16 is written. The voice information integration unit 13 also associates the identification result of the speaker or speaker attributes (for example, the gender of the speaker) output from the speaker identification unit 5 with the utterance content as part of the voice information. Write to the voice information storage unit 16. Furthermore, the audio information integration unit 13 generates a unique expression that indicates a specific thing such as a broadcast program theme, jingle, music such as sound effects, a person name, a place name, an organization name, or a structure composed of a plurality of words. Are also integrated as part of the voice information and written to the voice information storage unit 16. The data structure of audio information will be described later.

話題抽出部１４は、テキストデータ取得部３が取得したテキストデータをテキストデータ記憶部１０から読み出し、このテキストデータを前記音声認識部６から出力された音声認識結果と比較することにより話題を抽出する処理を行なう。より具体的には、話題抽出部１４は、前記のテキストデータに含まれる所定数の単語組（３つ組など）が音声認識結果の所定数の単語中に含まれる数により類似度を算出し、この類似度に基づいてテキストデータと音声認識結果との間の対応付けを行なうことによって、テキストデータから話題を抽出する、なお、話題抽出部１４の処理の詳細については、後でフローチャートを参照しながら説明する。 The topic extraction unit 14 reads the text data acquired by the text data acquisition unit 3 from the text data storage unit 10, and extracts the topic by comparing the text data with the speech recognition result output from the speech recognition unit 6. Perform processing. More specifically, the topic extraction unit 14 calculates the degree of similarity based on the number of word groups (such as triplets) included in the text data included in the predetermined number of words in the speech recognition result. The topic is extracted from the text data by associating the text data with the voice recognition result based on the similarity, and the details of the processing of the topic extraction unit 14 will be described later. While explaining.

検索用転置インデックス作成部１５は、音声認識部６による音声認識結果に基づき検索用転置インデックス（検索用インデックス）のデータを作成して検索用インデックス記憶部１７に書き込む処理を行なう。なお、本実施形態では、検索用転置インデックス作成部１５は、ラティス展開・圧縮部１２により圧縮されたラティスのデータを基に検索用転置インデックスを作成する。なお、検索用転置インデックスのデータ構造については後述する。 The search transposition index creation unit 15 creates data of a search transposition index (search index) based on the speech recognition result by the speech recognition unit 6 and writes the data into the search index storage unit 17. In this embodiment, the search inverted index creation unit 15 creates a search inverted index based on the lattice data compressed by the lattice expansion / compression unit 12. The data structure of the search inverted index will be described later.

音声情報記憶部１６は、音声情報を記憶する。ここで、音声情報とは、番組ＩＤ、発話開始時刻、発話内容（単語列）のテキスト、話者名、話者性別、音楽（非音声情報）、話題、固有表現を含む情報である。この音声情報は、話者識別部５や、音声認識部６や、話題抽出部１４の各部の処理によって得られた情報である。 The voice information storage unit 16 stores voice information. Here, the speech information is information including a program ID, speech start time, text of speech content (word string), speaker name, speaker sex, music (non-speech information), topic, and unique expression. This voice information is information obtained by processing of each part of the speaker identification unit 5, the voice recognition unit 6, and the topic extraction unit 14.

検索用転置インデックス記憶部１７は、音声認識結果に基づいて作られる検索用転置インデックスを記憶するものである。この検索用転置インデックスは、単語と、前記音声における発話時刻との対応関係の情報を含んでいる。ここで、本実施形態における発話時刻とは、番組を識別するための番組ＩＤと発話開始時刻の組み合わせによって特定されるものである。 The search transposed index storage unit 17 stores a search transposed index created based on the speech recognition result. This transpose index for search includes information on correspondence between words and utterance times in the voice. Here, the utterance time in this embodiment is specified by a combination of a program ID for identifying a program and an utterance start time.

映像音声記憶部１８は、映像・音声収録部１によって得られるデジタル動画データファイルを記憶するものである。このデジタル動画データファイルは、映像データおよび音声データを含んでいる。 The video / audio storage unit 18 stores a digital moving image data file obtained by the video / audio recording unit 1. This digital moving image data file includes video data and audio data.

検索サーバ部１９は、音声情報記憶部１６と検索用転置インデックス記憶部１７と映像音声記憶部１８からデータを読み出せるように構成されており、これらのデータを用いて検索クライアント部２０からの検索要求に応じた検索処理を行なうとともに、その応答として、検索結果のデータを検索クライアント部２０に返す。なお、検索結果のデータとは、検索の結果得られる音声情報（音声情報記憶部１６から読み出された情報）や、デジタル動画データファイル（映像音声記憶部１８から読み出された情報）である。 The search server unit 19 is configured to be able to read data from the audio information storage unit 16, the search transposition index storage unit 17, and the video / audio storage unit 18, and a search from the search client unit 20 using these data. A search process is performed in response to the request, and the search result data is returned to the search client unit 20 as a response. The search result data is audio information (information read from the audio information storage unit 16) obtained as a result of the search or a digital moving image data file (information read from the video / audio storage unit 18). .

検索クライアント部２０は、利用者からの入力に基づき検索要求を検索サーバ部１９に送信するとともに、その応答として検索サーバ部１９から返される検索結果のデータを画面等に表示する。これにより、利用者は、音声情報を検索し、検索結果を閲覧することができる。 The search client unit 20 transmits a search request to the search server unit 19 based on the input from the user, and displays search result data returned from the search server unit 19 as a response to the search server unit 19. Thereby, the user can search the voice information and browse the search result.

図２は、音声情報記憶部１６が記憶する音声情報の構造を示す概略図である。図示するように、音声情報は、表形式のデータであり、番組ＩＤと、発話開始時刻と、発話内容（単語列）と、話者名と、話者性別（話者属性）と、音楽（非音声情報）と、話題と、固有表現の各項目を含む。音楽（非音声情報）は、放送番組のテーマ音楽や、ジングルや、効果音などの音楽である。固有表現は、複数の単語で構成される表現であり、人名、地名、組織名、構造物などといった特定の事物を指し示すものである。 FIG. 2 is a schematic diagram showing the structure of audio information stored in the audio information storage unit 16. As shown in the figure, the audio information is tabular data, and includes a program ID, an utterance start time, an utterance content (word string), a speaker name, a speaker gender (speaker attribute), and music ( Non-voice information), topic, and specific expression items. The music (non-speech information) is music such as the theme music of a broadcast program, jingle, and sound effects. The specific expression is an expression composed of a plurality of words, and indicates a specific thing such as a person name, a place name, an organization name, or a structure.

図３は、検索クライアント部２０に設けられている表示装置に表示される画面の構成を示す概略図である。クライアント検索部２０は、検索の結果得られる音声情報およびデジタル動画データファイルの情報（映像と音声）をこの画面により利用者に提示する。
図示するように、この画面は、大きく３つの要素で構成されている。その第１は、音声情報が付与された番組一覧を表示するためのウィンドウ（符号１１３）である。そして、第２は、前記の番組一覧から選択された番組の映像・音声を表示するためのウィンドウ（符号１１１）である。そして、その第３は、音声認識結果（発話内容）を表示するウィンドウ（符号１１２）である。 FIG. 3 is a schematic diagram illustrating a configuration of a screen displayed on a display device provided in the search client unit 20. The client search unit 20 presents audio information and digital video data file information (video and audio) obtained as a result of the search to the user on this screen.
As shown in the figure, this screen is mainly composed of three elements. The 1st is a window (code | symbol 113) for displaying the program list to which audio | voice information was provided. The second is a window (reference numeral 111) for displaying the video / audio of the program selected from the program list. And the 3rd is a window (code | symbol 112) which displays a speech recognition result (utterance content).

まず第１の番組一覧のためのウィンドウ１１３は、同図に示す画面の左側に配置されており、（ａ）番組の代表的シーンを表わすサムネイル画像の表示エリア（符号１０２）と、（ｂ）番組のタイトルの表示エリア（符号１０３）と、（ｃ）番組に含まれる話題一覧（符号１０４）の各要素からなるものを一番組に対応する組として、複数番組分の表示を行なうようになっている。これら複数番組は縦に並べられており、新しい番組ほど上に、そして古い番組ほど下に表示されるようにしている。ここで表示される番組タイトルは、元々番組情報収集部２が取得したデータに基づくものであり、デジタル動画データファイルの中にメタデータとして含まれているものである。検索クライアント部２０は、このメタデータの中から番組タイトルを読み出して表示エリア１０３に表示する。また、ここで表示される話題一覧は、元々話題抽出部１４が抽出した情報である。検索クライアント部２０は、音声情報の中から話題のデータを読み出して表示エリア１０４に一覧表示する。また、サムネイル画像は、デジタル動画データファイルから適宜抽出された静止画像である。 First, a window 113 for the first program list is arranged on the left side of the screen shown in the figure, and (a) a thumbnail image display area (reference numeral 102) representing a typical scene of the program, (b) A program title display area (reference numeral 103) and (c) a list of topics included in the program (reference numeral 104) are combined into a set corresponding to one program and displayed for a plurality of programs. ing. These multiple programs are arranged vertically, so that newer programs are displayed above and older programs are displayed below. The program title displayed here is based on data originally acquired by the program information collection unit 2 and is included as metadata in the digital moving image data file. The search client unit 20 reads the program title from the metadata and displays it in the display area 103. The topic list displayed here is information originally extracted by the topic extraction unit 14. The search client unit 20 reads topic data from the audio information and displays the data in a list in the display area 104. The thumbnail image is a still image appropriately extracted from the digital moving image data file.

次に、第２の、番組の映像・音声を表示するためのウィンドウ１１１は、デジタル動画データファイルを再生することで得られる映像を表示するものである。利用者が前記の表示エリア１０２に表示されたサムネイル画像或いは前記の表示エリア１０３に表示された番組タイトルをクリックする操作を行なうと、検索クライアント部２０は、当該番組のデジタル動画データファイルを番組冒頭部分から再生する。また、利用者が前記の表示エリア１０４に表示された話題のいずれかをクリックする操作を行なうと、検索クライアント部２０は、当該番組のデジタル動画データファイルを、クリックされた話題に対応する箇所（当該話題の開始点）から再生する。 Next, a second window 111 for displaying video / audio of the program displays video obtained by playing back the digital moving image data file. When the user performs an operation of clicking a thumbnail image displayed in the display area 102 or a program title displayed in the display area 103, the search client unit 20 selects the digital moving image data file of the program at the beginning of the program. Play from part. When the user performs an operation of clicking one of the topics displayed in the display area 104, the search client unit 20 selects a digital video data file of the program corresponding to the clicked topic ( Play from the starting point of the topic.

なお、このウィンドウ１１１の上の部分には、各種の操作ボタン等が表示されており、利用者がこれら操作ボタン等を操作することにより、検索クライアント部２０は、番組の再生を開始したり停止したり、或いは再生箇所を変更したりする処理を行なう。
具体的には、符号１０８は、映像・音声の再生／停止ボタンである。映像・音声が停止されている状態のときにこのボタン１０８がクリックされると、検索クライアント部２０は映像・音声の再生を開始する。また、映像・音声が再生されている状態のときにこのボタン１０８がクリックされると、検索クライアント部２０は映像・音声の再生を停止させる。
また、符号１０７は再生位置を現再生位置から開始位置方向に３０秒戻すためのボタンであり、符号１０６は再生位置を現再生位置から開始位置方向に１０分戻すためのボタンであり、符号１０９は再生位置を現再生位置から終了位置方向に３０秒進めるためのボタンであり、符号１１０は再生位置を現再生位置から終了位置方向に１０分進めるためのボタンである。利用者がこれらのボタン１０６〜１１０のいずれかをクリックすると、検索クライアント部２０は、それぞれのボタンに従って映像・音声の再生位置を変更する制御を行なう。
また、符号１０５は、再生位置を開始位置から終了位置までの間の任意の位置に移動させるためのスライダーであり、利用者がこのスライダー１０５を移動させる操作を行なうと、検索クライアント部２０は、スライダー１０５の移動先の位置に応じた箇所に、映像・音声の再生位置を変更する制御を行なう。 Various operation buttons and the like are displayed on the upper portion of the window 111. When the user operates these operation buttons and the like, the search client unit 20 starts or stops playback of the program. Or a process of changing the playback location.
Specifically, reference numeral 108 denotes a video / audio playback / stop button. When this button 108 is clicked when the video / audio is stopped, the search client unit 20 starts playback of the video / audio. If the button 108 is clicked when the video / audio is being played back, the search client unit 20 stops the playback of the video / audio.
Reference numeral 107 denotes a button for returning the playback position from the current playback position toward the start position by 30 seconds, and reference numeral 106 denotes a button for returning the playback position from the current playback position toward the start position by 10 minutes. Is a button for advancing the reproduction position from the current reproduction position toward the end position by 30 seconds, and reference numeral 110 is a button for advancing the reproduction position from the current reproduction position toward the end position by 10 minutes. When the user clicks one of these buttons 106 to 110, the search client unit 20 performs control to change the playback position of the video / audio according to each button.
Reference numeral 105 denotes a slider for moving the playback position to an arbitrary position from the start position to the end position. When the user performs an operation of moving the slider 105, the search client unit 20 Control is performed to change the playback position of the video / audio at a position corresponding to the position of the slider 105 where the slider 105 is moved.

次に、第３の、ウィンドウ１１２は、番組に対応する音声認識結果（発話内容）を表示するためのものである。検索クライアント部２０は、発話内容のテキストをこのウィンドウ１１２に表示するとともに、再生中の映像・音声に同期させ、現時点で再生中の位置に対応する発話内容の単語を強調表示する。強調表示の方法としては、例えば、当該単語の背景を通常背景色とは異なる色で表示（いわゆるハイライト表示）させる方法をとる。つまり、映像・音声の再生が進むにつれて、順次、ハイライト表示される単語が遷移していく。これは、音声認識部６による音声認識結果を基に、単語毎の発話時刻を記憶しておき、再生時の経過時間に沿って現在発話中の単語をハイライト表示することによって実現する。また、音声情報として話者名あるいは話者属性が得られている場合には、話者名や話者属性を併せて表示するようにしても良い。 Next, a third window 112 is for displaying a speech recognition result (utterance content) corresponding to the program. The search client unit 20 displays the text of the utterance content on the window 112 and synchronizes with the video / audio being reproduced, and highlights the word of the utterance content corresponding to the currently reproduced position. As a highlighting method, for example, a method of displaying the background of the word in a color different from the normal background color (so-called highlight display) is used. That is, as the video / audio reproduction progresses, the highlighted words are sequentially shifted. This is realized by storing the utterance time for each word based on the voice recognition result by the voice recognition unit 6 and highlighting the currently uttered word along the elapsed time at the time of reproduction. In addition, when a speaker name or speaker attribute is obtained as voice information, the speaker name or speaker attribute may be displayed together.

さらに、図３に示す画面には、検索のためのテキスト入力部１００と検索ボタン１０１が設けられている。利用者がキーボード等を操作することによりテキスト入力部１００に検索語を入力した後に検索ボタン１０１を押すと、検索クライアント部２０は、検索サーバ部１９に対して入力された検索語を含んだ検索要求を送信する。検索サーバ１９では、検索語を形態素解析して形態素解析済みの検索語を用いて索引を検索する。そして、検索サーバ１９からの応答により検索結果のデータが得られると、検索クライアント部２０は、前記のウィンドウ１１３に、番組一覧の代わりに検索結果を表示する。 Further, the screen shown in FIG. 3 is provided with a text input unit 100 and a search button 101 for searching. When the user presses the search button 101 after inputting a search term to the text input unit 100 by operating a keyboard or the like, the search client unit 20 searches for the search term including the search term input to the search server unit 19. Send a request. The search server 19 performs a morphological analysis on the search word and searches the index using the search word that has been subjected to the morphological analysis. When the search result data is obtained from the response from the search server 19, the search client unit 20 displays the search result in the window 113 instead of the program list.

図４は、検索結果の表示画面の構成を示す概略図である。前述の通り、この検索結果は、ウィンドウ１１３に表示されるものである。同図に示すように、検索結果を表示するときのウィンドウ１１３は、（ａ）検索時に用いられた検索語を含む発話に対応する代表的画像をサムネイル画像として表示するための表示エリア（符号１２０）と、（ｂ）番組のタイトルの表示エリア（符号１２１）と、（ｃ）当該番組内で上記検索語にマッチした発話の開始時刻の表示エリア（符号１２２）と、（ｄ）その発話内容の表示エリア（符号１２３）とを含む。
なお、検索クライアント部２０は、表示エリア１２２と表示エリア１２３を一組として、当該番組内で上記検索語にマッチした発話の出現数分の組の表示を行なう。
また、検索結果として複数の番組がマッチした場合には、検索クライアント部２０は、それらそれぞれの番組についての表示を行なう。
なお、同図に示す表示においても、表示される番組タイトルは、元々番組情報収集部２が取得したデータに基づくものであり、デジタル動画データファイルの中にメタデータとして含まれているものである。また、表示されるサムネイル画像は、デジタル動画データファイルから適宜抽出された静止画像である。 FIG. 4 is a schematic diagram showing the configuration of the search result display screen. As described above, this search result is displayed in the window 113. As shown in the figure, the window 113 when displaying the search result is (a) a display area (reference numeral 120) for displaying a representative image corresponding to the utterance including the search word used at the time of the search as a thumbnail image. ), (B) program title display area (reference numeral 121), (c) utterance start time display area that matches the search term in the program (reference numeral 122), and (d) the utterance content Display area (reference numeral 123).
The search client unit 20 displays a set of the display area 122 and the display area 123 as many as the number of appearances of utterances matching the search word in the program.
When a plurality of programs are matched as a search result, the search client unit 20 displays each of the programs.
In the display shown in the figure, the displayed program title is based on the data originally acquired by the program information collection unit 2 and is included as metadata in the digital moving image data file. . The displayed thumbnail image is a still image that is appropriately extracted from the digital moving image data file.

次に、テキスト収集部３と言語モデル学習部９の詳細な処理手順について説明する。
図５は、テキスト収集部３および言語モデル学習部９による処理の手順を示すフローチャートである。
ステップＳ２０１において、テキスト収集部３は、所定の時間間隔でデータソースチェックを行なう。つまり、テキスト収集部３は、例えば放送局のウェブサイトのサーバなどといった外部のコンピュータにアクセスし、前回アクセス時のウェブサイトのデータと比較することによって、今回そこから新規のニュース原稿や話題のテキストデータが得られるか否かをチェックする。そして、新規のデータが得られた場合（ステップＳ２０１：ＹＥＳ）には次のステップＳ２０２に進み、得られなかった場合（ステップＳ２０１：ＮＯ）にはステップＳ２０１に戻ってさらに前記所定時間経過後にデータソースチェックの処理を繰り返す。 Next, detailed processing procedures of the text collecting unit 3 and the language model learning unit 9 will be described.
FIG. 5 is a flowchart showing a processing procedure by the text collection unit 3 and the language model learning unit 9.
In step S201, the text collection unit 3 performs a data source check at a predetermined time interval. In other words, the text collection unit 3 accesses an external computer such as a broadcast station website server, and compares it with the website data at the previous access, so that a new news manuscript or topic text is obtained from this time. Check if data is available. If new data is obtained (step S201: YES), the process proceeds to the next step S202. If not obtained (step S201: NO), the process returns to step S201, and further after the predetermined time has elapsed. Repeat the source check process.

次に、ステップＳ２０２において、テキスト収集部３は、ステップＳ２０１で得られたテキストデータの形態素解析処理を行い、その結果をテキストデータ記憶部１０に書き込む。ここで、形態素解析処理自体は、既存の技術を利用する。このステップでの処理の結果、テキストデータ記憶部１０には、単語単位に分割されたテキストデータ（ニュース原稿等）が保存される。 Next, in step S 202, the text collection unit 3 performs a morphological analysis process on the text data obtained in step S 201 and writes the result in the text data storage unit 10. Here, the morphological analysis process itself uses an existing technique. As a result of the processing in this step, text data (news manuscript or the like) divided into words is stored in the text data storage unit 10.

ステップＳ２０４において、言語モデル学習部９は、テキストデータ記憶部１０へのデータの蓄積状況を監視し、新規のデータが所定量以上蓄積されたか否かをチェックする。そして、新規データが所定量以上蓄積されていた場合（ステップＳ２０４：ＹＥＳ）には次のステップＳ２０５に進み、そうでない場合（ステップＳ２０４：ＮＯ）にはステップＳ２０１の処理に戻る。 In step S204, the language model learning unit 9 monitors the accumulation state of data in the text data storage unit 10 and checks whether or not new data is accumulated in a predetermined amount or more. If new data is accumulated in a predetermined amount or more (step S204: YES), the process proceeds to the next step S205. If not (step S204: NO), the process returns to step S201.

次に、ステップＳ２０５において、言語モデル学習部９は、テキストデータ記憶部１０から新規データを読み出し、そのデータに基づいて言語モデルを作成する処理を行なう。このとき、言語モデル学習部９は単語辞書記憶部１１から読み出す辞書データを参照する。前述の通り、ここで作成される言語モデルはｎグラムであり、言語モデル学習部９は、テキストデータ記憶部１０から読み出した形態素解析済みのテキストデータを基に、連続するｎ個の単語列ごとの出現頻度をカウントし、統計的処理をすることによって言語モデルのデータを作成する。そして、その結果に基づき、言語モデル学習部９は、言語モデル記憶部８のデータを書き換える。 Next, in step S205, the language model learning unit 9 reads new data from the text data storage unit 10 and performs a process of creating a language model based on the data. At this time, the language model learning unit 9 refers to the dictionary data read from the word dictionary storage unit 11. As described above, the language model created here is an n-gram, and the language model learning unit 9 performs each of n consecutive word strings based on the morphological-analyzed text data read from the text data storage unit 10. The language model data is created by counting the frequency of occurrence and statistical processing. Based on the result, the language model learning unit 9 rewrites the data in the language model storage unit 8.

そして、ステップＳ２０６において、言語モデル学習部９は、音声認識部６に対して、更新された言語モデル記憶部８のデータをロードし直すように通知する。その通知に基づき、音声認識部６が言語モデルをロードしなおすことにより、音声認識部６は常に最新の言語モデルを用いて音声認識の処理を行なうことができる。 In step S206, the language model learning unit 9 notifies the voice recognition unit 6 to reload the updated data of the language model storage unit 8. Based on the notification, the speech recognition unit 6 reloads the language model, so that the speech recognition unit 6 can always perform speech recognition processing using the latest language model.

図６は、話題抽出部１４による処理の手順を示すフローチャートである。以下では、話題抽出部１４による処理の詳細を説明する。
この処理においては、話題抽出部１４は、ウェブサイトから得られたニュース原稿等のテキストデータの冒頭ｍ単語と、音声認識部６から取得した発話内容における発話開始からのｍ単語とを比較し、両者間の類似度を計算することによって音声認識結果がどのテキストデータと一致するものであるかを判定する。なお、ｍは正整数である。
なお、話題抽出部１４による処理を行なうに当たり、音声認識部６は、音声認識結果に対して１から始まる一連の番号を予め付与する。また、テキスト収集部３がウェブサイトから収集したテキストのうちの最新のＫ個（Ｋは正整数）のファイルを話題抽出部１４による処理の対象とし、これらＫ個のファイルにも１から始まる一連の番号が付与されている。 FIG. 6 is a flowchart illustrating a processing procedure performed by the topic extraction unit 14. Below, the detail of the process by the topic extraction part 14 is demonstrated.
In this process, the topic extraction unit 14 compares the first m words of text data such as a news manuscript obtained from the website with the m words from the start of utterance in the utterance content acquired from the speech recognition unit 6, It is determined which text data the speech recognition result matches by calculating the similarity between the two. Note that m is a positive integer.
In performing the processing by the topic extraction unit 14, the speech recognition unit 6 assigns a series of numbers starting from 1 to the speech recognition result in advance. Further, the latest K files (K is a positive integer) of the text collected from the website by the text collection unit 3 are processed by the topic extraction unit 14, and a series of K files starting from 1 is also set. The number is given.

以下、同図のフローチャートに沿って説明する。
ステップＳ３０１において、話題抽出部１４は、音声認識部６から音声認識結果（発話内容）を取得する。ここで取得する音声認識結果は、事後確率による最尤単語列である。
次に、ステップＳ３０２において、話題抽出部１４は、変数ｎを１に設定（初期化）する。
そして、ステップＳ３０３において、話題抽出部１４は、第ｎ発話の冒頭ｍ単語取り出す。
ステップＳ３０４において、話題抽出部１４は、テキストデータ記憶部１０から読み出した第ｋ番目（ｋ＝１，２，・・・，Ｋ）のテキストデータの冒頭ｍ単語と、ステップＳ３０３において取り出したｍ単語との間の類似度を計算する。第ｎ発話の冒頭ｍ単語と第ｋ番目のテキストデータの冒頭ｍ単語との間の類似度は、例えば次のように定義される。即ち、その類似度は、ｋ番目のテキストデータのｍ単語に含まれる単語３つ組（単語組）が、第ｎ発話のｍ単語に含まれる数とする。
ステップＳ３０５において、話題抽出部１４は、算出された類似度が閾値以上か否かを判定する。なお、この閾値は、予め適切に定められ設定されている。そして、類似度がこの閾値以上の場合（ステップＳ３０５：ＹＥＳ）はステップＳ３０７に進む。そして、類似度がこの閾値未満の場合（ステップＳ３０５：ＮＯ）はステップＳ３０６に進む。
ステップＳ３０６において、話題抽出部１４は、変数ｎをインクリメントする（ｎ←ｎ＋１）。ステップＳ３０６の処理を終えると、ステップＳ３０３の処理に戻る。
ステップＳ３０７においては、話題抽出部１４は、この第ｎ番目の発話を、第ｋ番目の話題の開始点とする。即ち、話題抽出部１４は、音声認識結果のデータに話題境界情報を付与する。これにより、音声認識結果を話題境界にて分割することが可能になるとともに、分割された結果に対して話題を関連付けて記憶させることができる。
以上述べたステップＳ３０１からＳ３０７までの一連の処理を、話題抽出部１４は、第１番目から第Ｋ番目までの各々のテキストデータに対して行なう。 Hereinafter, description will be made along the flowchart of FIG.
In step S 301, the topic extraction unit 14 acquires a speech recognition result (utterance content) from the speech recognition unit 6. The speech recognition result acquired here is a maximum likelihood word string based on the posterior probability.
Next, in step S302, the topic extraction unit 14 sets (initializes) the variable n to 1.
In step S303, the topic extraction unit 14 extracts the first m words of the nth utterance.
In step S304, the topic extraction unit 14 starts with the first m words of the kth (k = 1, 2,..., K) text data read from the text data storage unit 10 and the m words extracted in step S303. The similarity between is calculated. The similarity between the first m words of the nth utterance and the first m words of the kth text data is defined as follows, for example. That is, the similarity is the number of triples (word pairs) included in m words of k-th text data included in m words of the nth utterance.
In step S305, the topic extraction unit 14 determines whether the calculated similarity is greater than or equal to a threshold value. This threshold value is appropriately determined and set in advance. If the similarity is equal to or higher than this threshold (step S305: YES), the process proceeds to step S307. If the similarity is less than this threshold (step S305: NO), the process proceeds to step S306.
In step S306, the topic extraction unit 14 increments the variable n (n ← n + 1). When the process of step S306 is completed, the process returns to step S303.
In step S307, the topic extraction unit 14 sets the nth utterance as the starting point of the kth topic. That is, the topic extraction unit 14 adds topic boundary information to the speech recognition result data. As a result, the speech recognition result can be divided at the topic boundary, and the topic can be stored in association with the divided result.
The topic extraction unit 14 performs the series of processes from steps S301 to S307 described above on each of the text data from the first to the Kth.

図７，図８，図９は、ラティス展開・圧縮部１２による処理の手順を示す一連のフローチャートである。ラティス展開・圧縮部１２は、前掲の［非特許文献１］および［非特許文献２］に記載されている従来法を改良した方法により音声認識結果のラティスの展開および圧縮を行なう。
音声認識部６は、音声認識結果を表わすラティス構造（有向非巡回グラフ）のデータを出力する。このデータは、音声認識結果の単語をエッジとし、開始点、中間点、終了点のいずれかをノードとする有向グラフである。開始点と終了点のノードは１つずつ存在し、中間点のノードは通常は複数存在する。これらのノードは、それぞれ所定の時刻に対応している。つまり、ノードＡを始端としてノードＢを終端とするエッジが存在するとき、ノードＡの時刻が当該エッジに対応する単語の始端時刻であり、ノードＢの時刻が当該エッジに対応する単語の終端時刻である。すべてのノードは連結されており、開始点のノードからはエッジをたどって全ての中間点のノードに到達可能であり、任意の中間点のノードからはエッジをたどって終了点のノードに到達可能である。音声認識部６による出力は確率を伴う音声認識結果の仮説であり、開始点と終了点との間において並列する経路（つまり時刻的に重なりを有する複数の経路）は互いに対立する仮説に対応するものである。
なお、本実施形態では、このようなラティス構造を、ノードおよびエッジをそれぞれエンティティとするリレーショナルデータで表現し、各処理部間での受け渡しを行なう。
また、このラティスは、隣り合う２つの単語を結合するバイグラム（ｂｉｇｒａｍ）言語モデルに基づくものである。 7, 8, and 9 are a series of flowcharts showing a processing procedure by the lattice expansion / compression unit 12. The lattice expansion / compression unit 12 expands and compresses the lattice of the speech recognition result by a method improved from the conventional method described in [Non-Patent Document 1] and [Non-Patent Document 2].
The speech recognition unit 6 outputs lattice structure (directed acyclic graph) data representing the speech recognition result. This data is a directed graph in which a word of the speech recognition result is an edge and any one of a start point, an intermediate point, and an end point is a node. There is one start point node and one end point node, and there are usually a plurality of intermediate point nodes. Each of these nodes corresponds to a predetermined time. That is, when there is an edge that starts at node A and ends at node B, the time at node A is the start time of the word corresponding to the edge, and the time at node B is the end time of the word corresponding to the edge It is. All nodes are connected and can reach all intermediate nodes by following the edge from the start node, and can reach the end node by following the edge from any intermediate node It is. The output from the speech recognition unit 6 is a hypothesis of a speech recognition result with a probability, and the paths that are parallel between the start point and the end point (that is, a plurality of paths that overlap in time) correspond to hypotheses that oppose each other. Is.
In the present embodiment, such a lattice structure is expressed by relational data having nodes and edges as entities, and exchanged between the processing units.
The lattice is based on a bigram language model that combines two adjacent words.

以下、このフローチャートに沿って説明する。
まず、図７のステップＳ４０１において、ラティス展開・圧縮部１２は、音声認識部６から上記のラティス構造の音声認識結果データを取得する。
次に、ステップＳ４０２において、ラティス展開・圧縮部１２は、上で取得したラティスを、連続する３つの単語を結合するトライグラム（ｔｒｉｇｒａｍ）言語モデルに基づくラティスに展開する。この展開処理自体は前述の従来技術を利用する。 Hereinafter, it demonstrates along this flowchart.
First, in step S 401 of FIG. 7, the lattice expansion / compression unit 12 acquires the above-described lattice structure speech recognition result data from the speech recognition unit 6.
Next, in step S402, the lattice expansion / compression unit 12 expands the lattice acquired above into a lattice based on a trigram language model that combines three consecutive words. This unfolding process itself uses the above-described conventional technique.

次に、ステップＳ４０３において、ラティス展開・圧縮部１２は、上で得られたラティスをフォワード・バックワード（ｆｏｒｗａｒｄ−ｂａｃｋｗａｒｄ）アルゴリズムにより走査し、事後確率を計算する。そして、事後確率が最大となる経路（最尤系列）を取得し、圧縮ラティスの基礎となるグラフpを構成する。 Next, in step S403, the lattice expansion / compression unit 12 scans the lattice obtained above using a forward-backward algorithm, and calculates a posterior probability. Then, a route (maximum likelihood sequence) having the maximum posterior probability is acquired, and a graph p serving as a basis of the compression lattice is constructed.

次に、ステップＳ４０４からＳ４０９までにおいて、ラティス展開・圧縮部１２は、エッジのクラスタリング処理を行なう。このクラスタリング処理の詳細は次の通りである。
即ち、ラティス展開・圧縮部１２は、ステップＳ４０４において、エッジ集合Ｅ｛ｅ₁，ｅ_２，ｅ_３，ｅ_４，・・・｝から、この集合要素を事後確率の降順に並べ替えたリスト｛ｅ’₁，ｅ’_２，ｅ’_３，ｅ’_４，・・・，ｅ’_ｍ，・・・｝を生成する。
そして、ラティス展開・圧縮部１２は、ステップＳ４０５において、クラスタリングのための変数ｎを１に初期化する。
次のステップＳ４０６からＳ４０９までは、上記リストの要素を順次走査する処理である。
ラティス展開・圧縮部１２は、上記リストのｎ番目のエッジｅ’_ｎを取り出したとき、発話時刻の重なりが予め定められた所定の閾値よりも大きく、且つエッジ上の単語表記が同一となる巡回済み（走査済み）のｍ番目のエッジｅ’_ｍ（ｎ＞ｍ）があれば（ステップＳ４０６：ＹＥＳ）、ステップＳ４０７において、エッジｅ’_ｎをエッジ集合Ｅから取り除くとともに、エッジｅ’_ｍの事後確率にエッジｅ’_ｎの事後確率を加える。なお、ステップＳ４０７における判定結果が否定的である場合には、ステップＳ４０７をスキップして次のステップＳ４０８に進む。
そして、ラティス展開・圧縮部１２は、次のステップＳ４０８において、クラスタリングのための変数ｎをインクリメントする（ｎ←ｎ＋１）。
そして、ステップＳ４０９において、ラティス展開・圧縮部１２は、エッジクラスタリングが全て終了したか否かを判定する。そして、全て終了していない場合（ステップＳ４０９：ＮＯ）には、残りのエッジ集合について同様の処理を行なうためにステップＳ４０６に戻る。全て終了していた場合（ステップＳ４０９：ＹＥＳ）には、次のステップＳ４１０に進む。 Next, in steps S404 to S409, the lattice expansion / compression unit 12 performs edge clustering processing. The details of this clustering process are as follows.
That is, in step S404, the lattice expansion / compression unit 12 sorts the set elements from the edge set E {e ₁ , e ₂ , e ₃ , e ₄ ,. e ′ ₁ , e ′ ₂ , e ′ ₃ , e ′ ₄ ,..., e ′ _m ,.
In step S405, the lattice expansion / compression unit 12 initializes a variable n for clustering to 1.
The next steps S406 to S409 are processes for sequentially scanning the elements of the list.
When the lattice expansion / compression unit 12 extracts the n- th edge e ′ _n of the list, the cycle in which the overlap of the utterance times is larger than a predetermined threshold value and the word expressions on the edges are the same If there is a completed (scanned) m-th edge e ′ _m (n> m) (step S406: YES), in step S407, the edge e ′ _n is removed from the edge set E and the posterior of the edge e ′ _m . Add the posterior probability of edge e ' _{n to} the probability. If the determination result in step S407 is negative, step S407 is skipped and the process proceeds to the next step S408.
The lattice expansion / compression unit 12 then increments the variable n for clustering (n ← n + 1) in the next step S408.
In step S409, the lattice expansion / compression unit 12 determines whether or not all edge clustering has been completed. If all the processing has not been completed (step S409: NO), the process returns to step S406 in order to perform the same processing for the remaining edge set. If all have been completed (step S409: YES), the process proceeds to the next step S410.

次に、図８のステップＳ４１０からＳ４１７まででは、ラティス展開・圧縮部１２は、エッジの集約を行う。
まずステップＳ４１０において，上記のエッジ集合Ｅの要素を事後確率の降順に並べかえたエッジリストを生成する。
そして、ラティス展開・圧縮部１２は、ステップＳ４１１において、集約のための変数ｎを１に初期化する。 Next, in steps S410 to S417 in FIG. 8, the lattice expansion / compression unit 12 performs edge aggregation.
First, in step S410, an edge list is generated by rearranging the elements of the edge set E in descending order of posterior probabilities.
In step S411, the lattice expansion / compression unit 12 initializes a variable n for aggregation to 1.

ステップＳ４１２において、ラティス展開・圧縮部１２は、上記のエッジリストのｎ番目のエッジｅ’_ｎの事後確率が定められた閾値以上か否かを判定する。そして、エッジｅ’_ｎの事後確率が定められた閾値に満たない場合（ステップＳ４１２：ＮＯ）はステップＳ４１３に進み、その事後確率が閾値以上の場合（ステップＳ４１２：ＹＥＳ）はステップＳ４１４に進む。
ステップＳ４１３に進んだ場合、ラティス展開・圧縮部１２は、ｅ’_ｎをエッジ集合Ｅから取り除くとともに、ステップＳ４１６に進む。
ステップＳ４１４に進んだ場合、Ｓ４１４において、ラティス展開・圧縮部１２は、エッジｅ’_ｎに対し発話時刻の重なりが所定の閾値以上となるエッジｅ’_ｍ（但し、ｎ＞ｍ）を探索する。
そのようなｅ’_ｍが存在すれば（ステップＳ４１４：ＹＥＳ）、次のステップＳ４１５において、ラティス展開・圧縮部１２は、エッジｅ’_ｍの始終端ノードをエッジｅ’_ｎの始終端ノードに変更する。
ステップＳ４１４における判定結果が否定的であった場合は、ステップＳ４１５の処理をスキップして、次のＳ４１６に進む。
ステップＳ４１６においては、変数ｎをインクリメントする（ｎ←ｎ＋１）。
そして、ステップＳ４１７において、ラティス展開・圧縮部１２は、集約処理がすべて完了したか否かを判定する。エッジ集合中で昇順に全てのエッジについて上のステップＳ４１５の処理を終えている場合（ステップＳ４１７：ＹＥＳ）には次のステップＳ４１８の処理に進み、まだ残っているエッジがある場合（ステップＳ４１７：ＮＯ）にはステップＳ４１２に戻って次のエッジについての処理を行なう。 In step S412, the lattice expansion / compression unit 12 determines whether or not the posterior probability of the n-th edge e ′ _{n in} the edge list is equal to or greater than a predetermined threshold. _If the posterior probability of the edge e ′ _n is less than the predetermined threshold value (step S412: NO), the process proceeds to step S413. If the posterior probability is equal to or greater than the threshold value (step S412: YES), the process proceeds to step S414.
When the process proceeds to step S413, the lattice expansion / compression unit 12 removes e ′ _n from the edge set E, and the process proceeds to step S416.
When the process proceeds to step S414, in S414, the lattice expansion / compression unit 12 searches for an edge e ′ _m (where n> m) where the overlap of the utterance time is _equal to or greater than a predetermined threshold with respect to the edge e ′ n.
If such e ′ _m exists (step S 414: YES), in the next step S 415, the lattice expansion / compression unit 12 changes the start / end node of the edge e ′ _{m to} the start / end node of the edge e ′ _n. To do.
If the determination result in step S414 is negative, the process of step S415 is skipped and the process proceeds to the next S416.
In step S416, the variable n is incremented (n ← n + 1).
In step S417, the lattice expansion / compression unit 12 determines whether all the aggregation processing has been completed. If the processing in step S415 above is completed for all edges in the ascending order in the edge set (step S417: YES), the process proceeds to the next step S418, and if there are still remaining edges (step S417: If NO, the process returns to step S412 to process the next edge.

そして、図９のステップＳ４１８からＳ４２９まででは、ラティス展開・圧縮部１２は、前記のクラスタリングおよび集約により得られたエッジ集合を系列pにマージしていくことで圧縮ラティスを得る。
まずステップＳ４１８において、ラティス展開・圧縮部１２は、ラティスのノード集合をトポロジカルオーダーで並べかえたリストを得る。
そしてステップＳ４１９において、マージのための変数ｋを１に初期化する。
そしてステップＳ４２０において、ラティス展開・圧縮部１２は、ノードｖ_ｋを始点とするエッジのリストをエッジ集合Ｅから生成する。
そしてステップＳ４２１において、変数ｌ（エル）を１に初期化する。
そしてステップＳ４２２において、ラティス展開・圧縮部１２は、エッジリストのｌ（エル）番目のエッジｅ_ｌについて、発話時刻の重なりが最大となる圧縮ラティスのエッジｆ_ｈを探索する。
そしてステップＳ４２３においてこのｆ_ｈが訪問済みであるか否かを判定する。訪問済みであれば（ステップＳ４２３：ＹＥＳ）次のステップＳ４２４に進み、未訪問の場合（ステップＳ４２３：ＮＯ）はステップＳ４２５に進む。
ステップＳ４２４に進んだ場合、ラティス展開・圧縮部１２は、ｆ_ｈの終端ノードを２つに分け、新たなノードfを圧縮ラティス上に作成し、エッジｅ_ｌの単語表記と事後確率をコピーする。そしてステップＳ４２６に進む。
ステップＳ４２５に進んだ場合、ｆ_ｈの始端・終端を結ぶ新たなエッジfを生成して、エッジｅ_ｌの単語表記と事後確率をコピーする。なお、この際、ｆ_ｈは訪問済みとする。そしてステップＳ４２６に進む。
ステップＳ４２６においては、変数ｌ（エル）をインクリメントする（ｌ←ｌ＋１）。
ステップＳ４２７ではエッジリスト終了判定を行い、終了している場合（ステップＳ４２７：ＹＥＳ）にはステップＳ４２８に進み、未終了の場合（ステップＳ４２７：ＮＯ）にはステップＳ４２２に戻る。
また、ステップＳ４２８においても別の終了判定を行い、終了している場合（ステップＳ４２８：ＹＥＳ）にはこのフローチャート全体の処理を終了し、未終了の場合（ステップＳ４２８：ＮＯ）にはステップＳ４２９に進む。
ステップＳ４２９においては、変数ｋをインクリメントし（ｋ←ｋ＋１）、ステップＳ４２０に戻る。
つまり、ラティス展開・圧縮部１２は、ステップＳ４２２からＳ４２５までの操作を、エッジ集合Ｅのすべてのエッジについて行い、圧縮ラティスを得る。 In steps S418 to S429 in FIG. 9, the lattice expansion / compression unit 12 obtains a compressed lattice by merging the edge set obtained by the clustering and aggregation into the sequence p.
First, in step S418, the lattice expansion / compression unit 12 obtains a list in which lattice node sets are arranged in a topological order.
In step S419, a variable k for merging is initialized to 1.
In step S420, the lattice expansion / compression unit 12 generates a list of edges starting from the node v _k from the edge set E.
In step S421, a variable l (L) is initialized to 1.
In step S422, the lattice expansion / compression unit 12 searches for the edge f _{h of the} compression lattice that maximizes the overlap of the utterance times for the _l- th edge e _l of the edge list.
In step S423, it is determined whether or not this f _h has been visited. If it has been visited (step S423: YES), the process proceeds to the next step S424, and if it has not been visited (step S423: NO), the process proceeds to step S425.
When step S424, the lattice expansion and compression unit 12 divides the terminal node of the f _h into two, creating a new node f on compressed lattice, copies the word notation and the posterior probability of the edge e _l . Then, the process proceeds to step S426.
When step S425, generates a new edge f connecting the start-end of a f _h, copies the word notation and the posterior probability of the edge e _l. At this time, it is assumed that f _h has been visited. Then, the process proceeds to step S426.
In step S426, the variable l is incremented (l ← l + 1).
In step S427, an edge list end determination is performed. If it has been completed (step S427: YES), the process proceeds to step S428. If it has not been completed (step S427: NO), the process returns to step S422.
Also, another end determination is performed in step S428, and if it has ended (step S428: YES), the process of the entire flowchart is ended, and if it has not ended (step S428: NO), the process proceeds to step S429. move on.
In step S429, the variable k is incremented (k ← k + 1), and the process returns to step S420.
That is, the lattice expansion / compression unit 12 performs the operations from step S422 to S425 for all the edges of the edge set E to obtain a compression lattice.

ラティス展開・圧縮部１２による上述の処理のポイントは、要するに、次の（１）〜（３）の通りである。
（１）ラティス上のエッジについて、発話開始時刻・発話終了時刻がオーバーラップするエッジのうち、同一の表記を持つエッジをクラスタリングする（つまり、エッジの始端と終端を事後確率の大きなもので代表させ、事後確率の和を大きな方（代表させたほう）に与える）。
（２）ラティス上のエッジについて，オーバーラップするエッジをクラスタリングする（つまり、同一の始端ノードおよび終端ノードを持つようにする）。
（３）ラティス上のエッジについて、トポロジカルな順番でノードを訪問し、リンクをマージしていく。
これにより、従来技術による方法よりも高速に、且つ高圧縮率で、音声認識結果のラティスデータを圧縮することができる。 In short, the points of the above-described processing by the lattice expansion / compression unit 12 are as follows (1) to (3).
(1) For the edges on the lattice, of the edges where the utterance start time and utterance end time overlap, the edges having the same notation are clustered (that is, the start and end of the edge are represented by those having a large posterior probability). , Give the sum of the posterior probabilities to the larger one (representative)).
(2) For overlapping edges, cluster overlapping edges (that is, have the same start and end nodes).
(3) Visit the nodes in the topological order for the edges on the lattice and merge the links.
Thereby, the lattice data of the speech recognition result can be compressed at a higher speed and with a higher compression rate than the method according to the prior art.

以上説明した手順の処理により、ラティス展開・圧縮部１２は、展開されたラティスを基に、これを圧縮し、圧縮ラティス（コンフュージョンネットワーク）を作成する。
なお、これによって得られた圧縮ラティスに関して、隣接するノード間を結ぶエッジの事後確率の総和が１を超える場合には、それらのエッジの各々の事後確率を前記事後確率の総和で割る処理を行なう。逆に、隣接するノード間を結ぶエッジの事後確率の総和が１に満たない場合には、それらノード間に空の単語表記を持つ新たなエッジを生成し、エッジの事後確率の総和が１になるように、新たに生成されたエッジの事後確率値を設定する。ここで、新たに生成されたエッジの事後確率値は、１−（他のエッジの事後確率の総和）である。 Through the processing of the procedure described above, the lattice expansion / compression unit 12 compresses the expanded lattice based on the expanded lattice and creates a compressed lattice (confusion network).
When the total of posterior probabilities of edges connecting adjacent nodes exceeds 1 with respect to the compression lattice obtained in this way, a process of dividing the posterior probability of each of those edges by the sum of the posterior probabilities. Do. Conversely, if the sum of the posterior probabilities of the edges connecting adjacent nodes is less than 1, a new edge having an empty word notation is generated between those nodes, and the sum of the posterior probabilities of the edges is 1. The posterior probability value of the newly generated edge is set so that Here, the newly generated posterior probability value of the edge is 1− (the sum of the posterior probabilities of other edges).

音声認識結果のラティスデータの量は膨大なものとなるが、上述したようにラティス展開・圧縮部１２がラティスを圧縮することにより、扱い易いサイズのデータにすることができ、処理の高速化を図れる。 The amount of lattice data of the speech recognition result is enormous, but as described above, the lattice expansion / compression unit 12 compresses the lattice, so that the data can be easily handled and the processing speed can be increased. I can plan.

図１０は、検索用転置インデックス作成部１５による処理の手順を示すフローチャートである。以下、このフローチャートに沿って検索用インデックスの作成の方法を説明する。
まずステップＳ５０１において、検索用転置インデックス作成部１５は、ラティス展開・圧縮部１２から、コンパクトに圧縮された１発話分のラティスのデータを取得する。以下のステップにおいては、このラティスに含まれる各エッジについての処理を行なう。
次に、ステップＳ５０２において、検索用転置インデックス作成部１５は、現エッジに単語表記が割り当てられているか否かを判定する。割り当てられている場合（ステップＳ５０２：ＹＥＳ）には次のステップＳ５０３に進み、割り当てられていない場合（ステップＳ５０２：ＮＯ）にはステップＳ５０４に飛ぶ。
そしてステップＳ５０３において、検索用転置インデックス作成部１５は、現エッジに割り当てられている単語表記に基づいて、検索用転置インデックス記憶部１７に１レコードを追加する形で更新を行なう。 FIG. 10 is a flowchart showing a processing procedure performed by the search inverted index creation unit 15. A method for creating a search index will be described below with reference to this flowchart.
First, in step S 501, the search inverted index creation unit 15 acquires lattice data for one utterance that is compactly compressed from the lattice expansion / compression unit 12. In the following steps, processing for each edge included in this lattice is performed.
Next, in step S502, the search inverted index creation unit 15 determines whether a word notation is assigned to the current edge. If it is assigned (step S502: YES), the process proceeds to the next step S503, and if it is not assigned (step S502: NO), the process jumps to step S504.
In step S 503, the search inverted index creation unit 15 updates the search inverted index storage unit 17 by adding one record based on the word notation assigned to the current edge.

図１１は、検索用転置インデックス記憶部１７が記憶する転置インデックスのデータ構成を示す概略図である。図示するように、この転置インデックスは、表形式のデータであり、単語表記ＩＤと番組ＩＤと発話開始時刻の各項目を有している。単語表記ＩＤは、単語表記を一意に識別するためのデータであり、エッジに割り当てられた単語のＩＤが未付与の場合、新たなＩＤを符号なし３２ビット整数として付与する。番組ＩＤは、音声認識の対象となっている放送番組を一意に識別するためのデータである。そして、発話開始時刻は、１つの発話を単位として、当該番組内における当該発話の開始位置を表わす時刻情報である。この時刻情報は、番組開始時からの相対時刻で表わしても良いし、現実の日時（例えば日本標準時）で表わしても良い。転置インデックスがこのような構造をとることにより、この検索用転置インデックス記憶部１７から、番組ごとの単語表記の出現回数を容易に取り出すことができる。つまり、検索サーバ部１９は、前述の検索処理を行なう際に、この検索用転置インデックス記憶部１７から読み出す情報を活用することができる。 FIG. 11 is a schematic diagram showing the data configuration of the inverted index stored in the search inverted index storage unit 17. As shown in the figure, this transposed index is tabular data, and has items of a word notation ID, a program ID, and an utterance start time. The word notation ID is data for uniquely identifying the word notation, and when the ID of the word assigned to the edge is not assigned, a new ID is assigned as an unsigned 32-bit integer. The program ID is data for uniquely identifying the broadcast program that is the target of voice recognition. The utterance start time is time information indicating the start position of the utterance in the program, with one utterance as a unit. This time information may be expressed as a relative time from the start of the program, or as an actual date and time (for example, Japan Standard Time). Since the inverted index has such a structure, the number of appearances of word notation for each program can be easily extracted from the search inverted index storage unit 17. That is, the search server unit 19 can utilize the information read from the search inverted index storage unit 17 when performing the above-described search process.

図１０に戻って、次にステップＳ５０４において、検索用転置インデックス作成部１５は、与えられた１発話分のラティスにおいて全てのエッジの処理を終えたか否かを判定する。全てのエッジの処理を終えている場合（ステップＳ５０４：ＹＥＳ）にはこのフローチャート全体の処理を終了し、まだ残っているエッジが存在する場合（ステップＳ５０４：ＮＯ）には次のエッジを処理するためにステップＳ５０２に戻る。 Returning to FIG. 10, next, in step S 504, the search inverted index creation unit 15 determines whether or not processing of all edges has been completed in the given lattice for one utterance. When all the edges have been processed (step S504: YES), the processing of the entire flowchart is terminated, and when there are still remaining edges (step S504: NO), the next edge is processed. Therefore, the process returns to step S502.

なお、上述した音声情報抽出装置の機能は、電子回路によって実現される。
また特に、同装置の機能を、単数又は複数のストアドプログラム方式のコンピュータで実現することが好適である。その場合、この制御機能を実現するためのプログラムをコンピュータ読み取り可能な記録媒体に記録して、この記録媒体に記録されたプログラムをコンピュータシステムに読み込ませ、実行することによって実現しても良い。なお、ここでいう「コンピュータシステム」とは、ＯＳや周辺機器等のハードウェアを含むものとする。また、「コンピュータ読み取り可能な記録媒体」とは、フレキシブルディスク、光磁気ディスク、ＲＯＭ、ＣＤ−ＲＯＭ等の可搬媒体、コンピュータシステムに内蔵されるハードディスク等の記憶装置のことをいう。さらに「コンピュータ読み取り可能な記録媒体」とは、インターネット等のネットワークや電話回線等の通信回線を介してプログラムを送信する場合の通信線のように、短時刻の間、動的にプログラムを保持するもの、その場合のサーバやクライアントとなるコンピュータシステム内部の揮発性メモリのように、一定時刻プログラムを保持しているものも含んでも良い。また上記プログラムは、前述した機能の一部を実現するためのものであっても良く、さらに前述した機能をコンピュータシステムにすでに記録されているプログラムとの組み合わせで実現できるものであっても良い。 Note that the function of the above-described voice information extraction device is realized by an electronic circuit.
In particular, it is preferable that the function of the apparatus is realized by a single or a plurality of stored program type computers. In that case, the program for realizing the control function may be recorded on a computer-readable recording medium, and the program recorded on the recording medium may be read by a computer system and executed. Here, the “computer system” includes an OS and hardware such as peripheral devices. The “computer-readable recording medium” refers to a storage device such as a flexible medium, a magneto-optical disk, a portable medium such as a ROM and a CD-ROM, and a hard disk incorporated in a computer system. Further, the “computer-readable recording medium” dynamically holds a program for a short time, like a communication line when transmitting a program via a network such as the Internet or a communication line such as a telephone line. It is also possible to include those that hold a program for a certain time, such as a volatile memory inside a computer system serving as a server or client in that case. The program may be a program for realizing a part of the functions described above, and may be a program capable of realizing the functions described above in combination with a program already recorded in a computer system.

＜実施例＞
前記の実施形態の動作を検証するため、実際にシステムを構築した。その概要を以下に記載する。
映像・音声収録部１は、テレビチューナーから映像および音声の信号を取得できる構成とし、予め設定した日時に設定されたチャンネルの放送を実際に受信して取り込むようにした。日時およびチャンネルの設定は、ＮＨＫ（日本放送協会）のウェブサイトにある放送番組表や放送波に重畳されたＥＰＧに基づいて自動的に行われる。一方で、利用者インタフェースを通して画面から行なえるようにもした。また、随時、利用者からのボタン操作により、映像および音声の取得の開始／終了を行なうとともに、チャンネル設定を変更できるようにした。
テキスト収集部３は、インターネットを介してＮＨＫのウェブサイトから放送番組に関するテキスト情報を取得できるようにした。
音声情報抽出装置５０を構成する各機能は、コンピュータ用のプログラムを記述し、ＬＡＮで連携する複数台のコンピュータ上でそれらのプログラムを実行させることによって実現した。
また、検索クライアント部２０においては、検索結果が前述の方法で表示され、そこから利用者が選んだ映像および音声を再生表示させるようにした。
また、番組に出演するアナウンサー等のそれぞれの音響的特徴を話者データ記憶部４に予め記憶させておいたことにより、話者識別を高精度で行ない、音声認識結果のテキストとともに話者名を表示させることができた。 <Example>
In order to verify the operation of the above embodiment, a system was actually constructed. The outline is described below.
The video / audio recording unit 1 is configured to be able to acquire video and audio signals from the TV tuner, and actually receives and captures broadcasts of channels set at a preset date and time. The date and time and channel are automatically set based on the broadcast program guide and the EPG superimposed on the broadcast wave on the NHK (Japan Broadcasting Corporation) website. On the other hand, it can be done from the screen through the user interface. In addition, it is now possible to start / end the acquisition of video and audio and change the channel setting by button operation from the user.
The text collection unit 3 can obtain text information about broadcast programs from the NHK website via the Internet.
Each function constituting the voice information extracting apparatus 50 is realized by writing a program for a computer and executing the program on a plurality of computers linked via a LAN.
In the search client unit 20, the search result is displayed by the above-described method, and the video and audio selected by the user are reproduced and displayed.
Also, each acoustic feature of the announcer appearing in the program is stored in the speaker data storage unit 4 in advance, so that speaker identification is performed with high accuracy, and the speaker name is added together with the text of the speech recognition result. It was possible to display.

以上、この発明の実施形態について図面を参照して詳述してきたが、具体的な構成はこの実施形態に限られるものではなく、この発明の要旨を逸脱しない範囲の設計等も含まれる。 The embodiment of the present invention has been described in detail with reference to the drawings. However, the specific configuration is not limited to this embodiment, and includes designs and the like that do not depart from the gist of the present invention.

本発明を利用することにより、放送番組や映像・音声リソースを索引化し、容易に検索・閲覧できるようにすることが可能となる。
また、本発明を利用することにより、抽出された音声情報をメタデータ制作システムに提供し、放送サービスを行なうことが可能となる。
また、本発明を利用することにより、音声認識装置で用いる統計的音響モデルおよび統計的言語モデルを構築するために、音声データおよび関連するテキストデータを効率的に収集することが可能となる。 By utilizing the present invention, it is possible to index broadcast programs and video / audio resources so that they can be easily searched and viewed.
Further, by using the present invention, it is possible to provide the extracted audio information to the metadata production system and perform a broadcasting service.
Further, by utilizing the present invention, it is possible to efficiently collect speech data and related text data in order to construct a statistical acoustic model and a statistical language model used in the speech recognition apparatus.

１映像・音声収録部
２番組情報収集部
３テキスト収集部
４話者データ記憶部
５話者識別部
６音声認識部
７音響モデル記憶部
８言語モデル記憶部
９言語モデル学習部
１０テキストデータ記憶部
１１単語辞書記憶部
１２ラティス展開・圧縮部（ラティス圧縮部）
１３音声情報統合部
１４話題抽出部
１５検索用転置インデックス作成部（検索用インデックス作成部）
１６音声情報記憶部
１７検索用転置インデックス記憶部（検索用インデックス記憶部）
１８映像音声記憶部
１９検索サーバ部
２０検索クライアント部
５０音声情報抽出装置 DESCRIPTION OF SYMBOLS 1 Video | video recording part 2 Program information collection part 3 Text collection part 4 Speaker data storage part 5 Speaker identification part 6 Speech recognition part 7 Acoustic model storage part 8 Language model storage part 9 Language model learning part 10 Text data storage part 11 Word Dictionary Storage Unit 12 Lattice Expansion / Compression Unit (Lattice Compression Unit)
13 voice information integration unit 14 topic extraction unit 15 transposed index creation unit (search index creation unit)
16 Voice information storage unit 17 Inverted index storage unit for search (index storage unit for search)
18 Video / Audio Storage Unit 19 Search Server Unit 20 Search Client Unit 50 Audio Information Extraction Device

Claims

A video / audio storage unit for storing video and audio;
A search index storage unit that stores a search index including correspondence between words and speech utterance times, speech time, utterance contents that are word strings, topics, speaker names, or speaker attributes A voice information storage unit that stores voice information associated with at least one of the following: an acoustic model storage unit that stores an acoustic model that statistically represents the acoustic features of the voice;
A language model storage unit for storing a language model that statistically represents the appearance frequency of words;
A speaker data storage unit that pre-stores speaker data that statistically represents acoustic features for each speaker or speaker attribute;
A video and audio recording unit that acquires video and audio from outside and writes the video and audio in the video and audio storage unit;
Using the acoustic model read from the acoustic model storage unit and the language model read from the language model storage unit, the voice recognition processing of the voice acquired by the video / audio recording unit is performed, and a voice recognition result is output. A voice recognition unit that
Using the speaker data read from the speaker data storage unit, a speaker identification unit that calculates and outputs a speaker name or speaker attribute corresponding to the sound acquired by the video and audio recording unit;
A text data acquisition unit for acquiring text data related to the video and the audio acquired by the video / audio recording unit;
A topic extraction unit that extracts a topic by comparing the text data acquired by the text data acquisition unit and the speech recognition result output by the speech recognition unit;
A voice information integration unit that writes voice information obtained by integrating the voice recognition result, the topic, and at least one of the speaker name and the speaker attribute into the voice information storage unit;
A search index creation unit that creates data for the search index based on the speech recognition result and writes the data in the search index storage unit;
The search index storage unit and the audio information storage unit are searched based on a search request based on a search word, and the audio information associated with the video and the audio corresponding to the search word is read from the audio information storage unit. A search server unit that presents the search source as a search result and that can reproduce the video and the audio stored in the video and audio storage unit;
An audio information extraction apparatus comprising:

The topic extraction unit calculates the similarity based on the number of word sets included in the text data included in the predetermined number of words of the speech recognition result, and based on the similarity, the text data and the text data Extracting the topic from the text data by performing a correlation with a speech recognition result;
The audio information extracting apparatus according to claim 1, wherein

It further comprises a language model learning unit that updates the language model stored in the language model storage unit by calculating the appearance frequency of words in the text data acquired by the text data acquisition unit. The voice information extracting device according to claim 1 or 2.

A lattice compression unit for performing a process of compressing lattice data representing a directed acyclic graph of the word hypothesis output as the speech recognition result by the speech recognition unit;
The search index creation unit creates the search index based on the lattice data compressed by the lattice compression unit.
The speech information extraction device according to any one of claims 1 to 3, wherein

The search request is transmitted to the search server unit using a search term based on an input from the user, the search result from the search server unit is displayed on the screen, and further based on an operation from the user The audio information extracting apparatus according to claim 1, further comprising a search client unit that reproduces the corresponding video and audio.