JP5912729B2

JP5912729B2 - Speech recognition apparatus, speech recognition program, and speech recognition method

Info

Publication number: JP5912729B2
Application number: JP2012067192A
Authority: JP
Inventors: 航笠井
Original assignee: Dwango Co Ltd
Current assignee: Dwango Co Ltd
Priority date: 2012-03-23
Filing date: 2012-03-23
Publication date: 2016-04-27
Anticipated expiration: 2032-03-23
Also published as: JP2013200362A

Description

本発明は、マルチメディア情報に含まれる音声を認識する音声認識装置、音声認識プログラム、及び音声認識方法に関する。 The present invention relates to a speech recognition apparatus, a speech recognition program, and a speech recognition method for recognizing speech included in multimedia information.

従来から、生放送による動画や音声の配信や、あらかじめ録画、録音された動画や音声のストリーミング等によるオンデマンド配信等により、各種のマルチメディア情報が広く提供されるようになりつつある。 2. Description of the Related Art Various types of multimedia information have been widely provided by distribution of moving images and sounds by live broadcasting, on-demand distribution by streaming of previously recorded and recorded moving images and sounds, and the like.

ここで、マルチメディア情報を聴取するユーザが、聴取をしながら当該マルチメディア情報に対するコメントを入力すると、当該マルチメディア情報を聴取する他のユーザにそのコメントが提示されるコメント配信システムが提案されている（特許文献１参照）。 Here, a comment distribution system is proposed in which when a user who listens to multimedia information inputs a comment on the multimedia information while listening, the comment is presented to other users who listen to the multimedia information. (See Patent Document 1).

一方、あらかじめ用意された候補語とその出現確率とを用いて、単語単位で音声認識を行う技術が提案されている（非特許文献１参照）。さらに、音声と、ディクテーションによって当該音声から書き起こされたテキストと、の時間的な対応関係を解析して、音声認識の精度を上げる技術が提案されている（特許文献２参照）。 On the other hand, a technique has been proposed in which speech recognition is performed in units of words using candidate words prepared in advance and their appearance probabilities (see Non-Patent Document 1). Furthermore, a technique for improving the accuracy of speech recognition by analyzing temporal correspondence between speech and text transcribed from the speech by dictation has been proposed (see Patent Document 2).

特許第４２６３２１８号公報Japanese Patent No. 4263218 特許第４７５８９１９号公報Japanese Patent No. 4758919

Akinobu Lee and Tatsuya Kawahara，Recent Development of Open-Source Speech Recognition Engine Julius，Proceedings : APSIPA ASC 2009 : Asia-Pacific Signal and Information Processing Association，2009 Annual Summit and Conference，pp.131-137，２００９年１０月４日発行，http://hdl.handle.net/2115/39653Akinobu Lee and Tatsuya Kawahara, Recent Development of Open-Source Speech Recognition Engine Julius, Proceedings: APSIPA ASC 2009: Asia-Pacific Signal and Information Processing Association, 2009 Annual Summit and Conference, pp.131-137, October 4, 2009 Published by http://hdl.handle.net/2115/39653

多数のマルチメディア情報が提供される現状では、マルチメディア情報に含まれる動画に対する字幕の付与や、マルチメディア情報の要約のテキストによる提供や、マルチメディア情報のテキストによる検索などの要望が高まりつつある。したがって、マルチメディア情報に含まれる音声のテキスト化をより一層適切に行えるようにしたい、との要望は強い。 In the current situation where a large amount of multimedia information is provided, there is an increasing demand for subtitles for videos included in the multimedia information, provision of multimedia information summary texts, and retrieval of multimedia information texts. . Therefore, there is a strong demand for making it possible to more appropriately convert the voice included in the multimedia information into text.

一方で、音声中に出現する単語は、話題や、時代の流行や、発言者ならびに聴取者の嗜好等によって変化するため、このような変化に即応できるようなディクテーション技術が求められている。 On the other hand, since words appearing in speech change according to topics, trends in the times, preferences of speakers and listeners, etc., a dictation technique that can immediately respond to such changes is required.

本発明は、このような課題を解決しようとするものであり、マルチメディア情報に付されたコメントを利用して、マルチメディア情報に含まれる音声を適切に認識する音声認識装置、音声認識プログラム、及び音声認識方法を提供することを目的とする。 The present invention is intended to solve such a problem, and uses a comment attached to multimedia information to appropriately recognize a voice included in the multimedia information, a voice recognition program, It is another object of the present invention to provide a speech recognition method.

上記目的を達成するため、本発明の第１の観点に係る音声認識装置は、
ユーザがマルチメディア情報の再生により発せられる音声を聴取しながら入力したコメントを蓄積する蓄積部、
前記蓄積されたコメントを含む文集合に出現する単語及び当該文集合における当該単語の共起語を含む候補語を抽出する抽出部、
前記抽出された候補語に基づいて、前記マルチメディア情報の再生により発せられる音声を音声認識する音声認識部、を備える、
ことを特徴とする。 In order to achieve the above object, a speech recognition apparatus according to the first aspect of the present invention provides:
An accumulator for accumulating comments input by the user while listening to audio generated by playing multimedia information;
An extraction unit that extracts words that appear in a sentence set including the accumulated comments and candidate words including co-occurrence words of the word in the sentence set;
A voice recognition unit that recognizes a voice generated by reproducing the multimedia information based on the extracted candidate word;
It is characterized by that.

また、第１の観点に係る音声認識装置において、
前記文集合は、前記マルチメディア情報を聴取したユーザが閲覧した文書に出現する文を含む
としても良い。 In the speech recognition apparatus according to the first aspect,
The sentence set may include a sentence that appears in a document viewed by a user who has listened to the multimedia information.

また、第１の観点に係る音声認識装置において、
前記抽出部は、前記候補語のそれぞれの出現尤度を算定し、
前記音声認識部は、前記音声から認識された音素と前記候補語を表す音素との一致度及び当該候補語の出現尤度に基づいて、音声認識する、
としても良い。 In the speech recognition apparatus according to the first aspect,
The extraction unit calculates the likelihood of appearance of each of the candidate words,
The speech recognition unit recognizes speech based on the degree of coincidence between the phoneme recognized from the speech and the phoneme representing the candidate word and the appearance likelihood of the candidate word.
It is also good.

また、第１の観点に係る音声認識装置において、
前記候補語のうち、前記コメントに出現する単語には、当該コメントが入力された入力時点が対応付けられ、
前記音声認識部は、前記入力時点が対応付けられている候補語に対しては、当該候補語に対応付けられた入力時点と、前記音素が発せられた発音時点との合致度を求め、当該求められた合致度にさらに基づいて、音声認識する、
としても良い。 In the speech recognition apparatus according to the first aspect,
Of the candidate words, words appearing in the comment are associated with an input time point when the comment is input,
For the candidate word associated with the input time point, the speech recognition unit obtains a degree of match between the input time point associated with the candidate word and the pronunciation time point when the phoneme is emitted, Voice recognition based on the degree of match
It is also good.

また、第１の観点に係る音声認識装置において、
前記入力時点と、前記発音時点と、は、前記マルチメディア情報の再生が開始されてからの再生時間により表現される、
としても良い。 In the speech recognition apparatus according to the first aspect,
The input time point and the sound generation time point are expressed by a reproduction time after the reproduction of the multimedia information is started.
It is also good.

また、第１の観点に係る音声認識装置において、
前記合致度は、前記入力時点と前記発音時点との差及び前記マルチメディア情報の再生が可能となった時点と当該ユーザがマルチメディア情報の再生を開始した時点との差に基づいて定められる、
としても良い。 In the speech recognition apparatus according to the first aspect,
The degree of match is determined based on a difference between the input time point and the sound generation time point, and a difference between a time point when the multimedia information can be played back and a time point when the user starts playing the multimedia information.
It is also good.

また、本発明の第２の観点に係る音声認識プログラムは、
コンピュータを、
ユーザがマルチメディア情報の再生により発せられる音声を聴取しながら入力したコメントを蓄積する蓄積部、
前記蓄積されたコメントを含む文集合に出現する単語及び当該文集合における当該単語の共起語を含む候補語を抽出する抽出部、
前記抽出された候補語に基づいて、前記マルチメディア情報の再生により発せられる音声を音声認識する音声認識部、として機能させる、
ことを特徴とする。 A speech recognition program according to the second aspect of the present invention is:
Computer
An accumulator for accumulating comments input by the user while listening to audio generated by playing multimedia information;
An extraction unit that extracts words that appear in a sentence set including the accumulated comments and candidate words including co-occurrence words of the word in the sentence set;
Based on the extracted candidate words, function as a speech recognition unit that recognizes speech generated by playing the multimedia information,
It is characterized by that.

さらに、本発明の第３の観点に係る音声認識方法は、
蓄積部、抽出部、及び音声認識部を備える音声認識装置が実行する方法であって、
前記蓄積部が、ユーザがマルチメディア情報の再生により発せられる音声を聴取しながら入力したコメントを蓄積する蓄積ステップ、
前記抽出部が、前記蓄積されたコメントを含む文集合に出現する単語及び当該文集合における当該単語の共起語を含む候補語を抽出する抽出ステップ、
前記音声認識部が、前記抽出された候補語に基づいて、前記マルチメディア情報の再生により発せられる音声を音声認識する音声認識ステップ、を有する、
ことを特徴とする。 Furthermore, the speech recognition method according to the third aspect of the present invention provides:
A method performed by a speech recognition apparatus including an accumulation unit, an extraction unit, and a speech recognition unit,
An accumulating step in which the accumulating unit accumulates a comment input by the user while listening to a sound uttered by reproduction of multimedia information;
An extraction step in which the extraction unit extracts a word that appears in a sentence set including the accumulated comments and a candidate word including a co-occurrence word of the word in the sentence set;
The speech recognition unit has a speech recognition step for recognizing speech generated by reproducing the multimedia information based on the extracted candidate words.
It is characterized by that.

本発明に係る音声認識装置、音声認識プログラム、及び音声認識方法によれば、マルチメディア情報に付されたコメントを利用して、マルチメディア情報に含まれる音声を適切に認識できる。 According to the voice recognition device, the voice recognition program, and the voice recognition method according to the present invention, the voice included in the multimedia information can be appropriately recognized using the comment attached to the multimedia information.

音声認識システムの一構成例を表すシステム構成図である。It is a system configuration figure showing an example of 1 composition of a voice recognition system. 本発明の実施例に係る音声認識装置の一例を表すハードウェア構成図である。It is a hardware block diagram showing an example of the speech recognition apparatus which concerns on the Example of this invention. 音声認識装置が実行する生放送処理の一例を表すフローチャートである。It is a flowchart showing an example of the live broadcast process which a speech recognition apparatus performs. 実施例１に係る音声認識装置が有する機能の一例を表す機能ブロック図である。FIG. 3 is a functional block diagram illustrating an example of functions of the voice recognition device according to the first embodiment. 音声認識装置が記憶する放送テーブルの一例を表す図である。It is a figure showing an example of the broadcast table which a voice recognition apparatus memorizes. 音声認識装置が記憶するコメントテーブルの一例を表す図である。It is a figure showing an example of the comment table which a speech recognition apparatus memorizes. 実施例１に係る端末装置が表示する視聴画面の一例を表す図である。It is a figure showing an example of the viewing-and-listening screen which the terminal device concerning Example 1 displays. 音声認識装置が実行する再放送処理の一例を表すフローチャートである。It is a flowchart showing an example of the rebroadcast process which a speech recognition apparatus performs. 実施例１に係る音声認識装置が実行する要約生成処理の一例を表すフローチャートである。6 is a flowchart illustrating an example of a summary generation process executed by the speech recognition apparatus according to the first embodiment. 音声認識装置が記憶する参照テーブルの一例を表す図である。It is a figure showing an example of the reference table which a speech recognition apparatus memorize | stores. 音声認識装置が記憶する文集合テーブルの一例を表す図である。It is a figure showing an example of the sentence set table which a speech recognition device memorizes. 実施例１に係る音声認識装置が記憶する共起テーブルの一例を表す図である。It is a figure showing an example of the co-occurrence table which the speech recognition apparatus concerning Example 1 memorizes. 音声認識装置が記憶する候補語テーブルの一例を表す図である。It is a figure showing an example of the candidate word table which a speech recognition apparatus memorizes. 音声認識装置が記憶する情報で表される合致度曲線の一例を表す図である。It is a figure showing an example of the coincidence degree curve represented by the information which a voice recognition device memorizes. 音声認識装置が実行する文集合生成処理の一例を表すフローチャートである。It is a flowchart showing an example of the sentence set production | generation process which a speech recognition apparatus performs. 音声認識装置が実行する候補語抽出処理の一例を表すフローチャートである。It is a flowchart showing an example of the candidate word extraction process which a speech recognition apparatus performs. 音声認識装置が実行する連続音声認識処理の一例を表すフローチャートである。It is a flowchart showing an example of the continuous speech recognition process which a speech recognition apparatus performs. 実施例２に係る音声認識装置が実行する要約生成処理の一例を表すフローチャートである。10 is a flowchart illustrating an example of a summary generation process executed by the speech recognition apparatus according to the second embodiment. 実施例２に係る音声認識装置が有する機能の一例を表す機能ブロック図である。It is a functional block diagram showing an example of the function which the speech recognition apparatus concerning Example 2 has. 実施例２に係る音声認識装置が記憶する共起テーブルの一例を表す図である。It is a figure showing an example of the co-occurrence table which the speech recognition apparatus concerning Example 2 memorize | stores. 実施例３に係る端末装置が表示する視聴画面の一例を表す図である。It is a figure showing an example of the viewing-and-listening screen which the terminal device concerning Example 3 displays.

以下、本発明の実施例について添付図面を参照しつつ説明する。 Embodiments of the present invention will be described below with reference to the accompanying drawings.

＜実施例１＞
本発明の実施例１に係る音声認識装置１００は、図１に示すような音声認識システム１を構成する。 <Example 1>
A speech recognition apparatus 100 according to Embodiment 1 of the present invention constitutes a speech recognition system 1 as shown in FIG.

音声認識システム１は、音声認識装置１００の他に、例えば、インターネットなどのコンピュータ通信網１０（以下単に、通信網１０という）と、通信網１０に接続された端末装置２０、３０、及び４０と、で構成される。 In addition to the speech recognition device 100, the speech recognition system 1 includes, for example, a computer communication network 10 such as the Internet (hereinafter simply referred to as the communication network 10), and terminal devices 20, 30, and 40 connected to the communication network 10. , Is composed.

端末装置２０から４０は、例えば、ＬＣＤ（Liquid Crystal Display）などの表示部と、スピーカなどの音声出力部と、キーボード及びマウスなどの入力部と、を備えたパーソナル・コンピュータでそれぞれ構成される。 The terminal devices 20 to 40 are each configured by a personal computer including a display unit such as an LCD (Liquid Crystal Display), an audio output unit such as a speaker, and an input unit such as a keyboard and a mouse.

また、端末装置２０は、例えば、ウェブカメラなどの撮像装置２１と、例えば、マイクロフォンなどの音声収集装置２２と、に接続されている。 In addition, the terminal device 20 is connected to an imaging device 21 such as a web camera and a sound collection device 22 such as a microphone.

音声認識装置１００は、撮像装置２１で撮影された動画及び音声収集装置２２で収集された音声を表すマルチメディア情報を端末装置２０から受信し、受信したマルチメディア情報を端末装置２０から４０へ配信する。これにより、撮像装置２１で撮影された動画及び音声収集装置２２で収集された音声が番組の映像及び音声として放送される。 The voice recognition device 100 receives multimedia information representing the moving image shot by the imaging device 21 and the voice collected by the voice collection device 22 from the terminal device 20, and distributes the received multimedia information to the terminal devices 20 to 40. To do. Thereby, the moving image photographed by the imaging device 21 and the sound collected by the sound collecting device 22 are broadcast as the video and sound of the program.

ここでは、音声認識装置１００は、端末装置２０のユーザが出演する番組を、当該番組の収録から所定時間以内に端末装置２０及び３０へ放送する（以下、生放送するという）として説明を行う。尚、端末装置２０のユーザは、放送された当該番組を視聴しながら出演を行う。 Here, the voice recognition device 100 will be described as broadcasting a program in which the user of the terminal device 20 appears to the terminal devices 20 and 30 within a predetermined time from the recording of the program (hereinafter referred to as live broadcasting). In addition, the user of the terminal device 20 performs while viewing the broadcasted program.

またここでは、音声認識装置１００は、生放送された番組（以下、生放送番組という）を、当該番組の収録から所定時間経過後に端末装置４０へ放送する（以下、再放送するという）として説明を行う。 Further, here, the speech recognition apparatus 100 is described as broadcasting a live broadcast program (hereinafter referred to as a live broadcast program) to the terminal device 40 (hereinafter referred to as rebroadcast) after a predetermined time has elapsed since the recording of the program. .

次に、図２を参照して、音声認識装置１００のハードウェア構成について説明する。
音声認識装置１００は、図２に示すようなサーバ機で構成され、ＣＰＵ（Central Processing Unit）１０１、ＲＯＭ（Read Only Memory）１０２、ＲＡＭ（Random Access Memory）１０３、ハードディスク１０４、メディアコントローラ１０５、ＬＡＮカード（Local Area Network）１０６、ビデオカード１０７、ＬＣＤ（Liquid Crystal Display）１０８、キーボード１００i、スピーカ１１０、及びタッチパッド１１１で構成される。 Next, the hardware configuration of the speech recognition apparatus 100 will be described with reference to FIG.
The speech recognition apparatus 100 includes a server machine as shown in FIG. 2, and includes a central processing unit (CPU) 101, a read only memory (ROM) 102, a random access memory (RAM) 103, a hard disk 104, a media controller 105, a LAN. A card (Local Area Network) 106, a video card 107, an LCD (Liquid Crystal Display) 108, a keyboard 100i, a speaker 110, and a touch pad 111 are included.

ＣＰＵ１０１は、ＲＯＭ１０２又はハードディスク１０４に保存されたプログラムに従ってプログラムを実行することで、音声認識装置１００の全体制御を行う。ＲＡＭ１０３は、ＣＰＵ１０１によるプログラムの実行時において、処理対象とするデータを一時的に記憶するワークメモリである。 The CPU 101 performs overall control of the speech recognition apparatus 100 by executing a program according to a program stored in the ROM 102 or the hard disk 104. The RAM 103 is a work memory that temporarily stores data to be processed when the CPU 101 executes a program.

ハードディスク１０４は、各種のデータを蓄積したテーブルを記憶する蓄積部である。尚、音声認識装置１００は、ハードディスク１０４の代わりに、フラッシュメモリを備えても良い。 The hard disk 104 is a storage unit that stores a table storing various data. Note that the speech recognition apparatus 100 may include a flash memory instead of the hard disk 104.

メディアコントローラ１０５は、フラッシュメモリ、ＣＤ（Compact Disc）、ＤＶＤ（Digital Versatile Disc）、及びブルーレイディスク（Blu-ray Disc）（登録商標）を含む記録媒体から各種のデータ及びプログラムを読み出す。 The media controller 105 reads various data and programs from recording media including flash memory, CD (Compact Disc), DVD (Digital Versatile Disc), and Blu-ray Disc (registered trademark).

ＬＡＮカード１０６は、通信網１０を介して接続する端末装置２０から４０との間でデータを送受信する。キーボード１００i及びタッチパッド１１１は、ユーザの操作に応じた信号を入力する。 The LAN card 106 transmits and receives data to and from the terminal devices 20 to 40 connected via the communication network 10. The keyboard 100i and the touch pad 111 input signals according to user operations.

ビデオカード１０７は、ＣＰＵ１０１から出力されたデジタル信号に基づいて画像を描画（つまり、レンダリング）すると共に、描画された画像を表す画像信号を出力する。ＬＣＤ１０８は、ビデオカード１０７から出力された画像信号に従って画像を表示する。なお、音声認識装置１００は、ＬＣＤ１０８の代わりに、ＰＤＰ（Plasma Display Panel）又はＥＬ（Electroluminescence）ディスプレイを備えても良い。スピーカ１１０は、ＣＰＵ１０１から出力された信号に基づいて音声を出力する。 The video card 107 draws (that is, renders) an image based on the digital signal output from the CPU 101, and outputs an image signal representing the drawn image. The LCD 108 displays an image according to the image signal output from the video card 107. The voice recognition apparatus 100 may include a PDP (Plasma Display Panel) or an EL (Electroluminescence) display instead of the LCD 108. The speaker 110 outputs sound based on the signal output from the CPU 101.

次に、音声認識装置１００の有する機能について説明する。
ＣＰＵ１０１は、図３に示す生放送処理を実行することにより、図４に示す入力部１２０、保存部１３０、及び出力部１４０として機能する。また、ＣＰＵ１０１は、図２に示したハードディスク１０４と協働して、蓄積部１９０として機能する。 Next, functions of the speech recognition apparatus 100 will be described.
The CPU 101 functions as the input unit 120, the storage unit 130, and the output unit 140 illustrated in FIG. 4 by executing the live broadcast processing illustrated in FIG. Further, the CPU 101 functions as the storage unit 190 in cooperation with the hard disk 104 shown in FIG.

図４に示す入力部１２０は、図２に示すＬＡＮカード１０６が受信した各種の情報を入力する。保存部１３０は、入力部１２０で入力された各種の情報を蓄積部１９０へ保存する。出力部１４０は、入力部１２０で入力された各種の情報を、配信先を指定してＬＡＮカード１０６へ出力する。蓄積部１９０は、保存部１３０によって保存された各種の情報を蓄積する。 The input unit 120 illustrated in FIG. 4 inputs various types of information received by the LAN card 106 illustrated in FIG. The storage unit 130 stores various types of information input by the input unit 120 in the storage unit 190. The output unit 140 outputs various information input from the input unit 120 to the LAN card 106 by designating a distribution destination. The storage unit 190 stores various types of information stored by the storage unit 130.

次に、蓄積部１９０に蓄積される各種情報について説明する。
蓄積部１９０は、放送された番組の書誌的事項が保存される、図５に示す放送テーブルを記憶している。放送テーブルには、番組の放送を識別する放送ＩＤと、当該番組の放送開始日時と、当該放送のシフト時間と、当該番組で放送された動画及び音声を表すマルチメディア情報のパスと、が対応付けられたデータが複数保存される。尚、番組の放送開始日時とは、番組の放送が開始された日時をいう。また、放送のシフト時間は、当該放送が生放送の場合、値「0」であり、当該放送が再放送の場合、当該再放送の開始日時から生放送の開始日時を減算した値である。 Next, various information stored in the storage unit 190 will be described.
The storage unit 190 stores a broadcast table shown in FIG. 5 in which bibliographic items of broadcast programs are stored. The broadcast table corresponds to the broadcast ID for identifying the broadcast of the program, the broadcast start date and time of the program, the shift time of the broadcast, and the path of multimedia information representing the video and audio broadcast in the program. Multiple attached data are saved. Note that the broadcast start date and time of the program refers to the date and time when the broadcast of the program was started. The broadcast shift time is a value “0” when the broadcast is a live broadcast, and is a value obtained by subtracting the start date and time of the live broadcast from the start date and time of the rebroadcast when the broadcast is a rebroadcast.

また、蓄積部１９０は、番組の動画若しくは音声に対するコメントが保存される、図６に示すコメントテーブルを記憶している。コメントテーブルには、番組の放送ＩＤと、当該番組に対するコメントを識別するコメントＩＤと、当該コメントの入力時点と、当該コメントと、コメントしたユーザを識別するユーザＩＤと、が対応付けられたデータが複数保存される。尚、入力時点は、番組の放送が開始した時点からの経過時間で表される。 In addition, the storage unit 190 stores a comment table shown in FIG. 6 in which comments about the moving image or sound of the program are stored. The comment table includes data in which a broadcast ID of a program, a comment ID that identifies a comment on the program, an input time point of the comment, the comment, and a user ID that identifies the commented user are associated with each other. Multiple saved. The input time is represented by the elapsed time from the start of the program broadcast.

次に、図４に示す入力部１２０、保存部１３０、及び出力部１４０で行われるＣＰＵ１０１の動作について説明する。 Next, operations of the CPU 101 performed by the input unit 120, the storage unit 130, and the output unit 140 illustrated in FIG. 4 will be described.

ユーザは、音声認識装置１００のキーボード１０９に対して、生放送の開始を指示する操作（以下、生放送開始指示操作という）を行う。次に、ユーザは、キーボード１０９に対して、放送を開始する予定の日時（以下、放送開始予定日時という）と、放送を終了する予定の日時（以下、放送終了予定日時という）と、を指示する操作を行う。 The user performs an operation for instructing the start of live broadcast (hereinafter referred to as a live broadcast start instruction operation) on the keyboard 109 of the speech recognition apparatus 100. Next, the user instructs the keyboard 109 to specify the date and time when the broadcast is scheduled to start (hereinafter referred to as the broadcast start scheduled date and time) and the date and time when the broadcast is scheduled to end (hereinafter referred to as the scheduled broadcast end date and time). Perform the operation.

ＣＰＵ１０１は、キーボード１０９によって、生放送開始指示操作に応じた操作信号を入力されると、図３に示す生放送処理の実行を開始する。 When an operation signal corresponding to a live broadcast start instruction operation is input from the keyboard 109, the CPU 101 starts execution of the live broadcast process shown in FIG.

生放送処理の実行を開始すると、入力部１２０は、放送ＩＤを生成し、キーボード１０９から入力される操作信号に基づいて、ユーザの操作で指定された放送開始予定日時及び放送終了予定日時を取得する（ステップＳ０１）。 When the execution of the live broadcast process is started, the input unit 120 generates a broadcast ID, and acquires a scheduled broadcast start date and a scheduled broadcast end date and time designated by the user operation based on an operation signal input from the keyboard 109. (Step S01).

次に、保存部１３０は、例えば、ＯＳ（Operating System）が管理するシステム日時を参照し、参照したシステム日時が、放送開始予定日時を経過した日時であるか否かを判別する（ステップＳ０２）。このとき、保存部１３０は、放送開始予定日時を経過していないと判別すると（ステップＳ０２；Ｎｏ）、所定時間スリープした後に、ステップＳ０２の処理を繰り返す。 Next, the storage unit 130 refers to, for example, a system date and time managed by an OS (Operating System), and determines whether or not the referred system date and time is a date and time when the scheduled broadcast start date and time has passed (step S02). . At this time, if the storage unit 130 determines that the scheduled broadcast start date and time has not elapsed (step S02; No), the process proceeds to step S02 after sleeping for a predetermined time.

ステップＳ０２において、保存部１３０は、放送開始予定日時を経過したと判別すると（ステップＳ０２；Ｙｅｓ）、参照したシステム日時を放送開始日時とする。また、保存部１３０は、生放送であるので、当該番組のシフト時間を値「0」とする。さらに、保存部１３０は、当該番組の動画及び音声を表すマルチメディア情報が保存される電子ファイルのパスを生成し、生成したパスに電子ファイルを作成する。次に、保存部１３０は、放送ＩＤと、放送開始日時と、シフト時間と、パスと、を対応付けて、図５の放送テーブルへ追加保存する（ステップＳ０３）。 In step S02, when the storage unit 130 determines that the scheduled broadcast start date / time has elapsed (step S02; Yes), the stored system date / time is set as the broadcast start date / time. Since the storage unit 130 is a live broadcast, the shift time of the program is set to a value “0”. Further, the storage unit 130 generates an electronic file path in which multimedia information representing the moving image and audio of the program is stored, and creates an electronic file in the generated path. Next, the storage unit 130 associates the broadcast ID, the broadcast start date and time, the shift time, and the path, and additionally stores them in the broadcast table of FIG. 5 (step S03).

次に、保存部１３０は、ソフトウェアタイマをスタートさせて、番組の放送開始からの経過時間を計時し始める（ステップＳ０４）。 Next, the storage unit 130 starts a software timer and starts counting the elapsed time from the start of program broadcasting (step S04).

ここで、放送開始予定日時を経過したので、端末装置２０のユーザは、端末装置２０に接続された撮影装置２１に撮影を開始させ、かつ音声収集装置２２に音声の収集を開始させる操作を、端末装置２０に行うとして説明する。 Here, since the scheduled broadcast start date and time has passed, the user of the terminal device 20 performs an operation of causing the photographing device 21 connected to the terminal device 20 to start photographing and causing the sound collecting device 22 to start collecting sound. A description will be given assuming that the process is performed on the terminal device 20.

端末装置２０は、当該操作に応じて、撮影装置２１の撮影及び音声収集装置２２の音声収集を開始させる。次に、端末装置２０は、例えば、出演者の姿などを撮影した動画を表すデータ（以下、動画データという）を撮影装置２１から入力し始める。また、端末装置２０は、例えば、出演者の発言などの音声を表す電気信号（以下、音声信号という）を音声入力装置２２から入力し始める。その後、端末装置２０は、入力した音声信号に基づいて音声データを生成し、生成した音声データと、撮影装置２１から入力した動画データと、を、データの入力日時及び生成日時で対応付けたマルチメディア情報を音声認識装置１００へ送信し始める。 The terminal device 20 starts photographing by the photographing device 21 and voice collection by the voice collecting device 22 in accordance with the operation. Next, for example, the terminal device 20 starts to input data (hereinafter referred to as “moving image data”) representing a moving image of the appearance of the performer from the shooting device 21. In addition, the terminal device 20 starts to input an electric signal (hereinafter referred to as an audio signal) representing an audio such as a speech of a performer from the audio input device 22, for example. Thereafter, the terminal device 20 generates audio data based on the input audio signal, and the generated audio data and the moving image data input from the imaging device 21 are associated with the data input date / time and the generation date / time. The transmission of the media information to the voice recognition device 100 is started.

次に、入力部１２０は、図２に示したＬＡＮカード１０６から、端末装置２０からＬＡＮカード１０６が受信したマルチメディア情報を入力する（ステップＳ０５）。 Next, the input unit 120 inputs the multimedia information received by the LAN card 106 from the terminal device 20 from the LAN card 106 shown in FIG. 2 (step S05).

次に、保存部１３０は、入力されたマルチメディア情報を、前述のパスにある電子ファイルに追加保存する（ステップＳ０６）。 Next, the storage unit 130 additionally stores the input multimedia information in the electronic file in the path described above (step S06).

その後、出力部１４０は、入力されたマルチメディア情報を、端末装置２０及び３０を宛先として、図２に示したＬＡＮカード１０６に出力する（ステップＳ０７）。その後、ＬＡＮカード１０６は、マルチメディア情報を端末装置２０及び３０へ配信（つまり、生放送）する。 Thereafter, the output unit 140 outputs the input multimedia information to the LAN card 106 shown in FIG. 2 with the terminal devices 20 and 30 as destinations (step S07). Thereafter, the LAN card 106 distributes the multimedia information to the terminal devices 20 and 30 (that is, live broadcast).

ここで、端末装置２０及び３０は、音声認識装置１００からマルチメディア情報を受信すると、マルチメディア情報で表される動画を表示する、図７に示す視聴画面を表示する。次に、端末装置２０及び３０は、マルチメディア情報を再生した動画を、視聴画面の動画表示領域ＡＭに表示し、再生した音声を音声出力装置から出力する。 Here, when receiving the multimedia information from the voice recognition device 100, the terminal devices 20 and 30 display a viewing screen shown in FIG. 7 that displays a moving image represented by the multimedia information. Next, the terminal devices 20 and 30 display the moving image in which the multimedia information is reproduced in the moving image display area AM of the viewing screen, and output the reproduced sound from the audio output device.

ここでは、端末装置２０のユーザは、撮影装置２１の正面に向った状態で、「都政が混乱するので」という内容の発言をしたとして説明を行う。このため、図２に示す視聴画面には、端末装置２０のユーザが発言する様子を正面から撮影した動画が表示され、端末装置２０及び３０から「都政が混乱するので」という音声が出力される。 Here, a description will be given assuming that the user of the terminal device 20 makes a statement saying “Because the metropolitan government is confused” while facing the front of the photographing device 21. For this reason, on the viewing screen shown in FIG. 2, a moving image obtained by shooting the state in which the user of the terminal device 20 speaks from the front is displayed, and the voice “Because the metropolitan administration is confused” is output from the terminal devices 20 and 30. .

その後、番組を視聴した端末装置２０及び端末装置３０のユーザは、視聴した番組のコメントを入力させる操作を端末装置３０に行っても良いし、行わなくて良い。このとき、ユーザが端末装置３０に当該操作を行うと、端末装置３０は、コメントを入力し、入力したコメントを表すコメント情報と、コメントしたユーザのユーザＩＤと、を、音声認識装置１００へ送信する。 Thereafter, the user of the terminal device 20 and the terminal device 30 who has watched the program may or may not perform an operation for inputting a comment of the viewed program on the terminal device 30. At this time, when the user performs the operation on the terminal device 30, the terminal device 30 inputs a comment, and transmits the comment information indicating the input comment and the user ID of the commented user to the voice recognition device 100. To do.

図３に示すステップＳ０７が実行された後に、入力部１２０は、ステップＳ０５と同様の処理を実行することで、マルチメディア情報を入力する（ステップＳ０８）。 After step S07 shown in FIG. 3 is executed, the input unit 120 inputs the multimedia information by executing the same process as step S05 (step S08).

その後、入力部１２０は、図２に示したＬＡＮカード１０６から出力される信号に基づいて、ＬＡＮカード１０６がコメント情報を受信したか否かを判別する（ステップＳ０９）。 Thereafter, the input unit 120 determines whether or not the LAN card 106 has received comment information based on a signal output from the LAN card 106 illustrated in FIG. 2 (step S09).

このとき、入力部１２０は、ＬＡＮカード１０６がコメント情報を受信しなかったと判別すると（ステップＳ０９；Ｎｏ）、ステップＳ０６及びステップＳ０７と同様の処理を実行することで、コメント情報の保存及び出力を行う（ステップＳ１０及びステップＳ１１）。 At this time, if the input unit 120 determines that the LAN card 106 has not received the comment information (step S09; No), the input unit 120 performs the same processing as step S06 and step S07, thereby saving and outputting the comment information. It performs (step S10 and step S11).

これに対して、入力部１２０は、ＬＡＮカード１０６がコメント情報を受信したと判別すると（ステップＳ０９；Ｙｅｓ）、ＬＡＮカード１０６が受信したコメント情報と、ユーザＩＤと、を、ＬＡＮカード１０６から入力する（ステップＳ１２）。 In contrast, when the input unit 120 determines that the LAN card 106 has received the comment information (step S09; Yes), the input unit 120 inputs the comment information received by the LAN card 106 and the user ID from the LAN card 106. (Step S12).

その後、保存部１３０は、ソフトウェアタイマを参照し、生放送の開始日時からの経過時間を取得する（ステップＳ１３）。次に、保存部１３０は、取得した経過時間をコメントの入力時点とする（ステップＳ１４）。その後、保存部１３０は、コメント情報で表されるコメントのコメントＩＤを生成する。 After that, the storage unit 130 refers to the software timer and acquires the elapsed time from the start date and time of live broadcasting (step S13). Next, the storage unit 130 sets the acquired elapsed time as a comment input time (step S14). Thereafter, the storage unit 130 generates a comment ID of the comment represented by the comment information.

次に、保存部１３０は、番組の放送ＩＤと、当該番組に対するコメントの入力時点及びコメントＩＤと、当該コメントと、当該コメントを発したユーザのユーザＩＤと、を、対応付けて、図６のコメントテーブルに追加保存する（ステップＳ１５）。 Next, the storage unit 130 associates the broadcast ID of the program, the input time and comment ID of the comment for the program, the comment, and the user ID of the user who issued the comment, as shown in FIG. It is additionally stored in the comment table (step S15).

その後、出力部１４０は、入力されたコメント情報を、端末装置２０及び３０を宛先として、図２に示したＬＡＮカード１０６に出力する（ステップＳ１６）。その後、ＬＡＮカード１０６は、コメント情報を端末装置２０及び３０へ配信する。 Thereafter, the output unit 140 outputs the input comment information to the LAN card 106 illustrated in FIG. 2 with the terminal devices 20 and 30 as destinations (step S16). Thereafter, the LAN card 106 distributes the comment information to the terminal devices 20 and 30.

端末装置２０及び３０は、コメント情報を音声認識装置１００から受信すると、コメント情報で表されるコメントを、図７に示す視聴画面のコメント表示領域ＡＣに表示する。 When receiving the comment information from the voice recognition device 100, the terminal devices 20 and 30 display the comment represented by the comment information in the comment display area AC of the viewing screen shown in FIG.

次に、保存部１３０は、ステップＳ１２で入力されたコメント情報で表されるコメントを、ステップＳ０８で入力されたマルチメディア情報で表される動画に合成する（ステップＳ１７）。 Next, the storage unit 130 synthesizes the comment represented by the comment information input at step S12 with the moving image represented by the multimedia information input at step S08 (step S17).

その後、保存部１３０は、コメントが合成された動画を表すマルチメディア情報を、前述のパスにあるファイルに追加保存する（ステップＳ１８）。 Thereafter, the storage unit 130 additionally stores the multimedia information representing the moving image with the combined comment in the file in the above-described path (step S18).

次に、出力部１４０は、コメントが合成されたマルチメディア情報を、端末装置２０及び３０を宛先として、図２に示したＬＡＮカード１０６に出力する（ステップＳ１９）。その後、ＬＡＮカード１０６は、マルチメディア情報を端末装置２０及び３０へ配信する。 Next, the output unit 140 outputs the multimedia information combined with the comments to the LAN card 106 shown in FIG. 2 with the terminal devices 20 and 30 as destinations (step S19). Thereafter, the LAN card 106 distributes the multimedia information to the terminal devices 20 and 30.

端末装置２０及び３０は、マルチメディア情報を音声認識装置１００から受信すると、マルチメディア情報を再生し、コメントが合成された動画を、図７に示す視聴画面の動画表示領域ＡＭに表示する。 When the terminal devices 20 and 30 receive the multimedia information from the voice recognition device 100, the terminal devices 20 and 30 reproduce the multimedia information, and display the moving image in which the comment is synthesized in the moving image display area AM of the viewing screen shown in FIG.

ここでは、端末装置３０を使用する視聴者は、出力された音声「都政が混乱するので」を聴取し、当該音声に対するコメント「混乱し過ぎだろ」を端末装置３０に入力させたとして説明を行う。また、当該視聴者は、視聴画面に表示された出演者の映像を視認し、出演者の氏名に言及するコメント「佐藤一郎きたー！」を端末装置３０に入力させたとして説明を行う。このため、図７に示す視聴画面のコメント表示領域ＡＣには、「混乱し過ぎだろ」及び「佐藤一郎きたー！」というコメントが表示される。また、動画表示領域ＡＭには、出演者の正面像に対して「混乱し過ぎだろ」及び「佐藤一郎きたー！」というコメントが合成された動画が表示される。 Here, it is assumed that a viewer who uses the terminal device 30 listens to the output voice “Because the metropolitan government is confused”, and inputs a comment “It is too confusing” to the voice to the terminal device 30. . Further, the viewer will be described assuming that the video of the performer displayed on the viewing screen is visually recognized and the comment “Ichiro Sato Kita!” Referring to the name of the performer is input to the terminal device 30. Therefore, in the comment display area AC of the viewing screen shown in FIG. 7, the comments “I'm too confused” and “Ichiro Sato!” Are displayed. In the moving image display area AM, a moving image in which the comments “I'm too confused” and “Ichiro Sato!” Is displayed on the front image of the performer is displayed.

ステップＳ１１若しくはステップＳ１９が実行された後に、入力部１２０は、システム日時を参照し、参照したシステム日時が、ステップＳ０１で取得した生放送終了予定日時を経過した日時であるか否かを判別する（ステップＳ２０）。このとき、入力部１２０は、生放送終了予定日時を経過していないと判別すると（ステップＳ２０；Ｎｏ）、ステップＳ０８から上記処理を繰り返す。 After step S11 or step S19 is executed, the input unit 120 refers to the system date and time to determine whether or not the referenced system date is the date and time when the scheduled live broadcast end date acquired in step S01 has passed ( Step S20). At this time, if the input unit 120 determines that the scheduled live broadcast end date has not elapsed (step S20; No), the input unit 120 repeats the above-described processing from step S08.

ステップＳ２０において、入力部１２０は、生放送終了予定日時を経過したと判別すると（ステップＳ２０；Ｙｅｓ）、生放送処理の実行を終了する。 In step S20, when the input unit 120 determines that the scheduled live broadcast end date has passed (step S20; Yes), the input of the live broadcast process ends.

次に、ＣＰＵ１０１の動作について、音声認識装置１００が、既に生放送した番組を再放送し、端末装置４０のユーザが当該番組を視聴する場合を例に挙げて説明する。 Next, the operation of the CPU 101 will be described by taking as an example a case where the voice recognition device 100 rebroadcasts a program that has already been broadcast live and the user of the terminal device 40 views the program.

ここで、端末装置４０のユーザは、生放送の開始から所定時間経過後に、生放送された番組の再放送を要求するリクエスト（以下、再放送リクエストという）を音声認識装置１００へ送信させる操作を端末装置４０に行う。端末装置４０は、当該操作に応じて再放送リクエストを音声認識装置１００へ送信する。 Here, the user of the terminal device 40 performs an operation for transmitting to the voice recognition device 100 a request for requesting rebroadcast of a live broadcast program (hereinafter referred to as a rebroadcast request) after a predetermined time has elapsed since the start of live broadcast. To 40. The terminal device 40 transmits a rebroadcast request to the voice recognition device 100 according to the operation.

ＣＰＵ１０１は、図２に示したＬＡＮカード１０６が再放送リクエストを受信すると、図８に示す再放送処理の実行を開始する。 When the LAN card 106 shown in FIG. 2 receives the rebroadcast request, the CPU 101 starts executing the rebroadcast process shown in FIG.

先ず、入力部１２０は、放送ＩＤを生成し、ＬＡＮカード１０６から、受信された再放送リクエストを入力する。次に、入力部１２０は、再放送リクエストから、再放送が求められた生放送番組の放送ＩＤ、及び再放送の開始を求める日時（以下、再放送要求日時という）を取得する（ステップＳ３１）。 First, the input unit 120 generates a broadcast ID and inputs the received rebroadcast request from the LAN card 106. Next, the input unit 120 acquires from the rebroadcast request the broadcast ID of the live broadcast program for which rebroadcast is requested, and the date and time for requesting the start of rebroadcast (hereinafter referred to as rebroadcast request date and time) (step S31).

次に、保存部１３０は、システム日時を参照し、参照したシステム日時が、生放送開始要求日時を経過した日時であるか否かを判別する（ステップＳ３２）。このとき、保存部１３０は、再放送開始要求日時を経過していないと判別すると（ステップＳ３２；Ｎｏ）、所定時間待機した後に、ステップＳ３２の処理を繰り返す。 Next, the storage unit 130 refers to the system date and time, and determines whether or not the referenced system date and time is the date and time when the live broadcast start request date and time has passed (step S32). At this time, if it is determined that the rebroadcast start request date has not elapsed (step S32; No), the storage unit 130 waits for a predetermined time and then repeats the process of step S32.

ステップＳ３２において、保存部１３０は、再放送開始要求日時を経過したと判別すると（ステップＳ３２；Ｙｅｓ）、システム日時を参照し、参照したシステム日時を、再放送の放送開始日時とする。また、保存部１３０は、再放送が求められた生放送番組の放送ＩＤに対応付けられた放送開始日時とパスとを、図５に示した放送テーブルから検索する。その後、保存部１３０は、再放送の放送開始日時と、生放送の放送開始日時と、の差異を算出し、算出した差異をシフト時間とする。次に、保存部１３０は、再放送の放送ＩＤと、当該再放送の放送開始日時と、当該再放送のシフト時間と、再放送された生番組のパスと、を対応付けて、図５の放送テーブルへ追加保存する（ステップＳ３３）。 In step S32, when determining that the rebroadcast start request date / time has passed (step S32; Yes), the storage unit 130 refers to the system date / time and sets the referred system date / time as the rebroadcast start date / time. Further, the storage unit 130 searches the broadcast table shown in FIG. 5 for the broadcast start date and time and the path associated with the broadcast ID of the live broadcast program for which rebroadcast is requested. Thereafter, the storage unit 130 calculates a difference between the broadcast start date and time of the rebroadcast and the broadcast start date and time of the live broadcast, and sets the calculated difference as the shift time. Next, the storage unit 130 associates the broadcast ID of the rebroadcast, the broadcast start date and time of the rebroadcast, the shift time of the rebroadcast, and the path of the rebroadcast live program, as shown in FIG. It is additionally stored in the broadcast table (step S33).

次に、保存部１３０は、ステップＳ０４と同様の処理を実行することで、再放送開始日時からの経過時間の計時を開始する（ステップＳ３４）。 Next, the preservation | save part 130 starts the measurement of the elapsed time from a rebroadcast start date by performing the process similar to step S04 (step S34).

次に、入力部１２０は、前述のパスにある電子ファイルから、所定サイズのマルチメディア情報を読み出す（ステップＳ３５）。 Next, the input unit 120 reads multimedia information of a predetermined size from the electronic file in the above path (step S35).

その後、出力部１４０は、読み出されたマルチメディア情報を、端末装置４０を宛先として、図３に示したＬＡＮカード１０６に出力する（ステップＳ３７）。その後、ＬＡＮカード１０６は、マルチメディア情報を端末装置４０へ送信する。端末装置４０は、受信したマルチメディア情報を再生することで（いわゆる、タイムシフト再生）、端末装置３０のユーザが入力したコメントが合成された動画を表示し、音声を出力する。 Thereafter, the output unit 140 outputs the read multimedia information to the LAN card 106 shown in FIG. 3 with the terminal device 40 as the destination (step S37). Thereafter, the LAN card 106 transmits multimedia information to the terminal device 40. The terminal device 40 reproduces the received multimedia information (so-called time-shifted reproduction), displays a moving image in which a comment input by the user of the terminal device 30 is synthesized, and outputs a sound.

その後、端末装置４０のユーザは、再放送された番組を視聴し、番組に対するコメントを入力させる操作を端末装置４０に行っても良いし、行わなくて良い。 Thereafter, the user of the terminal device 40 may or may not perform an operation of viewing the rebroadcast program and inputting a comment on the program on the terminal device 40.

次に、入力部１２０は、ステップＳ３５と同様の処理を実行し、マルチメディア情報を読み出す（ステップＳ３８）。 Next, the input unit 120 executes the same processing as step S35 and reads the multimedia information (step S38).

その後、入力部１２０は、図３に示したステップＳ０９と同様の処理を実行することで、ＬＡＮカード１０６がコメント情報を受信したか否かを判別する（ステップＳ３９）。 Thereafter, the input unit 120 determines whether or not the LAN card 106 has received the comment information by executing the same process as in step S09 illustrated in FIG. 3 (step S39).

このとき、入力部１２０は、ＬＡＮカード１０６がコメント情報を受信しなかったと判別すると（ステップＳ３９；Ｎｏ）、ステップＳ３７の処理と同様の処理を実行することで、ステップＳ３８で読み出されたマルチメディア情報の出力を行う（ステップＳ４１）。 At this time, if the input unit 120 determines that the LAN card 106 has not received the comment information (step S39; No), the input unit 120 executes the same process as the process of step S37, thereby reading the multi data read in step S38. Media information is output (step S41).

ステップＳ３９において、入力部１２０は、ＬＡＮカード１０６がコメント情報を受信したと判別すると（ステップＳ３９；Ｙｅｓ）、図３のステップＳ１２からステップＳ１７までの処理と同様の処理を実行する（ステップＳ４２からステップＳ４７）。これにより、ステップＳ３８で読み出されたマルチメディア情報で表される動画に、ステップＳ４２で入力されたコメント情報で表されるコメントが合成されたマルチメディア情報が生成される。 In step S39, when the input unit 120 determines that the LAN card 106 has received the comment information (step S39; Yes), the input unit 120 executes processing similar to the processing from step S12 to step S17 in FIG. 3 (from step S42). Step S47). As a result, multimedia information in which the comment represented by the comment information input at step S42 is combined with the moving image represented by the multimedia information read at step S38 is generated.

次に、保存部１３０は、前述のパスにある電子ファイルに保存されたマルチメディア情報の内で、ステップＳ３８で読み出されたマルチメディア情報を、ステップＳ４７で生成されたマルチメディア情報に書き換える（ステップＳ４８）。 Next, the storage unit 130 rewrites the multimedia information read in step S38 among the multimedia information stored in the electronic file in the above-described path with the multimedia information generated in step S47 ( Step S48).

その後、出力部１４０は、図３に示したステップＳ１９と同様の処理を実行する（ステップＳ４９）。これにより、端末装置４０へ、端末装置４０のユーザが入力したコメントが合成された動画を表すマルチメディア情報が送信される。 Thereafter, the output unit 140 executes the same process as step S19 shown in FIG. 3 (step S49). Thereby, the multimedia information representing the moving image in which the comment input by the user of the terminal device 40 is combined is transmitted to the terminal device 40.

ステップＳ４１若しくはステップＳ４９の処理が実行された後に、入力部１２０は、前述のパスにある電子ファイルからマルチメディア情報を読み出す位置（以下、読出位置という）を、読み出したマルチメディア情報のサイズだけ後側にシフトさせる。次に、入力部１２０は、読出位置が、電子ファイルの最後であるＥＯＦ（End Of File）であるか否かを判別する（ステップＳ５０）。このとき、入力部１２０は、読出位置がＥＯＦでないと判別すると（ステップＳ５０；Ｎｏ）、ステップＳ３８から上記処理を繰り返す。 After the process of step S41 or step S49 is executed, the input unit 120 moves the position where multimedia information is read from the electronic file in the above-described path (hereinafter referred to as a read position) by the size of the read multimedia information. Shift to the side. Next, the input unit 120 determines whether or not the reading position is an end of file (EOF) that is the last of the electronic file (step S50). At this time, if the input unit 120 determines that the reading position is not EOF (step S50; No), the above processing is repeated from step S38.

ステップＳ５０において、入力部１２０は、読出位置がＥＯＦであると判別すると（ステップＳ５０；Ｙｅｓ）、再放送処理の実行を終了する。 In step S50, when the input unit 120 determines that the reading position is EOF (step S50; Yes), the execution of the rebroadcast process is terminated.

音声認識装置１００のＣＰＵ１０１は、放送された番組の検索キー、若しくは番組で放送される動画に付される字幕として、番組での発言内容を要約したテキストを生成する、図９に示す要約生成処理を実行する。これにより、ＣＰＵ１０１は、図４に示す前述の入力部１２０、保存部１３０、及び出力部１４０の他に、抽出部１５０及び音声認識部１６０として機能する。また、ＣＰＵ１０１は、前述のように、ハードディスク１０４と協働して蓄積部１９０として機能する。 The CPU 101 of the speech recognition apparatus 100 generates a text summarizing the content of the remarks in the program as a search key for the broadcast program or a caption attached to a moving image broadcast in the program, as shown in FIG. Execute. Thereby, the CPU 101 functions as the extraction unit 150 and the voice recognition unit 160 in addition to the above-described input unit 120, storage unit 130, and output unit 140 shown in FIG. Further, the CPU 101 functions as the storage unit 190 in cooperation with the hard disk 104 as described above.

抽出部１５０は、番組で発言された音声を表す単語の候補となる単語（以下、候補語という）を、蓄積部１９０に蓄積されたコメント等から抽出する。音声認識部１６０は、抽出された候補語に基づいてマルチメディア情報の再生により発せられる音声を認識する。 The extraction unit 150 extracts words (hereinafter referred to as “candidate words”) that are candidates for the words representing the voice uttered in the program from comments and the like stored in the storage unit 190. The voice recognition unit 160 recognizes a voice uttered by playing multimedia information based on the extracted candidate words.

次に、要約生成処理に用いられる各種情報について説明する。
蓄積部１９０は、番組にコメントしたユーザが参照した文書のＵＲＬが保存された、図１０に示す参照テーブルを記憶している。参照テーブルには、ユーザのユーザＩＤと、当該ユーザが参照した文書のＵＲＬ（Uniform Resource Locator）と、当該ＵＲＬにある文書を当該ユーザが参照した日時（以下、参照日時という）と、が対応付けられたデータが複数保存されている。 Next, various information used for the summary generation process will be described.
The storage unit 190 stores a reference table shown in FIG. 10 in which URLs of documents referred to by users who comment on programs are stored. The reference table associates the user ID of the user, the URL (Uniform Resource Locator) of the document referred to by the user, and the date and time (hereinafter referred to as reference date and time) when the user referred to the document at the URL. Multiple saved data are stored.

尚、ユーザが参照した文書は、例えば、ニュースや百科事典や辞書の内容を掲載したウェブページ若しくはブログなどを含む。また、音声認識装置１００は、文書サーバとして機能し、端末装置２０から４０それぞれから、文書の送信リクエストと、送信を要求する文書のＵＲＬと、送信を要求するユーザのユーザＩＤと、を受信する。音声認識装置１００は、送信が要求された文書を返信すると共に、ユーザＩＤと、リクエストの返信日時（つまり、ユーザの参照日時）と、文書のＵＲＬと、を対応付けて、図１０に示す参照テーブルへ蓄積する。 The document referred to by the user includes, for example, a web page or a blog on which news, encyclopedias, and dictionary contents are posted. Further, the speech recognition apparatus 100 functions as a document server, and receives a document transmission request, a URL of a document requesting transmission, and a user ID of a user requesting transmission from each of the terminal devices 20 to 40. . The speech recognition apparatus 100 returns the document requested to be transmitted, and associates the user ID, the request reply date and time (that is, the user reference date and time), and the document URL with reference to FIG. Accumulate in the table.

また、蓄積部１９０は、番組に関連した文を要素とする文集合が保存される、図１１に示す文集合テーブルを記憶している。ここでは、番組に関連した文は、入力された番組のコメントを構成する文（以下、入力文という）及び番組にコメントしたユーザが参照した文書に掲載された文（以下、参照文という）を含む。 Further, the storage unit 190 stores a sentence set table shown in FIG. 11 in which a sentence set whose elements are sentences related to a program is stored. Here, the sentence related to the program includes a sentence constituting the comment of the input program (hereinafter referred to as an input sentence) and a sentence (hereinafter referred to as a reference sentence) posted in a document referred to by the user who has commented on the program. Including.

文集合テーブルには、番組に関連した文が入力文である場合に、当該文を識別する文ＩＤと、当該文と、当該文の種類と、当該文の入力時点と、当該番組の放送開始日時のシフト時間（以下、当該文に対応したシフト時間という）と、が、が対応付けられたデータが複数保存される。 In the sentence set table, when a sentence related to a program is an input sentence, the sentence ID for identifying the sentence, the sentence, the type of the sentence, the input time of the sentence, and the broadcast start of the program A plurality of data in which the date / time shift time (hereinafter referred to as the shift time corresponding to the sentence) is associated with each other are stored.

また、文集合テーブルには、文集合に含まれる番組に関連した文が参照文である場合に、当該文を識別する文ＩＤと、当該文と、当該文の種類と、当該文の検索に用いられたコメントの入力時点と、当該文に対応したシフト時間と、が、が対応付けられたデータが複数保存される。 In the sentence set table, when a sentence related to a program included in the sentence set is a reference sentence, the sentence ID for identifying the sentence, the sentence, the type of the sentence, and the search for the sentence are included. A plurality of data in which the input time of the used comment is associated with the shift time corresponding to the sentence is stored.

また、蓄積部１９０は、コメントや文書に含まれることがある単語と、コメントや文書において当該単語と共に使用されることがある共起語が保存された、図１２に示す共起語テーブルを記憶している。共起語テーブルには、単語と、当該単語の共起語と、当該単語と当該共起語とがコメントや文書で共に使用される（つまり、共起する）ことがどの程度尤もであるかを表す尤度（以下、共起尤度という）と、が対応付けられたデータが複数保存されている。 Further, the storage unit 190 stores a co-occurrence word table shown in FIG. 12 in which words that may be included in comments and documents and co-occurrence words that may be used with the words in comments and documents are stored. doing. In the co-occurrence word table, how likely is the word, the co-occurrence word for the word, and the word and the co-occurrence word to be used together in a comment or document (ie, co-occurring) A plurality of pieces of data in which the likelihood (hereinafter referred to as a co-occurrence likelihood) is represented is stored.

さらに、蓄積部１９０は、候補語が保存される、図１３に示す候補語テーブルを記憶している。本実施例では、音声認識装置１００は、番組で発言された音声を表す単語の候補として、入力文に含まれる単語（以下、入力語という）、入力文が入力された時期にユーザが参照した参照文に含まれる単語（以下、参照語という）、及びこれらの共起語（以下それぞれ、入力共起語及び参照共起語という）を用いる。 Further, the storage unit 190 stores a candidate word table shown in FIG. 13 in which candidate words are stored. In this embodiment, the speech recognition apparatus 100 refers to a word included in an input sentence (hereinafter referred to as an input word) as a word candidate representing a voice uttered in a program, and the user refers to the time when the input sentence is input. Words included in the reference sentence (hereinafter referred to as reference words) and their co-occurrence words (hereinafter referred to as input co-occurrence words and reference co-occurrence words, respectively) are used.

このため、候補語テーブルには、候補語が入力語である場合に、当該入力語を識別する候補語ＩＤと、当該入力語と、当該入力語を含む入力文の入力時点（以下、当該入力語に対応した入力時点という）と、当該入力語を含む文に対応したシフト時間（以下、当該入力語に対応したシフト時間という）と、当該入力語の出現尤度と、が対応付けて保存される。出現尤度とは、候補語の抽出に用いられたコメントが入力された条件の下で、当該候補語が番組中の発言に出現することの尤もらしさを表す値をいう。 Therefore, in the candidate word table, when the candidate word is an input word, the candidate word ID for identifying the input word, the input word, and the input time point of the input sentence including the input word (hereinafter, the input word) An input time corresponding to a word), a shift time corresponding to a sentence including the input word (hereinafter referred to as a shift time corresponding to the input word), and an appearance likelihood of the input word are stored in association with each other. Is done. The appearance likelihood is a value representing the likelihood that the candidate word appears in the utterance in the program under the condition that the comment used for extracting the candidate word is input.

また、候補語テーブルには、候補語が参照語の場合に、当該参照語の候補語ＩＤと、当該参照語と、当該参照語を含む文書の検索に用いられたコメントの入力時点（以下、当該参照語に対応した入力時点という）と、当該参照語を含む文に対応したシフト時間（以下、当該参照語に対応したシフト時間という）と、当該参照語の出現尤度と、が対応付けて保存される。 Further, in the candidate word table, when the candidate word is a reference word, the candidate word ID of the reference word, the reference word, and the input time point of the comment used for searching the document including the reference word (hereinafter, The input time corresponding to the reference word), the shift time corresponding to the sentence including the reference word (hereinafter referred to as the shift time corresponding to the reference word), and the appearance likelihood of the reference word Saved.

さらに、候補語テーブルには、候補語が入力共起語の場合に、当該入力共起語の候補語ＩＤと、当該入力共起語と、当該入力共起語と共に用いられると推測される入力語の入力時点（以下、当該入力共起語に対応した入力時点という）と、当該入力語を含む文に対応したシフト時間（以下、当該入力共起語に対応したシフト時間という）と、当該入力共起語の出現尤度と、が対応付けて保存される。 Further, in the candidate word table, when the candidate word is an input co-occurrence word, the input that is assumed to be used together with the candidate co-occurrence word ID, the input co-occurrence word, and the input co-occurrence word The input time of the word (hereinafter referred to as the input time corresponding to the input co-occurrence word), the shift time corresponding to the sentence including the input word (hereinafter referred to as the shift time corresponding to the input co-occurrence word), the The appearance likelihood of the input co-occurrence word is stored in association with each other.

またさらに、候補語テーブルには、候補語が参照共起語の場合に、当該参照共起語の候補語ＩＤと、当該参照共起語と、当該参照共起語と共に用いられると推測される参照語に対応した入力時点（以下、当該参照共起語に対応した入力時点）と、当該参照語を含む文に対応したシフト時間（以下、当該参照共起語に対応したシフト時間という）と、当該参照共起語の出現尤度と、が対応付けて保存される。 Furthermore, in the candidate word table, when the candidate word is a reference co-occurrence word, it is estimated that the candidate word ID, the reference co-occurrence word, and the reference co-occurrence word are used together with the reference co-occurrence word. An input time corresponding to the reference word (hereinafter referred to as an input time corresponding to the reference co-occurrence word) and a shift time corresponding to a sentence including the reference word (hereinafter referred to as a shift time corresponding to the reference co-occurrence word); The appearance likelihood of the reference co-occurrence word is stored in association with each other.

また、蓄積部１９０は、番組の音声を認識するために用いられる、音響モデル、単語辞書、及び言語モデルを記憶している。音響モデルは、音素や音節の周波数パターンを表し、番組の音声を音素若しくは音節（以下、音素等という）の配列（以下、音素等列という）に分解するために用いられる。単語辞書は、単語と当該単語の発音を表す音素等列とを複数対応付けた辞書である。言語モデルは、複数の単語の連鎖を規定するモデルであり、２つの単語の連鎖を規定するバイグラムモデルであっても、３つの単語の連鎖を規定するトライグラムモデルであっても、Ｎ個の単語の連鎖を規定するＮグラムモデルであっても良い。 In addition, the storage unit 190 stores an acoustic model, a word dictionary, and a language model that are used to recognize the sound of the program. The acoustic model represents a frequency pattern of phonemes and syllables, and is used to decompose a program sound into an array of phonemes or syllables (hereinafter referred to as phonemes) (hereinafter referred to as a phoneme sequence). The word dictionary is a dictionary in which a plurality of words and phoneme sequences representing pronunciations of the words are associated with each other. The language model is a model that defines a chain of a plurality of words, and is a bigram model that defines a chain of two words or a trigram model that defines a chain of three words. It may be an N-gram model that defines a chain of words.

また、蓄積部１９０は、ある発音時点で発音された音声が、ある入力時点で入力されたコメントの対象とされた音声と、どの程度の確率で合致するかを表す合致度を表す合致度データを記憶している。合致度データは、入力時点から発音時点を減算した差異（以下、時点差異という）の変化に伴って、合致度がどのように推移するかを表す合致度曲線を表す。 In addition, the accumulating unit 190 is a degree-of-match data representing a degree of coincidence that indicates to what extent a voice that is pronounced at a certain pronunciation point matches a voice that is the subject of a comment that is input at a certain input point. Is remembered. The degree-of-match data represents a degree-of-match curve representing how the degree of match changes with a change in the difference obtained by subtracting the pronunciation time from the input time (hereinafter referred to as time difference).

蓄積部１９０が記憶する合致度曲線は、生放送合致度曲線と、再放送合致度曲線と、を含む。生放送合致度曲線は、生放送された番組の音声と、当該番組の放送中に入力されたコメントの対象となった音声と、の合致度を表す。再放送合致度曲線は、再放送された番組の音声と、当該番組の再放送中に入力されたコメントの対象となった音声と、の合致度を表す。 The matching degree curve stored by the storage unit 190 includes a live broadcast matching degree curve and a rebroadcast matching degree curve. The live broadcast match level curve represents the degree of match between the sound of a program broadcast live and the sound of a comment input during the program broadcast. The rebroadcast match curve represents the degree of match between the sound of the rebroadcast program and the sound of the comment input during the rebroadcast of the program.

再放送合致度曲線上の点は、時点差異が所定の値「-TD1」以上「+TD2」以下の範囲で、生放送合致度曲線上の点よりも合致度が大きくなっている。既に生放送で番組を視聴している視聴者や、再放送で同じ番組を繰り返し視聴している視聴者は、予め番組で放送される音声を把握している。このため、これらの視聴者は、生放送で初めて番組を視聴する視聴者よりも、コメント対象とする音声の発音時点に近い時点でコメントを入力する傾向にあるからである。 The points on the rebroadcast match score curve have a greater match score than the points on the live broadcast match score curve within a time difference between the predetermined value “−TD1” and “+ TD2”. A viewer who has already watched a program by live broadcasting or a viewer who has repeatedly viewed the same program by rebroadcast knows in advance the sound broadcast by the program. For this reason, these viewers tend to input comments at a time closer to the time of pronunciation of the speech to be commented than viewers viewing the program for the first time in a live broadcast.

また、生放送合致度曲線は、時点差異が「TP」のときがピークであり、時点差異が「TP」から離れるに従って減衰する。これは、生放送の場合には、出演者の音声を聞いた後で当該音声にコメントを入力することが多いためである。但し、出演者が入力されたコメントに対して発言する場合もあるため、必ずしも時点差異は正となる（すなわち、コメントの入力時点の方が発音時点よりも遅くなる）訳ではない。 In addition, the live broadcast match degree curve has a peak when the time difference is “TP”, and attenuates as the time difference moves away from “TP”. This is because in the case of live broadcasting, a comment is often input to the sound after listening to the sound of the performer. However, since the performer may speak in response to the input comment, the time difference is not necessarily positive (that is, the comment input time is later than the pronunciation time).

さらに、再放送合致度曲線は、時点差異が「0」のときがピークであり、時点差異「0」から離れるに従って減衰する。前述のように、既に生放送で番組を視聴している視聴者などは、コメント対象とする音声の発音時点と同じ時点でコメントを入力することが多いためである。 Further, the rebroadcast match degree curve has a peak when the time difference is “0”, and attenuates as the distance from the time difference “0” increases. This is because, as described above, viewers who have already watched a program on a live broadcast often input a comment at the same time as the sound of the voice to be commented.

次に、図４に示した入力部１２０、保存部１３０、出力部１４０、抽出部１５０、及び音声認識部１６０で行われるＣＰＵ１０１の動作について説明する。 Next, operations of the CPU 101 performed by the input unit 120, the storage unit 130, the output unit 140, the extraction unit 150, and the voice recognition unit 160 illustrated in FIG. 4 will be described.

放送が終了すると、音声認識装置１００のユーザは、番組で放送された音声の内容を要約したテキストを生成するように指示する操作（以下、要約生成指示操作という）と、要約を生成させる番組のマルチメディア情報のパスを指定する操作（以下、パス指定操作という）と、を、図２に示したキーボード１０９に行う。 When the broadcast ends, the user of the speech recognition apparatus 100 performs an operation for instructing to generate a text summarizing the contents of the audio broadcast in the program (hereinafter referred to as a summary generation instruction operation), and a program for generating a summary. An operation for designating a path of multimedia information (hereinafter referred to as a path designating operation) is performed on the keyboard 109 shown in FIG.

音声認識装置１００のＣＰＵ１０１は、キーボード１０９から要約生成指示操作に応じた信号を入力すると、図９に示す要約生成処理の実行を開始する。 When the CPU 101 of the speech recognition apparatus 100 inputs a signal corresponding to the summary generation instruction operation from the keyboard 109, the CPU 101 starts executing the summary generation process shown in FIG.

先ず、入力部１２０は、キーボード１０９から出力される信号を入力し、入力した信号に基づいて、パス指定操作で指定されたパス（以下、指定パスという）を特定する（ステップＳ６１）。 First, the input unit 120 receives a signal output from the keyboard 109, and specifies a path specified by a path specifying operation (hereinafter referred to as a specified path) based on the input signal (step S61).

次に、抽出部１５０は、パスにあるマルチメディア情報で表される番組に関連した文を要素とする文集合を生成する、図１５に示す文集合生成処理を実行する（ステップＳ６２）。 Next, the extraction unit 150 executes a sentence set generation process shown in FIG. 15 that generates a sentence set having a sentence related to the program represented by the multimedia information in the path as an element (step S62).

文集合生成処理を開始すると、抽出部１５０は、指定パスに対応付けられた放送ＩＤを、図５に示した放送テーブルから全て検索する（ステップＳ７１）。 When the sentence set generation process is started, the extraction unit 150 searches all broadcast IDs associated with the designated path from the broadcast table shown in FIG. 5 (step S71).

次に、抽出部１５０は、検索された放送ＩＤ（以下、検索放送ＩＤという）それぞれについて、検索放送ＩＤに対応付けられたコメントと、入力時点と、ユーザＩＤと、を、図６に示したコメントテーブルから全て検索する（ステップＳ７２）。これにより、抽出部１５０は、指定パスにあるメディア情報で表される番組が生放送若しくは再放送されたときに入力されたコメントと、当該コメントを発したユーザと、放送の開始日時からの経過時間で表されるコメントの入力時点と、を特定する。 Next, for each searched broadcast ID (hereinafter referred to as a search broadcast ID), the extraction unit 150 shows a comment associated with the search broadcast ID, an input time point, and a user ID, as shown in FIG. All are searched from the comment table (step S72). As a result, the extraction unit 150 allows the comment input when the program represented by the media information in the designated path is broadcast live or rebroadcast, the user who issued the comment, and the elapsed time from the broadcast start date and time. And the input time point of the comment represented by.

その後、抽出部１５０は、検索されたコメント（以下、検索コメントという）の全てについて、コメントを構成する文（つまり、入力文）を取得し、取得した入力文を、指定されたマルチメディア情報で表される番組に関連した文とする。次に、抽出部１５０は、入力文を要素とする文集合を生成する（ステップＳ７３）。 Thereafter, the extraction unit 150 acquires a sentence (that is, an input sentence) that constitutes a comment for all of the searched comments (hereinafter referred to as a search comment), and the acquired input sentence is designated with the specified multimedia information. The sentence is related to the program being represented. Next, the extraction unit 150 generates a sentence set having the input sentence as an element (step S73).

その後、抽出部１５０は、検索された放送ＩＤそれぞれについて、放送ＩＤに対応付けられたシフト時間を、図５に示した放送テーブルから検索する。次に、抽出部１５０は、入力文の文ＩＤを生成する。その後、検索されたシフト時間を、同じ放送ＩＤで検索されたコメントの入力文に対応したシフト時間とする。 Thereafter, the extraction unit 150 searches the broadcast table shown in FIG. 5 for the shift time associated with the broadcast ID for each searched broadcast ID. Next, the extraction unit 150 generates a sentence ID of the input sentence. Thereafter, the searched shift time is set as the shift time corresponding to the input sentence of the comment searched with the same broadcast ID.

その後、抽出部１５０は、生成した文ＩＤと、当該文と、当該文の種類と、当該文で構成されるコメントの入力時点と、当該文に対応したシフト時間と、を対応付けて、図１１に示した文集合テーブルに保存する（ステップＳ７４）。 Thereafter, the extraction unit 150 associates the generated sentence ID, the sentence, the kind of the sentence, the input time of the comment configured with the sentence, and the shift time corresponding to the sentence, 11 is stored in the sentence set table shown in FIG. 11 (step S74).

コメントから抽出された入力文にシフト時間を対応付けておくのは、シフト時間によって、音声の出力タイミングに対するコメントの入力タイミングが異なると推測されるからである。このため、後の処理のために入力文とシフト時間とを対応付けておく必要があるからである。 The reason why the shift time is associated with the input sentence extracted from the comment is that it is presumed that the input timing of the comment differs from the output timing of the voice depending on the shift time. For this reason, it is necessary to associate the input sentence with the shift time for later processing.

その後、抽出部１５０は、ステップＳ７１で検索された放送ＩＤそれぞれについて、放送ＩＤに対応付けられた放送開始日時を、図５に示した放送テーブルから検索する（ステップＳ７５）。 Thereafter, the extraction unit 150 searches the broadcast table shown in FIG. 5 for the broadcast start date and time associated with the broadcast ID for each broadcast ID searched in step S71 (step S75).

その後、抽出部１５０は、ステップＳ７２で検索されたコメントそれぞれについて、検索された放送開始日時を入力時点に加算することで、コメントが入力された日時（以下、コメント入力日時という）を特定する（ステップＳ７６）。 Thereafter, the extraction unit 150 identifies the date and time when the comment was input (hereinafter referred to as the comment input date and time) by adding the searched broadcast start date and time to the input time for each of the comments searched at step S72 ( Step S76).

次に、抽出部１５０は、コメント入力日時より所定の時間Ａだけ前の日時から、コメント入力日時より所定の時間Ｂだけ後の日時までの時間区間（以下、コメント入力時期という）を算出する。次に、抽出部１５０は、ステップＳ７２で検索されたコメントそれぞれについて、コメント入力時期に含まれる参照日時と、ステップＳ７２で検索されたユーザＩＤと、に対応付けられたＵＲＬを、図１０に示した参照テーブルから検索する（ステップＳ７７）。これにより、抽出部１５０は、コメント入力時期にユーザが参照した文書を特定し、特定した文書を、当該コメントを入力するためにユーザが参照したページとする。尚、好適な所定の時間Ａ及び所定の時間Ｂは、当業者が実験により定めることはできる。 Next, the extraction unit 150 calculates a time interval (hereinafter referred to as a comment input time) from a date and time that is a predetermined time A before a comment input date and time to a date and time that is a predetermined time B after the comment input date and time. Next, for each comment searched in step S72, the extraction unit 150 shows the URL associated with the reference date and time included in the comment input time and the user ID searched in step S72 in FIG. Search from the reference table (step S77). Accordingly, the extraction unit 150 identifies the document referred to by the user at the comment input time, and sets the identified document as the page referred to by the user for inputting the comment. Note that a suitable predetermined time A and predetermined time B can be determined by experiments by those skilled in the art.

次に、抽出部１５０は、ステップＳ７６で検索された全ＵＲＬについて、ＵＲＬにある文書を取得する（ステップＳ７８）。 Next, the extraction unit 150 acquires the document at the URL for all the URLs searched in step S76 (step S78).

その後、抽出部１５０は、取得された全文書について、参照された文書に掲載された文（以下、参照文という）を取得し、取得した参照文を、指定されたマルチメディア情報で表される番組に関連した文とする。次に、抽出部１５０は、参照文を文集合に追加する（ステップＳ７９）。 Thereafter, the extraction unit 150 acquires a sentence (hereinafter referred to as a reference sentence) posted in the referenced document for all the acquired documents, and the acquired reference sentence is represented by designated multimedia information. The sentence is related to the program. Next, the extraction unit 150 adds the reference sentence to the sentence set (step S79).

例えば、視聴者が、番組を視聴しているときに参照した文書は、番組で放送された内容の内で、視聴者が疑問に思った内容や確認したいと思った内容など、番組に関連した内容を掲載していることが多いためである。 For example, the document that the viewer referred to while watching the program related to the program, such as the content that the viewer was wondering about or the content that the viewer wanted to check in the content broadcast on the program. This is because the contents are often posted.

その後、抽出部１５０は、参照文を、図１１に示した文集合テーブルに保存した後に（ステップＳ７８）、文集合生成処理の実行を終了する。具体的には、抽出部１５０は、参照文の文ＩＤを生成し、生成した文ＩＤと、当該文と、当該文の種類と、当該文を含む文書の検索に用いられたコメントの入力時点と、当該文に対応するシフト時間と、を、対応付けて文集合テーブルに保存する。 After that, the extraction unit 150 stores the reference sentence in the sentence set table shown in FIG. 11 (step S78), and ends the execution of the sentence set generation process. Specifically, the extraction unit 150 generates a sentence ID of a reference sentence, and inputs the generated sentence ID, the sentence, the type of the sentence, and a comment used for searching for a document including the sentence. And the shift time corresponding to the sentence are stored in the sentence set table in association with each other.

尚、参照された文書から抽出された参照文にシフト時間を対応付けておくのは、シフト時間によって、音声の出力タイミングに対する文書の参照タイミングが異なると推測されるからである。このため、後の処理のために参照文とシフト時間とを対応付けておく必要があるからである。 The reason why the shift time is associated with the reference sentence extracted from the referenced document is that it is estimated that the reference timing of the document with respect to the output timing of the sound differs depending on the shift time. For this reason, it is necessary to associate the reference sentence with the shift time for later processing.

図９のステップＳ６２の後に、抽出部１５０は、文集合に含まれる文から、番組で放送された音声を表す単語の候補（つまり、候補語）を抽出する、図１６に示す候補語抽出処理を実行する（ステップＳ６３）。 After step S62 in FIG. 9, the extraction unit 150 extracts word candidates (that is, candidate words) representing the voice broadcast in the program from the sentences included in the sentence set, as shown in FIG. 16. Is executed (step S63).

候補語抽出処理の実行を開始すると、抽出部１５０は、文集合に含まれる文を全て取得する（ステップＳ８１）。次に、抽出部１５０は、取得した文それぞれに形態素解析を施す（ステップＳ８２）。これにより、抽出部１５０は、入力文を構成する単語（つまり、入力語）の全てと、参照文を構成する単語（つまり、参照語）の全てと、を、それぞれの文から抽出する（ステップＳ８３）。 When the execution of the candidate word extraction process is started, the extraction unit 150 acquires all the sentences included in the sentence set (step S81). Next, the extraction unit 150 performs morphological analysis on each acquired sentence (step S82). Thereby, the extraction unit 150 extracts all of the words (that is, input words) constituting the input sentence and all of the words (that is, reference words) constituting the reference sentence from the respective sentences (step S83).

その後、抽出部１５０は、抽出した入力語のそれぞれについて、入力語に対応付けられた共起語（つまり、入力共起語）を、図１２に示した共起語テーブルから検索する。次に、抽出部１５０は、入力語に基づいて検索された入力共起語を、当該入力語が番組のコメントの一部として入力された場合に、番組の出演者の発言内容に用いられている（つまり、発言内容として共起している）と推測される単語とする。 Thereafter, the extraction unit 150 searches the co-occurrence word table shown in FIG. 12 for the co-occurrence words associated with the input words (that is, the input co-occurrence words) for each of the extracted input words. Next, the extraction unit 150 uses the input co-occurrence word searched based on the input word as the comment content of the performer of the program when the input word is input as part of the comment of the program. It is assumed that the word is assumed to be present (that is, co-occurs as the content of the statement).

また、抽出部１５０は、抽出した参照語のそれぞれについて、参照語に対応付けられた共起語（つまり、参照共起語）を、共起語テーブルから検索する（ステップＳ８４）。次に、抽出部１５０は、参照語に基づいて検索された参照共起語を、番組にコメントするために当該参照語を視聴者が参照した場合に、番組の出演者の発言に用いられていると推測される単語とする。 In addition, the extraction unit 150 searches the co-occurrence word table for the co-occurrence word (that is, the reference co-occurrence word) associated with the reference word for each of the extracted reference words (step S84). Next, when the viewer refers to the reference co-occurrence word searched based on the reference word in order to comment on the program, the extraction unit 150 is used for the remark of the performer of the program. It is assumed that the word is

次に、抽出部１５０は、ステップＳ８３で抽出された入力語及び参照語、並びにステップＳ８４で検索された入力共起語及び参照共起語を候補語とする（ステップＳ８５）。 Next, the extraction unit 150 sets the input word and reference word extracted in step S83 and the input co-occurrence word and reference co-occurrence word searched in step S84 as candidate words (step S85).

その後、抽出部１５０は、候補語を、図１３に示した候補語テーブルに保存した後に（ステップＳ８６）、候補語抽出処理の実行を終了する。 Thereafter, the extraction unit 150 stores the candidate words in the candidate word table shown in FIG. 13 (step S86), and ends the execution of the candidate word extraction process.

具体的には、抽出部１５０は、候補語のそれぞれについて、候補語を識別する候補語ＩＤを生成する。次に、抽出部１５０は、入力語と、当該入力語の共起語、当該入力語を含むコメントに基づいて検索された文書に掲載された参照語、及び当該参照語の共起語に対応した入力時点を、当該入力語が抽出された入力文の入力時点とする。 Specifically, the extraction unit 150 generates a candidate word ID for identifying a candidate word for each candidate word. Next, the extraction unit 150 corresponds to an input word, a co-occurrence word of the input word, a reference word posted in a document searched based on a comment including the input word, and a co-occurrence word of the reference word This input time is set as the input time of the input sentence from which the input word is extracted.

次に、抽出部１５０は、入力語である候補語の候補語ＩＤと、当該候補語と、当該候補語の種類と、当該候補語に対応した入力時点と、当該候補語を含む入力文に対応付けられたシフト時間と、を、対応付けて、候補語テーブルに保存する。また、抽出部１５０は、入力共起語である候補語の候補語ＩＤと、当該候補語と、当該候補語の種類と、当該候補語に対応した入力時点と、共起が推測される入力語に対応したシフト時間と、を、対応付けて、候補語テーブルに保存する。さらに、抽出部１５０は、参照語である候補語の候補語ＩＤと、当該候補語と、当該候補語の種類と、当該候補語に対応した入力時点と、当該候補語を含む参照文に対応したシフト時間と、を、対応付けて、候補語テーブルに保存する。またさらに、抽出部１５０は、参照共起語である候補語の候補語ＩＤと、当該候補語と、当該候補語の種類と、当該候補語に対応する入力時点と、共起が推測される参照語に対応付けられたシフト時間と、を、対応付けて、候補語テーブルに保存する。 Next, the extraction unit 150 extracts the candidate word ID of the candidate word that is the input word, the candidate word, the type of the candidate word, the input time point corresponding to the candidate word, and the input sentence including the candidate word. The associated shift times are associated and stored in the candidate word table. The extraction unit 150 also inputs the candidate word ID of the candidate word that is the input co-occurrence word, the candidate word, the type of the candidate word, the input time point corresponding to the candidate word, and the co-occurrence. The shift time corresponding to the word is associated and stored in the candidate word table. Furthermore, the extraction unit 150 corresponds to a candidate word ID of a candidate word that is a reference word, the candidate word, a type of the candidate word, an input time point corresponding to the candidate word, and a reference sentence including the candidate word. The shifted shift time is stored in the candidate word table in association with each other. Furthermore, the extraction unit 150 estimates the co-occurrence of the candidate word ID of the candidate word that is the reference co-occurrence word, the candidate word, the type of the candidate word, the input time point corresponding to the candidate word. The shift times associated with the reference words are stored in the candidate word table in association with each other.

図９示したステップＳ６３で候補語が抽出された後に、図４に示した音声認識部１６０は、候補語のそれぞれについて出現尤度を算出する（ステップＳ６４）。 After the candidate words are extracted in step S63 shown in FIG. 9, the speech recognition unit 160 shown in FIG. 4 calculates the likelihood of appearance for each candidate word (step S64).

ここで、ステップＳ６４の処理の一例について説明する。
音声認識部１６０は、図１３に示した候補語テーブルに保存された候補語の全てを検索する。次に、音声認識部１６０は、入力語である候補語のそれぞれについて、出現尤度として第１所定値を割り当てる。この第１所定値は、番組のコメントの一部として当該入力語が入力された条件の下で、例えば、当該入力語が番組中で発言されるなど、当該入力語が番組の音声に出現することの尤もらしさがどの程度であるかを表す値である。 Here, an example of the process of step S64 will be described.
The voice recognition unit 160 searches all candidate words stored in the candidate word table shown in FIG. Next, the speech recognition unit 160 assigns a first predetermined value as the appearance likelihood for each candidate word that is an input word. This first predetermined value appears in the audio of the program, for example, the input word is spoken in the program under the condition that the input word is input as part of the program comment. It is a value representing how much the likelihood of a thing is.

また、音声認識部１６０は、参照語である候補語のそれぞれについて、出現尤度として第２所定値を割り当てる。この第２所定値は、番組のコメントの一部として当該参照語の検索に用いられたコメントが入力された条件の下で、当該参照語が番組の音声に出現することの尤もらしさがどの程度であるかを表す値である。第１所定値及び第２所定値は、当業者が実験により好適な値に定めることができる。 In addition, the speech recognition unit 160 assigns a second predetermined value as an appearance likelihood for each candidate word that is a reference word. This second predetermined value is the degree of likelihood that the reference word appears in the audio of the program under the condition that the comment used for searching the reference word is input as part of the program comment. It is a value indicating whether or not. The first predetermined value and the second predetermined value can be set to suitable values by those skilled in the art through experiments.

また、抽出部１５０は、候補語の内で、入力語の共起語のそれぞれについて、図１２に示した共起語テーブルから、当該入力語と当該共起語とに対応付けられた共起尤度を検索する。次に、抽出部１５０は、検索した共起尤度を用いて前述の第１所定値を調整した値（以下、第１調整値）を、当該共起語の出現尤度として割り当てる。この第１調整値は、当該入力語を含むコメントが入力された条件の下で、当該共起語が番組の発言中に出現することの尤もらしさがどの程度であるかを表す値であり、共起尤度が高い程高い値に調整される。 In addition, the extraction unit 150 generates a co-occurrence associated with the input word and the co-occurrence word from the co-occurrence word table illustrated in FIG. 12 for each of the co-occurrence words of the input word among the candidate words. Search for likelihood. Next, the extraction unit 150 assigns a value obtained by adjusting the first predetermined value using the searched co-occurrence likelihood (hereinafter, first adjustment value) as the appearance likelihood of the co-occurrence word. This first adjustment value is a value that represents the likelihood that the co-occurrence word will appear during the utterance of the program under the condition that a comment including the input word is input, The higher the co-occurrence likelihood, the higher the value.

さらに、抽出部１５０は、候補語の内で、参照語の共起語のそれぞれについて、図１２に示した共起語テーブルから、当該参照語と当該共起語とに対応付けられた共起尤度を検索する。次に、抽出部１５０は、検索した共起尤度を用いて前述の第２所定値を調整した値（以下、第２調整値）を、当該共起語の出現尤度として割り当てる。この第２調整値は、当該参照語の検索に用いられたコメントが入力された条件の下で、当該共起語が番組の発言中に出現することの尤もらしさがどの程度であるかを表す値であり、共起尤度が高い程高い値に調整される。 Furthermore, the extraction unit 150 generates a co-occurrence associated with the reference word and the co-occurrence word from the co-occurrence word table illustrated in FIG. 12 for each of the co-occurrence words of the reference word among the candidate words. Search for likelihood. Next, the extraction unit 150 assigns a value obtained by adjusting the second predetermined value using the searched co-occurrence likelihood (hereinafter, the second adjustment value) as the appearance likelihood of the co-occurrence word. This second adjustment value represents the degree of likelihood that the co-occurrence word will appear during the utterance of the program under the condition that the comment used to search for the reference word is input. This value is adjusted to a higher value as the co-occurrence likelihood is higher.

図９に示すステップＳ６４が実行された後に、入力部１２０は、ステップＳ６１で特定された指定パスから、所定サイズのマルチメディア情報を読み出す（ステップＳ６５）。 After step S64 shown in FIG. 9 is executed, the input unit 120 reads out multimedia information of a predetermined size from the designated path specified in step S61 (step S65).

次に、図４に示した音声認識部１６０は、ステップＳ６５で読み出されたマルチメディア情報で表される番組の音声（以下、番組音声）Ｘを認識する、図１７に示すような連続音声認識処理を実行する（ステップＳ６６）。 Next, the voice recognition unit 160 shown in FIG. 4 recognizes the voice (hereinafter referred to as program voice) X of the program represented by the multimedia information read out in step S65, as shown in FIG. Recognition processing is executed (step S66).

尚、音声認識部１６０が実行する連続音声認識処理は、非特許文献１に記載されているので、以下、概略を説明する。 Note that the continuous speech recognition processing executed by the speech recognition unit 160 is described in Non-Patent Document 1, and will be briefly described below.

連続音声認識処理は、ステップＳ６５で読み出された番組の音声（以下、番組音声という）Ｘが入力されたときに、番組音声Ｘの内容が単語列Ｗで表される確率ｐ（Ｗ｜Ｘ）を最大にする単語列Ｗ^＊を探索する処理である。 In the continuous speech recognition process, the probability p (W | X) that the content of the program audio X is represented by the word string W when the audio (hereinafter referred to as program audio) X of the program read in step S65 is input. ) To maximize the word string W ^* .

ここで、確率ｐ（Ｗ｜Ｘ）は、ベイズ則により、以下の式（１）のように書き換えることができる。 Here, the probability p (W | X) can be rewritten as the following formula (1) by Bayes rule.

ここで、分母の確率ｐ（Ｘ)は、単語列Ｗの決定に影響しない正規化係数と考えられるので無視できる。 Here, the denominator probability p (X) can be ignored because it is considered as a normalization coefficient that does not affect the determination of the word string W.

このため、以下の式（２）で表される、確率ｐ（Ｗ｜Ｘ）を最大にする単語列Ｗ^＊は、以下の式（３）若しくは式（４）でも表される。 Therefore, the word string W ^* that maximizes the probability p (W | X) represented by the following formula (2) is also represented by the following formula (3) or formula (4).

本実施例では、音声認識部１６０は、式（３）を満たす単語列Ｗ^＊を探索するとして説明するが、これに限定される訳ではなく、式（４）を満たす単語列Ｗ^＊を探索しても良い。 In the present embodiment, the speech recognition unit 160 is described as searching for a word string W ^* that satisfies Expression (3), but is not limited thereto, and searches for a word string W ^* that satisfies Expression (4). You may do it.

音声認識処理の実行を開始すると、音声認識部１６０は、図９に示すステップＳ６５で読み出されたマルチメディア情報で表される音声の音声信号から、例えば、周波数及び音圧に基づいて、番組の音声（以下、番組音声という）Ｘを抽出する信号処理を行う（ステップＳ９１）。 When the execution of the voice recognition process is started, the voice recognition unit 160 performs a program based on, for example, the frequency and the sound pressure from the voice signal of the voice represented by the multimedia information read in step S65 shown in FIG. Signal processing (hereinafter, referred to as program audio) X is performed (step S91).

次に、音声認識部１６０は、抽出された番組音声Ｘの周波数の変化と、蓄積部１９０に記憶された音響モデルで表される音素や音節の周波数パターンと、を、マッチングさせることで、番組音声Ｘを音素等に分解し、番組音声Ｘを表す音素等列Ｘ＝｛ｘ_１，ｘ_２，…，ｘ_ｋ｝を生成する（ステップＳ９２）。 Next, the voice recognizing unit 160 matches the frequency change of the extracted program audio X with the frequency pattern of phonemes and syllables represented by the acoustic model stored in the storage unit 190, so that the program The speech X is decomposed into phonemes and the like, and a phoneme sequence X = {x ₁ , x ₂ ,..., X _k } representing the program speech X is generated (step S92).

その後、音声認識部１６０は、番組音声Ｘが発音された発音時点を特定し、放送の開始日時から音声が発せられるまでの経過時間を用いて表す（ステップＳ９３）。 Thereafter, the voice recognition unit 160 identifies the point of time when the program voice X was pronounced, and represents it using the elapsed time from the broadcast start date and time until the voice is emitted (step S93).

次に、音声認識部１６０は、図１３に示した候補語テーブルに保存された候補語の全てについて、候補語に対応する入力時点と、抽出された番組音声の発音時点と、の差異（つまり、時点差異）を算出する（ステップＳ９４）。 Next, the speech recognizing unit 160, for all candidate words stored in the candidate word table shown in FIG. 13, the difference between the input time corresponding to the candidate word and the pronunciation time of the extracted program sound (that is, , Time difference) is calculated (step S94).

その後、音声認識部１６０は、図１３に示した候補語テーブルに保存された候補語の全てについて、候補語に対応するシフト時間を検索する。次に、音声認識部１６０は、シフト時間が所定値以下の候補語について、ステップＳ９４で算出された時点差異と、蓄積部１９０に保存されたデータで表される生放送合致度曲線と、に基づいて合致度を算出する。また、音声認識部１６０は、シフト時間が所定値より大きい候補語について、算出された時点差異と、蓄積部１９０に保存されたデータで表される再放送合致度曲線と、に基づいて合致度を算出する（ステップＳ９５）。 Thereafter, the voice recognition unit 160 searches for the shift time corresponding to the candidate word for all of the candidate words stored in the candidate word table shown in FIG. Next, the voice recognition unit 160 is based on the time difference calculated in step S94 and the live broadcast match degree curve represented by the data stored in the storage unit 190 for candidate words whose shift time is equal to or less than a predetermined value. To calculate the degree of match. Also, the speech recognition unit 160 determines the matching degree based on the calculated time difference and the rebroadcast matching degree curve represented by the data stored in the storage unit 190 for the candidate words whose shift time is greater than the predetermined value. Is calculated (step S95).

次に、音声認識部１６０は、生成した単語列Ｗの数の計数に用いられる変数ｊを値「0」で初期化する（ステップＳ９６）。 Next, the speech recognition unit 160 initializes the variable j used for counting the number of generated word strings W with the value “0” (step S96).

次に、音声認識部１６０は、合致度の高い候補語ほど、高確率で、単語列Ｗ＝｛ｗ_１，ｗ_２，…，ｗ_ｋ｝を構成する候補語ｗ_１からｗ_ｋとして選択する。また、音声認識部１６０は、出現尤度の高い候補語ほど、高確率で、上記単語列Ｗを構成する候補語ｗ_１からｗ_ｋとして選択する。その後、音声認識部１６０は、選択した候補語ｗ_１からｗ_ｋで構成される単語列Ｗを生成する（ステップＳ９７）。尚、単語列Ｗを構成する候補語の数ｋは、ステップＳ９７の実行時に確率的に決定される。 Next, the speech recognition unit 160 selects candidate words with higher probability as candidate words w ₁ to w _k constituting the word string W = {w ₁ , w ₂ ,..., W _k } with higher probability. . Also, the speech recognition unit 160 selects candidate words with higher probability of appearance as candidate words w ₁ to w _k constituting the word string W with higher probability. Then, the speech recognition unit 160 generates a word string W composed of candidate words _{w 1} selected in _{w k} (step S97). Note that the number k of candidate words constituting the word string W is stochastically determined when step S97 is executed.

その後、音声認識部１６０は、蓄積部１９０が記憶する単語辞書を用いて、単語列Ｗを構成する候補語それぞれについて音素等列を生成し、単語列Ｗの発音を表す音素等列Ｍ＝｛ｍ_１，ｍ_２，…，ｍ_ｉ｝を生成する（ステップＳ９８）。 Thereafter, the speech recognition unit 160 generates a phoneme sequence for each candidate word constituting the word sequence W using the word dictionary stored in the storage unit 190, and the phoneme sequence M = {representing the pronunciation of the word sequence W m ₁ , m ₂ ,..., m _i } are generated (step S98).

次に、音声認識部１６０は、以下の式（５）を用いて、単語列Ｗから番組音声Ｘが生起する確率ｐ（Ｘ｜Ｗ）を算出する（ステップＳ９９）。尚、確率ｐ（Ｘ｜Ｗ）は、単語列Ｘの発音を表す音素等列と、番組音声の音素等列と、がどの程度一致するかを表すため、一致度と称される。 Next, the voice recognition unit 160 calculates the probability p (X | W) that the program voice X occurs from the word string W using the following equation (5) (step S99). The probability p (X | W) is referred to as a degree of coincidence because it represents how much the phoneme sequence representing the pronunciation of the word sequence X matches the phoneme sequence of the program audio.

尚、音声認識部１６０は、音響モデルで表される音素等ｍ_ｉの音響的特徴と、音声信号から分解された音素等ｘ_ｉの音響的特徴と、が、どの程度一致しているかを比較し、一致しているほどｐ（ｘ_ｉ｜ｍ_ｉ）を値「1」に近い値とし、相違しているほどｐ（ｘ_ｉ｜ｍ_ｉ）を「0」に近い値とする。 The voice recognition unit 160, compares the acoustic features of phonemes such as m _i represented by the acoustic model, the acoustic features of phonemes such as x _i decomposed from the audio signal, but to what extent coincide and, more consistent p _| a (x i _m _i) is a value close to the value "1", p as are different _| a (x i _m _i) to a value close to "0".

次に、音声認識部１６０は、番組音声Ｘが入力される時点で、単語列Ｗが生起する確率であり、番組音声Ｘとは無関係の言語的確からしさを表す結合度ｐ（Ｗ）を、下記の式（６）を用いて算出する。このとき、音声認識部１６０は、下記の式（６）を式（７）で近似し、Ｎグラムモデルの言語モデルを用いて結合度ｐ（Ｗ）の近似値を算出する（ステップＳ１００）。計算量を軽減するためである。 Next, the voice recognizing unit 160 is the probability that the word string W will occur at the time when the program voice X is input, and the degree of connectivity p (W) representing the linguistic accuracy unrelated to the program voice X is calculated as follows: It calculates using the following formula | equation (6). At this time, the speech recognition unit 160 approximates the following equation (6) by equation (7), and calculates an approximate value of the coupling degree p (W) using the language model of the N-gram model (step S100). This is to reduce the calculation amount.

その後、音声認識部１６０は、ステップＳ９９で算出されたｐ（Ｘ｜Ｗ）と、ステップＳ１００で算出された結合度ｐ（Ｗ）と、を乗算して、ｐ（Ｗ|Ｘ）を算出する（ステップＳ１０１）。 After that, the voice recognition unit 160 calculates p (W | X) by multiplying p (X | W) calculated in step S99 and the degree of coupling p (W) calculated in step S100. (Step S101).

その後、音声認識部１６０は、変数ｊを値「1」だけ増加させた後に（ステップＳ１０２）、変数ｊが所定値Ｔｈより大きいか否かを判別する（ステップＳ１０３）。このとき、音声認識部１６０は、変数ｊが所定値Ｔｈ以下であると判別すると（ステップＳ１０３；Ｎｏ）、ステップＳ９７に戻り、上記処理を繰り返す。尚、好適な所定値Ｔｈは、当業者が実験により定めることができる。 Thereafter, the voice recognition unit 160 increases the variable j by the value “1” (step S102), and then determines whether or not the variable j is larger than the predetermined value Th (step S103). At this time, when the voice recognition unit 160 determines that the variable j is equal to or smaller than the predetermined value Th (step S103; No), the speech recognition unit 160 returns to step S97 and repeats the above processing. A suitable predetermined value Th can be determined by a person skilled in the art through experiments.

これに対して、音声認識部１６０は、変数ｊが所定値Ｔｈより大きいと判別すると（ステップＳ１０３；Ｙｅｓ）、算出されたＴｈ通りの単語列Ｗの内で、ｐ（Ｗ｜Ｘ）を最大にする（すなわち、式（２）から（４）を満たす）単語列Ｗ^＊を特定した後に（ステップＳ１０４）、連続音声認識処理の実行を終了する。 On the other hand, when the voice recognition unit 160 determines that the variable j is larger than the predetermined value Th (step S103; Yes), the maximum p (W | X) is calculated among the calculated Th word strings W. After specifying the word string W ^* that satisfies (that satisfies the expressions (2) to (4)) (step S104), the continuous speech recognition process is terminated.

図９に示すステップＳ６６の連続音声認識処理が実行された後に、音声認識処理部１６０は、認識された単語列Ｗ^＊を要約に追加する（ステップＳ６７）。 After the continuous speech recognition process in step S66 shown in FIG. 9 is executed, the speech recognition processing unit 160 adds the recognized word string W ^* to the summary (step S67).

その後、入力部１２０は、前述のパスにある電子ファイルの読出位置を、読み出したマルチメディア情報のサイズだけ後側にシフトさせる。次に、入力部１２０は、読出位置が、電子ファイルの最後であるＥＯＦであるか否かを判別する（ステップＳ６８）。このとき、入力部１２０は、読出位置がＥＯＦでないと判別すると（ステップＳ６８；Ｎｏ）、ステップＳ６５から上記処理を繰り返す。 Thereafter, the input unit 120 shifts the reading position of the electronic file in the above-described path backward by the size of the read multimedia information. Next, the input unit 120 determines whether or not the reading position is the last EOF of the electronic file (step S68). At this time, if the input unit 120 determines that the reading position is not EOF (step S68; No), the above processing is repeated from step S65.

ステップＳ６８において、入力部１２０が、読出位置がＥＯＦであると判別すると（ステップＳ６８；Ｙｅｓ）、出力部１４０は、図２に示したビデオカード１０７へ要約を出力する（ステップＳ６９）。その後、ビデオカード１０７は、ＬＣＤ１０８に要約を表示させる。 In step S68, if the input unit 120 determines that the reading position is EOF (step S68; Yes), the output unit 140 outputs the summary to the video card 107 shown in FIG. 2 (step S69). Thereafter, the video card 107 causes the LCD 108 to display the summary.

次に、出力部１４０は、指定パスと、当該指定パスにあるマルチメディア情報で表される音声の要約を表すテキストと、を、対応付けて蓄積部１９０に保存した後に（ステップＳ７０）、要約生成処理の実行を終了する。キーワードに基づいてマルチメディア情報を検索できるようにするためである。 Next, the output unit 140 stores the designated path and the text representing the audio summary represented by the multimedia information in the designated path in association with each other in the storage unit 190 (step S70), and then summarizes the summary. The generation process is terminated. This is because multimedia information can be searched based on keywords.

ここで、マルチメディア情報の再生により出力された音声に対するコメントは、当該音声の内容を表す単語若しくは当該単語の共起語を含むことが多い。このため、これらの構成によれば、音声認識装置１００は、コメントを構成する単語（つまり、入力語）及び当該単語の共起語（つまり、入力共起語）を、音声の内容を表す単語の候補（つまり、候補語）とするので、従来よりも音声を適切に認識できる。つまり、音声認識装置１００は、マルチメディア情報に付されたコメントを利用して、マルチメディア情報に含まれる音声を従来よりも適切に認識できる。 Here, the comment on the sound output by the reproduction of the multimedia information often includes a word representing the content of the sound or a co-occurrence word of the word. For this reason, according to these configurations, the speech recognition apparatus 100 uses a word representing the content of speech as a word (that is, an input word) constituting a comment and a co-occurrence word of the word (that is, an input co-occurrence word). Therefore, it is possible to recognize speech more appropriately than in the past. That is, the speech recognition apparatus 100 can recognize the speech included in the multimedia information more appropriately than before by using the comments attached to the multimedia information.

また、番組の音声についてコメントを入力するユーザは、発言の意味内容を文書で調べたり、確認したりすることが多い。このため、マルチメディア情報を聴取し、コメントを入力したユーザが閲覧した文書には、マルチメディア情報の再生により出力される音声の内容を表す単語若しくは当該単語の共起語を含むことが多い。よって、これらの構成によれば、音声認識装置１００は、ユーザが参照した文書を構成する単語（つまり、参照語）及び当該単語の共起語（つまり、参照共起語）を、音声の内容を表す単語の候補（つまり、候補語）とするので、音声を従来よりも適切に認識できる。 Further, a user who inputs a comment about the audio of a program often checks or confirms the meaning content of the statement in a document. For this reason, a document viewed by a user who has listened to multimedia information and input a comment often includes a word representing the content of audio output by reproducing the multimedia information or a co-occurrence word of the word. Therefore, according to these configurations, the speech recognition apparatus 100 converts the word (that is, the reference word) constituting the document referred to by the user and the co-occurrence word (that is, the reference co-occurrence word) of the word into the content of the voice. As a candidate for a word representing (ie, a candidate word), the speech can be recognized more appropriately than before.

さらに、これらの構成によれば、音声から認識された音素と、候補語の発音を表す音素と、の一致度だけでなく、候補語の出現尤度にも基づいて音声認識するため、一致度だけに基づいて音声を認識する従来の音声認識装置よりも、精度良く音声を認識できる。 Furthermore, according to these configurations, since the speech recognition is performed based not only on the degree of coincidence between the phoneme recognized from the speech and the phoneme representing the pronunciation of the candidate word, but also on the appearance likelihood of the candidate word, The speech can be recognized with higher accuracy than the conventional speech recognition device that recognizes the speech based only on the above.

また、ここで、音声の発音時点と、当該音声に対するコメントの入力時点と、は、通常、所定時間以上相違することが少ないなど、互いに合致していることが多い。このため、音声認識装置１００は、候補語に対応した入力時点と、音声が発せられた発音時点と、の合致度と、当該候補語を含むコメントと、に基づいて、音声を認識するため、従来よりも精度良く音声を認識できる。 Here, the sound generation time and the comment input time with respect to the sound often coincide with each other, for example, because they are usually not different by a predetermined time or more. For this reason, the speech recognition apparatus 100 recognizes speech based on the degree of coincidence between the input time corresponding to the candidate word and the pronunciation time when the speech is emitted, and the comment including the candidate word. Speech can be recognized with higher accuracy than before.

ここで、前述のように、既に生放送で番組を視聴している視聴者や、再放送で同じ番組を繰り返し視聴している視聴者は、生放送で初めて番組を視聴する視聴者よりも、コメント対象とする音声の発音時点に近い時点でコメントを入力する傾向にある。音声認識装置１００が記憶する再放送合致度曲線は、図１４に示すように、時点差異が「-TD1」から「TD2」までの範囲で、生放送合致度曲線よりも上側に位置する。このため、同じ候補語で、時点差異が「-TD1」から「TD2」までの範囲に含まれる同じ値ならば、再放送で入力若しくは参照された単語又は当該単語の共起語の方が、生放送で入力等された単語又は当該単語の共起語よりも、図１７に示した連続音声認識処理で生成される単語列Ｗに採用される確率が高い。 Here, as mentioned above, viewers who have already watched a program on live broadcasts and viewers who have repeatedly watched the same program on rebroadcasts will be subject to comment rather than viewers who have watched the program for the first time on live broadcasts. There is a tendency to input comments at a time close to the time of sound generation. As shown in FIG. 14, the rebroadcast match level curve stored in the speech recognition apparatus 100 is located above the live broadcast match level curve in the time difference range from “−TD1” to “TD2”. Therefore, if the same candidate word and the time difference is the same value included in the range from "-TD1" to "TD2", the word entered or referenced in the rebroadcast or the co-occurrence word of the word is The probability of being adopted in the word string W generated by the continuous speech recognition process shown in FIG. 17 is higher than that of a word input in live broadcasting or a co-occurrence word of the word.

また、前述のように、既に生放送で番組を視聴している視聴者などは、再放送時において、コメント対象とする音声の発音時点に近い時点でコメントを入力することが多い。また、音声認識装置１００が記憶する再放送合致度曲線は、図１４に示すように、時点差異が「0」のときがピークであり、時点差異「0」から離れるに従って減衰する。このため、同じ候補語で、共に再放送で入力等された単語若しくは当該単語の共起語であれば、発音時点と入力等された時点との差異が少ない方が、連続音声認識処理で生成される単語列Ｗに採用される確率が高い。 In addition, as described above, viewers who have already watched a program on a live broadcast often input a comment at the time of re-broadcasting, near the point of time when the voice to be commented is pronounced. Further, as shown in FIG. 14, the rebroadcast match curve stored in the speech recognition apparatus 100 has a peak when the time difference is “0”, and attenuates as the time difference is away from “0”. For this reason, if the words are the same candidate words that are both input by re-broadcasting or co-occurrence words of the words, the one with the smaller difference between the pronunciation time and the input time is generated by the continuous speech recognition process. The probability of being adopted for the word string W to be processed is high.

これに対して、生放送の視聴者は、出演者の音声を聞いた後で当該音声に対してコメントを入力することが多い。音声認識装置１００が記憶する生放送合致度曲線は、図１４に示すように、時点差異が「TP」のときがピークであり、時点差異が「TP」から離れるに従って減衰する。このため、同じ候補語で、共に生放送で入力等された単語若しくは当該単語の共起語であれば、発音時点と入力等された時点との差異が「TP」に近い方が、連続音声認識処理で生成される単語列Ｗに採用される確率が高い。それにより、音声認識装置１００は、従来よりも精度良く音声認識できる。 On the other hand, viewers of live broadcasts often input comments on the audio after listening to the audio of the performer. As shown in FIG. 14, the live broadcast match degree curve stored in the speech recognition apparatus 100 has a peak when the time difference is “TP”, and attenuates as the time difference becomes farther from “TP”. For this reason, if the same candidate word is a word that is input live in live broadcasts or a co-occurrence word of the word, the difference between the time of pronunciation and the time of input is closer to “TP”. There is a high probability of being adopted for the word string W generated by the processing. Thereby, the speech recognition apparatus 100 can perform speech recognition with higher accuracy than in the past.

本実施例では、図１に示した通信網１０は、インターネットであると説明したが、これに限定される訳ではなく、ＬＡＮ（Local Area Network）又は公衆回線網であっても良い。 In the present embodiment, the communication network 10 shown in FIG. 1 has been described as being the Internet. However, the communication network 10 is not limited thereto, and may be a LAN (Local Area Network) or a public line network.

本実施例では、マルチメディア情報は、番組の動画と音声とを表すとして説明したが、これに限定される訳ではなく、番組の音声のみを表しても良い。 In the present embodiment, the multimedia information is described as representing the video and sound of the program, but is not limited to this, and may represent only the sound of the program.

＜実施例２＞
本発明の実施例２に係る音声認識装置２００は、実施例１に係る音声認識装置１００と同様に、図１に示した音声認識システム１を構成する。以下、実施例１との相違点について主に説明するため、実施例１との共通点については説明を省略する。 <Example 2>
Similar to the speech recognition apparatus 100 according to the first embodiment, the speech recognition apparatus 200 according to the second embodiment of the present invention configures the speech recognition system 1 illustrated in FIG. In the following, since differences from the first embodiment will be mainly described, descriptions of common points with the first embodiment will be omitted.

音声認識装置２００のハードウェア構成は、実施例１に係る音声認識装置２００のハードウェア構成と同様であるので説明を省略する。 Since the hardware configuration of the speech recognition apparatus 200 is the same as the hardware configuration of the speech recognition apparatus 200 according to the first embodiment, the description thereof is omitted.

次に、音声認識装置２００が有する機能について説明する。
実施例２に係る音声認識装置２００のＣＰＵは、図１８に示す要約生成処理を実行することで、図１９に示すような入力部２２０、保存部２３０、出力部２４０、抽出部２５０、音声認識部２６０、及び共起尤度算出部２７０として機能する。また、音声認識装置２００のＣＰＵは、ハードディスク１０４と協働して蓄積部２９０として機能する。入力部２２０、保存部２３０、出力部２４０、抽出部２５０、音声認識部２６０、及び蓄積部２９０は、実施例１で説明した入力部１２０、保存部１３０、出力部１４０、抽出部１５０、音声認識部１６０、及び蓄積部１９０と同様の機能を有する。 Next, functions of the speech recognition apparatus 200 will be described.
The CPU of the speech recognition apparatus 200 according to the second embodiment executes the summary generation process illustrated in FIG. 18, so that the input unit 220, the storage unit 230, the output unit 240, the extraction unit 250, and the speech recognition illustrated in FIG. Unit 260 and co-occurrence likelihood calculating unit 270. In addition, the CPU of the speech recognition apparatus 200 functions as the storage unit 290 in cooperation with the hard disk 104. The input unit 220, the storage unit 230, the output unit 240, the extraction unit 250, the voice recognition unit 260, and the storage unit 290 are the input unit 120, the storage unit 130, the output unit 140, the extraction unit 150, the voice described in the first embodiment. The recognizing unit 160 and the accumulating unit 190 have the same functions.

共起尤度算出部２７０は、端末装置２０から４０を使用するユーザ毎に、ユーザが参照した文書に掲載された単語と、当該文書において当該単語と共に使用される共起語と、当該共起語の共起尤度と、を算出する。 The co-occurrence likelihood calculating unit 270, for each user who uses the terminal devices 20 to 40, a word posted in a document referred to by the user, a co-occurrence word used together with the word in the document, and the co-occurrence The word co-occurrence likelihood is calculated.

蓄積部１９０は、図１２に示す共起語テーブルではなく、図２０に示す共起語テーブルを記憶している。この共起語テーブルには、ユーザＩＤと、当該ユーザＩＤで識別されるユーザが参照した文書に掲載された単語と、当該単語の共起語と、当該単語と当該共起語とがコメントや文書で共に使用される（つまり、共起する）ことがどの程度尤もであるかを表す尤度（以下、共起尤度という）と、が対応付けられたデータが複数保存される。 The accumulating unit 190 stores the co-occurrence word table shown in FIG. 20 instead of the co-occurrence word table shown in FIG. In this co-occurrence word table, a user ID, a word posted in a document referred to by the user identified by the user ID, a co-occurrence word of the word, the word and the co-occurrence word are commented or A plurality of pieces of data in which likelihoods that indicate how likely it is to be used together in a document (that is, co-occurrence) (hereinafter referred to as co-occurrence likelihoods) are stored.

次に、図１９に示す各機能部で行われるＣＰＵの動作について説明する。 Next, the operation of the CPU performed by each function unit shown in FIG. 19 will be described.

音声認識装置２００のＣＰＵは、キーボードから要約生成指示操作に応じた信号を入力すると、図１８に示す要約生成処理の実行を開始する。 When the CPU of the speech recognition apparatus 200 inputs a signal corresponding to the summary generation instruction operation from the keyboard, it starts executing the summary generation process shown in FIG.

要約生成処理の実行を開始すると、共起尤度算出部２７０は、共起尤度を算出する共起尤度算出処理を実行する（ステップＳ６０）。 When the execution of the summary generation process is started, the co-occurrence likelihood calculating unit 270 executes the co-occurrence likelihood calculating process for calculating the co-occurrence likelihood (step S60).

共起尤度算出処理では、共起尤度算出部２７０は、図１０に示した参照テーブルに保存されたユーザＩＤ毎に、ユーザＩＤと対応付けられたＵＲＬを検索する。次に、共起尤度算出部２７０は、検索したＵＲＬの全てについて、ＵＲＬにある文書を取得する。その後、共起尤度算出部２７０は、取得した文書の全てについて、文書に掲載された掲載単語と、当該掲載単語と当該文書で共に使用された共起単語と、当該共起単語が当該掲載単語と共に使用された共起回数と、を算出する。その後、共起尤度算出部２７０は、掲載単語と共起単語との全組み合わせについて、共起回数に基づき共起尤度を算出する。次に、共起尤度算出部２７０は、所定値以上の共起尤度について、ユーザＩＤと、掲載単語と、共起単語と、共起尤度と、を対応付けて、図２０に示す共起テーブルに保存する。 In the co-occurrence likelihood calculation process, the co-occurrence likelihood calculation unit 270 searches for a URL associated with the user ID for each user ID stored in the reference table illustrated in FIG. Next, the co-occurrence likelihood calculating unit 270 acquires documents in the URL for all the searched URLs. Thereafter, the co-occurrence likelihood calculating unit 270, for all of the acquired documents, the posted word posted in the document, the co-occurrence word used together with the posted word and the document, and the co-occurrence word being the posted Calculate the number of co-occurrence used with the word. Thereafter, the co-occurrence likelihood calculating unit 270 calculates the co-occurrence likelihood for all combinations of the posted word and the co-occurrence word based on the number of co-occurrence. Next, the co-occurrence likelihood calculating unit 270 associates the user ID, the posted word, the co-occurrence word, and the co-occurrence likelihood with respect to the co-occurrence likelihood equal to or greater than a predetermined value, as illustrated in FIG. Save to co-occurrence table.

図１８に示すステップＳ６０の処理が実行された後に、ステップＳ６１からステップＳ６３の処理を実行する。 After the process of step S60 shown in FIG. 18 is executed, the processes of step S61 to step S63 are executed.

その後、音声認識部２６０は、候補語のそれぞれについて出現尤度を算出する（ステップＳ６４）。このとき、音声認識部２６０は、候補語が入力共起語の場合に、入力共起語と共起する入力語を入力したユーザのユーザＩＤを特定し、特定したユーザＩＤと当該入力語と当該入力共起語とに、図２０に示す共起テーブルで対応付けられた共起尤度を検索する。次に、音声認識部２６０は、検索した共起尤度を用いて出現尤度を算出する。また、音声認識部２６０は、候補語が参照共起語の場合に、参照共起語と共起する参照語を参照したユーザのユーザＩＤを特定し、特定したユーザＩＤと当該参照語と当該参照共起語とに、図２０に示す共起テーブルで対応付けられた共起尤度を検索する。次に、音声認識部２６０は、検索した共起尤度を用いて出現尤度を算出する。 Thereafter, the speech recognition unit 260 calculates the appearance likelihood for each candidate word (step S64). At this time, when the candidate word is an input co-occurrence word, the speech recognition unit 260 specifies the user ID of the user who has input the input word co-occurring with the input co-occurrence word, and specifies the specified user ID and the input word The co-occurrence likelihood associated with the input co-occurrence word in the co-occurrence table shown in FIG. 20 is searched. Next, the speech recognition unit 260 calculates an appearance likelihood using the searched co-occurrence likelihood. In addition, when the candidate word is a reference co-occurrence word, the speech recognition unit 260 identifies the user ID of the user who referred to the reference word that co-occurs with the reference co-occurrence word, and identifies the identified user ID, the reference word, The co-occurrence likelihood associated with the reference co-occurrence word in the co-occurrence table shown in FIG. 20 is searched. Next, the speech recognition unit 260 calculates an appearance likelihood using the searched co-occurrence likelihood.

その後、音声認識部２６０は、ステップＳ６５からステップＳ７０の処理を実行した後に、要約生成処理の実行を終了する。 Thereafter, the voice recognition unit 260 ends the execution of the summary generation process after executing the processes from step S65 to step S70.

これらの構成によれば、音声認識２００は、視聴者が参照した文書に掲載された掲載単語と、当該掲載単語と共に文書中で使用された単語を共起語とし、当該掲載単語と当該共起語とが当該文書で共起した回数に基づいて共起尤度を算出する。また、音声認識２００は、算出した共起尤度を用いて、視聴者が入力若しくは参照した単語の共起語の出現尤度を算出し、算出した共起語の出現尤度と、共起語の発音と音声との一致度と、に基づいて音声を認識する。ここで、視聴者が互いに共起させてコメントに用いる単語や互いに共起して文書に掲載される単語は、話題となっている事項や、時代の流行や、視聴者の嗜好によって変化する。このため、話題となっている事項や、時代の流行や、視聴者の嗜好が変化しても、音声認識装置２００は、精度良く音声を認識できる。 According to these configurations, the speech recognition unit 200 uses the posted word posted in the document referred to by the viewer and the word used in the document together with the posted word as a co-occurrence word, and the posted word and the co-occurrence The co-occurrence likelihood is calculated based on the number of times the word co-occurs in the document. Further, the speech recognition 200 uses the calculated co-occurrence likelihood to calculate the appearance likelihood of the co-occurrence word of the word input or referred to by the viewer, and the calculated co-occurrence word appearance likelihood and the co-occurrence The speech is recognized based on the degree of coincidence between the pronunciation of the word and the speech. Here, the words that viewers co-occur with each other and use in comments and the words that co-occur with each other and appear in the document vary depending on the topic, trends in the times, and viewer preferences. For this reason, the speech recognition apparatus 200 can recognize speech with high accuracy even if the topic, trend of the times, or viewer's preference changes.

＜実施例３＞
実施例１に係る音声認識装置１００は、図３に示したステップＳ１７で、コメントを合成した動画を生成し、ステップＳ１９で、コメントの合成された動画を表すマルチメディア情報を、図２に示したＬＡＮカード１０６へ出力すると説明した。ＬＡＮカード１０６は、当該マルチメディア情報を端末装置２０及び３０へ送信し、端末装置２０及び３０は、コメントの合成された動画を、図７に示した視聴画面の動画表示領域ＡＭに表示すると説明した。 <Example 3>
The speech recognition apparatus 100 according to the first embodiment generates a video in which a comment is synthesized in step S17 shown in FIG. 3, and multimedia information representing the video in which a comment is synthesized in step S19 is shown in FIG. It has been described that the data is output to the LAN card 106. The LAN card 106 transmits the multimedia information to the terminal devices 20 and 30, and the terminal devices 20 and 30 explain that the combined video is displayed in the video display area AM of the viewing screen shown in FIG. did.

実施例３に係る音声認識装置は、図３に示したステップＳ１７で、コメントを合成した動画を生成せず、ステップＳ１９で、マルチメディア情報とコメント情報とをＬＡＮカード１０６へ出力する。ＬＡＮカード１０６は、当該マルチメディア情報と当該コメント情報とを端末装置へ送信する。 The speech recognition apparatus according to the third embodiment does not generate a moving image in which the comment is synthesized in step S <b> 17 illustrated in FIG. 3, and outputs the multimedia information and the comment information to the LAN card 106 in step S <b> 19. The LAN card 106 transmits the multimedia information and the comment information to the terminal device.

実施例３に係る端末装置は、図２１に示すような視聴画面を表示する。この視聴画面は、実施例１で説明した動画表示領域ＡＭと、コメント表示領域ＡＣと、動画表示領域ＡＭの上に重ねられた（つまり、動画表示領域ＡＭよりも上位のレイヤーに属する）コメント表示欄ＵＬと、を有する。端末装置は、マルチメディア情報とコメント情報とを受信すると、マルチメディア情報で表される動画を動画表示領域ＡＭに表示し、コメント情報で表されるコメントを動画表示領域ＡＭに重ねられたコメント表示欄ＵＬとコメント表示領域ＡＣとに表示する。尚、コメント表示欄ＵＬの枠を作図の便宜のために点線で表したが、コメント表示欄ＵＬの枠は視聴画面に表示されない。 The terminal device according to the third embodiment displays a viewing screen as illustrated in FIG. This viewing screen is superimposed on the moving image display area AM, the comment display area AC, and the moving image display area AM described in the first embodiment (that is, the comment display belongs to a higher layer than the moving image display area AM). And a column UL. When the terminal device receives the multimedia information and the comment information, the terminal device displays the moving image represented by the multimedia information in the moving image display area AM, and displays the comment represented by the comment information superimposed on the moving image display area AM. Displayed in the column UL and the comment display area AC. The frame of the comment display field UL is indicated by a dotted line for the convenience of drawing, but the frame of the comment display field UL is not displayed on the viewing screen.

＜実施例４＞
実施例４に係る音声認識装置１００は、番組を生放送及び再放送するだけでなく、ＶＯＤ（Video On Demand）で番組を配信する。端末装置２０から４０は、生放送若しくは再放送された番組だけでなく、配信された番組の映像を表示し、音声を出力する。 <Example 4>
The speech recognition apparatus 100 according to the fourth embodiment not only broadcasts and re-broadcasts a program, but also distributes the program by VOD (Video On Demand). The terminal devices 20 to 40 display not only a live broadcast or rebroadcast program but also a video of the distributed program and output sound.

ここで、端末装置４０のユーザが、生放送された番組をＶＯＤで配信することを要求するリクエスト（以下、ＶＯＤ配信リクエストという）を送信させる操作を端末装置４０に行うとして説明する。 Here, a description will be given on the assumption that the user of the terminal device 40 performs an operation for causing the terminal device 40 to transmit a request (hereinafter referred to as a VOD distribution request) requesting distribution of a live broadcast program by VOD.

端末装置４０は、当該操作に応じてＶＯＤ配信リクエストを音声認識装置１００へ送信する。音声認識装置１００は、ＶＯＤ配信リクエストを端末装置４０から受信すると、配信を要求された番組を表すマルチメディア情報を読み出し、読み出したマルチメディア情報を端末装置４０へ配信し始める。端末装置４０は、音声認識装置１００から受信したマルチメディア情報で表される番組映像の表示及び番組音声の出力を開始する。 The terminal device 40 transmits a VOD distribution request to the voice recognition device 100 according to the operation. When receiving the VOD distribution request from the terminal device 40, the voice recognition device 100 reads out multimedia information representing the program requested to be distributed, and starts distributing the read multimedia information to the terminal device 40. The terminal device 40 starts displaying the program video represented by the multimedia information received from the voice recognition device 100 and outputting the program audio.

その後、端末装置４０のユーザが、配信された番組の再生位置を所定時間先に進めるスキップ操作を端末装置４０に行うとして説明する。 Then, it demonstrates that the user of the terminal device 40 performs the skip operation which advances the reproduction | regeneration position of the delivered program ahead predetermined time to the terminal device 40. FIG.

端末装置４０は、番組映像の表示及び番組音声の出力を中止し、スキップとスキップする時間とを指示するスキップコマンドを音声認識装置１００へ送信する。声認識装置１００は、スキップコマンドを受信すると、当該スキップコマンドで指定された時間に相当するサイズだけマルチメディア情報の読出位置を後側にシフトしてからマルチメディア情報の読み出し及び配信を継続する。その後、端末装置４０は、配信されたマルチメディア情報で表される番組映像の表示及び番組音声の出力を再開する。 The terminal device 40 stops the display of the program video and the output of the program audio, and transmits a skip command for instructing the skip and the time to skip to the voice recognition device 100. When the voice recognition device 100 receives the skip command, it shifts the read position of the multimedia information backward by the size corresponding to the time specified by the skip command, and then continues to read and distribute the multimedia information. Thereafter, the terminal device 40 resumes the display of the program video and the output of the program audio represented by the distributed multimedia information.

その後、端末装置４０のユーザが、配信された番組の再生位置を所定時間後に戻すスキップ操作を端末装置４０に行うと、端末装置４０は、番組映像の表示及び番組音声の出力を中止し、既に保存したマルチメディア情報を用いて、当該スキップ操作で指定された時間に相当するサイズだけ前の再生位置から番組映像の再生及び番組音声の出力を再開する。 After that, when the user of the terminal device 40 performs a skip operation on the terminal device 40 to return the playback position of the distributed program after a predetermined time, the terminal device 40 stops displaying the program video and outputting the program audio, Using the saved multimedia information, the reproduction of the program video and the output of the program audio are resumed from the previous reproduction position by the size corresponding to the time designated by the skip operation.

また、端末装置４０のユーザが、配信された番組の再生を一時停止させる操作を端末装置４０に行うと、端末装置４０は、番組映像の表示及び番組音声の出力を中止する。その後、端末装置４０のユーザが、配信された番組をコマ送り再生させる操作を端末装置４０に行うと、番組音声の出力を中止し、配信された若しくは既に保存されたマルチメディア情報を用いて、番組映像のコマ送り再生を開始する。 When the user of the terminal device 40 performs an operation on the terminal device 40 to temporarily stop the reproduction of the distributed program, the terminal device 40 stops displaying the program video and outputting the program audio. After that, when the user of the terminal device 40 performs an operation for frame-by-frame playback of the distributed program, the terminal device 40 stops outputting the program audio, and uses the distributed or already stored multimedia information. Start frame-by-frame playback of program video.

その後、端末装置４０のユーザが、番組の再生を停止させる操作を端末装置４０に行うと、端末装置４０は、番組映像の表示及び音声出力を停止し、停止を指示する停止コマンドを音声認識装置１００へ送信する。音声認識装置１００は、端末装置４０から停止コマンドを受信すると、当該停止コマンドに従って、マルチメディア情報の配信を停止する。 Thereafter, when the user of the terminal device 40 performs an operation on the terminal device 40 to stop the reproduction of the program, the terminal device 40 stops the display of the program video and the audio output, and issues a stop command instructing the stop to the voice recognition device. To 100. When receiving the stop command from the terminal device 40, the voice recognition device 100 stops the delivery of the multimedia information according to the stop command.

実施例１から４は、互いに組み合わせることができる。実施例１から４のいずれかに係る機能を実現するための構成を備えた音声認識装置１００として提供できることはもとより、複数の装置で構成されるシステムであって、実施例１から４のいずれかに係る機能を実現するための構成をシステム全体として備えたシステムとして提供することもできる。 Examples 1 to 4 can be combined with each other. In addition to being able to be provided as a speech recognition apparatus 100 having a configuration for realizing the function according to any one of the first to fourth embodiments, the system includes a plurality of apparatuses, and any one of the first to fourth embodiments. It is also possible to provide a system having a configuration for realizing the functions according to the system as a whole.

尚、実施例１に係る機能を実現するための構成を予め備えた音声認識装置１００、実施例２に係る機能を実現するための構成を予め備えた音声認識装置２００、又は実施例３若しくは４に係る機能を実現するための構成を予め備えた音声認識装置として提供できることはもとより、プログラムの適用により、既存の音声認識装置を実施例１に係る音声認識装置１００、実施例２に係る音声認識装置２００、又は実施例３若しくは４に係る音声認識装置として機能させることもできる。すなわち、実施例１で例示した音声認識装置１００、実施例２で例示した音声認識装置２００、又は実施例３若しくは４で例示した音声認識装置による各機能構成を実現させるための音声認識プログラムを、既存の音声認識装置を制御するコンピュータ（ＣＰＵなど）が実行できるように適用することで、実施例１に係る音声認識装置１００、実施例２に係る音声認識装置２００、又は実施例３若しくは４に係る音声認識装置として機能させることができる。 Note that the speech recognition apparatus 100 provided with a configuration for realizing the function according to the first embodiment in advance, the speech recognition apparatus 200 provided with a configuration for realizing the function according to the second embodiment in advance, or the third or fourth embodiment. In addition to being able to be provided as a speech recognition device having a configuration for realizing the functions related to the above, the existing speech recognition device can be replaced with the speech recognition device 100 according to the first embodiment and the speech recognition according to the second embodiment by applying a program. It can also be made to function as the apparatus 200 or the voice recognition apparatus according to the third or fourth embodiment. That is, a speech recognition program for realizing each functional configuration by the speech recognition device 100 exemplified in the first embodiment, the speech recognition device 200 exemplified in the second embodiment, or the voice recognition device exemplified in the third or fourth embodiment, By being applied so that a computer (such as a CPU) that controls an existing speech recognition device can be executed, the speech recognition device 100 according to the first embodiment, the speech recognition device 200 according to the second embodiment, or the third or fourth embodiment. The voice recognition apparatus can function.

このようなプログラムの配布方法は任意であり、例えば、メモリカード、ＣＤ−ＲＯＭ、又はＤＶＤ−ＲＯＭなどの記録媒体に格納して配布できる他、インターネットなどの通信媒体を介して配布することもできる。また、本発明に係る音声認識方法は、実施例１に係る音声認識装置１００、実施例２に係る音声認識装置２００、又は実施例３若しくは４に係る音声認識装置を用いて実施できる。 Such a program distribution method is arbitrary. For example, the program can be distributed by being stored in a recording medium such as a memory card, a CD-ROM, or a DVD-ROM, or via a communication medium such as the Internet. . The speech recognition method according to the present invention can be implemented using the speech recognition apparatus 100 according to the first embodiment, the speech recognition apparatus 200 according to the second embodiment, or the speech recognition apparatus according to the third or fourth embodiment.

以上本発明の好ましい実施例について詳述したが、本発明は係る特定の実施例に限定されるものではなく、特許請求の範囲に記載された本発明の要旨の範囲内において、種々の変形、変更が可能である。 Although the preferred embodiments of the present invention have been described in detail above, the present invention is not limited to the specific embodiments, and various modifications, within the scope of the gist of the present invention described in the claims, It can be changed.

１０：通信網
２０，３０，４０：端末装置
２１：撮影装置
２２：音声収集装置
１００，２００：音声認識装置
１０１：ＣＰＵ
１０２：ＲＯＭ
１０３：ＲＡＭ
１０４：ハードディスク
１０５：メディアコントローラ
１０６：ＬＡＮカード
１０７：ビデオカード
１０８：ＬＣＤ
１０９：キーボード
１１０：スピーカ
１１１：タッチパッド
１２０，２２０：入力部
１３０，２３０：保存部
１４０，２４０：出力部
１５０，２５０：抽出部
１６０，２６０：音声認識部
１９０，２９０：蓄積部
２７０：共起尤度算出部 10: communication network 20, 30, 40: terminal device 21: photographing device 22: voice collection device 100, 200: voice recognition device 101: CPU
102: ROM
103: RAM
104: Hard disk 105: Media controller 106: LAN card 107: Video card 108: LCD
109: keyboard 110: speaker 111: touch pad 120, 220: input unit 130, 230: storage unit 140, 240: output unit 150, 250: extraction unit 160, 260: voice recognition unit 190, 290: storage unit 270: shared Likelihood calculation unit

Claims

An accumulator for accumulating comments input by the user while listening to audio generated by playing multimedia information;
An extraction unit that extracts words that appear in a sentence set including the accumulated comments and candidate words including co-occurrence words of the word in the sentence set;
A voice recognition unit that recognizes a voice generated by reproducing the multimedia information based on the extracted candidate word;
A speech recognition apparatus characterized by that.

The speech recognition device according to claim 1,
The speech recognition apparatus, wherein the sentence set includes a sentence that appears in a document viewed by a user who has listened to the multimedia information.

The speech recognition apparatus according to claim 1 or 2,
The extraction unit calculates the likelihood of appearance of each of the candidate words,
The speech recognition unit recognizes speech based on the degree of coincidence between the phoneme recognized from the speech and the phoneme representing the candidate word and the appearance likelihood of the candidate word.
A speech recognition apparatus characterized by that.

The speech recognition device according to claim 3,
Of the candidate words, words appearing in the comment are associated with an input time point when the comment is input,
For the candidate word associated with the input time point, the speech recognition unit obtains a degree of match between the input time point associated with the candidate word and the pronunciation time point when the phoneme is emitted, Voice recognition based on the degree of match
A speech recognition apparatus characterized by that.

The speech recognition device according to claim 4,
The input time point and the sound generation time point are expressed by a reproduction time after the reproduction of the multimedia information is started.
A speech recognition apparatus characterized by that.

The speech recognition device according to claim 5,
The degree of match is determined based on a difference between the input time point and the sound generation time point, and a difference between a time point when the multimedia information can be played back and a time point when the user starts playing the multimedia information.
A speech recognition apparatus characterized by that.

Computer
An accumulator for accumulating comments input by the user while listening to audio generated by playing multimedia information;
An extraction unit that extracts words that appear in a sentence set including the accumulated comments and candidate words including co-occurrence words of the word in the sentence set;
Based on the extracted candidate words, function as a speech recognition unit that recognizes speech generated by playing the multimedia information,
A speech recognition program characterized by that.

A method performed by a speech recognition apparatus including an accumulation unit, an extraction unit, and a speech recognition unit,
An accumulating step in which the accumulating unit accumulates a comment input by the user while listening to a sound uttered by reproduction of multimedia information;
An extraction step in which the extraction unit extracts a word that appears in a sentence set including the accumulated comments and a candidate word including a co-occurrence word of the word in the sentence set;
The speech recognition unit has a speech recognition step for recognizing speech generated by reproducing the multimedia information based on the extracted candidate words.
A speech recognition method characterized by the above.