JP2016201678A

JP2016201678A - Recognition device and image content presentation system

Info

Publication number: JP2016201678A
Application number: JP2015080592A
Authority: JP
Inventors: 優鎌本; Masaru Kamamoto; 善史白木; Yoshifumi Shiraki; 尚佐藤; Takashi Sato; パブロナバガブリエル; Pablo Nava Gabriele; 健弘守谷; Takehiro Moriya
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2015-04-10
Filing date: 2015-04-10
Publication date: 2016-12-01
Anticipated expiration: 2035-04-10
Also published as: JP6367748B2

Abstract

PROBLEM TO BE SOLVED: To provide a recognition device and an image content presentation system for displaying an image with information superimposed thereon with suitable timing, without clicking a comment contribution key or depressing an enter key.SOLUTION: A recognition unit 130 of a viewer terminal 100 obtains, on the basis of acoustic signals including at least non-speech sounds collected from a microphone 120, predetermined visual information corresponding to the non-speech sounds and being other than character strings obtained by writing the non-speech sounds as it is, and transmits the visual information to a moving image distribution server via a communication line.SELECTED DRAWING: Figure 1

Description

本発明は、映像を見るものによって入力される情報を、その映像に重畳して表示する技術に関する。 The present invention relates to a technique for superimposing and displaying information input by a person viewing a video.

映像を見るものによって入力されるテキスト情報を、その映像に重畳して表示する従来技術として非特許文献１が知られている。非特許文献１では、視聴者は、動画を視聴しながら、コメントを投稿することができる。 Non-Patent Document 1 is known as a prior art that displays text information input by a person viewing a video in a superimposed manner. In Non-Patent Document 1, a viewer can post a comment while watching a moving image.

「動画の視聴コメントの投稿」、[online]、NIWANGO.INC、[平成27年2月2日検索]、インターネット<URL : http://info.nicovideo.jp/help/player/howto/>“Watching videos and posting comments”, [online], NIWANGO.INC, [Search February 2, 2015], Internet <URL: http://info.nicovideo.jp/help/player/howto/>

しかしながら、従来技術では、動画に対してコメントしたいと思ってから、コメントを入力し、コメント投稿ボタンをクリックまたはエンターキーを押下する必要があるため、視聴者がコメントしたいと思ったタイミングから遅れてコメントが表示される場合がある。逆に動画の内容を予め知っている場合には、予めコメントを入力しておき、コメント投稿ボタンをクリックまたはエンターキーを押下するタイミングを視聴者が図ることもできるが、その場合であっても、視聴者がコメントしたいと思ったタイミングよりも早くなったり、または、遅くなったりする場合がある。例えば、ミュージックビデオやライブ映像の楽曲のテンポに合わせて、拍手を意味するテキスト情報「８」をコメントする場合、実際に拍手する場合よりも、ズレてしまう場合が多い、または、ズレ幅が大きくなりやすい。 However, in the conventional technology, it is necessary to enter a comment after clicking the comment post button or pressing the enter key after you want to comment on the video, so it is delayed from the timing when the viewer wants to comment Comments may be displayed. Conversely, if you know the content of the video in advance, you can enter comments in advance and the viewer can try to click the comment post button or press the enter key. , It may be earlier or later than when the viewer wants to comment. For example, when commenting the text information “8” meaning applause in time with the tempo of a music video or live video composition, there are many cases where the discrepancy is greater than when applause is actually performed, or the discrepancy is larger. Prone.

本発明は、コメント投稿ボタンのクリックまたはエンターキーの押下を行わずに、情報を映像に適切なタイミングで重畳して表示するための認識装置、映像コンテンツ提示システムを提供することを目的とする。 An object of the present invention is to provide a recognition apparatus and a video content presentation system for displaying information superimposed on video at an appropriate timing without clicking a comment posting button or pressing an enter key.

上記の課題を解決するために、本発明の一態様によれば、認識装置は、収音された少なくとも発話以外の音を含む音信号に基づき、発話以外の音に対応し、かつ、発話以外の音をそのまま表記した文字列ではない、所定の視覚情報を得る認識部を含む。 In order to solve the above-described problem, according to one aspect of the present invention, the recognition apparatus is adapted to sound other than utterance based on a sound signal including at least collected sound other than utterance and other than utterance. It includes a recognition unit that obtains predetermined visual information that is not a character string that directly represents the sound.

本発明によれば、コメント投稿ボタンのクリックまたはエンターキーの押下を行わずに、情報を映像に適切なタイミングで重畳して表示することができるという効果を奏する。 According to the present invention, there is an effect that information can be displayed superimposed on video at an appropriate timing without clicking a comment posting button or pressing an enter key.

第一実施形態に係る映像コンテンツ提示システムの機能ブロック図。The functional block diagram of the video content presentation system which concerns on 1st embodiment. 第一実施形態に係る映像コンテンツ提示システムの処理フローの例を示す図。The figure which shows the example of the processing flow of the video content presentation system which concerns on 1st embodiment. 視覚情報付き映像コンテンツの例を示す図。The figure which shows the example of the video content with visual information. 視覚情報データベースの例を示す図。The figure which shows the example of a visual information database.

以下、本発明の実施形態について、説明する。なお、以下の説明に用いる図面では、同じ機能を持つ構成部や同じ処理を行うステップには同一の符号を記し、重複説明を省略する。 Hereinafter, embodiments of the present invention will be described. In the drawings used for the following description, components having the same function and steps performing the same process are denoted by the same reference numerals, and redundant description is omitted.

＜第一実施形態に係る映像コンテンツ提示システム１＞
図１は第一実施形態に係る映像コンテンツ提示システム１の機能ブロック図を、図２はその処理フローを示す。 <Video Content Presentation System 1 According to First Embodiment>
FIG. 1 is a functional block diagram of a video content presentation system 1 according to the first embodiment, and FIG. 2 shows a processing flow thereof.

映像コンテンツ提示システム１は、１台以上の視聴者端末９１と、視聴者端末１００と、映像コンテンツを視聴者端末９１及び１００に配信する動画配信サーバ９２とを含む。各視聴者端末９１及び１００と動画配信サーバ９２とは、通信回線を介して通信可能とされている。 The video content presentation system 1 includes one or more viewer terminals 91, a viewer terminal 100, and a moving image distribution server 92 that distributes video content to the viewer terminals 91 and 100. The viewer terminals 91 and 100 and the moving image distribution server 92 can communicate with each other via a communication line.

＜視聴者端末９１＞
視聴者端末９１は、映像コンテンツ(例えば、動画)を見るもの(例えば、動画の視聴者)によって操作され、入力部（キーボード、マウス、タッチパネル等）と、表示部（ディスプレイ、タッチパネル等）とを含み、例えば、パーソナルコンピュータ、スマートホン、タブレット等からなる。視聴者は、視聴者端末９１の入力部を介して、動画配信サーバ９２に対して映像コンテンツの再生を要求することができる。また、視聴者端末９１の表示部を介して、映像コンテンツを視聴することができる。さらに、視聴者は、入力部を介して、映像コンテンツに重畳して表示される視覚情報（例えば、コメント）を入力することができる。ここで、「視覚情報」とは、表示部を介して視覚的に認識可能な情報であって、例えば、文字、図形若しくは記号若しくはこれらの結合又はこれらと色彩との結合である。また、静止画に限らず、動く画像であってもよい。例えば、(1)「笑い」や「拍手」等の所定の行為を意味するテキスト情報（例えば「ｗ」や「８」等）、(2)テキスト情報以外の「笑い」や「拍手」等の所定の行為を意味し、識別するためのコンピュータ上のビット情報、(3)顔文字、絵文字等、通常のテキスト情報で無いもの。例えば、キャリアの異なる携帯電話間で共通絵文字(参考文献１参照)、(4)アスキーアート等，全体としてはテキスト情報とテキスト情報の配置情報を用いた絵のようになっているもの（参考文献２参照）、(5)上述の(1)〜(4)に対応するネットスラング。例えば、「笑い」を意味するテキスト情報「ｗｗｗｗｗ…」に対して「草生えた」等のネットスラングがある。
（参考文献１）「docomo／au共通絵文字」、株式会社NTTドコモ、[online]、[平成27年2月9日検索]、インターネット<URL: https://www.nttdocomo.co.jp/service/developer/smart_phone/make_contents/pictograph/>
（参考文献２）「アスキーアート」、[online]、2015年2月2日、ウィキペディア、[平成27年2月9日検索]、インターネット<URL: http://ja.wikipedia.org/wiki/%E3%82%A2%E3%82%B9%E3%82%AD%E3%83%BC%E3%82%A2%E3%83%BC%E3%83%88> <Viewer terminal 91>
The viewer terminal 91 is operated by a user who views video content (for example, moving images) (for example, a moving image viewer), and has an input unit (keyboard, mouse, touch panel, etc.) and a display unit (display, touch panel, etc.). Including, for example, a personal computer, a smart phone, and a tablet. The viewer can request the video distribution server 92 to play back video content via the input unit of the viewer terminal 91. In addition, the video content can be viewed via the display unit of the viewer terminal 91. Furthermore, the viewer can input visual information (for example, a comment) displayed superimposed on the video content via the input unit. Here, the “visual information” is information visually recognizable via the display unit, and is, for example, a character, a figure, a symbol, a combination thereof, or a combination of these and a color. Further, the image is not limited to a still image, and may be a moving image. For example, (1) text information (such as “w” or “8”) meaning a predetermined act such as “laughter” or “applause”, (2) “laughter” or “applause” other than text information Bit information on the computer to identify and identify a predetermined action, (3) Emoticon, pictograph, etc. that are not normal text information. For example, common pictograms between mobile phones of different carriers (see Reference 1), (4) ASCII art, etc., which are generally like pictures using text information and text information layout information (Reference 2) (5) Net slang corresponding to the above (1) to (4). For example, there is a net slang such as “I grew up” for text information “www...
(Reference 1) “docomo / au common pictograms”, NTT DOCOMO, Inc., [online], [Search February 9, 2015], Internet <URL: https://www.nttdocomo.co.jp/service / developer / smart_phone / make_contents / pictograph />
(Reference 2) "ASCII art", [online], February 2, 2015, Wikipedia, [Search February 9, 2015], Internet <URL: http://en.wikipedia.org/wiki/% E3% 82% A2% E3% 82% B9% E3% 82% AD% E3% 83% BC% E3% 82% A2% E3% 83% BC% E3% 83% 88>

＜動画配信サーバ９２＞
動画配信サーバ９２は、動画データベース及びビデオカメラから動画を受け取り、視聴者端末９１及び１００の要求に応じて、動画データベース内に格納されている動画、または、ビデオカメラで収録した動画をリアルタイムで配信する。また、ビデオカメラで収録された動画に限らず、リアルタイムで合成・編集されたＣＧやモーションキャプチャ等から合成されたＣＧをリアルタイム配信することもある。なお、本実施形態において、動画とは、時間軸に同期させた音響信号と共に提供される映像コンテンツを意味する。動画データベースには、動画と共に動画に付加された視覚情報が記憶される。さらに、視覚情報にはメタデータが付加されている。メタデータとしては、視覚情報の入力時刻、視覚情報の大きさ、その色、その出現方法、その移動速度や、移動位置等がある。例えば、大きさ、色、出現方法、移動速度、移動位置等は、視覚情報の入力者が選択できるものとしてもよく、視聴者端末９１及び１００がメタデータとして視覚情報と一緒に送信し、動画データベースに動画と共に記憶される。 <Video distribution server 92>
The moving image distribution server 92 receives moving images from the moving image database and the video camera, and distributes the moving images stored in the moving image database or the moving images recorded by the video camera in real time in response to requests from the viewer terminals 91 and 100. To do. In addition to moving images recorded by a video camera, CG synthesized and edited in real time or CG synthesized from motion capture may be distributed in real time. In the present embodiment, the moving image means video content provided with an audio signal synchronized with the time axis. The moving image database stores visual information added to the moving image together with the moving image. Furthermore, metadata is added to the visual information. The metadata includes the input time of visual information, the size of visual information, its color, its appearance method, its moving speed, moving position, and the like. For example, the size, color, appearance method, moving speed, moving position, and the like may be selectable by the input person of the visual information, and the viewer terminals 91 and 100 transmit the metadata together with the visual information as a moving image. Stored in the database with the video.

＜視聴者端末１００＞
視聴者端末１００は、映像コンテンツ(例えば、動画)を見るもの(例えば、動画の視聴者)によって操作される。視聴者端末１００は、表示部（ディスプレイ、タッチパネル等）１１０と収音部（マイクロホン等）１２０と認識部１３０と表示サイズ取得部１４０とを含み、例えば、パーソナルコンピュータ、スマートホン、タブレット等からなる。視聴者は、視聴者端末１００の入力部（収音部１２０、キーボード、マウス、タッチパネル等）を介して、動画配信サーバ９２に対して映像コンテンツの再生を要求することができる。また、視聴者端末１００の表示部１１０を介して、映像コンテンツを視聴することができる。視聴者端末１００の視聴者は、入力部を介して、映像コンテンツに重畳して表示される視覚情報（例えば、コメント）を入力することができる。 <Viewer terminal 100>
The viewer terminal 100 is operated by a person who views video content (for example, a moving image) (for example, a moving image viewer). The viewer terminal 100 includes a display unit (display, touch panel, etc.) 110, a sound collection unit (microphone, etc.) 120, a recognition unit 130, and a display size acquisition unit 140, and includes, for example, a personal computer, a smart phone, a tablet, and the like. . The viewer can request the video distribution server 92 to play back video content via the input unit (sound collection unit 120, keyboard, mouse, touch panel, etc.) of the viewer terminal 100. Also, the video content can be viewed via the display unit 110 of the viewer terminal 100. The viewer of the viewer terminal 100 can input visual information (for example, a comment) displayed superimposed on the video content via the input unit.

視聴者端末１００は、表示部１１０を介して、映像コンテンツを表示する（Ｓ１１０Ａ）。なお、映像コンテンツは、動画配信サーバ９２から配信される。視聴者端末１００は、収音部１２０を介して、映像コンテンツの視聴者が、映像コンテンツに対応して発する音を音信号x(t)として収音する（Ｓ１２０）。なお、tは時刻を表すインデックスである。 The viewer terminal 100 displays the video content via the display unit 110 (S110A). Note that the video content is distributed from the moving image distribution server 92. The viewer terminal 100 collects the sound generated by the viewer of the video content corresponding to the video content as the sound signal x (t) via the sound collection unit 120 (S120). Note that t is an index representing time.

＜認識部１３０＞
認識部１３０は、収音された少なくとも発話以外の音を含む音信号x(t)を受け取り、この音信号x(t)に基づき、発話以外の音に対応し、かつ、発話以外の音をそのまま表記した文字列ではない、視聴者端末９１の表示部または表示部１１０を介して視覚的に認識可能な所定の視覚情報v(t)を得（Ｓ１３０）、通信回線を介して動画配信サーバ９２に送信する。 <Recognition unit 130>
The recognizing unit 130 receives the collected sound signal x (t) including at least a sound other than the utterance, and corresponds to the sound other than the utterance based on the sound signal x (t) and outputs the sound other than the utterance. Predetermined visual information v (t) that is not a character string written as it is and is visually recognizable via the display unit or display unit 110 of the viewer terminal 91 is obtained (S130), and the video distribution server is obtained via the communication line. 92.

動画配信サーバ９２は、視覚情報v(t)を受け取り、動画に重畳して配信する。なお、動画データベースに、動画と共に動画に付加された視覚情報v(t)を格納する。視聴者端末９１の表示部または表示部１１０は、視覚情報v(t)が重畳された映像コンテンツを受け取り、表示する（Ｓ１１０Ｂ）。なお、視覚情報v(t)を送信した際の再生時には、視覚情報v(t)を重畳せずに動画のみを配信し、それ以降の再生時に、視覚情報を重畳した動画を配信する構成としてもよい。 The moving image distribution server 92 receives the visual information v (t) and distributes it superimposed on the moving image. The visual information v (t) added to the moving image is stored in the moving image database together with the moving image. The display unit or display unit 110 of the viewer terminal 91 receives and displays the video content on which the visual information v (t) is superimposed (S110B). In addition, at the time of playback when visual information v (t) is transmitted, only the video is distributed without superimposing visual information v (t), and at the subsequent playback, the video with visual information superimposed is distributed Also good.

ここで、「発話以外の音」とは、「言語を音声として発し、その結果として発せられた音声」以外の音を意味し、例えば、笑い声、拍手音である。 Here, “sound other than speech” means sound other than “speech produced as a result of language as speech,” such as laughter and applause.

「所定の視覚情報」は、前述の発話以外の音（例えば拍手音、笑い声）に対応するものであり、本実施形態では、「所定の視覚情報」として拍手音を表すテキスト情報「８」や笑い声を表すテキスト情報「ｗ」等を用いるものとする。 “Predetermined visual information” corresponds to sounds other than the above-mentioned utterances (for example, applause sound, laughter), and in this embodiment, text information “8” representing applause sound as “predetermined visual information” It is assumed that text information “w” indicating laughter is used.

「発話以外の音をそのまま表記した文字列」とは、要は、発話以外の音を、従来の音声認識装置に入力して得られるテキスト情報である。従来の音声認識装置は、「発話」、つまり、「言語を音声として発し、その結果として発せられた音声」を認識対象としているため、「発話以外の音」を認識対象とした場合、適切な認識結果を得ることができない。例えば、発話以外の音が拍手音の場合、従来の音声認識装置に拍手音を入力しても、「パチパチパチ」といったテキスト情報を得られる可能性は低く、ノイズとして音声認識の対象とされない可能性が高い。従来の音声認識装置を用いて、音声認識の結果、「パチパチパチ」というテキスト情報を得たいのであれば、人が「パチパチパチ」と発音する必要がある。同様に、発話以外の音が笑い声の場合、従来の音声認識装置に笑い声を入力しても、「ワハハッ」というテキスト情報を得られる可能性は低く、ノイズとして音声認識の対象とされないか、または、笑い声とは判断できないような音声認識結果が得られる可能性がある。そこで、認識部１３０は、発話以外の音に対して、発話以外の音を従来の音声認識装置に入力して得られるテキスト情報とは異なる所定の視覚情報（例えば、拍手音を表すテキスト情報「８」や笑い声を表すテキスト情報「ｗ」）を得る。 The “character string in which sounds other than utterances are expressed as they are” is basically text information obtained by inputting sounds other than utterances into a conventional speech recognition apparatus. Since the conventional speech recognition apparatus recognizes “speech”, that is, “speech produced as a result of speech as a language” and “sound other than speech” as a recognition target, it is appropriate to recognize The recognition result cannot be obtained. For example, if the sound other than utterance is applause, even if applause sound is input to a conventional speech recognition device, it is unlikely that text information such as “flapping” will be obtained, and may not be subject to speech recognition as noise Is expensive. If it is desired to obtain the text information “flickering” as a result of speech recognition using a conventional speech recognition apparatus, it is necessary for a person to pronounce “cracking”. Similarly, when the sound other than the utterance is a laughing voice, even if the laughing voice is input to the conventional voice recognition device, it is unlikely that the text information “wahaha” can be obtained and is not subject to voice recognition as noise, or There is a possibility that a voice recognition result that cannot be judged as a laughing voice is obtained. Therefore, the recognizing unit 130 determines predetermined visual information (for example, text information “applause sound indicating applause sound) different from text information obtained by inputting sounds other than utterances into a conventional speech recognition apparatus for sounds other than utterances. 8 ”and text information“ w ”representing laughter).

以下、「発話以外の音」を認識し、認識結果として視覚情報を取得する方法について説明する。なお、本実施形態において、所定の視覚情報は、映像コンテンツに重畳して表示するために得るものである。 Hereinafter, a method of recognizing “sound other than speech” and acquiring visual information as a recognition result will be described. In the present embodiment, the predetermined visual information is obtained so as to be superimposed on the video content.

認識部１３０は、受け取った音信号x(t)から、その大きさを表す指標（例えば、音量、パワー、エネルギー）を求め、大きさを表す指標と所定の閾値との大小関係に基づき、無音か否かを判定し、無音ではなく、何らかの音を収音できたと判定したときに、以下の方法により、所定の視覚情報を取得する。例えば、大きさを表す指標が音が大きいほど大きくなる値（例えば、音量、パワー、エネルギー）の場合には、大きさを表す指標が所定の閾値未満のときに無音と判定し、閾値以上のときに何らかの音を収音できたと判定する。また、大きさを表す指標が音信号が大きいほど小さくなる値の場合には、大きさを表す指標が所定の閾値より大きいときに無音と判定し、閾値以下のときに何らかの音を収音できたと判定する。本実施形態では、音信号x(t)の音量が所定の閾値以上となったときに何らかの音を収音できたと判定する。 The recognizing unit 130 obtains an index (eg, volume, power, energy) representing the magnitude from the received sound signal x (t), and based on the magnitude relationship between the magnitude index and a predetermined threshold, When it is determined whether or not some sound can be picked up instead of silence, predetermined visual information is acquired by the following method. For example, when the loudness indicator is a value that increases as the sound increases (for example, volume, power, energy), it is determined that there is no sound when the loudness indicator is less than a predetermined threshold. Sometimes it is determined that some sound could be picked up. In addition, when the index indicating the magnitude is smaller as the sound signal is larger, it is determined that there is no sound when the index indicating the magnitude is larger than a predetermined threshold, and any sound can be collected when the index is smaller than the threshold. It is determined that In the present embodiment, it is determined that any sound can be collected when the volume of the sound signal x (t) is equal to or higher than a predetermined threshold.

（取得方法１）
認識部１３０は、所定の視覚情報を、映像コンテンツに既に重畳されている1種類以上の視覚情報から得る。例えば、音信号x(t)の音量が所定の閾値以上となる時刻において、映像コンテンツに既に重畳されている1種類以上の視覚情報の中から1種類の視覚情報を選択し、所定の視覚情報とする。例えば、(1-1)重畳されている数が最も多い種類の視覚情報を、所定の視覚情報を選択する。また、(1-2)重畳されている数の割合に応じて、ランダムに所定の視覚情報を選択する。(1-3)重畳されている1種類以上の視覚情報の中からランダムに所定の視覚情報を選択する。 (Acquisition method 1)
The recognition unit 130 obtains predetermined visual information from one or more types of visual information already superimposed on the video content. For example, at a time when the volume of the sound signal x (t) is equal to or higher than a predetermined threshold, one type of visual information is selected from one or more types of visual information already superimposed on the video content, and the predetermined visual information is selected. And For example, (1-1) the predetermined visual information is selected as the type of visual information having the largest number of superimposed information. In addition, (1-2) predetermined visual information is selected at random according to the ratio of the number superimposed. (1-3) Randomly select predetermined visual information from one or more types of superimposed visual information.

例えば、音信号x(t)の音量が所定の閾値以上になった時刻において、図３のような視覚情報付き映像コンテンツを受け取った場合、拍手音を表す「８」というテキスト情報と、笑い声を表す「ｗ」というテキスト情報との、２種類の視覚情報が、映像コンテンツに既に重畳されており、それぞれの視覚情報の重畳されている個数は4個と2個である。なお、本実施形態では、ある視覚情報（例えば「８」というテキスト情報）とその視覚情報の繰り返し（例えば「８８８…」というテキスト情報）とは同じ種類の視覚情報として取り扱う。ただし、異なる種類の視覚情報として取り扱ってもよい。この２種類の視覚情報から何れか一方の視覚情報を選択して、所定の視覚情報を得る。(1-1)の場合、重畳されている数が最も多い種類の視覚情報は、拍手音を表す「８」というテキスト情報なので、これを所定の視覚情報として得る。(1-2)の場合、拍手音を表す「８」というテキスト情報が重畳されている数の割合は4/6であり、笑い声を表す「ｗ」というテキスト情報が重畳されている数の割合は2/6であり、この割合に応じて、ランダムに所定の視覚情報を選択する。例えば、4/6の確率で拍手音を表す「８」というテキスト情報を所定の視覚情報として選択し、2/6の確率で笑い声を表す「ｗ」というテキスト情報を所定の視覚情報として選択する。(1-3)の場合、1/2の確率で拍手音を表す「８」というテキスト情報を所定の視覚情報として選択し、1/2の確率で笑い声を表す「ｗ」というテキスト情報を所定の視覚情報として選択する。 For example, when the video content with visual information as shown in FIG. 3 is received at the time when the volume of the sound signal x (t) exceeds a predetermined threshold, text information “8” representing applause sound and laughter Two types of visual information, the text information “w” to be represented, are already superimposed on the video content, and the number of the respective superimposed pieces of visual information is four and two. In this embodiment, certain visual information (for example, text information “8”) and repetition of the visual information (for example, text information “888”) are handled as the same type of visual information. However, it may be handled as different types of visual information. Either one of the two types of visual information is selected to obtain predetermined visual information. In the case of (1-1), the type of visual information with the largest number of superimposed information is text information “8” representing applause sound, and is thus obtained as predetermined visual information. In the case of (1-2), the ratio of the number of text information “8” superimposed on applause sound is 4/6, the ratio of the number of text information “w” superimposed on laughter Is 2/6, and predetermined visual information is selected at random according to this ratio. For example, text information “8” representing applause sound with probability of 4/6 is selected as predetermined visual information, and text information “w” representing laughter with probability of 2/6 is selected as predetermined visual information. . In the case of (1-3), text information “8” representing applause sound with a probability of 1/2 is selected as predetermined visual information, and text information “w” representing laughter with a probability of 1/2 is predetermined. Select as visual information.

また、例えば、「音信号x(t)の音量が所定の閾値以上となる時刻において、」ではなく、「音信号x(t)の音量が所定の閾値以上となる時刻までに、」映像コンテンツに既に重畳されている1種類以上の視覚情報の中から1種類の視覚情報を選択し、所定の視覚情報としてもよい。例えば、音信号x(t)の音量が所定の閾値以上となったときに、それ以前に得ていた1種以上の視覚情報の中から1種類の視覚情報を選択し、所定の視覚情報としてもよい。選択の方法としては、(1-1)〜(1-3)の方法を用いればよい。 For example, instead of “at the time when the volume of the sound signal x (t) is equal to or higher than a predetermined threshold”, “by the time when the volume of the sound signal x (t) is equal to or higher than the predetermined threshold” One type of visual information may be selected from one or more types of visual information that has already been superimposed on the predetermined visual information. For example, when the volume of the sound signal x (t) exceeds a predetermined threshold, one type of visual information is selected from one or more types of visual information obtained before that, and the predetermined visual information is selected. Also good. As a selection method, the methods (1-1) to (1-3) may be used.

（取得方法２）
認識部１３０は、所定の視覚情報を、映像コンテンツに重畳して表示するために予め用意されている1種類以上の視覚情報から得る。例えば、図４に示すような視覚情報データベースを予め用意しておき、(2)1種類以上の視覚情報の中からランダムに所定の視覚情報を選択する。なお、所定の視覚情報として、所定の行為、例えば、「笑い」を意味する情報のみを選択したい場合には、視覚情報データベースに「笑い」を意味する情報のみ、例えば、「ｗ」「（笑）」「:-)」「(^o^)」等を用意しておけばよい。 (Acquisition method 2)
The recognizing unit 130 obtains predetermined visual information from one or more types of visual information prepared in advance for superimposing and displaying the video content. For example, a visual information database as shown in FIG. 4 is prepared in advance, and (2) predetermined visual information is randomly selected from one or more types of visual information. In addition, when it is desired to select only a predetermined action, for example, information meaning “laughter” as the predetermined visual information, only information meaning “laughter” is stored in the visual information database, for example, “w” “(laughter). ) "" :-) "" (^ o ^) "etc. should be prepared.

（取得方法３）
（取得方法２）と、音信号x(t)がどのような音なのかを認識する処理を組合せてもよい。 (Acquisition method 3)
(Acquisition method 2) may be combined with processing for recognizing what kind of sound the sound signal x (t) is.

(3-1)例えば、音信号x(t)に対して、VAD(voice activity detection:音声区間検出)を行い、音声区間と判定した場合には、「発話以外の音」が「笑い声」に対応すると判定し、笑い声に対応する視覚情報を、所定の視覚情報として選択する。また、非音声区間と判定した場合には、「発話以外の音」が「拍手音」に対応すると判定し、拍手に対応する視覚情報を、所定の視覚情報として選択する。この場合、所定の視覚情報が意味する行為が2種類以上ある場合には、視覚情報データベースには、視覚情報が音声区間に対応するものか、非音声区間に対応するものかを記憶しておく。なお、音声区間や非音声区間に対応する視覚情報が複数種類ある場合には、その中からランダムに所定の視覚情報を選択する。 (3-1) For example, if the voice signal x (t) is subjected to VAD (voice activity detection) and determined to be a voice segment, `` sound other than speech '' becomes `` laughter '' It determines with corresponding, and the visual information corresponding to a laughter is selected as predetermined visual information. If it is determined as a non-voice section, it is determined that “sound other than speech” corresponds to “applause sound”, and visual information corresponding to applause is selected as predetermined visual information. In this case, when there are two or more types of actions that the predetermined visual information means, the visual information database stores whether the visual information corresponds to a voice section or a non-voice section. . In addition, when there are a plurality of types of visual information corresponding to a voice segment and a non-speech segment, predetermined visual information is selected at random.

(3-2)例えば、予め「発話以外の音」から音声特徴量を抽出しておき、音信号x(t)から抽出した音声特徴量との類似度を求め、類似度が所定の閾値以上となる場合に、その「発話以外の音」に対応する視覚情報を、所定の視覚情報として選択する。なお、笑い声や拍手等に対応する視覚情報が複数種類ある場合には、その中からランダムに所定の視覚情報を選択してもよい。なお、従来の音声認識装置では、発話から音声特徴量を抽出していたのに対し、本実施形態では「発話以外の音」から音声特徴量を抽出する。また、この場合、「発話以外の音」は視聴者の所定の行為(笑いや拍手)を意味し（所定の行為に対応し）、背景雑音等を含まない。 (3-2) For example, a voice feature is extracted in advance from “sound other than speech”, and a similarity with the voice feature extracted from the sound signal x (t) is obtained, and the similarity is equal to or greater than a predetermined threshold value. In this case, the visual information corresponding to the “sound other than speech” is selected as predetermined visual information. In addition, when there are a plurality of types of visual information corresponding to laughter, applause, etc., predetermined visual information may be selected at random. In the conventional speech recognition apparatus, the speech feature amount is extracted from the utterance. In the present embodiment, the speech feature amount is extracted from “sound other than the utterance”. In this case, “sound other than speech” means a predetermined action (laughing or applause) of the viewer (corresponding to the predetermined action) and does not include background noise or the like.

（取得方法４）
（取得方法１）と、（取得方法２）または（取得方法３）とを組合せてもよい。 (Acquisition method 4)
(Acquisition method 1) may be combined with (acquisition method 2) or (acquisition method 3).

認識部１３０は、(4-a)音信号の収音時刻に対応する映像コンテンツの時刻に、既に重畳されている視覚情報が1種類である場合、または、(4-b)複数種類あるがそのうち1種類の割合が極めて高い場合には、当該種類の視覚情報を所定の視覚情報として得る。 The recognizing unit 130 (4-a) has one type of visual information already superimposed at the time of the video content corresponding to the sound signal collection time, or (4-b) When the ratio of one type is extremely high, the type of visual information is obtained as predetermined visual information.

一方、(4-a)及び(4-b)以外の場合には、映像コンテンツに重畳して表示するために予め用意されている複数種類の視覚情報から所定の視覚情報を得る。 On the other hand, in cases other than (4-a) and (4-b), predetermined visual information is obtained from a plurality of types of visual information prepared in advance for superimposition and display on video content.

例えば、音信号x(t)の音量が所定の閾値以上となる時刻において、（または「音信号x(t)の音量が所定の閾値以上となる時刻までに、」）映像コンテンツに既に重畳されている視覚情報が1種類か、2種類以上かを判定する。1種類の場合には、その視覚情報を所定の視覚情報として得る。2種類以上の場合には、重畳されている数が最も多い種類の視覚情報の割合を求め、その割合が所定の閾値(例えば0.5)より大きいときに、その視覚情報を所定の視覚情報として選択する。重畳されている数が最も多い種類の視覚情報の割合が所定の閾値以下のときに、（取得方法２）または（取得方法３）の方法により、所定の視覚情報を選択する。 For example, at the time when the volume of the sound signal x (t) is equal to or greater than a predetermined threshold (or “by the time when the volume of the sound signal x (t) is equal to or greater than the predetermined threshold”), it is already superimposed on the video content. Determine whether the visual information is one type or more. In the case of one type, the visual information is obtained as predetermined visual information. When there are two or more types, the ratio of the type of visual information with the largest number of superimposed information is obtained, and when the ratio is larger than a predetermined threshold (for example, 0.5), the visual information is selected as the predetermined visual information. To do. When the ratio of the type of visual information having the largest number of superimposed information is equal to or less than a predetermined threshold value, the predetermined visual information is selected by the method (acquisition method 2) or (acquisition method 3).

このような取得方法により、発話以外の音を、そのとき表示部に出ている視覚情報の中で多数を占める視覚情報に変換して画面に表示することができる。
（取得方法５）
（取得方法１）と、（取得方法２）または（取得方法３）との組合せとしては以下のような方法も考えられる。 By such an acquisition method, it is possible to convert sounds other than speech into visual information that occupies most of the visual information currently displayed on the display unit and display it on the screen.
(Acquisition method 5)
As a combination of (acquisition method 1) and (acquisition method 2) or (acquisition method 3), the following methods are also conceivable.

認識部１３０は、(5)音信号の収音時刻に対応する映像コンテンツの時刻に、既に重畳されている視覚情報に占める1種類の視覚情報の割合が高い場合には、当該種類の視覚情報を所定の視覚情報として得ることを優先し、(5)以外の場合には、映像コンテンツに重畳して表示するために予め用意されている複数種類の視覚情報から所定の視覚情報を得ることを優先することで、所定の視覚情報を得る。 When the proportion of one type of visual information in the visual information already superimposed is high at the time of the video content corresponding to the sound signal collection time of (5) the sound signal, the recognition unit 130 determines that type of visual information. Is obtained as predetermined visual information, and in cases other than (5), the predetermined visual information should be obtained from a plurality of types of visual information prepared in advance for display in a superimposed manner on video content. By giving priority, predetermined visual information is obtained.

例えば、音信号x(t)の音量が所定の閾値以上となる時刻において、（または「音信号x(t)の音量が所定の閾値以上となる時刻までに、」）映像コンテンツに既に重畳されている視覚情報の種類毎にそれぞれの割合を求め、その割合が所定の閾値a(例えばa>0.5)より大きいときに、所定の確率b(0.5<b<1)でその割合に対応する視覚情報を所定の視覚情報として選択し、(1-b)の確率で、（取得方法２）または（取得方法３）の方法により、所定の視覚情報を選択する。一方、その割合が所定の閾値a以下のときに、所定の確率c(0.5<c<1)で、（取得方法２）または（取得方法３）の方法により、所定の視覚情報を選択し、(1-c)の確率でその割合に対応する視覚情報を所定の視覚情報として選択する。 For example, at the time when the volume of the sound signal x (t) is equal to or greater than a predetermined threshold (or “by the time when the volume of the sound signal x (t) is equal to or greater than the predetermined threshold”), it is already superimposed on the video content. Each ratio is calculated for each type of visual information, and when the ratio is greater than a predetermined threshold a (for example, a> 0.5), the visual corresponding to the ratio is determined with a predetermined probability b (0.5 <b <1). Information is selected as predetermined visual information, and predetermined visual information is selected by the method (acquisition method 2) or (acquisition method 3) with the probability of (1-b). On the other hand, when the ratio is equal to or less than the predetermined threshold a, the predetermined visual information is selected by the method of (Acquisition method 2) or (Acquisition method 3) with the predetermined probability c (0.5 <c <1), Visual information corresponding to the ratio is selected as predetermined visual information with a probability of (1-c).

＜表示サイズ取得部１４０＞
表示サイズ取得部１４０は、収音された少なくとも発話以外の音を含む音信号x(t)を受け取り、その大きさを表す指標（例えば、音量、パワー、エネルギー）を求め、大きさを表す指標と所定の閾値との大小関係に基づき、無音か否かを判定する。例えば、大きさを表す指標が音が大きいほど大きくなる値（例えば、音量、パワー、エネルギー）の場合には、大きさを表す指標が所定の閾値未満のときに無音と判定し、閾値以上のときに何らかの音を収音できたと判定する。また、大きさを表す指標が音信号が大きいほど小さくなる値の場合には、大きさを表す指標が所定の閾値より大きいときに無音と判定し、閾値以下のときに何らかの音を収音できたと判定する。さらに、無音ではなく、何らかの音を収音できたと判定したときに、表示サイズ取得部１４０は、大きさを表す指標に対応する情報を、所定の視覚情報が視聴者端末９１の表示部または表示部１１０を介して表示される際の大きさの情報s(t)として得（Ｓ１４０）、通信回線を介して動画配信サーバ９２に送信する。例えば、音信号x(t)の大きさが大きいほど、表示される際の大きさが大きくなるように情報s(t)を取得する。 <Display size acquisition unit 140>
The display size acquisition unit 140 receives the collected sound signal x (t) including at least a sound other than the utterance, obtains an index (for example, volume, power, energy) indicating the magnitude, and an index indicating the magnitude It is determined whether or not there is silence based on the magnitude relationship between and a predetermined threshold. For example, when the loudness indicator is a value that increases as the sound increases (for example, volume, power, energy), it is determined that there is no sound when the loudness indicator is less than a predetermined threshold. Sometimes it is determined that some sound could be picked up. In addition, when the index indicating the magnitude is smaller as the sound signal is larger, it is determined that there is no sound when the index indicating the magnitude is larger than a predetermined threshold, and any sound can be collected when the index is smaller than the threshold. It is determined that Furthermore, when it is determined that some sound can be picked up instead of silence, the display size acquisition unit 140 displays information corresponding to the index indicating the magnitude, and the predetermined visual information is displayed on the display unit or display of the viewer terminal 91. It is obtained as size information s (t) when displayed via the unit 110 (S140), and transmitted to the moving image distribution server 92 via the communication line. For example, the information s (t) is acquired so that the larger the sound signal x (t) is, the larger the sound signal x (t) is.

このような構成により、音量に合わせても文字の大きさを変えて表示部に表示することができ、より視聴者の雰囲気を詳細に伝えることができる。 With such a configuration, the character size can be changed and displayed on the display unit even in accordance with the volume, and the viewer's atmosphere can be conveyed in more detail.

＜効果＞
以上の構成により、コメント投稿ボタンのクリックまたはエンターキーの押下を行わずに、視覚情報を映像コンテンツに適切なタイミングで重畳して表示することができる。 <Effect>
With the above configuration, visual information can be displayed superimposed on video content at an appropriate timing without clicking a comment posting button or pressing an enter key.

＜変形例＞
本実施形態では、表示部１１０は、映像コンテンツと共にそれに重畳される視覚情報を表示しているが、映像コンテンツのみを表示する表示部を別途設けてもよい。 <Modification>
In this embodiment, the display unit 110 displays visual information superimposed on the video content, but a display unit that displays only the video content may be separately provided.

本実施形態の視聴者端末１００内に従来の音声認識装置を組み込んでもよい。例えば、認識部１３０の前段に従来の音声認識装置を組み込み、適切な音声認識ができなかった場合にのみ認識部１３０で認識処理を行う構成としてもよい。 A conventional speech recognition apparatus may be incorporated in the viewer terminal 100 of the present embodiment. For example, a conventional speech recognition device may be incorporated in the preceding stage of the recognition unit 130 and the recognition unit 130 may perform the recognition process only when appropriate speech recognition cannot be performed.

本実施形態では、表示サイズ取得部１４０を備えるが、必ずしも備えなくともよい。なお、表示サイズ取得部１４０を備えない場合、視覚情報の大きさを表す情報として予めデフォルト値を設定しておけばよい。また、視覚情報の大きさは視聴者の操作により図示しない入力部から変更可能としてもよい。 Although the display size acquisition unit 140 is provided in the present embodiment, it is not always necessary. If the display size acquisition unit 140 is not provided, a default value may be set in advance as information representing the size of visual information. The size of the visual information may be changeable from an input unit (not shown) by the operation of the viewer.

本実施形態では、認識部１３０が、視聴者端末１００に組み込まれる構成としたが、独立した認識装置として構成してもよい。また、認識部１３０が、動画配信サーバ９２、または、視聴者端末１００以外の動画を再生する側の視聴者端末９１に組み込まれる構成としてもよい。その場合には、認識部１３０が組み込まれた装置に、音信号x(t)を送信する必要がある。データの伝送量を考慮すると、本実施形態のように、視聴者端末１００に認識部１３０が組み込まれ、視覚情報v(t)を送信する構成が望ましい。 In the present embodiment, the recognition unit 130 is incorporated in the viewer terminal 100, but may be configured as an independent recognition device. In addition, the recognition unit 130 may be configured to be incorporated in the video distribution server 92 or the viewer terminal 91 on the side of reproducing a video other than the viewer terminal 100. In that case, it is necessary to transmit the sound signal x (t) to a device in which the recognition unit 130 is incorporated. In consideration of the data transmission amount, a configuration in which the recognition unit 130 is incorporated in the viewer terminal 100 and the visual information v (t) is transmitted as in the present embodiment is desirable.

なお、本実施形態では、表示部１１０において、視聴者に対して映像コンテンツを提示しているが、他のコンテンツを提示してもよい。端末は、対象者に対して何らかの刺激によってコンテンツを提示することができればよく、本実施形態のように音刺激及び光刺激による映像コンテンツを提示してもよいし、音刺激のみによる音響コンテンツ(ラジオ放送等)を提示してもよいし、対象者が持つ他の感覚器(触覚器、嗅覚器、味覚器)で受け取ることができる他の刺激(化学物質、温度、圧力)、または、各刺激の組合せによってコンテンツを提示してもよい。その場合であっても、表示部１１０は所定の視覚情報を表示するために用いる。なお、対象者が持つ感覚器で受け取ることができる刺激(光、音、化学物質、温度、圧力等)、または、それらの組合せによって提示されるコンテンツを纏めて「メディアコンテンツ」ともいい、メディアコンテンツを対象者に提示するための構成を提示部という。収音部１２０では、メディアコンテンツから刺激を感じ取った対象者がメディアコンテンツに対応して発する音を収音する。 In the present embodiment, video content is presented to the viewer on the display unit 110, but other content may be presented. The terminal only needs to be able to present content to the target person by some kind of stimulation, and may present video content by sound stimulation and light stimulation as in the present embodiment, or sound content (radio by only sound stimulation). Broadcasts, etc.), other stimuli (chemical substances, temperature, pressure) that can be received by other sensory organs (tactile sensation, olfactory sensation, taste sensation) of the subject, or each stimulus Content may be presented by a combination of Even in that case, the display unit 110 is used to display predetermined visual information. In addition, content presented by stimuli (light, sound, chemical substances, temperature, pressure, etc.) that can be received by the subject's sensory organs, or a combination thereof, is also referred to as “media content”. A configuration for presenting a subject to a target person is referred to as a presentation unit. The sound collection unit 120 collects sound generated by the target person who feels the stimulus from the media content in response to the media content.

視聴者端末１００は、図示しない表示長取得部を含んでもよい。表示長取得部は、収音された少なくとも発話以外の音を含む音信号x(t)を受け取り、その大きさを表す指標（例えば、音量の移動平均）を求め、大きさを表す指標と所定の閾値との大小関係に基づき、無音か否かを判定する。例えば、大きさを表す指標が音が大きいほど大きくなる値（例えば、音量、パワー、エネルギー）の場合には、大きさを表す指標が所定の閾値未満のときに無音と判定し、閾値以上のときに何らかの音を収音できたと判定する。また、大きさを表す指標が音信号が大きいほど小さくなる値の場合には、大きさを表す指標が所定の閾値より大きいときに無音と判定し、閾値以下のときに何らかの音を収音できたと判定する。表示長取得部は、無音ではなく、何らかの音を収音できたと判定したときに、その音の継続時間に対応する情報を、所定の視覚情報が視聴者端末９１の表示部または表示部１１０を介して表示される際の繰り返し回数として得、認識部１３０に出力する。例えば、音信号x(t)の継続時間が大きいほど、繰り返し回数が大きくなるような構成とする。認識部１３０は、求めた所定の視覚情報v(t)を繰り返し回数に応じて、繰り返した情報を、改めて所定の視覚情報として動画配信サーバ９２に出力する。例えば、所定の視覚情報v(t)が「ｗ」であり、繰り返し回数が5回のとき、改めて所定の視覚情報v(t)として「ｗｗｗｗｗ」、または、これと同じ意味を表すネットスラングである「草生えた」等を通信回線を介して動画配信サーバに送信する。このような構成により、より視聴者の雰囲気を詳細に伝えることができる。なお、認識部１３０では、繰り返し回数を得てから、所定の視覚情報v(t)を出力するため、視覚情報v(t)を送信した際の再生時において、視覚情報v(t)を動画に重畳する場合、発話以外の音が発生してから、所定の視覚情報v(t)を出力するまでに、ズレが生じる。このズレをなくすために、送信した際の再生時においては、繰り返さずに所定の視覚情報v(t)を動画配信サーバ９２に送信してもよい。そして、動画配信サーバ９２は、視覚情報v(t)を受け取り、繰り返さずに動画に重畳して配信する。認識部１３０では、繰り返し回数を得てから、求めた所定の視覚情報v(t)を繰り返し回数に応じて、繰り返した情報を、改めて所定の視覚情報として動画配信サーバ９２に出力する。動画配信サーバ９２では、動画データベースに、動画と共に繰り返した視覚情報v(t)を格納する。このような構成により、動画配信サーバ９２では、繰り返した視覚情報v(t)を格納した後になされた再生要求に対して、繰り返した視覚情報を重畳した動画を配信することができる。 The viewer terminal 100 may include a display length acquisition unit (not shown). The display length acquisition unit receives the collected sound signal x (t) including at least the sound other than the utterance, obtains an index indicating the magnitude (for example, moving average of the volume), and obtains an index indicating the magnitude and a predetermined value. Whether or not there is silence is determined based on the magnitude relationship with the threshold value. For example, when the loudness indicator is a value that increases as the sound increases (for example, volume, power, energy), it is determined that there is no sound when the loudness indicator is less than a predetermined threshold. Sometimes it is determined that some sound could be picked up. In addition, when the index indicating the magnitude is smaller as the sound signal is larger, it is determined that there is no sound when the index indicating the magnitude is larger than a predetermined threshold, and any sound can be collected when the index is smaller than the threshold. It is determined that When the display length acquisition unit determines that it has been able to pick up some sound, not silence, information corresponding to the duration of the sound is displayed as predetermined visual information on the display unit or display unit 110 of the viewer terminal 91. Is obtained as the number of repetitions when being displayed, and is output to the recognition unit 130. For example, the number of repetitions increases as the duration of the sound signal x (t) increases. The recognizing unit 130 outputs the obtained predetermined visual information v (t) to the moving image distribution server 92 as predetermined visual information again according to the number of repetitions. For example, when the predetermined visual information v (t) is “w” and the number of repetitions is 5, the predetermined visual information v (t) is “wwwww” or a net slang that has the same meaning as this. A certain “Kusasei Eita” or the like is transmitted to the moving image distribution server via the communication line. With such a configuration, the viewer's atmosphere can be conveyed in detail. The recognition unit 130 outputs the predetermined visual information v (t) after obtaining the number of repetitions. Therefore, the visual information v (t) is converted into a moving image during reproduction when the visual information v (t) is transmitted. In the case of superimposing on the sound, there is a difference between the generation of sound other than the utterance and the output of the predetermined visual information v (t). In order to eliminate this shift, the predetermined visual information v (t) may be transmitted to the moving image distribution server 92 without being repeated at the time of reproduction at the time of transmission. Then, the moving image distribution server 92 receives the visual information v (t) and distributes it superimposed on the moving image without repeating. After obtaining the number of repetitions, the recognizing unit 130 outputs the obtained predetermined visual information v (t) to the moving image distribution server 92 as predetermined visual information again according to the number of repetitions. The moving image distribution server 92 stores visual information v (t) repeated together with the moving image in the moving image database. With such a configuration, the moving image distribution server 92 can distribute a moving image in which repeated visual information is superimposed on a reproduction request made after storing the repeated visual information v (t).

本実施形態では、認識部１３０において、受け取った音信号x(t)から、その大きさを表す指標を求め、大きさを表す指標と所定の閾値との大小関係に基づき、無音か否かを判定しているが、認識部１３０の前段に既存のVAD(voice activity detection)を設け、VADで音信号x(t)が無音か否かを判定してもよい。さらに、何らかの音を収音できたと判定したときに、音信号x(t)が認識部１３０に入力される構成とし、認識部１３０では、音信号x(t)を用いて、所定の視覚情報v(t)を得ればよい。 In the present embodiment, the recognition unit 130 obtains an index representing the magnitude from the received sound signal x (t), and determines whether there is silence based on the magnitude relationship between the magnitude index and a predetermined threshold. Although it is determined, an existing VAD (voice activity detection) may be provided before the recognition unit 130 to determine whether the sound signal x (t) is silent by VAD. Further, when it is determined that some sound has been collected, the sound signal x (t) is input to the recognition unit 130. The recognition unit 130 uses the sound signal x (t) to generate predetermined visual information. Obtain v (t).

＜その他の変形例＞
本発明は上記の実施形態及び変形例に限定されるものではない。例えば、上述の各種の処理は、記載に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。その他、本発明の趣旨を逸脱しない範囲で適宜変更が可能である。 <Other variations>
The present invention is not limited to the above-described embodiments and modifications. For example, the various processes described above are not only executed in time series according to the description, but may also be executed in parallel or individually as required by the processing capability of the apparatus that executes the processes. In addition, it can change suitably in the range which does not deviate from the meaning of this invention.

＜プログラム及び記録媒体＞
また、上記の実施形態及び変形例で説明した各装置における各種の処理機能をコンピュータによって実現してもよい。その場合、各装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、上記各装置における各種の処理機能がコンピュータ上で実現される。 <Program and recording medium>
In addition, various processing functions in each device described in the above embodiments and modifications may be realized by a computer. In that case, the processing contents of the functions that each device should have are described by a program. Then, by executing this program on a computer, various processing functions in each of the above devices are realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。 The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させてもよい。 The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Further, the program may be distributed by storing the program in a storage device of the server computer and transferring the program from the server computer to another computer via a network.

このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶部に格納する。そして、処理の実行時、このコンピュータは、自己の記憶部に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実施形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよい。さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるＡＳＰ（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。なお、プログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等）を含むものとする。 A computer that executes such a program first stores, for example, a program recorded on a portable recording medium or a program transferred from a server computer in its storage unit. When executing the process, this computer reads the program stored in its own storage unit and executes the process according to the read program. As another embodiment of this program, a computer may read a program directly from a portable recording medium and execute processing according to the program. Further, each time a program is transferred from the server computer to the computer, processing according to the received program may be executed sequentially. Also, the program is not transferred from the server computer to the computer, and the above-described processing is executed by a so-called ASP (Application Service Provider) type service that realizes the processing function only by the execution instruction and result acquisition. It is good. Note that the program includes information provided for processing by the electronic computer and equivalent to the program (data that is not a direct command to the computer but has a property that defines the processing of the computer).

また、コンピュータ上で所定のプログラムを実行させることにより、各装置を構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 In addition, although each device is configured by executing a predetermined program on a computer, at least a part of these processing contents may be realized by hardware.

Claims

A recognition unit that obtains predetermined visual information that corresponds to a sound other than the utterance and is not a character string that directly expresses the sound other than the utterance based on a sound signal including at least the sound other than the utterance collected. Including,
Recognition device.

The recognition device according to claim 1,
The predetermined visual information is obtained for superimposition and display on video content,
The recognition unit
Obtaining the predetermined visual information from at least one of one or more types of visual information already superimposed on video content and one or more types of visual information prepared in advance for display superimposed on video content;
Recognition device.

The recognition device according to claim 2,
The recognition unit
When there are a plurality of types of visual information already superimposed at the time of the video content corresponding to the sound collection time of the sound signal, the type of visual information having the largest number of superimposed information is selected as the predetermined visual information. Get as,
Recognition device.

The recognition device according to claim 2,
The recognizing unit (1-a) when there is one type of visual information already superimposed at the time of the video content corresponding to the sound collection time of the sound signal, or (1-b) there are a plurality of types However, when the ratio of one type is extremely high, the visual information of the type is obtained as the predetermined visual information. (2) In other cases, prepared in advance to be displayed superimposed on the video content. The predetermined visual information is obtained from a plurality of types of visual information.
Recognition device.

The recognition device according to claim 2,
The recognizing unit (1) when the proportion of one type of visual information in the visual information already superimposed at the time of the video content corresponding to the sound collection time of the sound signal is high, Priority is given to obtaining information as the predetermined visual information. (2) In other cases, the predetermined visual information is obtained from a plurality of types of visual information prepared in advance to be displayed superimposed on video content. Priority is given to obtaining the above-mentioned predetermined visual information,
Recognition device.

The recognition apparatus according to any one of claims 1 to 5,
A display size acquisition unit that obtains information corresponding to an index that represents a loudness other than the utterance as loudness information when the predetermined visual information is displayed via the display unit;
Recognition device.

A first display for displaying video content;
A sound collection unit that collects sound emitted in response to the video content and converts the sound into the sound signal;
A recognition device according to any one of claims 1 to 6;
A second display unit that displays the predetermined visual information obtained by the recognition device superimposed on video content,
Video content presentation system.

A presentation unit for presenting media content;
A sound collection unit that collects sound emitted in response to the media content and converts the sound into the sound signal;
A recognition device according to any one of claims 1 to 6;
A display unit that displays the predetermined visual information obtained by the recognition device superimposed on video content corresponding to the media content,
Video content presentation system.