JP5346797B2

JP5346797B2 - Sign language video synthesizing device, sign language video synthesizing method, sign language display position setting device, sign language display position setting method, and program

Info

Publication number: JP5346797B2
Application number: JP2009293628A
Authority: JP
Inventors: 雄三大嶋
Original assignee: ASTEM, INC.
Current assignee: ASTEM, INC.
Priority date: 2009-12-25
Filing date: 2009-12-25
Publication date: 2013-11-20
Anticipated expiration: 2029-12-25
Also published as: JP2011135388A

Abstract

PROBLEM TO BE SOLVED: To provide a sign language video compositing device which generates and outputs composited video for displaying a sign language video at a position adjacent to an area of a person in a program video. SOLUTION: The sign language video compositing device is provided with a program video acceptance section 11, which accepts the program video which is a video of a program; a sign language video acceptance section 12, which accepts the sign language video which is a video of a sign language corresponding to the program video; a person area specification section 13, which specifies the area of the person in the program video; a display position setting section 15, which sets a display position of the sign language video at the position adjacent to the area of the person specified by the person area specification section 13; a video compositing section 20, which generates the composited video obtained by compositing the sign language video at the display position set by the display position setting section 15 in the program video; and a video output section 21 which outputs the composited image. COPYRIGHT: (C)2011,JPO&INPIT

Description

本発明は、番組映像と手話映像とを合成して出力する手話映像表示装置等や、手話映像の表示位置を決める手話表示位置設定装置等に関する。 The present invention relates to a sign language video display device that synthesizes and outputs a program video and a sign language video, a sign language display position setting device that determines a display position of a sign language video, and the like.

従来、番組映像と、その番組映像に関する手話映像とを表示する場合があった。その場合に、手話映像の表示位置が決まっていることがあった（例えば、特許文献１参照）。 Conventionally, there are cases where a program video and a sign language video related to the program video are displayed. In that case, the display position of the sign language image may be determined (see, for example, Patent Document 1).

特開２００６−１３５８２８号公報JP 2006-135828 A

手話映像が番組映像の表示領域内に表示される場合（例えば、番組映像の右上の領域などに、番組映像よりも小さい大きさの手話映像を表示する場合など）であって、手話映像の表示位置が決まっている場合には、不都合が生じる場合がある。例えば、番組映像において話者が左の方に表示されており、手話映像が右の方のあらかじめ決められた位置に表示されている場合には、それを見ている聴覚障害を有する者は、話者と手話映像との間で頻繁に視線移動をする必要があり、眼の疲労が大きくなるという問題があった。また、話者と手話映像とが重なってしまった場合には、話者を見ることができないという問題もあった。 When sign language video is displayed within the program video display area (for example, when a sign language video smaller than the program video is displayed in the upper right area of the program video) If the position is fixed, inconvenience may occur. For example, when a speaker is displayed on the left side in a program video and a sign language video is displayed at a predetermined position on the right side, a person who has a hearing impairment watching it There is a problem in that eye strain frequently increases between the speaker and the sign language image, which increases eye fatigue. There is also a problem that the speaker cannot be seen when the speaker and the sign language video overlap.

本発明は、このような課題を解決するためになされたものであり、番組映像における人物の領域を特定し、その人物の領域の近傍に手話映像を合成する手話映像合成装置等を提供することを目的とする。 The present invention has been made to solve such a problem, and provides a sign language video synthesizing device and the like for identifying a person area in a program video and synthesizing a sign language video in the vicinity of the person area. With the goal.

上記目的を達成するため、本発明による手話映像合成装置は、番組の映像である番組映像を受け付ける番組映像受付部と、番組映像に対応した手話の映像である手話映像を受け付ける手話映像受付部と、番組映像における人物の領域を特定する人物領域特定部と、人物領域特定部が特定した人物の領域に隣接する位置に手話映像の表示位置を設定する表示位置設定部と、番組映像における、表示位置設定部が設定した表示位置に、手話映像を合成した合成映像を生成する映像合成部と、合成映像を出力する映像出力部と、を備えたものである。 To achieve the above object, a sign language video synthesizing apparatus according to the present invention includes a program video receiving unit that receives a program video that is a video of a program, and a sign language video receiving unit that receives a sign language video that is a video of a sign language corresponding to the program video. A person area specifying unit for specifying a person area in the program video, a display position setting unit for setting a display position of the sign language video at a position adjacent to the person area specified by the person area specifying unit, and a display in the program video The display position set by the position setting unit includes a video composition unit that generates a composite image obtained by combining the sign language video, and a video output unit that outputs the composite image.

このような構成により、人物の領域に隣接する位置に手話映像を表示する合成映像を生成して出力することができる。したがって、その合成映像を見る者は、人物の近くに表示されている手話映像を見ることができる。その結果、例えば、話者である人物と、手話映像との間での視線移動の距離が短くなり、眼精疲労等の疲労を防止することができる。また、例えば、話者である人物が手話映像によって隠れてしまう事態を防止することができる。 With such a configuration, it is possible to generate and output a composite video that displays a sign language video at a position adjacent to a person's region. Therefore, the person who sees the synthesized video can see the sign language video displayed near the person. As a result, for example, the distance of the line-of-sight movement between the person who is the speaker and the sign language image is shortened, and fatigue such as eye strain can be prevented. In addition, for example, it is possible to prevent a person who is a speaker from being hidden by a sign language image.

また、本発明による手話映像合成装置では、人物領域特定部は、複数の人物の領域を特定するものであり、人物領域特定部が特定した複数の人物の領域のうち、話者の人物の領域を特定する話者特定部をさらに備え、表示位置設定部は、話者特定部が特定した話者に対応する人物の領域に隣接する位置に手話映像の表示位置を設定してもよい。 In the sign language video synthesizing device according to the present invention, the person area specifying unit specifies a plurality of person areas. Of the plurality of person areas specified by the person area specifying unit, the person area of the speaker is specified. The display position setting unit may set the display position of the sign language video at a position adjacent to the area of the person corresponding to the speaker specified by the speaker specifying unit.

このような構成により、複数の人物の領域が特定された場合であっても、話者の近くに手話映像を表示することができる。複数の表示されている人物を見る者は、話者に注目することが多いと考えられるため、その話者と手話映像との間の視線移動の距離を短くすることができる。 With such a configuration, it is possible to display a sign language image near the speaker even when a plurality of person areas are specified. Since it is considered that a person who views a plurality of displayed persons often pays attention to the speaker, the distance of the line of sight movement between the speaker and the sign language image can be shortened.

また、本発明による手話映像合成装置では、手話映像受付部は、複数の手話映像を受け付けるものであり、人物領域特定部は、複数の人物の領域を特定するものであり、各手話映像と、各人物の領域とをそれぞれ対応付ける対応特定部をさらに備え、表示位置設定部は、各手話映像の表示位置を、手話映像に対応特定部によって対応付けられた人物の領域に隣接する位置に設定し、映像合成部は、複数の手話映像を番組映像の設定された表示位置に合成してもよい。 Further, in the sign language video synthesizing device according to the present invention, the sign language video receiving unit receives a plurality of sign language videos, and the person area specifying unit specifies a plurality of person areas, and each sign language video, A correspondence specifying unit that associates each person's area with each other, and the display position setting unit sets the display position of each sign language video to a position adjacent to the person's area associated with the sign language video by the correspondence specifying unit. The video composition unit may synthesize a plurality of sign language videos at the set display position of the program video.

このような構成により、複数の手話映像を受け付けた場合に、人物の領域ごとに、それぞれ対応する手話映像を近くに表示することができる。その結果、人物と手話との間での視線移動が少なくなると共に、どの手話映像が、どの人物に対応しているのかを容易に把握することができるようになる。 With such a configuration, when a plurality of sign language images are received, corresponding sign language images can be displayed nearby for each person area. As a result, the movement of the line of sight between the person and the sign language is reduced, and it is possible to easily grasp which sign language image corresponds to which person.

また、本発明による手話映像合成装置では、対応特定部は、複数の手話映像と、複数の人物の領域とのうち、両者の動きの程度の近いものを対応付けてもよい。
このような構成により、動きの程度の類似性を見ることによって、手話映像と人物の領域との間の対応付けをとることができる。なお、人が話している場合には、口の周りやジェスチャーなどが大きくなると考えられ、またその発話に応じて手話映像の動きも大きくなると考えられるため、正確な対応付けが可能となりうる。 In the sign language video synthesizing device according to the present invention, the correspondence specifying unit may associate a plurality of sign language images and regions of a plurality of persons having similar degrees of movement.
With such a configuration, it is possible to establish an association between the sign language image and the person area by looking at the similarity of the degree of movement. When a person is speaking, it is considered that the surroundings of the mouth, gestures, and the like are increased, and the movement of the sign language image is also increased in accordance with the utterance, so that an accurate association may be possible.

また、本発明による手話映像合成装置では、人物の領域の動きは、人物の領域の口の動きであってもよい。
このような構成により、口の動きによって、発話しているかどうかをより正確に捉えることができ、その結果、より正確な対応付けが可能となる。 In the sign language video synthesizing device according to the present invention, the movement of the person area may be the movement of the mouth of the person area.
With such a configuration, it is possible to more accurately grasp whether or not the utterance is made by the movement of the mouth, and as a result, more accurate association is possible.

また、本発明による手話映像合成装置では、手話映像を表示するかどうか判断する表示判断部をさらに備え、映像合成部は、表示判断部が表示しないと判断した手話映像を合成しなくてもよい。
このような構成により、例えば、手話映像の表示が不要であると判断された場合に、その不要な手話映像によって番組映像の一部が隠されてしまうことを防止することができる。 The sign language video synthesizing apparatus according to the present invention further includes a display determination unit that determines whether or not to display a sign language video, and the video synthesis unit may not synthesize a sign language video that the display determination unit determines not to display. .
With such a configuration, for example, when it is determined that display of a sign language video is unnecessary, it is possible to prevent a part of the program video from being hidden by the unnecessary sign language video.

また、本発明による手話映像合成装置では、表示判断部は、手話映像に動きがない場合に、手話映像を表示しないと判断してもよい。
このような構成により、動きのない手話映像を表示しないようにすることができる。ここで、手話映像に動きがない場合には、手話のための動作が行われていないため、そのような手話映像を表示しても意味がないと考えられるからである。 In the sign language video synthesizing device according to the present invention, the display determination unit may determine that the sign language video is not displayed when there is no movement in the sign language video.
With such a configuration, it is possible to prevent a sign language image without movement from being displayed. Here, when there is no movement in the sign language image, since the operation for sign language is not performed, it is considered that it is meaningless to display such a sign language image.

また、本発明による手話映像合成装置では、番組映像受付部が受け付ける番組映像の属性を示す情報である番組関連情報を受け付ける番組関連情報受付部と、番組関連情報と、手話映像の表示の大きさを示す大きさ情報とを対応付ける情報である対応情報が記憶される対応情報記憶部と、番組関連情報受付部が受け付けた番組関連情報に、対応情報によって対応付けられる大きさ情報を取得する取得部と、をさらに備え、映像合成部は、番組映像に、取得部が取得した大きさ情報で示される大きさの手話映像を合成してもよい。 In the sign language video synthesizing device according to the present invention, the program related information receiving unit that receives the program related information that is the information indicating the attribute of the program video received by the program video receiving unit, the program related information, and the size of the display of the sign language video A correspondence information storage unit that stores correspondence information that is information for associating size information indicating the size information indicating the size, and an acquisition unit that acquires size information associated with the program related information received by the program related information reception unit by the correspondence information The video synthesizing unit may synthesize a sign language video having a size indicated by the size information acquired by the acquiring unit with the program video.

このような構成により、番組関連情報に応じて、表示する手話映像の大きさを変えることができる。例えば、手話映像が重要であることが番組関連情報によって示される場合には、その番組関連情報に対応する大きさ情報で示される手話影像の大きさを大きいものに設定しておくことによって、その番組関連情報に対応する番組映像については、手話映像を大きく表示することができる。 With such a configuration, the size of the sign language image to be displayed can be changed according to the program related information. For example, if the program-related information indicates that the sign language video is important, by setting the size of the sign language image indicated by the size information corresponding to the program-related information to be large, For the program video corresponding to the program-related information, the sign language video can be displayed in a large size.

また、本発明による手話映像合成装置では、番組関連情報は、映像番組のジャンルを示す情報を含むものであってもよい。
このような構成により、映像番組のジャンルに応じて、手話映像の大きさを切り替えることができる。例えば、ニュースでは、手話映像が重要であると考えられるため、その手話映像の大きさを大きく設定することができる。一方、野球やサッカーなどのスポーツでは、手話映像があまり重要ではないと考えられるため、その手話映像の大きさを小さく設定することができる。 In the sign language video synthesizing apparatus according to the present invention, the program related information may include information indicating a genre of the video program.
With such a configuration, the size of the sign language video can be switched according to the genre of the video program. For example, in news, a sign language video is considered important, so the size of the sign language video can be set large. On the other hand, in sports such as baseball and soccer, it is considered that sign language images are not so important, so the size of the sign language images can be set small.

また、本発明による手話映像合成装置では、表示位置設定部は、人物領域特定部が人物の領域を特定できなかった場合には、あらかじめ決められている位置を手話映像の表示位置に設定してもよい。
このような構成により、人物の領域を特定できなかった場合であっても、少なくとも手話映像を表示することはできることになる。 In the sign language video synthesizing device according to the present invention, the display position setting unit sets a predetermined position as the display position of the sign language video when the person area specifying unit cannot specify the person area. Also good.
With such a configuration, at least a sign language video can be displayed even if a person's area cannot be specified.

また、本発明による手話映像合成装置は、番組の映像である番組映像を受け付ける番組映像受付部と、番組映像に対応した手話の映像である複数の手話映像を受け付ける手話映像受付部と、番組映像に複数の手話映像を合成した合成映像を生成する映像合成部と、合成映像を出力する映像出力部と、を備えたものである。 In addition, a sign language video synthesizing apparatus according to the present invention includes a program video receiving unit that receives a program video that is a video of a program, a sign language video receiving unit that receives a plurality of sign language videos corresponding to a program video, and a program video Are provided with a video synthesizing unit that generates a synthesized video obtained by synthesizing a plurality of sign language videos, and a video output unit that outputs the synthesized video.

また、本発明による手話映像合成装置は、番組の映像である番組映像を受け付ける番組映像受付部と、番組映像に対応した手話の映像である手話映像を受け付ける手話映像受付部と、手話映像を表示するかどうか判断する表示判断部と、表示判断部が手話映像を表示すると判断した際には、番組映像に手話映像を合成した映像であり、表示判断部が手話映像を表示しないと判断した際には、手話映像である合成映像を生成する映像合成部と、合成映像を出力する映像出力部と、を備えたものである。 The sign language video synthesizing device according to the present invention displays a program video receiving unit that receives a program video that is a program video, a sign language video receiving unit that receives a sign language video corresponding to the program video, and a sign language video. When the display judgment unit judges whether to display the sign language video when the display judgment unit judges that the sign language video is to be displayed, and when the display judgment unit judges that the sign language video is not displayed. Includes a video composition unit that generates a composite video that is a sign language video, and a video output unit that outputs the composite video.

また、本発明による手話表示位置設定装置は、番組の映像である番組映像を受け付ける番組映像受付部と、前記番組映像における人物の領域を特定する人物領域特定部と、前記人物領域特定部が特定した人物の領域に隣接する位置に前記手話映像の表示位置を設定する表示位置設定部と、前記番組映像における、前記表示位置設定部が設定した表示位置を示す情報である位置情報を出力する出力部と、を備えたものである。
このような構成により、手話表示位置設定装置において、手話映像を合成する位置を示す位置情報を生成することができる。そして、その位置情報を用いて、例えば、手話映像合成装置において番組映像と手話映像とを合成することができる。 The sign language display position setting device according to the present invention includes: a program video receiving unit that receives a program video that is a video of a program; a person region specifying unit that specifies a person region in the program video; and the person region specifying unit A display position setting unit that sets a display position of the sign language video at a position adjacent to the area of the person who has performed, and an output that outputs position information that is information indicating the display position set by the display position setting unit in the program video Part.
With such a configuration, the sign language display position setting device can generate position information indicating a position where the sign language video is synthesized. The position information can be used to synthesize a program video and a sign language video in a sign language video synthesis device, for example.

また、本発明による手話映像合成装置は、番組の映像である番組映像を受け付ける番組映像受付部と、前記番組映像に対応した手話の映像である手話映像を受け付ける手話映像受付部と、前記番組映像における、前記手話映像の表示位置を示す情報である位置情報を受け付ける位置情報受付部と、前記番組映像における、前記位置情報によって示される表示位置に、前記手話映像を合成した合成映像を生成する映像合成部と、前記合成映像を出力する映像出力部と、を備えたものである。
このような構成により、手話映像合成装置において、位置情報の示す位置に手話映像を合成することができる。その位置情報は、例えば、前述の手話表示位置設定装置において生成されたものであってもよい。 The sign language video synthesizing apparatus according to the present invention includes a program video receiving unit that receives a program video that is a video of a program, a sign language video receiving unit that receives a sign language video corresponding to the program video, and the program video. A position information receiving unit that receives position information that is information indicating a display position of the sign language video, and a video that generates a composite video obtained by synthesizing the sign language video at the display position indicated by the position information in the program video A synthesis unit and a video output unit for outputting the synthesized video are provided.
With such a configuration, the sign language video can be synthesized at the position indicated by the position information in the sign language video synthesizing apparatus. For example, the position information may be generated by the sign language display position setting device described above.

本発明による手話映像合成装置等によれば、例えば、番組映像における人物の領域に隣接する位置に手話映像を表示する合成映像を生成して出力することができる。 According to the sign language video synthesizing apparatus and the like according to the present invention, for example, a synthesized video that displays a sign language video at a position adjacent to a person area in a program video can be generated and output.

本発明の実施の形態１による手話映像合成装置の構成を示すブロック図1 is a block diagram showing a configuration of a sign language video composition device according to Embodiment 1 of the present invention; 同実施の形態による手話映像合成装置の動作を示すフローチャートThe flowchart which shows operation | movement of the sign language image synthesis apparatus by the embodiment 同実施の形態における対応情報の一例を示す図The figure which shows an example of the correspondence information in the embodiment 同実施の形態における番組映像の一例を示す図The figure which shows an example of the program image | video in the embodiment 同実施の形態における手話映像の一例を示す図The figure which shows an example of the sign language image | video in the embodiment 同実施の形態における特定された人物の領域について説明するための図The figure for demonstrating the area | region of the specified person in the embodiment 同実施の形態における人物領域特定情報の一例を示す図The figure which shows an example of the person area specific information in the embodiment 同実施の形態における手話映像の表示位置の設定について説明するための図The figure for demonstrating the setting of the display position of the sign language image | video in the embodiment 同実施の形態における合成映像の一例を示す図The figure which shows an example of the synthetic | combination image | video in the embodiment 同実施の形態における番組映像の一例を示す図The figure which shows an example of the program image | video in the embodiment 同実施の形態における合成映像の一例を示す図The figure which shows an example of the synthetic | combination image | video in the embodiment 同実施の形態における番組映像の一例を示す図The figure which shows an example of the program image | video in the embodiment 同実施の形態における特定された人物の領域や口の領域について説明するための図The figure for demonstrating the area | region of the specified person and the area | region of the mouth in the embodiment 同実施の形態における人物領域特定情報の一例を示す図The figure which shows an example of the person area specific information in the embodiment 同実施の形態における合成映像の一例を示す図The figure which shows an example of the synthetic | combination image | video in the embodiment 同実施の形態における合成映像の一例を示す図The figure which shows an example of the synthetic | combination image | video in the embodiment 本発明の実施の形態２による手話映像合成装置の構成を示すブロック図A block diagram showing a configuration of a sign language video synthesizing device according to a second embodiment of the present invention. 同実施の形態による手話映像合成装置の動作を示すフローチャートThe flowchart which shows operation | movement of the sign language image synthesis apparatus by the embodiment 同実施の形態による手話映像合成装置の動作を示すフローチャートThe flowchart which shows operation | movement of the sign language image synthesis apparatus by the embodiment 同実施の形態における対応結果情報の一例を示す図The figure which shows an example of the correspondence result information in the same embodiment 同実施の形態における合成映像の一例を示す図The figure which shows an example of the synthetic | combination image | video in the embodiment 他の形態による手話表示位置設定装置と手話映像合成装置の構成を示すブロック図The block diagram which shows the structure of the sign language display position setting apparatus by another form, and a sign language image synthesis apparatus 他の形態による手話表示位置設定装置と手話映像合成装置の構成を示すブロック図The block diagram which shows the structure of the sign language display position setting apparatus by another form, and a sign language image synthesis apparatus 上記実施の形態におけるコンピュータシステムの外観一例を示す模式図The schematic diagram which shows an example of the external appearance of the computer system in the said embodiment 上記実施の形態におけるコンピュータシステムの構成の一例を示す図The figure which shows an example of a structure of the computer system in the said embodiment.

以下、本発明による手話映像合成装置、手話表示位置設定装置について、実施の形態を用いて説明する。なお、以下の実施の形態において、同じ符号を付した構成要素及びステップは同一または相当するものであり、再度の説明を省略することがある。 Hereinafter, a sign language video composition device and a sign language display position setting device according to the present invention will be described using embodiments. In the following embodiments, components and steps denoted by the same reference numerals are the same or equivalent, and repetitive description may be omitted.

（実施の形態１）
本発明の実施の形態１による手話映像合成装置について、図面を参照しながら説明する。本実施の形態による手話映像合成装置は、番組映像における人物の領域を認識し、その人物の領域に隣接する位置に手話映像を表示するものである。 (Embodiment 1)
A sign language video synthesizing apparatus according to Embodiment 1 of the present invention will be described with reference to the drawings. The sign language video synthesizing apparatus according to the present embodiment recognizes a person area in a program video and displays a sign language video at a position adjacent to the person area.

図１は、本実施の形態による手話映像合成装置１の構成を示すブロック図である。本実施の形態による手話映像合成装置１は、番組映像受付部１１と、手話映像受付部１２と、人物領域特定部１３と、話者特定部１４と、表示位置設定部１５と、表示判断部１６と、番組関連情報受付部１７と、対応情報記憶部１８と、取得部１９と、映像合成部２０と、映像出力部２１とを備える。 FIG. 1 is a block diagram showing a configuration of a sign language video synthesizing apparatus 1 according to this embodiment. The sign language video synthesizing apparatus 1 according to the present embodiment includes a program video receiving unit 11, a sign language video receiving unit 12, a person area specifying unit 13, a speaker specifying unit 14, a display position setting unit 15, and a display determining unit. 16, a program related information reception unit 17, a correspondence information storage unit 18, an acquisition unit 19, a video composition unit 20, and a video output unit 21.

番組映像受付部１１は、番組の映像である番組映像を受け付ける。番組映像は、例えば、ドラマや、映画、ニュース、ドキュメンタリー、スポーツ、バラエティー等の映像であり、そのジャンルを問わない。また、番組映像のデータ形式も問わない。例えば番組情報は、アナログのデータであってもよく、デジタルのデータであってもよい。後者の場合に、番組映像の形式は、例えば、ＭＰＥＧ（ＭｏｖｉｎｇＰｉｃｔｕｒｅＥｘｐｅｒｔｓＧｒｏｕｐ）であってもよく、ＡＶＩ（ＡｕｄｉｏＶｉｄｅｏＩｎｔｅｒｌｅａｖｅ）であってもよく、あるいは、その他の形式であってもよい。また、その番組映像が圧縮されている場合に、その圧縮形式も問わない。番組映像は、音の情報を含んでいてもよく、あるいは、含んでいなくてもよい。 The program video reception unit 11 receives a program video that is a video of a program. The program video is, for example, a video such as a drama, a movie, a news, a documentary, a sport, a variety, and the like, regardless of its genre. The data format of the program video is not limited. For example, the program information may be analog data or digital data. In the latter case, the format of the program video may be, for example, MPEG (Moving Picture Experts Group), AVI (Audio Video Interleave), or any other format. Further, when the program video is compressed, the compression format is not limited. The program video may or may not include sound information.

番組映像受付部１１は、例えば、カメラ等のデバイスから入力された番組映像を受け付けてもよく、有線もしくは無線の通信回線を介して送信された番組映像を受信してもよく、所定の記録媒体（例えば、光ディスクや磁気ディスク、半導体メモリなど）から読み出された番組映像を受け付けてもよい。本実施の形態では、放送された番組映像を受信する場合について説明する。なお、番組映像受付部１１は、受け付けを行うためのデバイス（例えば、モデムやネットワークカード、チューナなど）を含んでもよく、あるいは含まなくてもよい。また、番組映像受付部１１は、ハードウェアによって実現されてもよく、あるいは所定のデバイスを駆動するドライバ等のソフトウェアによって実現されてもよい。 For example, the program video receiving unit 11 may receive a program video input from a device such as a camera, or may receive a program video transmitted via a wired or wireless communication line, and a predetermined recording medium You may receive the program image | video read from (For example, an optical disk, a magnetic disc, a semiconductor memory, etc.). In the present embodiment, a case where a broadcast program video is received will be described. Note that the program video reception unit 11 may or may not include a device for receiving (for example, a modem, a network card, a tuner, or the like). The program video reception unit 11 may be realized by hardware, or may be realized by software such as a driver that drives a predetermined device.

手話映像受付部１２は、番組映像に対応した手話の映像である手話映像を受け付ける。この手話映像は、番組映像受付部１１が受け付けた番組映像に対応するものである。したがって、聴覚障害者が番組映像と、その番組映像に対応した手話映像とを見ることによって、番組映像に対応する音声の情報を知ることができることになる。手話映像は、例えば、ＣＧ（ＣｏｍｐｕｔｅｒＧｒａｐｈｉｃｓ）の映像であってもよく、アニメーションの映像であってもよく、実写の映像であってもよい。この手話映像が生成される過程は問わない。例えば、放送局において、番組映像と、手話映像とが生成され、それらが同期されて放送されてもよく、番組映像（例えば、放送されたものでもよく、記録媒体から読み出されたものでもよい）に対応する手話通訳が手話通訳者によって行われ、その手話通訳者の映像を撮影することによって手話映像が生成されてもよく、番組映像に対応するテキスト情報（例えば、番組映像に対応する字幕情報であってもよく、番組映像に対応する音声情報を音声認識することによって得られたテキスト情報であってもよい）を用いて自動的に手話映像が生成されてもよく、その他の手法で生成されてもよい。 The sign language video reception unit 12 receives a sign language video that is a sign language video corresponding to the program video. This sign language video corresponds to the program video received by the program video receiving unit 11. Therefore, the hearing-impaired person can know the audio information corresponding to the program video by viewing the program video and the sign language video corresponding to the program video. The sign language video may be, for example, a CG (Computer Graphics) video, an animation video, or a live-action video. The process in which this sign language image is generated does not matter. For example, in a broadcasting station, a program video and a sign language video are generated and may be broadcast in synchronization with each other, or a program video (for example, broadcasted or read from a recording medium) may be used. ) May be generated by a sign language interpreter, and a sign language video may be generated by shooting the video of the sign language interpreter, and text information corresponding to the program video (for example, subtitles corresponding to the program video) Information, or text information obtained by recognizing audio information corresponding to a program video), or a sign language video may be automatically generated using other methods. May be generated.

手話映像受付部１２は、例えば、カメラ等のデバイスから入力された手話映像を受け付けてもよく、有線もしくは無線の通信回線を介して送信された手話映像を受信してもよく、所定の記録媒体（例えば、光ディスクや磁気ディスク、半導体メモリなど）から読み出された手話映像を受け付けてもよい。なお、手話映像受付部１２は、受け付けを行うためのデバイス（例えば、モデムやネットワークカードなど）を含んでもよく、あるいは含まなくてもよい。また、手話映像受付部１２は、ハードウェアによって実現されてもよく、あるいは所定のデバイスを駆動するドライバ等のソフトウェアによって実現されてもよい。 For example, the sign language video reception unit 12 may receive a sign language video input from a device such as a camera, or may receive a sign language video transmitted via a wired or wireless communication line. For example, a sign language image read from an optical disk, a magnetic disk, a semiconductor memory, or the like may be received. Note that the sign language video reception unit 12 may or may not include a device (for example, a modem or a network card) for reception. Further, the sign language video reception unit 12 may be realized by hardware, or may be realized by software such as a driver that drives a predetermined device.

なお、番組映像受付部１１が受け付ける番組映像と、手話映像受付部１２が受け付ける手話映像とは、同期を取ることができるようになっている、すなわち、時間的な位置の対応が分かるようになっていることが好適である。例えば、時間的な位置が対応している番組映像と手話映像とのそれぞれが、番組映像受付部１１と手話映像受付部１２とにおいて同時に受け付けられてもよく、両者にタイムコードが含まれており、そのタイムコードを用いて両者の同期を取ることができるようになっていてもよい。後者の場合に、同期している時間的な位置に同じタイムコードが付与されていてもよく、あるいは、両者のタイムコードは独立して付与されているが、両者の同期するタイムコードを対応付ける情報が別途、存在していてもよい。両者の同期するタイムコードを対応付ける情報が別途、存在する場合に、そのタイムコードを対応付ける情報は、例えば、番組情報や、手話情報に重畳されていてもよい。 The program video received by the program video receiving unit 11 and the sign language video received by the sign language video receiving unit 12 can be synchronized, that is, the temporal position correspondence can be understood. It is suitable. For example, each of the program video and the sign language video corresponding to the temporal position may be received simultaneously by the program video receiving unit 11 and the sign language video receiving unit 12, both of which include a time code. The time code may be used to synchronize the two. In the latter case, the same time code may be assigned to the synchronized time position, or both time codes are given independently, but information that associates the time codes that are synchronized with each other May be present separately. When there is separate information that associates the time codes that are synchronized with each other, the information that associates the time codes may be superimposed on, for example, program information or sign language information.

人物領域特定部１３は、番組映像における人物の領域を特定する。なお、番組映像に複数の人物の領域が存在する場合には、人物領域特定部１３は、その複数の人物の領域のそれぞれを特定してもよい。人物とは、実写の人間のことであってもよく、ＣＧやアニメーションの登場人物であってもよい。人物領域特定部１３は、番組映像の動画から人物の領域の特定を行ってもよく、あるいは、その動画に含まれる一のフレーム（静止画）において人物の領域の特定を行ってもよい。人物領域特定部１３は、番組映像に写っている人物の画像領域を特定する。人物の領域の特定は、例えば、人物の全体の画像領域の特定であってもよく、あるいは、人物の一部（例えば、顔など）の画像領域の特定であってもよい。人物の顔の画像領域の特定方法としては、例えば、人物の目、鼻、口、耳などの特徴点等によって顔の画像領域を抽出する方法や、顔の肌色によって顔の画像領域を抽出する方法などがすでに知られている。また、人物の全体の画像領域の特定方法としては、例えば、背景差分を抽出する方法、人物の形を示すパターンを保持しておき、そのパターンマッチングによって人物の領域を特定する方法、人物の顔の領域を抽出し、その抽出された顔の領域に基づいて人物全体の領域を抽出する方法などがある。抽出された人物の顔の領域に基づいて人物全体の領域を抽出する方法としては、例えば、輪郭抽出アルゴリズムを用いて顔の輪郭を延長することによって人物全体の画像領域を抽出する方法や、人物の体の形を示すパターンを保持しておき、そのパターンマッチングによって抽出された顔の画像領域に続く人物全体の画像領域を抽出する方法などがある。なお、これ以外の方法を用いて人物の画像領域を特定してもよいことは言うまでもない。人物の画像領域の特定方法は従来から知られており、その詳細な説明を省略する。また、人物領域特定部１３は、特定した人物の領域を示す情報と、その人物を識別する人物ＩＤとを対応付けて蓄積してもよい。その場合に、時間の経過に応じて人物の領域が変化する場合もあるが、その場合であっても、同じ人物の領域を示す情報に対しては、同じ人物ＩＤが対応付けられることが好適である。例えば、特徴点群を追跡することによって同じ人物の領域をトラッキングする技術などがすでに知られており、この方法の詳細な説明を省略する。 The person area specifying unit 13 specifies a person area in the program video. When there are a plurality of person areas in the program video, the person area specifying unit 13 may specify each of the plurality of person areas. The person may be a live-action person or a CG or animation character. The person area specifying unit 13 may specify a person area from the moving image of the program video, or may specify a person area in one frame (still image) included in the moving image. The person area specifying unit 13 specifies an image area of a person shown in the program video. The person area may be specified, for example, by specifying an entire image area of the person, or by specifying an image area of a part of the person (for example, a face). As a method for specifying an image area of a person's face, for example, a method of extracting a face image area using feature points such as a person's eyes, nose, mouth, ear, or the like, or extracting a face image area using a skin color of the face. The method is already known. Further, as a method for specifying the entire image area of a person, for example, a method of extracting a background difference, a method of holding a pattern indicating a person's shape, and specifying a person's area by pattern matching, a person's face There is a method of extracting the entire area based on the extracted face area. As a method for extracting the entire area of the person based on the extracted face area of the person, for example, a method of extracting the image area of the entire person by extending the outline of the face using an outline extraction algorithm, There is a method in which a pattern indicating the shape of the human body is held and an image area of the entire person following the face image area extracted by the pattern matching is extracted. It goes without saying that a person's image area may be specified using other methods. A method for specifying a person's image area is conventionally known, and detailed description thereof is omitted. The person area specifying unit 13 may store information indicating the specified person area and a person ID for identifying the person in association with each other. In that case, the area of the person may change over time, but even in that case, it is preferable that the same person ID is associated with the information indicating the area of the same person. It is. For example, a technique for tracking a region of the same person by tracking a feature point group is already known, and a detailed description of this method is omitted.

話者特定部１４は、人物領域特定部１３が複数の人物の領域を特定した際に、人物領域特定部１３が特定した複数の人物の領域のうち、話者の人物の領域を特定する。例えば、話者特定部１４は、人物領域特定部１３が特定した人物の領域において、口の特徴点によって、口の領域を特定する。そして、話者特定部１４は、その口の領域について動き検出を行い、動きがある場合（例えば、検出された動きがしきい値以上である場合など）に、その口の領域を含む人物の領域を話者の領域として特定してもよい。なお、動き検出の方法は、例えば、ブロックマッチング法や勾配法などがすでに知られており、その詳細な説明を省略する。また、動き検出は、その検出時点に対して一定の期間だけ過去の時点から、その検出時点までの番組映像を用いて行われるものとする。このことは、他の構成要素において動き検出を行う場合にも同様であるとする。この話者特定部１４での動き検出の場合には、その一定の期間は、あまり長くないことが好適である。その検出時点において、動きがあるかどうかを知りたいからである。例えば、その一定の期間は、２秒程度に設定されてもよい。また、動き検出は、例えば、その一定の期間の最先のフレームと、後端のフレームとの類似度を求めることによって行ってもよい。その類似度が大きいほど動きが小さく、類似度が小さいほど動きが大きいことになる。また、複数の人物の領域のうちのいずれか一つが話者の人物の領域に特定されることが好適であるため、複数の人物の領域に対応する口の領域において動きが検出された場合には、その検出された動きが最も大きい口の領域に対応する人物の領域を話者の人物の領域に特定してもよい。また、ここでは、口の領域の動き検出を行うことによって話者を特定する場合について説明したが、顔の領域の動き検出を行うことによって話者を特定してもよい。話者の人物の領域を特定するとは、具体的には、話者の人物の領域を識別する情報に対応付けて、話者であることを示すフラグを設定することであってもよく、その話者の人物の領域を識別する情報を所定の記憶領域に蓄積することであってもよく、あるいは、その他の方法であってもよい。 When the person area specifying unit 13 specifies a plurality of person areas, the speaker specifying unit 14 specifies a person area of the speaker among the plurality of person areas specified by the person area specifying unit 13. For example, the speaker specifying unit 14 specifies the mouth region based on the mouth feature points in the person region specified by the person region specifying unit 13. Then, the speaker specifying unit 14 performs motion detection on the mouth region, and when there is motion (for example, when the detected motion is equal to or greater than a threshold value), the speaker specifying unit 14 The region may be specified as the speaker region. As the motion detection method, for example, the block matching method and the gradient method are already known, and detailed description thereof is omitted. In addition, it is assumed that motion detection is performed using a program video from a past time point to a detection time point for a certain period with respect to the detection time point. This also applies to the case where motion detection is performed in other components. In the case of motion detection by the speaker specifying unit 14, it is preferable that the certain period is not so long. This is because it is desired to know whether or not there is movement at the time of detection. For example, the certain period may be set to about 2 seconds. Further, the motion detection may be performed, for example, by obtaining the similarity between the earliest frame in the certain period and the rear end frame. The greater the similarity, the smaller the movement, and the smaller the similarity, the greater the movement. Further, since it is preferable that any one of the plurality of person areas is specified as the speaker person area, when movement is detected in the mouth area corresponding to the plurality of person areas. May specify the person area corresponding to the mouth area where the detected movement is the largest as the person area of the speaker. In addition, here, the case where the speaker is specified by detecting the movement of the mouth area has been described, but the speaker may be specified by detecting the movement of the face area. Specifically, the identification of the speaker person's area may be to set a flag indicating the speaker in association with information identifying the speaker person's area. Information for identifying the area of the speaker's person may be stored in a predetermined storage area, or another method may be used.

表示位置設定部１５は、人物領域特定部１３が特定した人物の領域に隣接する位置に手話映像の表示位置を設定する。人物の領域に隣接する位置とは、その人物の領域の右側であってもよく、左側であってもよく、上側であってもよく、下側であってもよい。また、隣接するとは、人物の領域と手話映像との間に全く空間を挟まないことであってもよく、少しの空間を挟むことを含んでもよい。ただし、後者の場合であっても、その少しの空間は、人物と手話映像との対応をとることができる程度の小さい空間であることが好適である。また、人物の領域に隣接するとは、手話映像がその人物の領域に重ならないことである。手話映像の表示位置を設定するとは、例えば、手話映像が矩形である場合に、番組映像における、手話映像の特定の点（例えば、いずれかの頂点であってもよく、中心点であってもよく、その他の点であってもよい）の位置を設定することであってもよい。本実施の形態では、表示位置設定部１５が、手話映像の頂点を特定する情報（例えば、手話映像の左上の頂点であることを示す情報等）と、その情報によって特定される頂点の番組映像における位置を示す情報とを設定する場合について説明する。なお、人物領域特定部１３が複数の人物の領域を特定し、話者特定部１４が話者の人物の領域を特定した場合には、表示位置設定部１５は、話者特定部１４が特定した話者に対応する人物の領域に隣接する位置に手話映像の表示位置を設定するものとする。なお、表示位置設定部１５は、人物領域特定部１３が人物の領域を特定できなかった場合には、あらかじめ決められている位置を手話映像の表示位置に設定してもよい。また、表示位置設定部１５は、人物領域特定部１３が複数の人物の領域を特定したが、話者特定部１４が話者を特定しなかった場合には、あらかじめ決められている位置を手話映像の表示位置に設定してもよい。そのあらかじめ決められている位置は、例えば、表示位置設定部１５がアクセス可能な図示しない記録媒体において記憶されていてもよい。 The display position setting unit 15 sets the display position of the sign language video at a position adjacent to the person area specified by the person area specifying unit 13. The position adjacent to the person area may be the right side, the left side, the upper side, or the lower side of the person area. Further, the term “adjacent” may mean that no space is interposed between the person's area and the sign language image, and may include a little space. However, even in the latter case, it is preferable that the small space is a space that is small enough to allow correspondence between a person and a sign language image. The phrase “adjacent to a person's area” means that the sign language image does not overlap the person's area. Setting the display position of the sign language image means that, for example, when the sign language image is rectangular, a specific point of the sign language image in the program image (for example, any vertex or center point) It is also possible to set the position of other points. In the present embodiment, the display position setting unit 15 specifies information for identifying the vertex of the sign language video (for example, information indicating that it is the upper left vertex of the sign language video), and the program video of the vertex identified by the information. A case of setting information indicating the position at will be described. When the person area specifying unit 13 specifies a plurality of person areas and the speaker specifying unit 14 specifies the person area of the speaker, the display position setting unit 15 is specified by the speaker specifying unit 14. It is assumed that the display position of the sign language video is set at a position adjacent to the person's area corresponding to the speaker who made the call. Note that the display position setting unit 15 may set a predetermined position as the display position of the sign language video when the person area specifying unit 13 cannot specify the person area. In addition, when the person area specifying unit 13 specifies a plurality of person areas but the speaker specifying unit 14 does not specify a speaker, the display position setting unit 15 uses a predetermined position as a sign language. You may set to the display position of an image | video. The predetermined position may be stored, for example, in a recording medium (not shown) accessible by the display position setting unit 15.

表示判断部１６は、手話映像を表示するかどうか判断する。この判断は、手話映像によって手話の動作が表示されていない場合にも、番組映像の一部を手話映像によって占有することは適切ではないため、行われるものである。したがって、表示判断部１６は、手話映像によって手話の動作が表示されているかどうかを判断することが好適である。その判断のために、表示判断部１６は、例えば、手話映像に対して動き検出を行い、動きがない場合に、手話映像を表示しないと判断し、動きがある場合に、手話映像を表示すると判断してもよい。なお、動きがないとは、全く動きがない場合（すなわち、時間的に隣接する２個のフレームが全く同じものである場合）であってもよく、あるいは、動きがしきい値以下である場合を含んでもよい。動き検出については、上述の説明と同様であり、その詳細な説明を省略する。なお、動き検出において一定の期間の手話映像を用いる場合に、その一定の期間は、前述の話者特定部１４の用いる一定の期間と同程度であってもよく、あるいは、その期間よりも長くてもよい。表示するかしないかは、ある程度長い周期で判断されてもよいからである。また、手話通訳者の映像等が含まれていないブランクの手話映像が受け付けられた場合や、手話映像そのものが受け付けられていない場合にも、表示判断部１６は、手話映像を表示しないと判断してもよい。また、番組映像が音の情報を含む場合に、表示判断部１６は、その音の情報に、発声された音声の情報が含まれていない場合、すなわち、音楽や効果音のみしか含まれていない場合には、手話映像を表示しないと判断し、発声された音声の情報が含まれている場合に、手話映像を表示すると判断してもよい。音の情報に発声された音声の情報が含まれている場合には、その音声に応じた有意な手話映像が存在するものと考えられるからである。なお、表示判断部１６は、音の情報に発声された音声の情報が含まれているかどうかを、例えば、音の情報に音声に対応する音響的な特徴が含まれるかどうかを判断することによって行ってもよい。その音響的な特徴が含まれる場合には、音の情報に音声の情報が含まれていることになる。その判断は、例えば、音響モデルを用いて行われてもよい。また、表示判断部１６は、音の情報に対して、既存の音声認識処理を実行し、その実行結果が有意な文書である場合には、音の情報に発声された音声の情報が含まれていると判断し、そうでない場合には、音の情報に発声された音声の情報が含まれていないと判断してもよい。有意な文書であるかどうかは、音声認識処理を実行した際の尤度を用いて知ることができる。その尤度があらかじめ設定されているしきい値よりも低い場合には、有意な文書への音声認識を行うことができなかったことになり、音の情報に発声された音声の情報が含まれていないと判断できる。また、発声された音声の特徴（例えば、周波数や強弱の変化等に関する特等等）をあらかじめ保持しておき、音の情報にその特徴が含まれるかどうか判断することによって、発声された音声が含まれるかどうかを判断してもよい。その特徴が含まれる場合には、音の情報が発声された音声であると判断されることになり、その特徴が含まれない場合には、音の情報が発声された音声でないと判断されることになる。なお、これ以外の方法によって、音の情報に発声された音声の情報が含まれているかどうかを判断してもよいことは言うまでもない。 The display determination unit 16 determines whether to display a sign language image. This determination is made because it is not appropriate to occupy a part of the program video by the sign language video even when the sign language operation is not displayed by the sign language video. Therefore, it is preferable that the display determination unit 16 determines whether or not a sign language operation is displayed by a sign language image. For this determination, for example, the display determination unit 16 performs motion detection on the sign language video, determines that the sign language video is not displayed when there is no motion, and displays the sign language video when there is a motion. You may judge. Note that “no motion” may be a case where there is no motion at all (that is, two temporally adjacent frames are exactly the same), or a case where the motion is below a threshold value. May be included. About motion detection, it is the same as that of the above-mentioned description, The detailed description is abbreviate | omitted. In addition, when using a sign language video for a certain period in motion detection, the certain period may be equal to or longer than the certain period used by the speaker specifying unit 14 described above. May be. This is because whether or not to display may be determined at a certain period. The display determination unit 16 also determines that the sign language image is not displayed when a blank sign language image that does not include a sign language interpreter image is received or when the sign language image itself is not received. May be. In addition, when the program video includes sound information, the display determination unit 16 does not include the sound information that is uttered in the sound information, that is, includes only music and sound effects. In this case, it may be determined that the sign language video is not displayed, and when the information of the uttered voice is included, it is determined that the sign language video is displayed. This is because, if the sound information includes voice information, it is considered that there is a significant sign language image corresponding to the sound. The display determination unit 16 determines whether or not the sound information includes voice information, for example, by determining whether or not the sound information includes an acoustic feature corresponding to the sound. You may go. When the acoustic feature is included, the sound information includes the sound information. The determination may be made using, for example, an acoustic model. In addition, the display determination unit 16 executes the existing voice recognition process on the sound information, and if the execution result is a significant document, the sound information includes the voice information uttered. If not, it may be determined that the sound information does not include the voice information. Whether or not the document is significant can be known using the likelihood when the speech recognition process is executed. If the likelihood is lower than a preset threshold value, it means that speech recognition for a significant document could not be performed, and the voice information included in the sound information is included. It can be judged that it is not. In addition, features of the uttered voice (for example, special features related to changes in frequency, strength, etc.) are stored in advance, and the voice information is included by determining whether the feature is included in the sound information. You may decide whether or not If the feature is included, it is determined that the sound information is the voice that was uttered. If the feature is not included, it is determined that the sound information is not the voice that was uttered. It will be. Needless to say, it may be determined whether or not the sound information includes the sound information by other methods.

番組関連情報受付部１７は、番組映像受付部１１が受け付ける番組映像の属性を示す情報である番組関連情報を受け付ける。番組関連情報は、例えば、番組映像のジャンルを示す情報を含んでいてもよく、番組映像の名称を示す情報を含んでいてもよく、番組の内容に関する情報（例えば、番組のトピックや、番組に登場する俳優の氏名等であってもよい）を含んでいてもよく、番組についての説明の情報を含んでいてもよく、その他の情報を含んでいてもよい。本実施の形態では、番組関連情報が番組映像のジャンルを示す情報である場合について説明する。また、番組関連情報は、ＥＰＧ（ＥｌｅｃｔｒｏｎｉｃＰｒｏｇｒａｍＧｕｉｄｅ：電子番組ガイド）の情報そのものであってもよく、その情報の一部であってもよい。このＥＰＧの情報は、例えば、ＳＩ情報（公式番組情報）と呼ばれることもある。 The program related information receiving unit 17 receives program related information which is information indicating the attributes of the program video received by the program video receiving unit 11. The program related information may include, for example, information indicating the genre of the program video, may include information indicating the name of the program video, and information related to the content of the program (for example, the program topic or the program). The name of the actor who appears may be included), information on the explanation of the program may be included, or other information may be included. In the present embodiment, a case where the program related information is information indicating the genre of the program video will be described. Further, the program-related information may be EPG (Electronic Program Guide) information itself or may be a part of the information. This EPG information may be called, for example, SI information (official program information).

番組関連情報受付部１７が番組関連情報を受け付ける過程は問わない。番組関連情報受付部１７は、例えば、インターネット等のネットワーク上のサーバから番組関連情報を受信してもよく、放送された番組関連情報を受信してもよく、番組映像に重畳されている番組関連情報を受け付けてもよく、所定の記録媒体（例えば、光ディスクや磁気ディスク、半導体メモリなど）から読み出された番組関連情報を受け付けてもよい。本実施の形態では、番組関連情報受付部１７は、番組映像に重畳されている番組関連情報を受け付けるものとする。なお、番組関連情報受付部１７は、受け付けを行うためのデバイス（例えば、モデムやネットワークカードなど）を含んでもよく、あるいは含まなくてもよい。また、番組関連情報受付部１７は、ハードウェアによって実現されてもよく、あるいは所定のデバイスを駆動するドライバ等のソフトウェアによって実現されてもよい。 The process in which the program relevant information reception part 17 receives program relevant information is not ask | required. The program-related information receiving unit 17 may receive program-related information from a server on a network such as the Internet, may receive broadcast-related program-related information, or may be related to a program-related information superimposed on a program video. Information may be received, and program-related information read from a predetermined recording medium (for example, an optical disk, a magnetic disk, a semiconductor memory, etc.) may be received. In the present embodiment, it is assumed that the program related information receiving unit 17 receives program related information superimposed on the program video. Note that the program-related information receiving unit 17 may or may not include a device (for example, a modem or a network card) for receiving. The program related information receiving unit 17 may be realized by hardware, or may be realized by software such as a driver that drives a predetermined device.

対応情報記憶部１８では、対応情報が記憶される。対応情報は、番組関連情報と、大きさ情報とを対応付ける情報である。大きさ情報は、手話映像の表示の大きさを示す情報であり、例えば、手話映像の画面の大きさそのものを示す情報（例えば、手話映像の画面の縦・横の長さ（ピクセル）を示す情報など）であってもよく、番組映像の画面に対する手話映像の画面の相対的な大きさを示す情報（例えば、面積や長さについて５０％、３０％など）であってもよく、手話映像の画面の複数の大きさがあらかじめ設定されている場合には、その大きさを識別する情報（例えば、「大」「中」「小」など）であってもよい。なお、大きさ情報によって示される手話映像の表示の大きさは、当然ながら、番組映像の表示の大きさよりも小さいものである。 The correspondence information storage unit 18 stores correspondence information. The correspondence information is information for associating the program-related information with the size information. The size information is information indicating the display size of the sign language video, for example, information indicating the size of the screen of the sign language video itself (for example, indicating the vertical and horizontal lengths (pixels) of the screen of the sign language video. Information, etc., or information indicating the relative size of the sign language video screen to the program video screen (for example, 50%, 30%, etc. in terms of area and length). When a plurality of screen sizes are set in advance, information for identifying the size (for example, “large”, “medium”, “small”, etc.) may be used. It should be noted that the display size of the sign language video indicated by the size information is naturally smaller than the display size of the program video.

ここで、「番組関連情報と、大きさ情報とを対応付ける」とは、番組関連情報から大きさ情報を取得できればよいという意味である。したがって、対応情報は、番組関連情報と大きさ情報とを組として含む情報を有してもよく、番組関連情報と大きさ情報とをリンク付ける情報であってもよい。後者の場合には、対応情報は、例えば、番組関連情報と大きさ情報の格納されている位置を示すポインタやアドレスとを対応付ける情報であってもよい。本実施の形態では、前者の場合について説明する。また、番組関連情報と大きさ情報とは、直接対応付けられていなくてもよい。例えば、番組関連情報に、第３の情報が対応しており、その第３の情報に大きさ情報が対応していてもよい。 Here, “associating program-related information with size information” means that it is sufficient to obtain size information from the program-related information. Therefore, the correspondence information may include information including program related information and size information as a set, or may be information that links program related information and size information. In the latter case, the correspondence information may be, for example, information that associates program-related information with a pointer or address indicating the position where the size information is stored. In the present embodiment, the former case will be described. Further, the program related information and the size information may not be directly associated with each other. For example, the third information may correspond to the program related information, and the size information may correspond to the third information.

対応情報記憶部１８に対応情報が記憶される過程は問わない。例えば、記録媒体を介して対応情報が対応情報記憶部１８で記憶されるようになってもよく、通信回線等を介して送信された対応情報が対応情報記憶部１８で記憶されるようになってもよく、あるいは、入力デバイスを介して入力された対応情報が対応情報記憶部１８で記憶されるようになってもよい。対応情報記憶部１８での記憶は、ＲＡＭ等における一時的な記憶でもよく、あるいは、長期的な記憶でもよい。対応情報記憶部１８は、所定の記録媒体（例えば、半導体メモリや磁気ディスク、光ディスクなど）によって実現されうる。 The process in which the correspondence information is stored in the correspondence information storage unit 18 does not matter. For example, the correspondence information may be stored in the correspondence information storage unit 18 via a recording medium, and the correspondence information transmitted via a communication line or the like is stored in the correspondence information storage unit 18. Alternatively, the correspondence information input via the input device may be stored in the correspondence information storage unit 18. The storage in the correspondence information storage unit 18 may be temporary storage in a RAM or the like, or may be long-term storage. The correspondence information storage unit 18 can be realized by a predetermined recording medium (for example, a semiconductor memory, a magnetic disk, an optical disk, etc.).

取得部１９は、番組関連情報受付部１７が受け付けた番組関連情報に、対応情報によって対応付けられる大きさ情報を取得する。厳密に言えば、取得部１９は、番組関連情報受付部１７が受け付けた番組関連情報であって、後述する映像合成部２０が手話映像を合成する番組映像に対応する番組関連情報に、対応情報で対応付けられている大きさ情報を取得する。その大きさ情報は、後述する映像合成部２０が番組映像に合成する手話映像の大きさを決めるために用いられることになる。映像合成部２０が手話映像を合成する番組映像に対応する番組関連情報を特定することができるように、番組映像と番組関連情報とが紐付けられていることが好適である。例えば、対応する番組映像と番組関連情報とが番組映像識別情報などによって紐付けられていてもよい。その場合には、取得部１９は、映像合成部２０が手話映像を合成する番組映像の番組映像識別情報を取得し、その番組映像識別情報に対応する番組関連情報を特定することによって、映像合成部２０が手話映像を合成する番組映像に対応する番組関連情報を特定することができることになる。 The acquiring unit 19 acquires size information associated with the program related information received by the program related information receiving unit 17 by the correspondence information. Strictly speaking, the acquisition unit 19 corresponds to the program related information received by the program related information receiving unit 17 and corresponding to the program related information corresponding to the program video synthesized by the video synthesizing unit 20 described later. The size information associated with is acquired. The size information is used to determine the size of the sign language video to be synthesized with the program video by the video synthesis unit 20 described later. It is preferable that the program video and the program-related information are linked so that the video synthesis unit 20 can specify program-related information corresponding to the program video for synthesizing the sign language video. For example, the corresponding program video and program-related information may be linked by program video identification information or the like. In that case, the acquisition unit 19 acquires the program video identification information of the program video for which the video synthesis unit 20 synthesizes the sign language video, and specifies the program related information corresponding to the program video identification information, thereby synthesizing the video. The unit 20 can specify the program related information corresponding to the program video to be combined with the sign language video.

映像合成部２０は、番組映像における、表示位置設定部１５が設定した表示位置に、手話映像を合成した合成映像を生成する。映像の一部に他の映像を合成する方法はすでに公知であり、その説明を省略する。また、映像合成部２０は、表示判断部１６が手話映像を表示しないと判断した場合には、手話映像の合成を行わない。手話映像の合成を行わない場合には、番組映像そのものが、そのまま合成映像となる。また、映像合成部２０は、番組映像に、取得部１９が取得した大きさ情報で示される大きさの手話映像を合成する。 The video synthesizing unit 20 generates a synthesized video in which the sign language video is synthesized at the display position set by the display position setting unit 15 in the program video. A method of synthesizing another video with a part of the video is already known, and the description thereof is omitted. In addition, when the display determination unit 16 determines that the sign language video is not displayed, the video synthesis unit 20 does not synthesize the sign language video. When the sign language video is not synthesized, the program video itself becomes the synthesized video as it is. In addition, the video composition unit 20 synthesizes a sign language video having the size indicated by the size information acquired by the acquisition unit 19 with the program video.

映像出力部２１は、合成映像を出力する。この合成映像は、前述のように、映像合成部２０による手話映像の合成が行われた場合には、手話映像を一部に含む番組映像であり、手話映像の合成が行われなかった場合には、番組映像そのものとなる。ここで、この出力は、例えば、表示デバイス（例えば、ＣＲＴや液晶ディスプレイなど）への表示でもよく、所定の機器への通信回線を介した送信でもよく、記録媒体への蓄積でもよく、他の構成要素への引き渡しでもよい。なお、映像出力部２１は、出力を行うデバイス（例えば、表示デバイスなど）を含んでもよく、あるいは含まなくてもよい。また、映像出力部２１は、ハードウェアによって実現されてもよく、あるいは、それらのデバイスを駆動するドライバ等のソフトウェアによって実現されてもよい。また、番組映像が音の情報も含む場合には、映像出力部２１は、スピーカによるその音の情報の出力を行ってもよい。 The video output unit 21 outputs a composite video. As described above, this composite video is a program video that includes a sign language video when the video synthesis unit 20 synthesizes the sign language video, and when the sign language video is not synthesized. Becomes the program video itself. Here, the output may be, for example, display on a display device (for example, a CRT or a liquid crystal display), transmission via a communication line to a predetermined device, accumulation in a recording medium, It may be delivered to the component. Note that the video output unit 21 may or may not include a device (for example, a display device) that performs output. The video output unit 21 may be realized by hardware, or may be realized by software such as a driver that drives these devices. When the program video also includes sound information, the video output unit 21 may output the sound information using a speaker.

なお、番組映像受付部１１が受け付けた番組映像や、手話映像受付部１２が受け付けた手話映像、表示位置設定部１５が設定した表示位置を示す情報、番組関連情報受付部１７が受け付けた番組関連情報、取得部１９が取得した大きさ情報などは、図示しない記録媒体において一時的に記憶されてもよいことは言うまでもない。 The program video received by the program video receiving unit 11, the sign language video received by the sign language video receiving unit 12, the information indicating the display position set by the display position setting unit 15, and the program related information received by the program related information receiving unit 17 Needless to say, the information, the size information acquired by the acquisition unit 19 and the like may be temporarily stored in a recording medium (not shown).

次に、本実施の形態による手話映像合成装置１の動作について、図２のフローチャートを用いて説明する。
（ステップＳ１０１）番組関連情報受付部１７は、番組関連情報を受け付けたかどうか判断する。そして、番組関連情報を受け付けた場合には、ステップＳ１０２に進み、そうでない場合には、ステップＳ１０４に進む。 Next, the operation of the sign language video synthesizing apparatus 1 according to the present embodiment will be described with reference to the flowchart of FIG.
(Step S101) The program related information receiving unit 17 determines whether program related information has been received. If the program related information is received, the process proceeds to step S102. If not, the process proceeds to step S104.

（ステップＳ１０２）取得部１９は、対応情報記憶部１８で記憶されている対応情報を用いて、番組関連情報受付部１７が受け付けた番組関連情報に対応する大きさ情報を取得する。 (Step S102) The acquiring unit 19 acquires size information corresponding to the program related information received by the program related information receiving unit 17 using the corresponding information stored in the corresponding information storage unit 18.

（ステップＳ１０３）映像合成部２０は、その取得された大きさ情報を、図示しない記録媒体で一時的に記憶する。そして、ステップＳ１０１に戻る。 (Step S103) The video composition unit 20 temporarily stores the acquired size information in a recording medium (not shown). Then, the process returns to step S101.

（ステップＳ１０４）番組映像受付部１１は、番組映像を受け付けたかどうか判断する。そして、番組映像を受け付けた場合には、ステップＳ１０５に進み、そうでない場合には、ステップＳ１０９に進む。なお、番組映像受付部１１は、番組映像を一フレームごとに受信してもよく、あるいは、連続する複数のフレームごとに受信してもよい。 (Step S104) The program video reception unit 11 determines whether a program video has been received. If a program video is accepted, the process proceeds to step S105, and if not, the process proceeds to step S109. Note that the program video reception unit 11 may receive the program video for each frame, or may receive it for each of a plurality of consecutive frames.

（ステップＳ１０５）手話映像受付部１２は、手話映像を受け付けたかどうか判断する。そして、手話映像を受け付けた場合には、ステップＳ１０６に進み、そうでない場合、すなわち、手話映像が手話映像合成装置１に来ていない場合には、ステップＳ１０８に進む。なお、手話映像受付部１２は、手話映像を一フレームごとに受信してもよく、あるいは、連続する複数のフレームごとに受信してもよい。 (Step S105) The sign language image receiving unit 12 determines whether a sign language image has been received. If a sign language image is received, the process proceeds to step S106. If not, that is, if the sign language image does not come to the sign language image synthesizing apparatus 1, the process proceeds to step S108. The sign language video reception unit 12 may receive the sign language video for each frame, or may receive it for each of a plurality of consecutive frames.

（ステップＳ１０６）表示判断部１６は、手話映像を表示するかどうか判断する。そして、手話映像を表示すると判断した場合には、ステップＳ１０７に進み、そうでない場合には、ステップＳ１０８に進む。なお、表示判断部１６は、それまでに受け付けられ、図示しない記録媒体で一時的に記憶されている一定期間（例えば、１秒程度、３秒程度など）の手話映像を用いて、この判断を行ってもよい。 (Step S106) The display determination unit 16 determines whether to display a sign language image. If it is determined that the sign language video is to be displayed, the process proceeds to step S107. If not, the process proceeds to step S108. The display determination unit 16 makes this determination using a sign language image of a certain period (for example, about 1 second, about 3 seconds, etc.) that has been received so far and is temporarily stored in a recording medium (not shown). You may go.

（ステップＳ１０７）映像合成部２０は、手話映像受付部１２が受け付けた手話映像を、番組映像受付部１１が受け付けた番組映像に合成した合成映像を生成する。なお、その合成の際に、映像合成部２０は、ステップＳ１０３で一時的に記憶された大きさ情報に応じた大きさで手話映像が表示されるように、手話映像を番組映像に合成するものとする。また、後述するステップＳ１１４で一時的に記憶された表示位置に、手話映像が表示されるように、手話映像を番組映像に合成するものとする。なお、手話映像の合成の際に、まだ表示位置の設定が行われていない場合には、あらかじめ決められている位置に手話映像が表示されるように、手話映像を番組映像に合成してもよい。 (Step S107) The video composition unit 20 generates a composite video in which the sign language video received by the sign language video reception unit 12 is combined with the program video received by the program video reception unit 11. At the time of the synthesis, the video synthesis unit 20 synthesizes the sign language video with the program video so that the sign language video is displayed in a size corresponding to the size information temporarily stored in step S103. And Also, it is assumed that the sign language video is synthesized with the program video so that the sign language video is displayed at the display position temporarily stored in step S114 described later. When sign language video is combined, if the display position has not been set, sign language video can be combined with program video so that the sign language video is displayed at a predetermined position. Good.

（ステップＳ１０８）映像出力部２１は、映像合成部２０が生成した合成映像を出力する。そして、ステップＳ１０１に戻る。 (Step S108) The video output unit 21 outputs the synthesized video generated by the video synthesizing unit 20. Then, the process returns to step S101.

（ステップＳ１０９）表示位置設定部１５は、表示位置の設定を行うかどうか判断する。そして、表示位置の設定を行う場合には、ステップＳ１１０に進み、そうでない場合には、ステップＳ１０１に戻る。なお、表示位置設定部１５は、定期的に（例えば、２秒ごと、１０秒ごとなど）に、表示位置の設定を行うと判断してもよい。手話映像の表示位置が頻繁に変更されると、手話映像を見づらくなるため、表示位置が設定される頻度は、手話映像の表示位置が頻繁に変更されない程度に設定されることが好適である。 (Step S109) The display position setting unit 15 determines whether or not to set the display position. And when setting a display position, it progresses to step S110, and when that is not right, it returns to step S101. The display position setting unit 15 may determine that the display position is set regularly (for example, every 2 seconds or every 10 seconds). If the display position of the sign language video is frequently changed, it becomes difficult to see the sign language video. Therefore, the frequency of setting the display position is preferably set to such an extent that the display position of the sign language video is not frequently changed.

（ステップＳ１１０）人物領域特定部１３は、番組映像における人物の領域を特定する。 (Step S110) The person area specifying unit 13 specifies a person area in the program video.

（ステップＳ１１１）話者特定部１４は、複数の人物の領域が特定されたかどうか判断する。そして、複数の人物の領域が特定された場合には、ステップＳ１１２に進み、そうでない場合には、ステップＳ１１３に進む。人物の領域の特定そのものができなかった場合にも、ステップＳ１１３に進むものとする。 (Step S111) The speaker specifying unit 14 determines whether or not a plurality of person areas have been specified. If a plurality of person areas are specified, the process proceeds to step S112. If not, the process proceeds to step S113. Even when the person area cannot be specified, the process proceeds to step S113.

（ステップＳ１１２）話者特定部１４は、特定された複数の人物の領域から、話者の人物の領域を特定する。 (Step S112) The speaker specifying unit 14 specifies the area of the speaker's person from the specified areas of the plurality of persons.

（ステップＳ１１３）表示位置設定部１５は、特定された人物の領域に隣接する位置に、手話映像の表示位置を設定する。その特定された人物の領域は、ステップＳ１１０において特定された人物の領域が１個である場合には、その人物の領域であり、２個以上である場合には、ステップＳ１１２で特定された話者の人物の領域である。なお、表示位置設定部１５は、手話映像のすべてが番組映像の範囲内に表示されるように、その位置の設定を行うものとする。また、ステップＳ１１０で人物の領域の特定を行うことができなかった場合には、表示位置設定部１５は、あらかじめ決められている位置を手話映像の表示位置に設定する。また、手話映像のすべてが番組映像の範囲内に表示されるように適切に設定することができなかった場合にも、表示位置設定部１５は、あらかじめ決められている位置を手話映像の表示位置に設定してもよい。また、複数の人物の領域が特定されたが、ステップＳ１１２において話者の人物の領域を特定できなかった場合にも、表示位置設定部１５は、あらかじめ決められている位置を手話映像の表示位置に設定してもよい。 (Step S113) The display position setting unit 15 sets the display position of the sign language image at a position adjacent to the specified person area. The area of the specified person is the area of the person when there is one person area specified at step S110, and the area specified at step S112 when there are two or more areas. This is the person's person's area. The display position setting unit 15 sets the position so that all of the sign language video is displayed within the range of the program video. If the person region cannot be specified in step S110, the display position setting unit 15 sets a predetermined position as the display position of the sign language video. In addition, even when the sign language video cannot be properly set to be displayed within the range of the program video, the display position setting unit 15 sets the predetermined position to the display position of the sign language video. May be set. In addition, even when a plurality of person areas are specified, but the speaker person area cannot be specified in step S112, the display position setting unit 15 sets the predetermined position as the display position of the sign language video. May be set.

（ステップＳ１１４）映像合成部２０は、表示位置設定部１５によって設定された表示位置を示す情報を図示しない記録媒体において一時的に記憶する。そして、ステップＳ１０１に戻る。 (Step S114) The video composition unit 20 temporarily stores information indicating the display position set by the display position setting unit 15 in a recording medium (not shown). Then, the process returns to step S101.

なお、図２のフローチャートにおいて、ステップＳ１０４，Ｓ１０５において、番組映像と手話映像とが時間的に直列的に受け付けられる場合について説明したが、そうでなくてもよい。例えば、並列して両映像が受け付けられ、受け付けられた手話映像は、図示しない記録媒体においてバッファリングされていてもよい。そして、番組映像が受け付けられた際に、バッファリングされており、まだ合成されていない手話映像が存在する場合には、ステップＳ１０６に進み、そうでない場合には、ステップＳ１０８に進んでもよい。なお、図２のフローチャートにおいて、電源オフや処理終了の割り込みにより処理は終了する。 In the flowchart of FIG. 2, the case where the program video and the sign language video are received serially in time in steps S104 and S105 has been described, but this need not be the case. For example, both videos may be accepted in parallel, and the accepted sign language video may be buffered in a recording medium (not shown). Then, when a program video is received, if there is a buffered sign language video that has not been synthesized yet, the process may proceed to step S106, and if not, the process may proceed to step S108. In the flowchart of FIG. 2, the process is terminated by powering off or a process termination interrupt.

また、一の番組映像については、その番組映像に対応する番組関連情報は同じであるため、番組関連情報受付部１７は、新たな番組映像が受け付けられるごとに、その番組映像に対応する番組関連情報を受け付けることが好適である。また、一の番組映像の受け付けが行われている際に、番組関連情報の受け付けが複数行われる場合には、最新の番組関連情報を一時的に記憶しておき、その最新の番組関連情報と異なる番組関連情報が受け付けられた場合にのみ、ステップＳ１０２に進み、そうでない場合には、ステップＳ１０４に進むようにしてもよい。 Moreover, since the program related information corresponding to the program video is the same for one program video, the program related information receiving unit 17 receives the program related information corresponding to the program video every time a new program video is received. It is preferable to accept information. In addition, when a plurality of program related information are received while one program video is being received, the latest program related information is temporarily stored, and the latest program related information and Only when different program related information is accepted, the process may proceed to step S102, and otherwise, the process may proceed to step S104.

次に、本実施の形態による手話映像合成装置１の動作について、具体例を用いて説明する。この具体例では、手話映像合成装置１が放送された番組映像、及び手話映像を受信し、その番組映像に重畳されている番組関連情報を用いて大きさ情報の取得が行われるものとする。したがって、番組映像に、その番組映像に対応する番組関連情報が重畳されていることによって、番組映像と番組関連情報との紐付けが行われていることになる。 Next, the operation of the sign language video synthesizing apparatus 1 according to the present embodiment will be described using a specific example. In this specific example, it is assumed that the sign language video synthesizing apparatus 1 receives the broadcast program video and the sign language video, and obtains the size information using the program related information superimposed on the program video. Therefore, the program video and the program related information are linked by superimposing the program related information corresponding to the program video on the program video.

また、この具体例において、対応情報記憶部１８では、図３で示される対応情報が記憶されているものとする。図３において、番組のジャンルを示す番組関連情報と、大きさ情報とが対応付けられている。大きさ情報は、手話映像の画面の高さ及び幅を示すものである。例えば、番組関連情報「ニュース」には、大きさ情報（Ｈ１，Ｗ１）が対応付けられている。したがって、ニュースの番組映像の場合には、画面の高さがＨ１となり、画面の幅がＷ１となるように手話映像が表示されることになる。 In this specific example, the correspondence information storage unit 18 stores the correspondence information shown in FIG. In FIG. 3, the program related information indicating the genre of the program is associated with the size information. The size information indicates the height and width of the sign language video screen. For example, the program related information “news” is associated with size information (H1, W1). Therefore, in the case of a news program video, the sign language video is displayed such that the height of the screen is H1 and the width of the screen is W1.

まず、ユーザが、手話映像合成装置１を操作することによって、ニュース番組を見るようにチャンネル設定を行ったとする。すると、そのチャンネルの図４で示される番組映像が番組映像受付部１１で受信され（ステップＳ１０４）、その番組映像に対応する図５で示される手話映像が手話映像受付部１２で受信される（ステップＳ１０５）。そして、表示判断部１６は、手話映像を表示するかどうか判断する（ステップＳ１０６）。なお、この段階では、判断できるだけの手話映像が受信されていないため、表示判断部１６は、手話映像を表示しないと判断するものとする。すると、映像合成部２０は、手話映像の合成されていない、番組映像そのものである合成映像を映像出力部２１に渡し、映像出力部２１は、その合成映像をディスプレイに表示する（ステップＳ１０８）。その結果、図４で示される表示が行われることになる。この番組映像受付部１１による番組映像の受信と、映像出力部２１による番組映像の表示とは、一定の手話映像が蓄積されて表示判断部１６が手話映像を表示すると判断するまで繰り返して実行されることになる。 First, it is assumed that the user sets a channel to watch a news program by operating the sign language video synthesizing apparatus 1. Then, the program video shown in FIG. 4 of the channel is received by the program video reception unit 11 (step S104), and the sign language video shown in FIG. 5 corresponding to the program video is received by the sign language video reception unit 12 (step S104). Step S105). Then, the display determination unit 16 determines whether or not to display a sign language video (step S106). At this stage, since the sign language video that can be judged is not received, the display judgment unit 16 judges that the sign language video is not displayed. Then, the video synthesizing unit 20 passes the synthesized video that is the program video itself, in which the sign language video is not synthesized, to the video output unit 21, and the video output unit 21 displays the synthesized video on the display (step S108). As a result, the display shown in FIG. 4 is performed. The reception of the program video by the program video reception unit 11 and the display of the program video by the video output unit 21 are repeatedly executed until a certain sign language video is accumulated and the display determination unit 16 determines to display the sign language video. Will be.

なお、その番組映像の受信に応じて、その番組映像に重畳されている番組関連情報「ニュース」が番組映像受付部１１によって抽出され、その抽出された番組関連情報が図示しない経路を介して番組関連情報受付部１７に渡されたとする。番組関連情報受付部１７は、その番組関連情報を受け付けると、その番組関連情報を取得部１９に渡す（ステップＳ１０１）。取得部１９は、受け取った番組関連情報「ニュース」を検索キーとして図３で示される対応情報の番組関連情報を検索する。すると、１番目のレコードがヒットするため、取得部１９は、その１番目のレコードから大きさ情報（Ｈ１，Ｗ１）を取得して映像合成部２０に渡す（ステップＳ１０２）。映像合成部２０は、受け取った大きさ情報を、図示しない記録媒体に蓄積する（ステップＳ１０３）。 In response to the reception of the program video, program-related information “news” superimposed on the program video is extracted by the program video reception unit 11, and the extracted program-related information is transmitted via a route (not shown). It is assumed that the information is passed to the related information receiving unit 17. When receiving the program related information, the program related information receiving unit 17 passes the program related information to the acquiring unit 19 (step S101). The acquisition unit 19 searches for the program related information of the corresponding information shown in FIG. 3 using the received program related information “news” as a search key. Then, since the first record is hit, the acquisition unit 19 acquires size information (H1, W1) from the first record and passes it to the video composition unit 20 (step S102). The video composition unit 20 accumulates the received size information in a recording medium (not shown) (step S103).

また、番組映像の受信が開始されたため、表示位置設定部１５は、表示位置の設定を行うと判断し、人物領域特定部１３に対して、人物の領域を特定する処理を行う旨の指示を渡す（ステップＳ１０９）。すると、人物領域特定部１３は、番組映像受付部１１が受け付けた図４で示される番組映像において、人物の領域を特定する（ステップＳ１１０）。その結果、図６の番組映像における太い曲線で囲まれた領域である人物の領域が特定されたとする。その人物の領域の特定に応じて、人物領域特定部１３は、図７で示される人物領域特定情報を生成し、図示しない記録媒体に蓄積する。なお、図７の人物領域特定情報において、人物ＩＤと、領域情報とが対応付けられている。人物ＩＤは、特定した人物の領域ごとに人物領域特定部１３が自動的に付与する識別情報である。また、領域情報は、図６の太い曲線を示す座標値（ピクセル値）である。各座標値（ｘ１，ｙ１）、（ｘ２，ｙ２）…等は、図６の太い曲線に対応する各ピクセルの座標値であってもよく、あるいは、図６の太い曲線に対応する各ピクセルから選択された飛び飛びの座標値（例えば、１０ピクセルごとの座標値）であってもよい。結果として、この領域情報を用いて、人物の領域を特定することができるのであれば、領域情報の内容は問わない。 In addition, since the reception of the program video is started, the display position setting unit 15 determines to set the display position, and instructs the person region specifying unit 13 to perform processing for specifying the person region. (Step S109). Then, the person area specifying unit 13 specifies a person area in the program video shown in FIG. 4 received by the program video receiving unit 11 (step S110). As a result, it is assumed that an area of a person that is an area surrounded by a thick curve in the program video in FIG. 6 is specified. In response to specifying the person area, the person area specifying unit 13 generates the person area specifying information shown in FIG. 7 and stores it in a recording medium (not shown). In the person area specifying information in FIG. 7, the person ID and the area information are associated with each other. The person ID is identification information automatically assigned by the person area specifying unit 13 for each specified person area. The area information is a coordinate value (pixel value) indicating the thick curve in FIG. Each coordinate value (x1, y1), (x2, y2), etc. may be the coordinate value of each pixel corresponding to the thick curve in FIG. 6, or from each pixel corresponding to the thick curve in FIG. It may be a coordinate value of the selected jump (for example, a coordinate value for every 10 pixels). As a result, the content of the region information is not limited as long as the region of the person can be specified using the region information.

次に、話者特定部１４は、図７で示される人物領域特定情報を参照し、人物ＩＤが１個しか存在しないため、複数の人物の領域の特定は行われなかったと判断する（ステップＳ１１１）。そして、表示位置設定部１５に表示位置の設定を行う旨の指示を出す。その指示に応じて、表示位置設定部１５は、特定された人物の領域に対応する表示位置の設定の処理を行う（ステップＳ１１３）。 Next, the speaker specifying unit 14 refers to the person area specifying information shown in FIG. 7, and determines that the area of a plurality of persons has not been specified because there is only one person ID (step S111). ). Then, the display position setting unit 15 is instructed to set the display position. In response to the instruction, the display position setting unit 15 performs a process of setting a display position corresponding to the specified person area (step S113).

ここで、表示位置を設定する方法の一例について図８を用いて説明する。まず、表示位置設定部１５は、特定された人物の領域のうち、横方向については最も右側の位置に対応し、上下方向については最も上側の位置に対応する第１の基準点の座標値を取得する。この座標値の取得は、領域情報に含まれる最大のｘ座標の値と、最小のｙ座標の値とを取得することによって行われる。なお、番組映像の座標系では、左上の点が原点であり、その原点から右向きにｘ軸が設定され、下向きにｙ軸が設定されているものとする。その後、表示位置設定部１５は、第１の基準点を手話映像の左上の頂点とする第１の表示位置に手話映像を表示できるかどうか判断する。具体的には、表示位置設定部１５は、映像合成部２０から大きさ情報を受け取り、その大きさ情報を用いて、第１の表示位置における左下の頂点、右上の頂点、右下の頂点のすべてが番組映像内に含まれるかどうか判断する。より具体的には、第１の基準点のｘ座標の値に、Ｗ１を加算したｘ座標の値が、番組映像のｘ座標の最大値を超えているかどうか判断する。そして、超えている場合には、右上と右下の頂点が番組映像内に含まれないことになるため、表示位置設定部１５は、少なくとも１個の頂点が番組映像内に含まれていないと判断する。また、第１の基準点のｙ座標の値に、Ｈ１を加算したｙ座標の値が、番組映像のｙ座標の最大値を超えているかどうか判断する。そして、超えている場合には、左下と右下の頂点が番組映像内に含まれないことになるため、表示位置設定部１５は、少なくとも１個の頂点が番組映像内に含まれていないと判断する。また、第１の基準点のｘ座標の値に、Ｗ１を加算したｘ座標の値が、番組映像のｘ座標の最大値を超えておらず、第１の基準点のｙ座標の値に、Ｈ１を加算したｙ座標の値が、番組映像のｙ座標の最大値を超えていない場合には、表示位置設定部１５は、すべての頂点が番組映像内に含まれると判断する。そして、表示位置設定部１５は、すべての頂点が番組映像内に含まれる場合には、手話映像の表示位置を第１の表示位置に決定する。具体的には、表示位置設定部１５は、表示位置を示す情報として、第１の基準点の座標値と、その座標値に対応するのが手話映像の左上の頂点であることを示す情報（例えば、「左上」でもよい）とを生成し、その表示位置を示す情報を映像合成部２０に渡す。 Here, an example of a method for setting the display position will be described with reference to FIG. First, the display position setting unit 15 determines the coordinate value of the first reference point corresponding to the rightmost position in the horizontal direction and the uppermost position in the vertical direction in the specified person area. get. The acquisition of the coordinate value is performed by acquiring the maximum x-coordinate value and the minimum y-coordinate value included in the region information. In the program video coordinate system, the upper left point is the origin, the x axis is set to the right from the origin, and the y axis is set to the lower side. Thereafter, the display position setting unit 15 determines whether or not the sign language image can be displayed at the first display position having the first reference point as the upper left vertex of the sign language image. Specifically, the display position setting unit 15 receives the size information from the video composition unit 20, and uses the size information to determine the lower left vertex, the upper right vertex, and the lower right vertex at the first display position. Determine whether everything is included in the program video. More specifically, it is determined whether or not the value of the x coordinate obtained by adding W1 to the value of the x coordinate of the first reference point exceeds the maximum value of the x coordinate of the program video. If it exceeds, the upper right corner and the lower right vertex are not included in the program video. Therefore, the display position setting unit 15 determines that at least one vertex is not included in the program video. to decide. Also, it is determined whether the y coordinate value obtained by adding H1 to the y coordinate value of the first reference point exceeds the maximum y coordinate value of the program video. If it exceeds, the lower left and lower right vertices are not included in the program video. Therefore, the display position setting unit 15 determines that at least one vertex is not included in the program video. to decide. Further, the value of the x coordinate obtained by adding W1 to the value of the x coordinate of the first reference point does not exceed the maximum value of the x coordinate of the program video, and the value of the y coordinate of the first reference point is If the y-coordinate value obtained by adding H1 does not exceed the maximum y-coordinate value of the program video, the display position setting unit 15 determines that all vertices are included in the program video. Then, when all the vertices are included in the program video, the display position setting unit 15 determines the display position of the sign language video as the first display position. Specifically, the display position setting unit 15 uses, as information indicating the display position, the coordinate value of the first reference point and information indicating that the upper left vertex of the sign language video corresponds to the coordinate value ( For example, “upper left” may be generated), and information indicating the display position is passed to the video composition unit 20.

また、表示位置設定部１５は、第１の表示位置の少なくとも１個の頂点が番組映像内に含まれていない場合には、第１の基準点を手話映像の左下の頂点とする第２の表示位置に手話映像を表示できるかどうか判断する。この判断の詳細な処理については省略する。そして、第２の表示位置に手話映像を表示できる場合には、表示位置設定部１５は、手話映像の表示位置を第２の表示位置に決定する。具体的には、表示位置設定部１５は、表示位置を示す情報として、第１の基準点の座標値と、その座標値に対応するのが手話映像の左下の頂点であることを示す情報（例えば、「左下」でもよい）とを生成し、その表示位置を示す情報を映像合成部２０に渡す。 In addition, when at least one vertex of the first display position is not included in the program video, the display position setting unit 15 uses the first reference point as the lower left vertex of the sign language video. It is determined whether a sign language image can be displayed at the display position. Detailed processing of this determination is omitted. When the sign language video can be displayed at the second display position, the display position setting unit 15 determines the display position of the sign language video as the second display position. Specifically, the display position setting unit 15 uses, as information indicating the display position, the coordinate value of the first reference point and information indicating that the coordinate value corresponds to the lower left vertex of the sign language video ( For example, “lower left” may be generated), and information indicating the display position is passed to the video composition unit 20.

また、表示位置設定部１５は、第２の表示位置に手話映像を表示できない場合には、第１の基準点を手話映像の右下の頂点とする第３の表示位置に手話映像を表示できるかどうか判断する。この判断の詳細な処理については省略する。そして、第３の表示位置に手話映像を表示できる場合には、表示位置設定部１５は、手話映像の表示位置を第３の表示位置に決定する。 In addition, when the sign language image cannot be displayed at the second display position, the display position setting unit 15 can display the sign language image at the third display position having the first reference point as the lower right vertex of the sign language image. Judge whether or not. Detailed processing of this determination is omitted. If the sign language video can be displayed at the third display position, the display position setting unit 15 determines the display position of the sign language video as the third display position.

また、表示位置設定部１５は、第３の表示位置に手話映像を表示できない場合には、第２の基準点を手話映像の右上の頂点とする第４の表示位置に手話映像を表示できるかどうか判断する。この判断の詳細な処理については省略する。なお、第２の基準点は、領域情報に含まれる最小のｘ座標の値と、最小のｙ座標の値とに対応する点である。そして、第４の表示位置に手話映像を表示できる場合には、表示位置設定部１５は、手話映像の表示位置を第４の表示位置に決定する。このようにして、順番に表示位置を変更しながら、手話映像の表示位置を設定する処理を行う。なお、第５の表示位置、第６の表示位置にも手話映像を表示することができなかった場合には、表示位置設定部１５は、図示しない記録媒体から、あらかじめ決められている表示位置を読み出し、手話映像の表示位置を、その読み出した表示位置に設定する。 If the display position setting unit 15 cannot display the sign language image at the third display position, can the display position setting unit 15 display the sign language image at the fourth display position having the second reference point as the upper right vertex of the sign language image? Judge whether. Detailed processing of this determination is omitted. The second reference point is a point corresponding to the minimum x coordinate value and the minimum y coordinate value included in the region information. If the sign language video can be displayed at the fourth display position, the display position setting unit 15 determines the display position of the sign language video as the fourth display position. In this manner, the display position of the sign language video is set while changing the display position in order. When the sign language video cannot be displayed at the fifth display position and the sixth display position, the display position setting unit 15 selects a predetermined display position from a recording medium (not shown). The display position of the readout and sign language video is set to the readout display position.

なお、この具体例では、手話映像の表示位置が、第１の表示位置に設定されたものとする。すなわち、第１の基準点の座標値（Ａ，Ｂ）と、頂点の位置を示す「左上」とを含む情報が映像合成部２０に渡されたとする。すると、映像合成部２０は、その情報を図示しない記録媒体に蓄積する（ステップＳ１１４）。 In this specific example, it is assumed that the display position of the sign language video is set to the first display position. That is, it is assumed that information including the coordinate value (A, B) of the first reference point and “upper left” indicating the position of the vertex is passed to the video composition unit 20. Then, the video composition unit 20 accumulates the information in a recording medium (not shown) (step S114).

次に、一定の手話映像が蓄積された後に、次の番組映像が番組映像受付部１１で受信され（ステップＳ１０４）、その番組映像に対応する手話映像が手話映像受付部１２で受信されたとする（ステップＳ１０５）。すると、表示判断部１６は、手話映像を表示するかどうか判断する（ステップＳ１０６）。この場合には、すでに過去の手話映像が存在するため、表示判断部１６は、その手話映像を用いて動き検出を行い、動きがしきい値以上であり、手話映像を表示すると判断したとする（ステップＳ１０６）。すると、映像合成部２０は、図９で示されるように、第１の基準点を左上の頂点として、高さがＨ１であり、幅がＷ１である手話映像を番組映像に合成し、その合成後の合成映像を映像出力部２１に渡す（ステップＳ１０７）。映像出力部２１は、その図９で示される合成映像をディスプレイに表示する（ステップＳ１０８）。このようにして、番組映像への手話映像の合成や、定期的な表示位置の設定等が繰り返して実行されることになる。 Next, after a certain sign language video is accumulated, the next program video is received by the program video reception unit 11 (step S104), and the sign language video corresponding to the program video is received by the sign language video reception unit 12. (Step S105). Then, the display determination unit 16 determines whether or not to display a sign language image (step S106). In this case, since the previous sign language video already exists, the display determination unit 16 performs motion detection using the sign language video, and determines that the motion is equal to or greater than the threshold value and the sign language video is displayed. (Step S106). Then, as shown in FIG. 9, the video composition unit 20 synthesizes a sign language video having a height of H1 and a width of W1 with the first reference point at the upper left vertex and a program video. The subsequent composite video is transferred to the video output unit 21 (step S107). The video output unit 21 displays the composite video shown in FIG. 9 on the display (step S108). In this way, synthesis of sign language video with program video, periodic display position setting, etc. are repeatedly executed.

なお、そのニュースの番組において、ニュースキャスターの位置が図１０で示されるように左の方に移動したとする。すると、それに応じて、新たな表示位置の設定が行われる（ステップＳ１０９〜Ｓ１１４）。そして、その新たに設定された表示位置に応じて、図１１で示されるように、合成される手話映像の位置も変更されることになる（ステップＳ１０４〜Ｓ１０８）。このように、番組映像において人物が移動しても、その移動に追随して手話映像も移動するため、ユーザは、人物の近くに絶えず表示される手話映像を見ることができることになる。 In the news program, it is assumed that the position of the news caster has moved to the left as shown in FIG. Then, a new display position is set accordingly (steps S109 to S114). Then, according to the newly set display position, as shown in FIG. 11, the position of the sign language video to be synthesized is also changed (steps S104 to S108). Thus, even if a person moves in the program image, the sign language image also moves following the movement, so that the user can see a sign language image constantly displayed near the person.

次に、番組映像に二人の人物が含まれる場合について説明する。図４の番組映像で表示されていたニュースキャスターの横に、図１２で示されるように、別のニュース解説者が登場したとする。すると、次の表示位置の設定を行うタイミングで、人物領域特定部１３は、図１３で示されるように、２個の人物の領域を特定する。なお、その際の人物領域特定情報において、左の人物に対応する人物ＩＤが「Ｕ００１」であり、右の人物に対応する人物ＩＤが「Ｕ００２」であったとする。また、その特定によって生成された人物領域特定情報は、図１４で示されるものであったとする。 Next, a case where two persons are included in the program video will be described. Assume that another news commentator appears next to the news caster displayed in the program video of FIG. 4 as shown in FIG. Then, at the timing of setting the next display position, the person area specifying unit 13 specifies the areas of two persons as shown in FIG. In the person area specifying information at this time, it is assumed that the person ID corresponding to the left person is “U001” and the person ID corresponding to the right person is “U002”. Further, it is assumed that the person area specifying information generated by the specifying is as shown in FIG.

この場合には、複数の人物の領域が特定されているため（ステップＳ１１１）、話者特定部１４は、話者を特定する処理を行う（ステップＳ１１２）。具体的には、話者特定部１４は、人物ＩＤ「Ｕ００１」で識別される人物の領域における口の領域である第１の口領域を特定し、その口領域の動き検出を行う。また、話者特定部１４は、人物ＩＤ「Ｕ００２」で識別される人物の領域における口の領域である第２の口領域を特定し、その口領域の動き検出も行う。そして、話者特定部１４は、両者の動きを比較して、動きの大きい方を話者に特定する（ステップＳ１１２）。この場合には、人物ＩＤ「Ｕ００１」に対応する動きの方が大きかったとする。すると、話者特定部１４は、その人物ＩＤ「Ｕ００１」を表示位置設定部１５に渡す。表示位置設定部１５は、その人物ＩＤ「Ｕ００１」に対応する領域情報を用いて、表示位置の設定を行う（ステップＳ１１３）。そして、その表示位置を示す情報が映像合成部２０に渡され、図示しない記録媒体に蓄積される（ステップＳ１１４）。その後、番組情報が受け付けられると、新たな表示位置に応じた手話映像と番組映像との合成が行われ（ステップＳ１０４〜Ｓ１０７）、ディスプレイに図１５で示される合成映像が表示される（ステップＳ１０８）。なお、その後に、人物ＩＤ「Ｕ００２」で識別される人物の領域における口の動きの方が大きくなると、それに応じて表示位置が変更される（ステップＳ１０９〜Ｓ１１４）。そして、その変更後の表示位置に応じて、図１６で示されるように、手話映像の合成位置が変更されることになる。 In this case, since a plurality of person areas are specified (step S111), the speaker specifying unit 14 performs a process of specifying the speaker (step S112). Specifically, the speaker specifying unit 14 specifies a first mouth area that is a mouth area in the person area identified by the person ID “U001”, and performs motion detection of the mouth area. In addition, the speaker specifying unit 14 specifies a second mouth area that is a mouth area in the person area identified by the person ID “U002”, and also performs motion detection of the mouth area. And the speaker specific | specification part 14 compares both movement, and specifies the one with a big movement to a speaker (step S112). In this case, it is assumed that the movement corresponding to the person ID “U001” is larger. Then, the speaker specifying unit 14 passes the person ID “U001” to the display position setting unit 15. The display position setting unit 15 sets the display position using the area information corresponding to the person ID “U001” (step S113). Then, the information indicating the display position is transferred to the video composition unit 20 and stored in a recording medium (not shown) (step S114). Thereafter, when the program information is received, the sign language video and the program video corresponding to the new display position are synthesized (steps S104 to S107), and the synthesized video shown in FIG. 15 is displayed on the display (step S108). ). After that, when the mouth movement in the person area identified by the person ID “U002” becomes larger, the display position is changed accordingly (steps S109 to S114). Then, according to the display position after the change, as shown in FIG. 16, the synthesis position of the sign language video is changed.

なお、ニュースの間などにおいて、単に音楽が流れるだけであり、ニュースキャスターや解説者による発話が中断したとする。すると、それに応じて手話映像の動きもなくなるため、その際には、表示判断部１６が手話映像を表示しないと判断し、手話映像の合成が行われないことになる（ステップＳ１０６，Ｓ１０８）。その後に、ニュースキャスター等による発話が開始されると、それに応じて手話映像の動きも生じることになり、表示判断部１６は手話映像を表示すると判断して、手話映像の表示が再開されることになる（ステップＳ１０６〜Ｓ１０８）。 It is assumed that music is simply played during the news, etc., and the utterances by newscasters and commentators are interrupted. As a result, there is no movement of the sign language video accordingly. At that time, the display determination unit 16 determines that the sign language video is not displayed, and the sign language video is not synthesized (steps S106 and S108). After that, when the utterance by the news caster or the like is started, the sign language video also moves accordingly, and the display determination unit 16 determines that the sign language video is displayed, and the display of the sign language video is resumed. (Steps S106 to S108).

以上のように、本実施の形態による手話映像合成装置１によれば、人物の領域の近傍に手話映像が表示されることになる。合成映像を見るユーザは、手話映像と人物の領域との両方を見たいと考えられるが、その場合でも、両者が近くに表示されるため、視線移動の距離が短くなり、眼精疲労等の疲労の起こる可能性を低減することができる。また、表示判断部１６による判断結果に応じて手話映像を表示したり、表示しなかったりするため、意味のない手話映像の表示をしないようにすることができる。また、番組映像に２以上の人物が含まれる場合には、話者を特定して、その話者の人物の領域の近傍に手話映像が表示されることになる。合成映像を見るユーザは、表示されている人物のうち、話者に注目すると考えられるため、この場合にも、その話者と手話映像とが近くに表示されることによって、視線移動を少なくすることができ、疲労を軽減できる。また、番組映像のジャンルに応じた大きさで手話映像を表示することができるため、例えば、ニュースなどのように発話内容が重要である場合には、手話映像を大きく表示し、スポーツなどのように発話内容があまり重要でない場合には、手話映像を小さく表示するようにもできる。 As described above, according to the sign language video synthesizing apparatus 1 according to the present embodiment, the sign language video is displayed in the vicinity of the person's area. A user who sees a composite video may want to see both the sign language video and the person's area, but even in that case, both of them are displayed close to each other. The possibility of fatigue can be reduced. Further, since the sign language image is displayed or not displayed according to the determination result by the display determination unit 16, it is possible to prevent the sign language image from being displayed without meaning. Further, when the program video includes two or more persons, the speaker is specified and the sign language video is displayed in the vicinity of the area of the speaker's person. Since the user who sees the composite video is considered to pay attention to the speaker among the displayed people, in this case as well, the speaker and the sign language video are displayed close to each other, thereby reducing eye movement. Can reduce fatigue. In addition, because the sign language video can be displayed in a size corresponding to the genre of the program video, for example, when the utterance content is important such as news, the sign language video is displayed in a large size, such as sports. When the utterance content is not so important, the sign language video can be displayed in a small size.

なお、本実施の形態では、複数の人物の領域が特定された場合に、話者特定部１４によって特定された話者の人物の領域に対応付けて手話映像が表示される場合について説明したが、そうでなくてもよい。すなわち、話者特定部１４による話者の特定を行わなくてもよい。話者の特定を行わない場合には、手話映像合成装置１は、話者特定部１４を備えていなくてもよい。また、その場合において、複数の人物の領域が特定された際には、例えば、選択された一の人物の領域に隣接する位置に手話映像の位置が設定されてもよく、複数の人物の領域の中心（この中心は、例えば、複数の人物の領域の重心であってもよい）の位置に手話映像の位置が設定されてもよい。なお、選択された一の人物の領域は、例えば、最も大きい人物の領域であってもよく、複数の人物の領域のうち、ちょうど真ん中に位置する人物の領域であってもよい（例えば、５個の人物の領域が横方向に並んでいる場合には、例えば、左から３番目の人物の領域であってもよい）。 In the present embodiment, a case has been described in which a sign language video is displayed in association with a speaker person's area specified by the speaker specifying unit 14 when a plurality of person areas are specified. It does n’t have to be. That is, it is not necessary to specify the speaker by the speaker specifying unit 14. When the speaker is not specified, the sign language video synthesizing device 1 may not include the speaker specifying unit 14. In this case, when a plurality of person areas are specified, for example, the position of the sign language video may be set at a position adjacent to the selected one person area. The position of the sign language image may be set at the position of the center (this center may be, for example, the center of gravity of a plurality of person areas). Note that the selected one person area may be, for example, the largest person area, or may be the person area located in the middle of the plurality of person areas (for example, 5 In the case where the areas of the persons are arranged in the horizontal direction, for example, the area of the third person from the left may be used).

また、本実施の形態では、話者特定部１４が口の領域の動き検出を行うことによって話者を特定する場合について説明したが、話者特定部１４は、それ以外の方法で話者を特定してもよい。例えば、話者を示す情報（例えば、左から２番目の人物が話者である等の情報）が番組映像に重畳されている場合には、話者特定部１４は、その情報を用いて、話者を特定してもよい。具体的には、話者を示す情報によって、左から２番目の人物が話者である旨が示される場合には、話者特定部１４は、人物領域特定部１３が特定した人物の領域のうち、左から２番目の人物の領域を話者の人物の領域に特定してもよい。 Further, in the present embodiment, the case where the speaker specifying unit 14 specifies the speaker by detecting the movement of the mouth region has been described. However, the speaker specifying unit 14 selects the speaker by other methods. You may specify. For example, when information indicating a speaker (for example, information indicating that the second person from the left is a speaker) is superimposed on the program video, the speaker specifying unit 14 uses the information, A speaker may be specified. Specifically, when the information indicating the speaker indicates that the second person from the left is the speaker, the speaker specifying unit 14 specifies the region of the person specified by the person region specifying unit 13. Of these, the area of the second person from the left may be specified as the area of the speaker's person.

（実施の形態２）
本発明の実施の形態２による手話映像合成装置について、図面を参照しながら説明する。本実施の形態による手話映像合成装置は、複数の手話映像を受け付けるものである。 (Embodiment 2)
A sign language video synthesizing apparatus according to Embodiment 2 of the present invention will be described with reference to the drawings. The sign language video synthesizing device according to the present embodiment accepts a plurality of sign language videos.

図１７は、本実施の形態による手話映像合成装置３の構成を示すブロック図である。本実施の形態による手話映像合成装置３は、番組映像受付部１１と、手話映像受付部１２と、人物領域特定部１３と、表示位置設定部１５と、表示判断部１６と、番組関連情報受付部１７と、対応情報記憶部１８と、取得部１９と、映像合成部２０と、映像出力部２１と、対応特定部３１とを備える。なお、対応特定部３１以外の構成及び動作は、手話映像受付部１２が複数の手話映像を受け付け、表示位置設定部１５が後述する対応特定部３１による対応付けの結果を用いて、複数の手話映像の表示位置をそれぞれ設定し、映像合成部２０が番組映像の設定されたそれぞれの位置に複数の手話映像を合成する以外は、実施の形態１と同様であり、その詳細な説明を省略する。 FIG. 17 is a block diagram showing a configuration of the sign language video synthesizing apparatus 3 according to the present embodiment. The sign language video synthesizing device 3 according to the present embodiment includes a program video receiving unit 11, a sign language video receiving unit 12, a person area specifying unit 13, a display position setting unit 15, a display determining unit 16, and a program related information receiving unit. Unit 17, correspondence information storage unit 18, acquisition unit 19, video composition unit 20, video output unit 21, and correspondence identification unit 31. Note that the configuration and operation other than the correspondence specifying unit 31 are configured such that the sign language image receiving unit 12 receives a plurality of sign language images, and the display position setting unit 15 uses a result of association by the correspondence specifying unit 31 to be described later. This is the same as in the first embodiment except that the video display position is set and the video synthesis unit 20 synthesizes a plurality of sign language videos at the set positions of the program video, and detailed description thereof is omitted. .

手話映像受付部１２は、前述のように複数の手話映像を受け付けるものである。手話映像受付部１２は、例えば、複数のインターフェースによって複数の手話映像を受け付けてもよく、複数のチャンネルで放送された複数の手話映像を受信してもよく、その複数の手話映像を受け付ける方法は問わない。本実施の形態では、手話映像受付部１２が２個の手話映像を受け付ける場合について説明する。なお、手話映像受付部１２が受け付けた複数の手話映像には、それらを識別することができる識別情報が対応付いていることが好適である。また、手話映像受付部１２が受け付ける手話映像の個数は、番組映像によって異なってもよく、また、一の番組映像内で変化してもよい。 As described above, the sign language image receiving unit 12 receives a plurality of sign language images. For example, the sign language video reception unit 12 may receive a plurality of sign language videos through a plurality of interfaces, may receive a plurality of sign language videos broadcast on a plurality of channels, and a method for receiving the plurality of sign language videos is as follows. It doesn't matter. In the present embodiment, a case will be described in which the sign language image receiving unit 12 receives two sign language images. In addition, it is preferable that the plurality of sign language images received by the sign language image receiving unit 12 are associated with identification information that can identify them. Further, the number of sign language images received by the sign language image receiving unit 12 may vary depending on the program image, or may vary within one program image.

本実施の形態では、人物領域特定部１３は、手話映像受付部１２が２以上の手話映像を受け付ける場合には、複数の人物の領域を特定することが好適である。複数の手話映像に対応する複数の人物の領域が番組映像に存在すると考えられるからである。 In the present embodiment, it is preferable that the person area specifying unit 13 specifies a plurality of person areas when the sign language image receiving unit 12 receives two or more sign language images. This is because it is considered that a plurality of person areas corresponding to a plurality of sign language images exist in the program image.

対応特定部３１は、手話映像受付部１２が受け付けた各手話映像と、人物領域特定部１３が特定した各人物の領域とをそれぞれ対応付ける。対応特定部３１は、複数の手話映像と、複数の人物の領域とのうち、両者の動きの程度（変化の程度）の近いもの同士を対応付けてもよい。人物の領域の動きは、人物の領域の口の動きであってもよい。「動きの程度」については、前述の動き検出の方法によって検出することができる。また、口の動きの程度の検出も、実施の形態１の話者特定部１４に関して説明した方法によって口の領域を特定し、その特定した口の領域について動き検出を行うことによって実現できる。動きの程度の近いもの同士を対応付けるとは、動きの大きいもの同士を対応付け、動きの小さいもの同士を対応付けることである。手話の動きが大きい場合には、それに対応する人物が発話しており、その発話に応じて口の周りの動きやジェスチャーなどが大きくなり、一方、手話の動きが小さい場合には、それに対応する人物が発話していないか発話しているとしても程度が低く、それに応じて口の周りの動きやジェスチャーなどが小さくなと考えられるからである。例えば、その対応付けは次のようにして行うことができる。対応特定部３１は、複数の手話映像の動きを検出し、その動きの大きい順に手話映像をソートする。また、対応特定部３１は、複数の人物の領域の動き、またはその領域の口の領域の動きを検出し、その動きの大きい順に人物の領域をソートする。そして、対応特定部３１は、ソート後の同じ順番の手話映像と、人物の領域とを対応付けることによって、動きの程度の近いもの同士を対応付けることができることになる。なお、手話映像のソートや、人物の領域のソートは、厳密には、その手話映像を識別する情報や、人物の領域を識別する情報のソートであってもよい。また、この動きの程度が近いかどうかを時系列に沿った動きの相関によって判断してもよい。例えば、ある人物領域と、手話映像とについて、時系列に沿った動きの相関が高い場合、すなわち、動きの緩急のパターンの類似度が高い場合には、両者は対応しているものであると考えられるからである。なお、手話映像と、人物の領域とを対応付けるとは、例えば、手話映像の識別情報と、人物の領域の識別情報とを対応付ける情報を生成し、図示しない記録媒体に蓄積することであってもよい。その情報は、例えば、各レコードに手話映像の識別情報と、人物の領域の識別情報とを有する情報であってもよい。 The correspondence specifying unit 31 associates each sign language video received by the sign language video receiving unit 12 with each person's area specified by the person area specifying unit 13. Correspondence specifying unit 31 may associate a plurality of sign language images and a plurality of person areas having similar degrees of movement (degrees of change) between them. The movement of the person area may be the movement of the mouth of the person area. The “degree of motion” can be detected by the above-described motion detection method. The detection of the degree of mouth movement can also be realized by specifying the mouth region by the method described with respect to the speaker specifying unit 14 of Embodiment 1 and performing motion detection on the specified mouth region. Associating objects with similar degrees of motion is associating objects with large movements with each other and objects with small movements with each other. If the sign language movement is large, the person corresponding to it speaks, and the movement around the mouth and gestures increase according to the utterance. On the other hand, if the sign language movement is small, it corresponds to it. This is because even if a person is not speaking or speaking, the degree is low, and movements around the mouth and gestures are considered to be small accordingly. For example, the association can be performed as follows. The correspondence identifying unit 31 detects the movements of a plurality of sign language images and sorts the sign language images in descending order of the movements. In addition, the correspondence specifying unit 31 detects movements of a plurality of person areas or mouth movements of the areas, and sorts the person areas in descending order of the movements. And the correspondence specific | specification part 31 can match | combine the thing with the near degree of a motion by matching the sign language image | video of the same order after a sort, and a person's area | region. Strictly speaking, the sort of the sign language video or the sort of the person area may be a sort of information for identifying the sign language picture or information for identifying the person area. Further, whether or not the degree of the movement is close may be determined by the correlation of the movement along the time series. For example, for a person area and sign language video, if the correlation of movement along the time series is high, that is, if the similarity of the pattern of movement is high, the two correspond to each other It is possible. Note that associating the sign language video with the person area may be, for example, generating information that associates the identification information of the sign language video with the identification information of the person area and storing the information in a recording medium (not shown). Good. The information may be, for example, information including identification information of sign language video and identification information of a person area in each record.

表示位置設定部１５は、各手話映像の表示位置を、手話映像に対応特定部３１によって対応付けられた人物の領域に隣接する位置に設定する。一の人物の領域に対する一の手話映像の表示位置の設定方法は、実施の形態１で説明した方法と同様である。なお、この場合には、複数の手話映像の位置を設定するため、表示位置設定部１５は、複数の手話映像が重ならないように、各手話映像の表示位置を設定することが好適である。例えば、図８のように表示位置を設定する際に、表示位置を設定しようとする手話映像が、すでに表示位置の設定された手話映像と重なる場合には、表示位置設定部１５は、その表示位置に手話映像を表示できないと判断してもよい。また、表示位置設定部１５は、人物領域特定部１３が人物の領域を特定できなかった場合には、あらかじめ決められている位置を手話映像の表示位置に設定してもよい。また、表示位置設定部１５は、手話映像同士が重ならないように表示位置を設定することができない場合にも、あらかじめ決められている位置を手話映像の表示位置に設定してもよい。また、表示位置設定部１５は、人物領域特定部１３が特定した人物の領域の個数と、手話映像受付部１２が受け付けた手話映像の個数とが一致しない場合にも、あらかじめ決められている位置を手話映像の表示位置に設定してもよい。なお、この場合には、複数の手話映像の表示位置が重ならないようにあらかじめ設定されているものとする。また、表示位置設定部１５は、人物領域特定部１３が特定した人物の領域の個数と、手話映像受付部１２が受け付けた手話映像の個数とが一致しない場合であって、特定された人物の領域が１個である場合には、その一の人物の領域に隣接する２以上の手話映像の表示位置を設定するようにしてもよい。例えば、図８のように表示位置を設定するのであれば、１個目の手話映像の表示位置が第Ｎの表示位置に決まった後に、２個目の手話映像の表示位置を第Ｎ＋１以降の表示位置に設定できるかどうかを、順次判断することによって、その表示位置を設定してもよい。３個目以降の手話映像についても同様である。 The display position setting unit 15 sets the display position of each sign language video to a position adjacent to the area of the person associated with the sign language video by the correspondence specifying unit 31. The method for setting the display position of one sign language video for one person area is the same as the method described in the first embodiment. In this case, in order to set the position of a plurality of sign language images, it is preferable that the display position setting unit 15 sets the display position of each sign language image so that the plurality of sign language images do not overlap. For example, when the display position is set as shown in FIG. 8, if the sign language video for which the display position is to be set overlaps the sign language video for which the display position is already set, the display position setting unit 15 displays the display position. It may be determined that the sign language video cannot be displayed at the position. Further, the display position setting unit 15 may set a predetermined position as the display position of the sign language video when the person region specifying unit 13 cannot specify the person region. Further, the display position setting unit 15 may set a predetermined position as the display position of the sign language image even when the display position cannot be set so that the sign language images do not overlap each other. The display position setting unit 15 also determines a predetermined position even when the number of person areas specified by the person area specifying unit 13 and the number of sign language images received by the sign language image receiving unit 12 do not match. May be set as the display position of the sign language video. In this case, it is assumed that the display positions of a plurality of sign language images are set in advance so as not to overlap. The display position setting unit 15 is a case where the number of person areas specified by the person area specifying unit 13 and the number of sign language images received by the sign language image receiving unit 12 do not coincide with each other. When there is one area, the display positions of two or more sign language images adjacent to the area of the one person may be set. For example, if the display position is set as shown in FIG. 8, after the display position of the first sign language image is determined as the Nth display position, the display position of the second sign language image is set to the (N + 1) th and subsequent positions. The display position may be set by sequentially determining whether or not the display position can be set. The same applies to the third and subsequent sign language images.

映像合成部２０は、複数の手話映像を、番組映像の表示位置設定部１５によって設定された表示位置にそれぞれ合成する。なお、手話映像を番組映像に合成する処理を手話映像の個数だけ行う以外は、実施の形態１での説明と同様である。 The video synthesizing unit 20 synthesizes a plurality of sign language videos to the display positions set by the program video display position setting unit 15. It should be noted that the process is the same as that described in the first embodiment, except that the process of combining the sign language video with the program video is performed for the number of the sign language video.

なお、本実施の形態では、２以上の手話映像が受け付けられる場合の処理について説明したが、一の手話映像のみが受け付けられた場合には、各構成要素は、実施の形態１と同様に動作すればよい。 In the present embodiment, the processing when two or more sign language images are accepted has been described. However, when only one sign language image is accepted, each component operates in the same manner as in the first embodiment. do it.

次に、本実施の形態による手話映像合成装置３の動作について、図１８のフローチャートを用いて説明する。なお、図１８のフローチャートにおいて、ステップＳ２０１〜Ｓ２０５以外の処理は、実施の形態１の図２のフローチャートと同様であり、その説明を省略する。なお、手話映像受付部１２が複数の手話映像を受け付けた場合には、ステップＳ１０６では、表示判断部１６は、各手話映像について表示するかどうかを個別に判断するものとする。そして、すべての手話映像を表示しないと判断した場合には、ステップＳ１０８に進み、少なくとも１個の手話映像を表示すると判断した場合には、ステップＳ１０７に進むものとする。また、ステップＳ１０７では、表示判断部１６によって表示すると判断された手話映像のみを番組映像に合成するものとする。 Next, the operation of the sign language video synthesizing apparatus 3 according to this embodiment will be described with reference to the flowchart of FIG. In the flowchart of FIG. 18, processes other than steps S201 to S205 are the same as those in the flowchart of FIG. 2 of the first embodiment, and the description thereof is omitted. When the sign language image receiving unit 12 receives a plurality of sign language images, the display determining unit 16 individually determines whether or not to display each sign language image in step S106. If it is determined not to display all the sign language images, the process proceeds to step S108. If it is determined to display at least one sign language image, the process proceeds to step S107. In step S107, only the sign language video determined to be displayed by the display determination unit 16 is combined with the program video.

（ステップＳ２０１）対応特定部３１は、手話映像の数と同じ個数の人物の領域が特定されたかどうか判断する。そして、手話映像の数と同じ個数の人物の領域が特定された場合には、ステップＳ２０２に進み、そうでない場合には、ステップＳ２０４に進む。人物の領域の特定そのものができなかった場合にも、ステップＳ２０４に進むものとする。 (Step S201) The correspondence specifying unit 31 determines whether or not the same number of person regions as the number of sign language images have been specified. If the same number of person regions as the number of sign language images are specified, the process proceeds to step S202. If not, the process proceeds to step S204. Even when the person area cannot be specified, the process proceeds to step S204.

（ステップＳ２０２）対応特定部３１は、手話映像と人物の領域とを対応付ける。この処理の詳細については、図１９のフローチャートを用いて後述する。 (Step S202) The correspondence identifying unit 31 associates a sign language image with a person area. Details of this processing will be described later with reference to the flowchart of FIG.

（ステップＳ２０３）表示位置設定部１５は、対応特定部３１による特定結果を用いて、各手話映像の表示位置を設定する。 (Step S 203) The display position setting unit 15 sets the display position of each sign language video using the identification result obtained by the correspondence identifying unit 31.

（ステップＳ２０４）表示位置設定部１５は、各手話映像の表示位置を、あらかじめ決められている位置に設定する。 (Step S204) The display position setting unit 15 sets the display position of each sign language image to a predetermined position.

（ステップＳ２０５）映像合成部２０は、表示位置設定部１５によって設定された表示位置を示す情報を図示しない記録媒体において一時的に記憶する。なお、この情報は、各手話映像について表示位置を示す情報である。そして、ステップＳ１０１に戻る。 (Step S205) The video composition unit 20 temporarily stores information indicating the display position set by the display position setting unit 15 in a recording medium (not shown). This information is information indicating a display position for each sign language video. Then, the process returns to step S101.

なお、図１８のフローチャートのステップＳ２０４において、一の人物の領域が特定された場合には、前述のように、その一の人物の領域に隣接する位置に、複数の手話映像の表示位置を設定するようにしてもよい。また、図１８のフローチャートにおいて、電源オフや処理終了の割り込みにより処理は終了する。 In the case where one person's area is specified in step S204 of the flowchart of FIG. 18, as described above, a plurality of sign language video display positions are set at positions adjacent to the one person's area. You may make it do. Further, in the flowchart of FIG. 18, the processing is ended by powering off or interruption for aborting the processing.

図１９は、図１８のフローチャートにおける対応を特定する処理（ステップＳ２０２）の詳細を示すフローチャートである。
（ステップＳ３０１）対応特定部３１は、人物領域特定部１３によって特定された各人物の領域のそれぞれについて動きの検出を行う。 FIG. 19 is a flowchart showing details of the processing (step S202) for specifying the correspondence in the flowchart of FIG.
(Step S 301) The correspondence specifying unit 31 detects a motion for each person area specified by the person area specifying unit 13.

（ステップＳ３０２）対応特定部３１は、手話映像受付部１２が受け付けた複数の手話映像のそれぞれについて動きの検出を行う。 (Step S302) The correspondence identifying unit 31 detects a motion for each of the plurality of sign language images received by the sign language image receiving unit 12.

（ステップＳ３０３）対応特定部３１は、動きの程度の近いもの同士を対応付ける。そして、図１８のフローチャートに戻る。 (Step S303) The correspondence specifying unit 31 associates objects having similar degrees of movement. And it returns to the flowchart of FIG.

次に、本実施の形態による手話映像合成装置３の動作について、具体例を用いて説明する。なお、複数の手話映像と、複数の人物の領域との対応付けを行い、その結果に応じて各手話映像の表示位置を設定する以外の動作については、実施の形態１の具体例と同様であり、その詳細な説明を省略する。 Next, the operation of the sign language video synthesizing apparatus 3 according to the present embodiment will be described using a specific example. The operations other than associating a plurality of sign language images with a plurality of human regions and setting the display position of each sign language image according to the result are the same as in the specific example of the first embodiment. Detailed description thereof will be omitted.

実施の形態１の具体例と同様に、ユーザがニュース番組を見るようにチャンネル設定が行われ、その番組映像がディスプレイに表示されたとする（ステップＳ１０４〜Ｓ１０８）。その番組映像の表示は、一定の手話映像が蓄積されて表示判断部１６が手話映像を表示すると判断するまで継続されるものとする。また、番組関連情報が受け付けられ、それに応じて大きさ情報が取得されて一時的に記憶されたとする（ステップＳ１０１〜Ｓ１０３）。なお、そのニュースの番組情報に応じて受信された手話映像は２個であるとする。 As in the specific example of the first embodiment, it is assumed that the channel is set so that the user views the news program, and the program video is displayed on the display (steps S104 to S108). The display of the program video is continued until a certain sign language video is accumulated and the display determination unit 16 determines to display the sign language video. In addition, it is assumed that the program related information is received and the size information is acquired and temporarily stored in accordance with the received information (steps S101 to S103). It is assumed that there are two sign language images received according to the program information of the news.

また、番組映像の受信が開始され、動き検出ができるほどの時間が経過した後に、表示位置設定部１５は、表示位置の設定を行うと判断し、人物領域特定部１３に対して、人物の領域を特定する処理を行う旨の指示を渡す（ステップＳ１０９）。すると、人物領域特定部１３は、番組映像受付部１１が受け付けた図１２と同様の番組映像において、２個の人物の領域を特定する（ステップＳ１１０）。そして、その結果を示す図１４と同様の人物領域特定情報が図示しない記録媒体に蓄積される。 Further, after reception of the program video is started and a time sufficient for motion detection elapses, the display position setting unit 15 determines to set the display position, and the person area specifying unit 13 An instruction to perform processing for specifying an area is passed (step S109). Then, the person area specifying unit 13 specifies areas of two persons in the same program video as FIG. 12 received by the program video receiving unit 11 (step S110). Then, the person area specifying information similar to that shown in FIG. 14 showing the result is accumulated in a recording medium (not shown).

次に、対応特定部３１は、手話映像の個数と、特定された人物の領域の個数とが等しいと判断し（ステップＳ２０１）、対応を特定する処理を実行する（ステップＳ２０２）。具体的には、対応特定部３１は、人物ＩＤ「Ｕ００１」に対応する領域情報で示される領域と、人物ＩＤ「Ｕ００２」に対応する領域情報で示される領域とについて、動き検出を行う（ステップＳ３０１）。また、対応特定部３１は、手話映像受付部１２が受け付けた２個の手話映像のそれぞれについても、動き検出を行う（ステップＳ３０２）。なお、その２個の手話映像を識別する情報として、手話映像ＩＤ「Ｆ００１」「Ｆ００２」が各手話映像に対応付けられているとする。そして、対応特定部３１は、対応する動きの大きい順に人物ＩＤと、手話映像ＩＤとをソートし、ソート後の手話映像ＩＤと人物ＩＤとを１番目から順番に対応付けた情報である図２０で示される対応結果情報を生成し、その対応結果情報を表示位置設定部１５に渡す（ステップＳ３０３）。表示位置設定部１５は、その対応結果情報を図示しない記録媒体に蓄積する。図２０において、例えば、対応結果情報の１番目のレコードで、人物ＩＤ「Ｕ００１」と、手話映像ＩＤ「Ｆ００１」とが対応付けられている。したがって、人物ＩＤ「Ｕ００１」に対応する人物の領域と、手話映像ＩＤ「Ｆ００１」に対応する手話映像とが対応付けられたことになる。 Next, the correspondence identifying unit 31 determines that the number of sign language images is equal to the number of identified human areas (step S201), and executes processing for identifying correspondence (step S202). Specifically, the correspondence specifying unit 31 performs motion detection on the region indicated by the region information corresponding to the person ID “U001” and the region indicated by the region information corresponding to the person ID “U002” (step S301). The correspondence specifying unit 31 also performs motion detection on each of the two sign language images received by the sign language image receiving unit 12 (step S302). It is assumed that sign language video IDs “F001” and “F002” are associated with each sign language video as information for identifying the two sign language videos. Then, the correspondence specifying unit 31 sorts the person IDs and the sign language video IDs in descending order of the corresponding movements, and is information that associates the sorted sign language video IDs and the person IDs in order from the first. Is generated, and the corresponding result information is passed to the display position setting unit 15 (step S303). The display position setting unit 15 stores the correspondence result information in a recording medium (not shown). In FIG. 20, for example, in the first record of the correspondence result information, the person ID “U001” and the sign language video ID “F001” are associated with each other. Therefore, the person region corresponding to the person ID “U001” is associated with the sign language image corresponding to the sign language image ID “F001”.

また、表示位置設定部１５は、その蓄積した対応結果情報を参照し、人物ＩＤ「Ｕ００１」に対応する人物の領域について、実施の形態１の具体例と同様にして表示位置を設定する。その表示位置が、手話映像ＩＤ「Ｆ００１」で識別される手話映像の表示位置となる。また、表示位置設定部１５は、同様にして、人物ＩＤ「Ｕ００２」に対応する人物の領域についても、実施の形態１の具体例と同様に表示位置を設定する。その際に、手話映像ＩＤ「Ｆ００１」で識別される手話映像の表示位置に重ならないように表示位置の設定を行うものとする（ステップＳ２０３）。そして、表示位置設定部１５は、表示位置を示す情報と、手話映像ＩＤとを対応付ける情報を、映像合成部２０に渡す。映像合成部２０は、その受け取った情報を図示しない記録媒体に蓄積する（ステップＳ２０５）。 The display position setting unit 15 refers to the stored correspondence result information, and sets the display position for the person region corresponding to the person ID “U001” in the same manner as in the specific example of the first embodiment. The display position is the display position of the sign language video identified by the sign language video ID “F001”. Similarly, the display position setting unit 15 sets the display position for the person region corresponding to the person ID “U002” as in the specific example of the first embodiment. At this time, the display position is set so as not to overlap the display position of the sign language video identified by the sign language video ID “F001” (step S203). Then, the display position setting unit 15 passes the information indicating the display position and the information that associates the sign language video ID with the video synthesis unit 20. The video composition unit 20 accumulates the received information in a recording medium (not shown) (step S205).

その後、表示判断部１６がそれぞれの手話映像を表示すると判断したとすると（ステップＳ１０６）、それに応じて映像合成部２０は、２個の手話映像をそれぞれ表示位置設定部１５から受け取った、番組映像の表示位置に合成し、その合成映像を映像出力部２１に渡す（ステップＳ１０７）。映像出力部２１は、その合成映像をディスプレイに表示する（ステップＳ１０８）。図２１は、そのようにしてディスプレイに表示された合成映像である。各人物の領域に対応付けられた手話映像が合成されている。したがって、図２１の合成映像を見たユーザは、人物と手話映像との対応を容易に知ることができることになる。 Thereafter, if the display determination unit 16 determines that each sign language video is to be displayed (step S106), the video composition unit 20 accordingly receives the two sign language videos from the display position setting unit 15, respectively. Are displayed at the display position and the combined video is transferred to the video output unit 21 (step S107). The video output unit 21 displays the synthesized video on the display (step S108). FIG. 21 is a composite image displayed on the display in this manner. Sign language images associated with each person's area are synthesized. Therefore, the user who has seen the composite video in FIG. 21 can easily know the correspondence between the person and the sign language video.

以上のように、本実施の形態による手話映像合成装置３によれば、２以上の手話映像が受け付けられた場合に、各手話映像を、その手話映像に対応する人物の領域に隣接して表示することができるようになる。したがって、合成映像を見る者は、人物と手話映像との対応を容易に把握することができるようになる。また、その手話映像が人物の近くに表示されるため、手話映像と人物との間の視線移動が少なくなり、眼精疲労等の疲労を軽減することができることは実施の形態１と同様である。 As described above, according to the sign language image synthesizing device 3 according to the present embodiment, when two or more sign language images are received, each sign language image is displayed adjacent to the area of the person corresponding to the sign language image. Will be able to. Therefore, the person who sees the synthesized video can easily grasp the correspondence between the person and the sign language video. Further, since the sign language image is displayed near the person, the movement of the line of sight between the sign language image and the person is reduced, and fatigue such as eye strain can be reduced as in the first embodiment. .

なお、本実施の形態では、複数の手話映像と、複数の人物の領域との動きを用いて両者の対応を特定する場合について説明したが、それ以外の方法によって両者を対応付けてもよいことは言うまでもない。例えば、手話映像に対して、その手話映像に対応する人物を識別する人物識別情報が重畳されているとする。したがって、手話映像と人物識別情報との対応を知ることができるようになっているものとする。また、各人物識別情報と、その人物識別情報で識別される人物の画像の特徴を示す特徴情報とを対応付ける人物特徴対応情報が、図示しない記録媒体で記憶されているものとする。そして、対応特定部３１は、人物領域特定部１３が特定した人物の領域から特徴を抽出し、その特徴と一致する特徴情報を特定することによって、その人物の領域が、その特定した特徴情報に対応する人物識別情報で識別される人物のものであると知ることができる。そして、その人物識別情報を介して、人物の領域と、手話映像とを対応付けることができるようになる。ここで、一致するとは、両者が完全に一致する場合だけでなく、両者の類似度がしきい値以上の場合を含んでもよい。 In the present embodiment, a case has been described in which the correspondence between a plurality of sign language images and a plurality of human regions is used to identify the correspondence between the two, but the other may be associated with other methods. Needless to say. For example, it is assumed that person identification information for identifying a person corresponding to the sign language video is superimposed on the sign language video. Therefore, it is assumed that the correspondence between the sign language video and the person identification information can be known. Further, it is assumed that person feature correspondence information that associates each person identification information with feature information indicating the feature of the image of the person identified by the person identification information is stored in a recording medium (not shown). Then, the correspondence specifying unit 31 extracts features from the person region specified by the person region specifying unit 13 and specifies feature information that matches the feature, so that the person region becomes the specified feature information. It can be known that it belongs to the person identified by the corresponding person identification information. Then, via the person identification information, the person area can be associated with the sign language video. Here, “matching” may include not only the case where the two match completely, but also the case where the similarity between the two is equal to or greater than a threshold value.

また、上記各実施の形態の具体例において、手話映像として同じ図形を用いているが、これは説明の便宜上であり、実際には動きがあるため、いろいろなポーズとなりうる。 In the specific examples of the above embodiments, the same figure is used as the sign language image. However, this is for convenience of explanation, and there are actually movements.

また、上記各実施の形態において、表示判断部１６によって手話映像を表示するかどうかの判断を行い、その判断結果に応じて、映像合成部２０が手話映像の合成を行う場合について説明したが、そうでなくてもよい。表示判断部１６による判断を行わなくてもよい。その場合には、手話映像合成装置１，３は、表示判断部１６を備えていなくてもよい。表示判断部１６による判断を行わない場合には、映像合成部２０は、絶えず手話映像を番組映像に合成することになる。 Further, in each of the embodiments described above, the display determination unit 16 determines whether or not to display a sign language video, and the video synthesis unit 20 synthesizes the sign language video according to the determination result. It may not be so. The determination by the display determination unit 16 may not be performed. In that case, the sign language video synthesizing apparatuses 1 and 3 may not include the display determination unit 16. When the determination by the display determination unit 16 is not performed, the video composition unit 20 continuously synthesizes the sign language video with the program video.

また、上記各実施の形態では、番組関連情報が番組のジャンルを示す情報である場合について主に説明したが、そうでなくてもよい。前述のように、番組関連情報は、番組に出演する俳優の氏名や番組の名称であってもよい。その場合であっても、その俳優の氏名や番組の名称に対応する大きさ情報が取得され、その取得された大きさ情報を用いた手話映像の合成が行われてもよい。 In each of the above embodiments, the case where the program-related information is information indicating the genre of the program has been mainly described, but this need not be the case. As described above, the program-related information may be the name of an actor who appears in the program or the name of the program. Even in such a case, size information corresponding to the name of the actor or the name of the program may be acquired, and a sign language video may be synthesized using the acquired size information.

また、上記各実施の形態では、取得部１９が取得した大きさ情報で示される大きさの手話映像が番組映像に合成される場合について説明したが、そうでなくてもよい。あらかじめ決められた大きさの手話映像が番組映像に合成されてもよい。その場合には、手話映像合成装置１，３は、番組関連情報受付部１７、対応情報記憶部１８、取得部１９を備えていなくてもよい。 Further, in each of the above embodiments, a case has been described in which the sign language video having the size indicated by the size information acquired by the acquisition unit 19 is combined with the program video, but this need not be the case. A sign language image having a predetermined size may be combined with the program image. In that case, the sign language video synthesizing apparatuses 1 and 3 may not include the program-related information receiving unit 17, the correspondence information storage unit 18, and the acquisition unit 19.

また、上記各実施の形態において、番組映像に手話映像を合成する際に、手話映像と、その手話映像に対応する人物の領域とを対応付ける表示である対応表示が追加されるようにしてもよい。例えば、手話映像と、その手話映像に対応する人物の領域とのそれぞれを囲む同じ色の外縁の線である対応表示が追加されてもよい。その追加は、映像合成部２０によって行われてもよく、他の構成要素によって行われてもよい。また、２以上の手話映像が合成される際には、手話映像ごとにその色が異なっていることが好適である。そのようにすることで、人物と手話映像との対応がよりわかりやすくなる。また、例えば、手話映像と、その手話映像に対応する人物の領域との間を結ぶ線である対応表示が追加されてもよい。また、その他の対応表示が追加されてもよいことは言うまでもない。 In each of the above embodiments, when a sign language video is synthesized with a program video, a correspondence display, which is a display for associating the sign language video with a person's area corresponding to the sign language video, may be added. . For example, a correspondence display that is an outer edge line of the same color surrounding each of a sign language image and a person area corresponding to the sign language image may be added. The addition may be performed by the video composition unit 20 or may be performed by other components. In addition, when two or more sign language images are combined, it is preferable that the sign language images have different colors. By doing so, the correspondence between the person and the sign language video becomes easier to understand. Further, for example, a correspondence display that is a line connecting a sign language video and a person's area corresponding to the sign language video may be added. It goes without saying that other correspondence displays may be added.

また、上記各実施の形態において、人物領域特定部１３が特定した人物の領域があらかじめ決められている大きさよりも小さい場合には、表示位置設定部１５は、その人物の領域に対して手話映像の位置の設定を行わなくてもよい。あまりにも小さい人物の映像の場合には、手話映像がその人物に対応したものであるのかどうかが明確でないことが多いと考えられるからである。また、人物領域特定部１３が特定した人物の領域があらかじめ決められている数よりも多い場合には、表示位置設定部１５は、その人物の領域に対して手話映像の位置の設定を行わなくてもよい。あまりにも人物の映像が多い場合には、手話映像がどの人物に対応したものであるのかが明確でないことが多いと考えられるからである。 In each of the above embodiments, when the person area specified by the person area specifying unit 13 is smaller than a predetermined size, the display position setting unit 15 displays the sign language video for the person area. It is not necessary to set the position of. This is because in the case of an image of a person who is too small, it is often not clear whether the sign language image corresponds to that person. When the number of person areas specified by the person area specifying unit 13 is larger than a predetermined number, the display position setting unit 15 does not set the position of the sign language video for the person area. May be. This is because if there are too many images of a person, it is often not clear which person the sign language image corresponds to.

また、手話映像合成装置は、２以上の手話映像を受け付けて、それを合成した合成映像を生成するものであってもよい。その場合には、例えば、手話映像合成装置は、番組映像受付部と、手話映像受付部と、映像合成部と、映像出力部とを備えたものであってもよい。番組映像受付部は、番組映像を受け付けるものであり、前述の番組映像受付部１１と同様のものである。また、手話映像受付部は、番組映像に対応した手話の映像である複数の手話映像を受け付けるものであり、前述の実施の形態２における手話映像受付部１２と同様のものである。映像合成部は、番組映像に複数の手話映像を合成した合成映像を生成する。その場合に、手話映像を合成する番組映像における位置は、あらかじめ決められたものであってもよく、あるいは、そうでなくてもよい。後者の場合には、前述のように表示位置設定部１５等によって決められた位置であってもよい。映像出力部は、合成映像を出力するものであり、前述の映像出力部２１と同様のものである。このような手話映像合成装置によって、２以上の手話映像を一の番組映像に合成することによって、合成映像を生成することができる。 The sign language video synthesizing apparatus may receive two or more sign language videos and generate a synthesized video obtained by synthesizing them. In that case, for example, the sign language video composition device may include a program video reception unit, a sign language video reception unit, a video synthesis unit, and a video output unit. The program video accepting unit accepts a program video and is the same as the program video accepting unit 11 described above. The sign language video accepting unit accepts a plurality of sign language images that are sign language images corresponding to the program video, and is similar to the sign language video accepting unit 12 in the second embodiment. The video composition unit generates a composite video in which a plurality of sign language videos are combined with a program video. In this case, the position in the program video for synthesizing the sign language video may or may not be determined in advance. In the latter case, the position may be determined by the display position setting unit 15 or the like as described above. The video output unit outputs a composite video and is the same as the video output unit 21 described above. With such a sign language video synthesizing apparatus, a synthesized video can be generated by synthesizing two or more sign language videos into one program video.

また、手話映像合成装置は、表示判断部による判断結果に応じて、手話映像を表示したり、しなかったりするものであってもよい。その場合には、例えば、手話映像合成装置は、番組映像受付部と、手話映像受付部と、表示判断部と、映像合成部と、映像出力部とを備えたものであってもよい。番組映像受付部は、番組映像を受け付けるものであり、前述の番組映像受付部１１と同様のものである。また、手話映像受付部は、番組映像に対応した手話の映像である複数の手話映像を受け付けるものであり、前述の実施の形態２における手話映像受付部１２と同様のものである。表示判断部は、手話映像を表示するかどうか判断するものであり、前述の表示判断部１６と同様のものである。映像合成部は、表示判断部が手話映像を表示すると判断した際には、番組映像に手話映像を合成した映像である合成映像を生成し、表示判断部が手話映像を表示しないと判断した際には、手話映像である合成映像を生成するものであり、前述の映像合成部２０と同様のものである。映像出力部は、合成映像を出力するものであり、前述の映像出力部２１と同様のものである。このような手話映像合成装置によって、手話映像を表示する必要がない場合には、その手話映像を合成しないようにすることができ、不必要な手話映像によって、番組映像の一部が占有されることを回避することができる。 In addition, the sign language video synthesizing apparatus may display or not display the sign language video according to the determination result by the display determination unit. In that case, for example, the sign language video composition device may include a program video reception unit, a sign language video reception unit, a display determination unit, a video synthesis unit, and a video output unit. The program video accepting unit accepts a program video and is the same as the program video accepting unit 11 described above. The sign language video accepting unit accepts a plurality of sign language images that are sign language images corresponding to the program video, and is similar to the sign language video accepting unit 12 in the second embodiment. The display determination unit determines whether or not to display a sign language video, and is the same as the display determination unit 16 described above. When the display determining unit determines that the sign language video is to be displayed, the video synthesizing unit generates a composite video that is a video obtained by combining the sign language video with the program video, and when the display determining unit determines that the sign language video is not displayed. , Which generates a composite video that is a sign language video, is the same as the video synthesis unit 20 described above. The video output unit outputs a composite video and is the same as the video output unit 21 described above. When it is not necessary to display a sign language video by such a sign language video synthesizing apparatus, the sign language video can be prevented from being synthesized, and a part of the program video is occupied by unnecessary sign language video. You can avoid that.

なお、上記各実施の形態では、手話映像合成装置１，３において、手話映像の表示位置を設定すると共に、番組映像と手話映像との合成も行う場合について説明したが、手話映像の表示位置の設定と、番組映像と手話映像との合成とは別個に行われてもよい。その場合には、例えば、図２２で示されるように、手話表示位置設定装置５において、手話映像の表示位置の設定が行われ、手話映像合成装置６において、番組映像と手話映像との合成が行われてもよい。 In each of the embodiments described above, the sign language video synthesizing apparatuses 1 and 3 describe the case where the sign language video display position is set and the program video and the sign language video are also synthesized. The setting and the synthesis of the program video and the sign language video may be performed separately. In that case, for example, as shown in FIG. 22, the sign language display position setting device 5 sets the display position of the sign language video, and the sign language video synthesizing device 6 synthesizes the program video and the sign language video. It may be done.

図２２において、手話表示位置設定装置５は、番組映像受付部１１と、手話映像受付部１２と、人物領域特定部１３と、話者特定部１４と、表示位置設定部１５と、出力部５１とを備える。なお、出力部５１以外の構成及び動作は、実施の形態１と同様であり、その説明を省略する。 In FIG. 22, the sign language display position setting device 5 includes a program video receiving unit 11, a sign language video receiving unit 12, a person area specifying unit 13, a speaker specifying unit 14, a display position setting unit 15, and an output unit 51. With. The configuration and operation other than the output unit 51 are the same as those in the first embodiment, and a description thereof will be omitted.

出力部５１は、番組映像における、表示位置設定部１５が設定した表示位置を示す情報である位置情報を出力する。なお、出力部５１は、番組映像受付部１１が受け付けた番組映像、及び、手話映像受付部１２が受け付けた手話映像を、位置情報と一緒に出力してもよい。ここで、この出力は、例えば、所定の機器への通信回線を介した送信でもよく、記録媒体への蓄積でもよい。ここでは、出力部５１は、有線または無線の通信回線５００を介して、位置情報、番組映像、手話映像を手話映像合成装置６に送信するものとする。通信回線５００を介した送信は、例えば、放送であってもよく、インターネットやイントラネット、公衆電話回線網を介した送信であってもよい。なお、出力部５１は、出力を行うデバイス（例えば、通信デバイスなど）を含んでもよく、あるいは含まなくてもよい。また、出力部５１は、ハードウェアによって実現されてもよく、あるいは、それらのデバイスを駆動するドライバ等のソフトウェアによって実現されてもよい。 The output unit 51 outputs position information that is information indicating the display position set by the display position setting unit 15 in the program video. The output unit 51 may output the program video received by the program video receiving unit 11 and the sign language video received by the sign language video receiving unit 12 together with the position information. Here, this output may be, for example, transmission via a communication line to a predetermined device, or accumulation in a recording medium. Here, it is assumed that the output unit 51 transmits position information, program video, and sign language video to the sign language video synthesizing device 6 via a wired or wireless communication line 500. The transmission via the communication line 500 may be broadcast, for example, or may be transmission via the Internet, an intranet, or a public telephone line network. Note that the output unit 51 may or may not include a device (for example, a communication device) that performs output. The output unit 51 may be realized by hardware, or may be realized by software such as a driver that drives these devices.

なお、手話表示位置設定装置５は、例えば、番組映像の送信元の装置（例えば、放送局の装置）であってもよく、その番組映像を中継する装置であってもよい。前者の場合には、例えば、番組映像受付部１１や手話映像受付部１２は、番組映像等を記録媒体から読み出すことによって受け付けてもよい。また、後者の場合には、例えば、番組映像受付部１１や手話映像受付部１２は、番組映像等を受信することになる。また、前述のように、両映像は同期されていることが好適であることは言うまでもない。また、出力部５１が番組映像や手話映像、位置情報を送信する際には、それらを同期できるように送信することが好適である。例えば、出力部５１は、同期している番組映像と手話映像と位置情報とを、それぞれ別チャンネルで送信してもよい。また、例えば、出力部５１は、番組映像と、手話映像と、位置情報とを、それらの同期のための情報（例えば、タイムコードなど）と一緒に送信してもよい。出力部５１が送信以外の出力を行う場合にも同様である。なお、手話表示位置設定装置５は、番組映像等をリアルタイムで受け付けて、それを用いて位置情報を生成し、番組映像等と位置情報とをリアルタイムで出力してもよく、あるいは、生成した位置情報を図示しない記録媒体で記憶しておき、その位置情報を一括して出力してもよい。 The sign language display position setting device 5 may be, for example, a device that transmits a program video (for example, a device at a broadcasting station) or a device that relays the program video. In the former case, for example, the program video receiving unit 11 and the sign language video receiving unit 12 may receive the program video by reading it from the recording medium. In the latter case, for example, the program video reception unit 11 and the sign language video reception unit 12 receive the program video and the like. Needless to say, it is preferable that the two images are synchronized as described above. Further, when the output unit 51 transmits the program video, the sign language video, and the position information, it is preferable to transmit them so that they can be synchronized. For example, the output unit 51 may transmit the synchronized program video, sign language video, and position information through different channels. Further, for example, the output unit 51 may transmit the program video, the sign language video, and the position information together with information (for example, a time code) for synchronizing them. The same applies when the output unit 51 performs output other than transmission. The sign language display position setting device 5 may receive the program video and the like in real time, generate position information using the program video and the like, and output the program video and the position information in real time, or may generate the generated position. The information may be stored in a recording medium (not shown) and the position information may be output collectively.

図２２において、手話映像合成装置６は、番組映像受付部１１と、手話映像受付部１２と、表示判断部１６と、番組関連情報受付部１７と、対応情報記憶部１８と、取得部１９と、映像出力部２１と、位置情報受付部６１と、映像合成部６２とを備える。なお、位置情報受付部６１及び映像合成部６２以外の構成及び動作は、実施の形態１と同様であり、その説明を省略する。なお、ここでは、番組映像受付部１１及び手話映像受付部１２は、手話表示位置設定装置５から送信された番組映像等を受信するものとする。 In FIG. 22, the sign language video synthesizing device 6 includes a program video reception unit 11, a sign language video reception unit 12, a display determination unit 16, a program related information reception unit 17, a correspondence information storage unit 18, and an acquisition unit 19. A video output unit 21, a position information receiving unit 61, and a video synthesis unit 62. The configuration and operation other than the position information reception unit 61 and the video composition unit 62 are the same as those in the first embodiment, and the description thereof is omitted. Here, it is assumed that the program video reception unit 11 and the sign language video reception unit 12 receive the program video transmitted from the sign language display position setting device 5.

位置情報受付部６１は、手話映像を表示する位置を示す位置情報を受け付ける。ここで、この受け付けは、例えば、有線もしくは無線の通信回線を介して送信された情報の受信でもよく、所定の記録媒体（例えば、光ディスクや磁気ディスク、半導体メモリなど）から読み出された情報の受け付けでもよい。ここでは、位置情報受付部６１が手話表示位置設定装置５から送信された位置情報を受信する場合について説明する。なお、位置情報受付部６１は、受け付けを行うためのデバイス（例えば、モデムやネットワークカードなど）を含んでもよく、あるいは含まなくてもよい。また、位置情報受付部６１は、ハードウェアによって実現されてもよく、あるいは所定のデバイスを駆動するドライバ等のソフトウェアによって実現されてもよい。また、位置情報受付部６１が受け付ける位置情報は、手話映像合成装置６の番組映像受付部１１が受け付ける番組映像に関連したものである。すなわち、その番組映像において、人物の領域に隣接する位置に手話映像を表示するための位置情報である。 The position information receiving unit 61 receives position information indicating a position for displaying a sign language video. Here, this reception may be, for example, reception of information transmitted via a wired or wireless communication line, and information received from a predetermined recording medium (for example, an optical disk, a magnetic disk, or a semiconductor memory). It may be accepted. Here, a case where the position information receiving unit 61 receives position information transmitted from the sign language display position setting device 5 will be described. Note that the position information receiving unit 61 may or may not include a device (for example, a modem or a network card) for receiving. The position information receiving unit 61 may be realized by hardware, or may be realized by software such as a driver that drives a predetermined device. The position information received by the position information receiving unit 61 is related to the program video received by the program video receiving unit 11 of the sign language video synthesizing device 6. That is, it is position information for displaying a sign language image at a position adjacent to the person area in the program image.

映像合成部６２は、表示位置設定部１５が設定した表示位置に代えて、位置情報受付部６１が受け付けた位置情報で示される表示位置に手話映像を合成する以外、実施の形態１の映像合成部２０と同様のものであり、その詳細な説明を省略する。 The video composition unit 62 synthesizes the sign language video to the display position indicated by the position information received by the position information receiving unit 61 instead of the display position set by the display position setting unit 15, according to the first embodiment. This is the same as the unit 20, and detailed description thereof is omitted.

この図２２で示されるように、手話表示位置設定装置５において表示位置を設定し、手話映像合成装置６においてその設定された表示位置を用いて手話映像を合成してもよい。
なお、手話表示位置設定装置５において、手話映像の表示位置の設定のためには、手話映像は必要ないため、手話表示位置設定装置５は、手話映像を受け付けなくてもよい。その場合には、手話表示位置設定装置５は、手話映像受付部１２を備えていなくてもよい。また、手話表示位置設定装置５において、話者を特定した表示位置の設定を行わなくてもよい。その場合には、手話表示位置設定装置５は、話者特定部１４を備えていなくてもよい。また、手話表示位置設定装置５の出力部５１は、番組映像や手話映像を出力しなくてもよい。その場合には、出力部５１は、位置情報のみを出力するものであってもよい。 As shown in FIG. 22, the sign language display position setting device 5 may set the display position, and the sign language video composition device 6 may use the set display position to synthesize a sign language video.
In the sign language display position setting device 5, the sign language image is not necessary for setting the display position of the sign language image. Therefore, the sign language display position setting device 5 may not accept the sign language image. In that case, the sign language display position setting device 5 does not have to include the sign language image receiving unit 12. Further, in the sign language display position setting device 5, the display position specifying the speaker may not be set. In that case, the sign language display position setting device 5 may not include the speaker specifying unit 14. Further, the output unit 51 of the sign language display position setting device 5 may not output a program video or a sign language video. In that case, the output unit 51 may output only position information.

また、手話映像合成装置６において、表示判断部１６による判断に応じた手話映像の合成を行わなくてもよい。その場合には、手話映像合成装置６は、表示判断部１６を備えていなくてもよい。また、手話映像合成装置６において、番組関連情報に応じた手話映像の合成を行わなくてもよい。その場合には、手話映像合成装置６は、番組関連情報受付部１７と、対応情報記憶部１８と、取得部１９とを備えていなくてもよい。 Further, in the sign language video synthesizing device 6, the sign language video may not be synthesized according to the determination by the display determining unit 16. In that case, the sign language video composition device 6 may not include the display determination unit 16. Further, the sign language video synthesizing device 6 may not synthesize a sign language video according to the program related information. In that case, the sign language video synthesizing device 6 may not include the program-related information receiving unit 17, the correspondence information storage unit 18, and the acquisition unit 19.

また、手話表示位置設定装置５から手話映像合成装置６に、記録媒体等を介して位置情報が渡されてもよいことは言うまでもない。例えば、出力部５１が位置情報を記録媒体に蓄積し、位置情報受付部６１がその記録媒体から位置情報を読み出してもよい。また、手話表示位置設定装置５から手話映像合成装置６に、番組映像や手話映像が送信されなくてもよいことは言うまでもない。その場合には、手話映像合成装置６の番組映像受付部１１や手話映像受付部１２は、受信以外の方法によって番組映像等を受け付けてもよい。 Needless to say, position information may be passed from the sign language display position setting device 5 to the sign language video composition device 6 via a recording medium or the like. For example, the output unit 51 may store the position information in a recording medium, and the position information receiving unit 61 may read the position information from the recording medium. Needless to say, the program video and the sign language video may not be transmitted from the sign language display position setting device 5 to the sign language video synthesizing device 6. In that case, the program video receiving unit 11 and the sign language video receiving unit 12 of the sign language video synthesizing device 6 may receive the program video and the like by a method other than reception.

また、上記実施の形態２の手話映像合成装置３についても、図２２の手話表示位置設定装置５と手話映像合成装置６と同様に、手話映像の表示位置の設定と、番組映像等の合成とを別の装置で行ってもよい。その場合には、図２３で示されるように、手話表示位置設定装置７は、番組映像受付部１１と、手話映像受付部１２と、人物領域特定部１３と、表示位置設定部１５と、対応特定部３１と、出力部５１とを備える。それらの構成及び動作は、実施の形態２及び前述の説明と同様のものであり、その説明を省略する。また、図２３において、手話映像合成装置６は、図２２の手話映像合成装置６と同様のものである。 Also, the sign language video synthesizing device 3 of the second embodiment is similar to the sign language display position setting device 5 and the sign language video synthesizing device 6 in FIG. May be performed by another apparatus. In that case, as shown in FIG. 23, the sign language display position setting device 7 includes a program video receiving unit 11, a sign language video receiving unit 12, a person area specifying unit 13, and a display position setting unit 15. A specifying unit 31 and an output unit 51 are provided. Their configuration and operation are the same as those in the second embodiment and the above description, and the description thereof is omitted. In FIG. 23, the sign language video composition device 6 is the same as the sign language video composition device 6 of FIG.

また、上記実施の形態において、各処理または各機能は、単一の装置または単一のシステムによって集中処理されることによって実現されてもよく、あるいは、複数の装置または複数のシステムによって分散処理されることによって実現されてもよい。 In the above embodiment, each process or each function may be realized by centralized processing by a single device or a single system, or may be distributedly processed by a plurality of devices or a plurality of systems. It may be realized by doing.

また、上記実施の形態において、各構成要素が実行する処理に関係する情報、例えば、各構成要素が受け付けたり、取得したり、選択したり、生成したり、送信したり、受信したりした情報や、各構成要素が処理で用いるしきい値や数式、アドレス等の情報等は、上記説明で明記していない場合であっても、図示しない記録媒体において、一時的に、あるいは長期にわたって保持されていてもよい。また、その図示しない記録媒体への情報の蓄積を、各構成要素、あるいは、図示しない蓄積部が行ってもよい。また、その図示しない記録媒体からの情報の読み出しを、各構成要素、あるいは、図示しない読み出し部が行ってもよい。 In the above embodiment, information related to processing executed by each component, for example, information received, acquired, selected, generated, transmitted, or received by each component In addition, information such as threshold values, mathematical formulas, addresses, etc. used by each component in processing is retained temporarily or over a long period of time on a recording medium (not shown) even when not explicitly stated in the above description. It may be. Further, the storage of information in the recording medium (not shown) may be performed by each component or a storage unit (not shown). Further, reading of information from the recording medium (not shown) may be performed by each component or a reading unit (not shown).

また、上記実施の形態において、各構成要素等で用いられる情報、例えば、各構成要素が処理で用いるしきい値やアドレス、各種の設定値等の情報がユーザによって変更されてもよい場合には、上記説明で明記していない場合であっても、ユーザが適宜、それらの情報を変更できるようにしてもよく、あるいは、そうでなくてもよい。それらの情報をユーザが変更可能な場合には、その変更は、例えば、ユーザからの変更指示を受け付ける図示しない受付部と、その変更指示に応じて情報を変更する図示しない変更部とによって実現されてもよい。その図示しない受付部による変更指示の受け付けは、例えば、入力デバイスからの受け付けでもよく、通信回線を介して送信された情報の受信でもよく、所定の記録媒体から読み出された情報の受け付けでもよい。 In the above embodiment, when information used by each component, for example, information such as a threshold value, an address, and various setting values used by each component may be changed by the user Even if it is not specified in the above description, the user may be able to change the information as appropriate, or it may not be. If the information can be changed by the user, the change is realized by, for example, a not-shown receiving unit that receives a change instruction from the user and a changing unit (not shown) that changes the information in accordance with the change instruction. May be. The change instruction received by the receiving unit (not shown) may be received from an input device, information received via a communication line, or information read from a predetermined recording medium, for example. .

また、上記実施の形態において、手話映像合成装置１，３に含まれる２以上の構成要素が通信デバイスや入力デバイス等を有する場合に、２以上の構成要素が物理的に単一のデバイスを有してもよく、あるいは、別々のデバイスを有してもよい。 Further, in the above embodiment, when two or more components included in the sign language video synthesizing apparatuses 1 and 3 have a communication device, an input device, etc., the two or more components have a physically single device. Or you may have separate devices.

また、上記実施の形態において、各構成要素は専用のハードウェアにより構成されてもよく、あるいは、ソフトウェアにより実現可能な構成要素については、プログラムを実行することによって実現されてもよい。例えば、ハードディスクや半導体メモリ等の記録媒体に記録されたソフトウェア・プログラムをＣＰＵ等のプログラム実行部が読み出して実行することによって、各構成要素が実現され得る。なお、上記実施の形態における手話映像合成装置を実現するソフトウェアは、以下のようなプログラムである。つまり、このプログラムは、コンピュータを、番組の映像である番組映像を受け付ける番組映像受付部、番組映像に対応した手話の映像である手話映像を受け付ける手話映像受付部、番組映像における人物の領域を特定する人物領域特定部、人物領域特定部が特定した人物の領域に隣接する位置に手話映像の表示位置を設定する表示位置設定部、番組映像における、表示位置設定部が設定した表示位置に、手話映像を合成した合成映像を生成する映像合成部、合成映像を出力する映像出力部として機能させるためのプログラムである。 In the above embodiment, each component may be configured by dedicated hardware, or a component that can be realized by software may be realized by executing a program. For example, each component can be realized by a program execution unit such as a CPU reading and executing a software program recorded on a recording medium such as a hard disk or a semiconductor memory. The software that realizes the sign language video synthesizing device in the above embodiment is the following program. In other words, this program identifies a computer in a program video receiving unit that receives a program video that is a video of the program, a sign language video receiving unit that receives a sign language video corresponding to the program video, and a person area in the program video. A sign area at the display position set by the display position setting section in the program video, a display position setting section for setting the display position of the sign language image at a position adjacent to the person area specified by the person area specifying section This is a program for functioning as a video synthesizing unit that generates a synthesized video obtained by synthesizing videos and a video output unit that outputs the synthesized video.

なお、上記プログラムにおいて、上記プログラムが実現する機能には、ハードウェアでしか実現できない機能は含まれない。例えば、情報を受け付ける受付部や、情報を出力する出力部などにおけるモデムやインターフェースカードなどのハードウェアでしか実現できない機能は、上記プログラムが実現する機能には少なくとも含まれない。 In the program, the functions realized by the program do not include functions that can be realized only by hardware. For example, functions that can be realized only by hardware such as a modem and an interface card in a reception unit that receives information and an output unit that outputs information are not included in at least the functions realized by the program.

また、このプログラムは、サーバなどからダウンロードされることによって実行されてもよく、所定の記録媒体（例えば、ＣＤ−ＲＯＭなどの光ディスクや磁気ディスク、半導体メモリなど）に記録されたプログラムが読み出されることによって実行されてもよい。また、このプログラムは、プログラムプロダクトを構成するプログラムとして用いられてもよい。 Further, this program may be executed by being downloaded from a server or the like, and a program recorded on a predetermined recording medium (for example, an optical disk such as a CD-ROM, a magnetic disk, a semiconductor memory, or the like) is read out. May be executed by Further, this program may be used as a program constituting a program product.

また、このプログラムを実行するコンピュータは、単数であってもよく、複数であってもよい。すなわち、集中処理を行ってもよく、あるいは分散処理を行ってもよい。 Further, the computer that executes this program may be singular or plural. That is, centralized processing may be performed, or distributed processing may be performed.

図２４は、上記プログラムを実行して、上記実施の形態による手話映像合成装置を実現するコンピュータの外観の一例を示す模式図である。上記実施の形態は、コンピュータハードウェア及びその上で実行されるコンピュータプログラムによって実現されうる。 FIG. 24 is a schematic diagram showing an example of the external appearance of a computer that executes the program and realizes the sign language video composition apparatus according to the embodiment. The above-described embodiment can be realized by computer hardware and a computer program executed on the computer hardware.

図２４において、コンピュータシステム９００は、ＣＤ−ＲＯＭ（ＣｏｍｐａｃｔＤｉｓｋＲｅａｄＯｎｌｙＭｅｍｏｒｙ）ドライブ９０５、ＦＤ（Ｆｌｏｐｐｙ（登録商標）Ｄｉｓｋ）ドライブ９０６を含むコンピュータ９０１と、キーボード９０２と、マウス９０３と、モニタ９０４とを備える。 24, a computer system 900 includes a CD-ROM (Compact Disk Read Only Memory) drive 905, a computer 901 including an FD (Floppy (registered trademark) Disk) drive 906, a keyboard 902, a mouse 903, a monitor 904, and the like. Is provided.

図２５は、コンピュータシステム９００の内部構成を示す図である。図２５において、コンピュータ９０１は、ＣＤ−ＲＯＭドライブ９０５、ＦＤドライブ９０６に加えて、ＭＰＵ（ＭｉｃｒｏＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）９１１と、ブートアッププログラム等のプログラムを記憶するためのＲＯＭ９１２と、ＭＰＵ９１１に接続され、アプリケーションプログラムの命令を一時的に記憶すると共に、一時記憶空間を提供するＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）９１３と、アプリケーションプログラム、システムプログラム、及びデータを記憶するハードディスク９１４と、ＭＰＵ９１１、ＲＯＭ９１２等を相互に接続するバス９１５とを備える。なお、コンピュータ９０１は、ＬＡＮへの接続を提供する図示しないネットワークカードを含んでいてもよい。 FIG. 25 is a diagram showing an internal configuration of the computer system 900. 25, in addition to the CD-ROM drive 905 and the FD drive 906, a computer 901 is connected to an MPU (Micro Processing Unit) 911, a ROM 912 for storing a program such as a bootup program, and the MPU 911. A RAM (Random Access Memory) 913 that temporarily stores program instructions and provides a temporary storage space, a hard disk 914 that stores application programs, system programs, and data, and an MPU 911 and a ROM 912 are interconnected. And a bus 915. The computer 901 may include a network card (not shown) that provides connection to the LAN.

コンピュータシステム９００に、上記実施の形態による手話映像合成装置の機能を実行させるプログラムは、ＣＤ−ＲＯＭ９２１、またはＦＤ９２２に記憶されて、ＣＤ−ＲＯＭドライブ９０５、またはＦＤドライブ９０６に挿入され、ハードディスク９１４に転送されてもよい。これに代えて、そのプログラムは、図示しないネットワークを介してコンピュータ９０１に送信され、ハードディスク９１４に記憶されてもよい。プログラムは実行の際にＲＡＭ９１３にロードされる。なお、プログラムは、ＣＤ−ＲＯＭ９２１やＦＤ９２２、またはネットワークから直接、ロードされてもよい。 A program that causes the computer system 900 to execute the functions of the sign language video synthesis apparatus according to the above embodiment is stored in the CD-ROM 921 or FD 922, inserted into the CD-ROM drive 905 or FD drive 906, and stored in the hard disk 914. May be forwarded. Instead, the program may be transmitted to the computer 901 via a network (not shown) and stored in the hard disk 914. The program is loaded into the RAM 913 when executed. The program may be loaded directly from the CD-ROM 921, the FD 922, or the network.

プログラムは、コンピュータ９０１に、上記実施の形態による手話映像合成装置の機能を実行させるオペレーティングシステム（ＯＳ）、またはサードパーティプログラム等を必ずしも含んでいなくてもよい。プログラムは、制御された態様で適切な機能（モジュール）を呼び出し、所望の結果が得られるようにする命令の部分のみを含んでいてもよい。コンピュータシステム９００がどのように動作するのかについては周知であり、詳細な説明は省略する。 The program does not necessarily include an operating system (OS) or a third-party program that causes the computer 901 to execute the functions of the sign language video synthesizing device according to the above-described embodiment. The program may include only a part of an instruction that calls an appropriate function (module) in a controlled manner and obtains a desired result. How the computer system 900 operates is well known and will not be described in detail.

また、本発明は、以上の実施の形態に限定されることなく、種々の変更が可能であり、それらも本発明の範囲内に包含されるものであることは言うまでもない。 Further, the present invention is not limited to the above-described embodiment, and various modifications are possible, and it goes without saying that these are also included in the scope of the present invention.

以上より、本発明による手話映像合成装置等によれば、番組映像における人物の領域に隣接する位置に手話映像を表示する合成映像を生成して出力できるという効果が得られ、番組映像と手話映像とを合成する装置等として有用である。 As described above, according to the sign language video synthesizing apparatus and the like according to the present invention, it is possible to generate and output a composite video that displays a sign language video at a position adjacent to a person area in the program video. It is useful as an apparatus for synthesizing

１、３、６手話映像合成装置
５、７手話表示位置設定装置
１１番組映像受付部
１２手話映像受付部
１３人物領域特定部
１４話者特定部
１５表示位置設定部
１６表示判断部
１７番組関連情報受付部
１８対応情報記憶部
１９取得部
２０、６２映像合成部
２１映像出力部
３１対応特定部
５１出力部
６１位置情報受付部 1, 3, 6 Sign language video synthesizing device 5, 7 Sign language display position setting device 11 Program video receiving unit 12 Sign language video receiving unit 13 Person area specifying unit 14 Speaker specifying unit 15 Display position setting unit 16 Display determining unit 17 Program related information Reception unit 18 Corresponding information storage unit 19 Acquisition unit 20, 62 Video composition unit 21 Video output unit 31 Corresponding identification unit 51 Output unit 61 Position information receiving unit

Claims

A program video reception unit for receiving a program video which is a video of the program;
A sign language image receiving unit that accepts a sign language image that is a sign language image corresponding to the program image;
A person area specifying unit for specifying a person area in the program video;
A display position setting unit that sets a display position of the sign language video at a position adjacent to the person area specified by the person area specifying unit;
A video synthesizing unit that synthesizes the sign language video at a display position set by the display position setting unit in the program video;
A video output unit for outputting the synthesized video;
A display determination unit that determines that the sign language image is not displayed when the sign language image does not move ,
The sign language video synthesizing apparatus , wherein the video synthesizing unit does not synthesize a sign language video that is determined not to be displayed by the display determining unit .

The person area specifying unit specifies a plurality of person areas,
A speaker specifying unit that specifies a region of a speaker's person among a plurality of person regions specified by the person region specifying unit;
The sign language video composition device according to claim 1, wherein the display position setting unit sets the display position of the sign language video at a position adjacent to a person area corresponding to the speaker specified by the speaker specifying unit.

The sign language image receiving unit is for receiving a plurality of sign language images,
The person area specifying unit specifies a plurality of person areas,
It further includes a correspondence specifying unit that associates each sign language video with each person's area,
The display position setting unit sets the display position of each sign language video to a position adjacent to the area of the person associated with the sign language video by the correspondence specifying unit,
The sign language video synthesizing device according to claim 1, wherein the video synthesizing unit synthesizes the plurality of sign language videos at a set display position of the program video.

4. The sign language video synthesizing device according to claim 3, wherein the correspondence specifying unit associates a plurality of sign language videos and regions of a plurality of persons that are close to each other.

The sign language image synthesizing device according to claim 4, wherein the movement of the person area is movement of a mouth of the person area.

A program related information receiving unit for receiving program related information which is information indicating an attribute of the program video received by the program video receiving unit;
A correspondence information storage unit that stores correspondence information that is information for associating program-related information with size information indicating the display size of sign language video;
An acquisition unit that acquires size information associated with the program-related information received by the program-related information reception unit by the correspondence information;
The image combining unit, said program image, to synthesize the magnitude of the sign language video indicated by the size information acquired by the acquiring unit, sign language video synthesizing apparatus according to any one of claims 1 to 5.

The sign language video composition apparatus according to claim 6 , wherein the program related information includes information indicating a genre of a video program.

The display position setting unit, when the person area specifying unit did not locate any region of a person, sets the position that is predetermined on the display position of the sign language video, of claims 1 to 7 Any of the sign language video composition devices described.

A program video reception unit for receiving a program video which is a video of the program;
A sign language image receiving unit that accepts a sign language image that is a sign language image corresponding to the program image;
A position information receiving unit that receives position information that is information indicating a display position of the sign language video in the program video;
A video synthesizing unit that synthesizes the sign language video at a display position indicated by the position information in the program video;
A video output unit for outputting the synthesized video;
A display determination unit that determines that the sign language image is not displayed when the sign language image does not move ,
The sign language video synthesizing apparatus , wherein the video synthesizing unit does not synthesize a sign language video that is determined not to be displayed by the display determining unit .

A program video reception step for receiving a program video that is a video of the program;
A sign language image receiving step for receiving a sign language image that is a sign language image corresponding to the program image;
A person area specifying step of specifying a person area in the program video;
A display position setting step for setting the display position of the sign language image at a position adjacent to the person area specified in the person area specifying step;
A display determination step for determining that the sign language image is not displayed when the sign language image is not moving, and for determining that the sign language image is displayed when the sign language image is moving;
When it is determined that the sign language video is displayed in the display determination step, a composite video is generated by synthesizing the sign language video at the display position set in the display position setting step in the program video, and the display determination step If it is determined not to display the sign language video, a video synthesis step of not synthesizing the sign language video ,
And a video output step of outputting the synthesized video.

A program video reception step for receiving a program video that is a video of the program;
A sign language image receiving step for receiving a sign language image that is a sign language image corresponding to the program image;
A position information receiving step for receiving position information which is information indicating a display position of the sign language video in the program video;
A display determination step for determining that the sign language image is not displayed when the sign language image is not moving, and for determining that the sign language image is displayed when the sign language image is moving;
When it is determined that the sign language video is to be displayed in the display determination step, a composite video is generated by synthesizing the sign language video at a display position indicated by the position information in the program video , and the sign language is determined in the display determination step. A video synthesizing step of not synthesizing the sign language video when it is determined not to display the video ;
And a video output step of outputting the synthesized video.

Computer
A program video reception unit that receives a program video that is a video of the program,
A sign language image receiving unit for receiving a sign language image which is a sign language image corresponding to the program image;
A person area specifying unit for specifying a person area in the program video;
A display position setting unit that sets a display position of the sign language image at a position adjacent to the person area specified by the person area specifying unit;
A video synthesizing unit that generates a synthesized video obtained by synthesizing the sign language video at a display position set by the display position setting unit in the program video;
A video output unit for outputting the synthesized video ;
When there is no movement in the sign language image, the sign language image functions as a display determination unit that determines that the sign language image is not displayed .
The video synthesizing unit does not synthesize a sign language video that is determined not to be displayed by the display determining unit .

Computer
A program video reception unit that receives a program video that is a video of the program,
A sign language image receiving unit for receiving a sign language image which is a sign language image corresponding to the program image;
A position information receiving unit that receives position information that is information indicating a display position of the sign language video in the program video;
A video synthesizing unit that generates a synthesized video obtained by synthesizing the sign language video at a display position indicated by the position information in the program video;
A video output unit for outputting the synthesized video ;
When there is no movement in the sign language image, the sign language image functions as a display determination unit that determines that the sign language image is not displayed .
The video synthesizing unit does not synthesize a sign language video that is determined not to be displayed by the display determining unit .