JP4492299B2

JP4492299B2 - Video apparatus, video display method, and program

Info

Publication number: JP4492299B2
Application number: JP2004318266A
Authority: JP
Inventors: 雄介鈴木; 晃一竹内
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 2004-11-01
Filing date: 2004-11-01
Publication date: 2010-06-30
Anticipated expiration: 2024-11-01
Also published as: JP2006129376A

Description

本発明は、映像装置、映像表示方法及びプログラムに関し、例えば、ある複数の映像を連結させた連続動作映像を表示する映像装置に適用し得る。 The present invention relates to a video apparatus, a video display method, and a program, and can be applied to, for example, a video apparatus that displays a continuous motion video in which a plurality of videos are connected.

従来、写実性の高い映像コンテンツを動的に作成するためには、例えば、実際の人物の動きを撮影した実写の映像ファイルを複数用意しておき、それらを必要に応じて連結するといった方法がある。 Conventionally, in order to dynamically create highly realistic video content, for example, a method of preparing a plurality of live-action video files in which an actual person's movement is photographed and connecting them as necessary is available. is there.

例えば、特許文献１は、複数の単一動作映像を連結させて連続動作映像を作成する際に、連結する単一動作映像間の被写体の動作が滑らかにつながるように、単一動作映像内で被写体のずれを補正し、また単一動作映像間の被写体のずれを補正する技術が開示されている。 For example, in Patent Document 1, when a plurality of single motion images are connected to create a continuous motion image, the motion of the subject between the connected single motion images is smoothly connected in the single motion image. A technique for correcting a shift of a subject and correcting a shift of a subject between single motion images is disclosed.

また、特許文献１及び非特許文献１は、特徴点位置のずれを補正するワーピングという技術が開示されている。具体的には、実写の人物の映像に手動で特徴点を設定し、特徴点の映像ファイル中での移動量を計算して、その移動量を用いて画像合成処理で映像ファイルの中での人物位置が一定になるように処理を行うなどして人物の位置を補正するなどして映像の質を高めるなどの方法が提案されている。
特開２００３−６９９００号公報横山哲，林正樹，「実写による二次元バーチャルアクターの検討」，情報処理学会全国大会講演予稿集、２００１年３月 Patent Document 1 and Non-Patent Document 1 disclose a technique called warping that corrects the deviation of the feature point position. Specifically, feature points are manually set on the video of a live-action person, the amount of movement of the feature points in the video file is calculated, and the amount of movement is used to perform image composition processing in the video file. There have been proposed methods for improving the quality of an image by correcting the position of the person by performing processing so that the position of the person becomes constant.
JP 2003-69900 A Satoru Yokoyama and Masaki Hayashi, “Examination of two-dimensional virtual actors by live action”, Proceedings of National Conference of Information Processing Society of Japan, March 2001

しかしながら、特許文献１に記載の技術は、単一動作映像内及び単一動作映像間の被写体のずれを補正するものであり、写実性の高い複数の映像フレームを断片的に単純に連結しても、連結する映像部分で被写体の形状や動作速度などが異なると、たとえ被写体のずれを補正しても被写体の位置がずれるなど連結部分で不整合が生じてしまい、不自然な印象を与えてしまうことがあるという問題がある。 However, the technique described in Patent Document 1 corrects a subject shift within a single motion image and between single motion images, and simply connects a plurality of highly realistic video frames in a fragmentary manner. However, if the subject's shape and operating speed are different in the connected video part, even if the subject's displacement is corrected, the subject's position will be misaligned, causing inconsistencies in the connected part, giving an unnatural impression. There is a problem that sometimes.

また、非特許文献１に示す技術は、非特許文献１にも指摘されているように、人物の位置を補正するなどの処理は映像ファイル中の全フレーム、すなわち全画面の画素を処理対象とするため、合成の際に大量の処理時間が必要であるという問題がある。 Further, as pointed out in Non-Patent Document 1, the technique shown in Non-Patent Document 1 is to process all frames in a video file, that is, pixels on the entire screen, as processing for correcting the position of a person. Therefore, there is a problem that a large amount of processing time is required at the time of synthesis.

そのため、上記問題を解決するため、連続動作映像合成の際に、動作映像内の表示映像の構成情報を利用することで、合成時に必要な処理を少なくし、かつ品質の高い合成映像を作成、表示できる映像装置、映像表示方法及びプログラムを提供する。 Therefore, in order to solve the above problems, the composition information of the display video in the motion video is used during continuous motion video synthesis, thereby reducing the processing required during synthesis and creating a high-quality composite video. Provided are a video device, a video display method, and a program that can be displayed.

かかる課題を解決するために、第１の本発明の映像装置は、ある情報に応じた動作を表わした動作映像を複数連結して連続動作映像を出力する映像装置において、（１）連続する複数のフレームから構成された動作映像ファイルを複数格納する動作映像格納手段と、（２）各情報に、動作映像ファイルの読み出し情報と、動作映像ファイル内に表示されている１又は複数の連結基準情報とを対応付けて格納する連結基準情報格納手段と、（３）入力された時間的順序のある複数の入力情報に基づいて連結基準情報格納手段から対応する複数の動作映像ファイルの読み出し情報を検索し、複数の動作映像ファイルの読み出し情報に基づいて動作映像格納手段から複数の動作映像ファイルを取り出す動作映像取得手段と、（４）動作映像取得手段により取り出された複数の動作映像ファイルについて、時間的に前後する先の動作映像ファイルの最終フレームと、後の動作映像ファイルの先頭フレームとを取得し、連結基準情報格納手段から読み出した最終フレームの連結基準情報と先頭フレームの上記連結基準情報とに基づいて、最終フレーム及び先頭フレームに表示される映像の中間データを作成する中間データ作成手段と、（５）連携基準情報格納手段から、中間データ作成手段により作成された中間データに近似する連結基準情報を持つ参照フレームを検索する参照フレーム検索手段と、（６）参照フレーム検索手段により検索された参照フレームを用いて、先の動作映像ファイルの最終フレームと、後の動作映像ファイルの先頭フレームとの間の中間動作映像を補間した動作映像を合成する合成手段と、（７）合成手段により合成された動作映像を出力する出力手段とを備え、動作映像取得手段は、当該入力情報の検索結果が複数ある場合に、少なくとも当該入力情報の直前情報の動作映像ファイルの連結基準情報と、検索したそれぞれの動作映像ファイルの連結基準情報との比較により、当該入力情報に対応する上記動作映像ファイルを選択することを特徴とする。 In order to solve such a problem, a plurality first video apparatus of the present invention, in the video apparatus that outputs a continuous operation picture operation image showing an operation in accordance with some information more linked to, continuous (1) operation and video storage means, (2) the information, and the read information of the operation image file, one or more connection criteria information displayed on the operation picture file which stores a plurality of configured operating video file from the frame And (3) retrieval of readout information of a plurality of corresponding motion video files from the connection reference information storage means based on the inputted plurality of input information having a temporal order. and, an operation image acquisition means for taking a plurality of operation images from the behavior video storage means based on the read information of a plurality of operation image file, (4) operation image acquisition unit For the plurality of motion video files extracted from the time frame, the last frame of the previous motion video file and the first frame of the subsequent motion video file, which are temporally mixed, are obtained, and the last frame read from the connection reference information storage unit is acquired. Intermediate data creation means for creating intermediate data of the video displayed in the last frame and the first frame based on the linkage reference information and the above-mentioned linkage reference information of the first frame; and (5) intermediate data from the linkage reference information storage means (6) a reference frame search unit that searches for a reference frame having connection standard information that approximates the intermediate data generated by the generation unit; and (6) a reference frame searched by the reference frame search unit. A motion video that interpolates the intermediate motion video between the last frame and the first frame of the subsequent motion video file Comprising synthesizing means for synthesizing, and output means for outputting an operation image synthesized by (7) combining means, operation image acquisition means, when the search result of the input information are a plurality, at least of the input information The operation video file corresponding to the input information is selected by comparing the connection reference information of the immediately preceding motion video file with the connection reference information of each searched motion video file .

また、第２の本発明の映像表示方法は、第１の本発明の映像装置に対応するものである。つまり、第２の本発明の映像表示方法は、ある情報に応じた動作を表わした動作映像を複数連結して連続動作映像を出力する映像表示方法において、（１）連続する複数のフレームから構成された動作映像ファイルを複数格納する動作映像格納手段と、（２）各情報に、動作映像ファイルの読み出し情報と、動作映像ファイル内に表示されている１又は複数の連結基準情報とを対応付けて格納する連結基準情報格納手段とを備え、（３）動作映像取得手段が、入力された時間的順序のある複数の入力情報に基づいて連結基準情報格納手段から対応する複数の動作映像ファイルの読み出し情報を検索し、複数の動作映像ファイルの読み出し情報に基づいて動作映像格納手段から複数の動作映像ファイルを取り出し、（４）中間データ作成手段が、動作映像取得手段により取り出された複数の動作映像ファイルについて、時間的に前後する先の動作映像ファイルの最終フレームと、後の動作映像ファイルの先頭フレームとを取得し、連結基準情報格納手段から読み出した最終フレームの連結基準情報と先頭フレームの連結基準情報とに基づいて、最終フレーム及び先頭フレームに表示される映像の中間データを作成し、（５）参照フレーム検索手段が、連携基準情報格納手段から、中間データ作成手段により作成された中間データに近似する連結基準情報を持つ参照フレームを検索し、（６）合成手段が、参照フレーム検索手段により検索された参照フレームを用いて、先の動作映像ファイルの最終フレームと、後の動作映像ファイルの先頭フレームとの間の中間動作映像を補間した動作映像を合成し、（７）出力手段が、合成手段により合成された動作映像を出力し、動作映像取得手段が、当該入力情報の検索結果が複数ある場合に、少なくとも当該入力情報の直前情報の動作映像ファイルの連結基準情報と、検索したそれぞれの動作映像ファイルの連結基準情報との比較により、当該入力情報に対応する動作映像ファイルを選択することを特徴とする。 The video display method according to the second aspect of the present invention corresponds to the video apparatus according to the first aspect of the present invention. In other words, the video display method of the second aspect of the present invention is a video display method for outputting a continuous motion video by connecting a plurality of motion videos representing motions according to certain information. (1) Consists of a plurality of continuous frames associates the operation image storage means for storing a plurality of operation image files, (2) to each information, and the read information of the operation image files, and one or more connection criteria information displayed on the operation picture file And (3) the motion picture acquisition means stores a plurality of motion picture files corresponding from the connection reference information storage means on the basis of the inputted plurality of input information in time order. Find the read information, it takes out a plurality of operation images from the behavior video storage means based on the read information of a plurality of operation image file, and (4) the intermediate data generating means For a plurality of motion video files extracted by the motion video acquisition means, obtain the last frame of the previous motion video file and the first frame of the subsequent motion video file that are temporally mixed and read out from the connection reference information storage means Intermediate data of the video displayed in the last frame and the first frame is created based on the connection standard information of the last frame and the connection standard information of the first frame, and (5) the reference frame search means is the cooperation standard information storage means (6) The synthesis means uses the reference frame searched by the reference frame search means to search for a reference frame having connection standard information that approximates the intermediate data created by the intermediate data creation means. Interpolated intermediate motion video between the last frame of the video file and the first frame of the subsequent motion video file Was synthesized image, (7) the output means outputs an operation image synthesized by the synthesizing means, the operation image acquisition means, when the search result of the input information are a plurality of immediately preceding information for at least the input information The operation video file corresponding to the input information is selected by comparing the connection reference information of the operation video file and the connection reference information of each searched operation image file .

さらに、第３の本発明のプログラムは、ある情報に応じた動作を表わした動作映像を複数連結して連続動作映像を出力するものであって、連続する複数のフレームから構成された動作映像ファイルを複数格納する動作映像格納手段と、各情報に、動作映像ファイルの読み出し情報と、動作映像ファイル内に表示されている１又は複数の連結基準情報とを対応付けて格納する連結基準情報格納手段とを備える映像装置を、（１）入力された時間的順序のある複数の入力情報に基づいて連結基準情報格納手段から対応する複数の動作映像ファイルの読み出し情報を検索し、複数の動作映像ファイルの読み出し情報に基づいて動作映像格納手段から複数の上記動作映像ファイルを取り出す動作映像取得手段、（２）動作映像取得手段により取り出された複数の動作映像ファイルについて、時間的に前後する先の動作映像ファイルの最終フレームと、後の動作映像ファイルの先頭フレームとを取得し、連結基準情報格納手段から読み出した最終フレームの連結基準情報と先頭フレームの連結基準情報とに基づいて、最終フレーム及び先頭フレームに表示される映像の中間データを作成する中間データ作成手段、（３）連携基準情報格納手段から、中間データ作成手段により作成された中間データに近似する連結基準情報を持つ参照フレームを検索する参照フレーム検索手段、（４）参照フレーム検索手段により検索された参照フレームを用いて、先の動作映像ファイルの最終フレームと、後の動作映像ファイルの先頭フレームとの間の中間動作映像を補間した動作映像を合成する合成手段、（５）合成手段により合成された動作映像を出力する出力手段として機能させ、動作映像取得手段が、当該入力情報の検索結果が複数ある場合に、少なくとも当該入力情報の直前情報の動作映像ファイルの連結基準情報と、検索したそれぞれの動作映像ファイルの連結基準情報との比較により、当該入力情報に対応する動作映像ファイルを選択するものとして機能させるためのプログラムである。 Further, the program of the third aspect of the present invention is for outputting a continuous operation image by connecting a plurality operation image showing an operation according to some information, the operation image file composed of a plurality of consecutive frames and operating the image storing means storing a plurality of, in each information, and the read information of the operation image file, linking criterion information storage means for storing in association with one or more connection criteria information displayed on the operation picture file (1) search for the readout information of a plurality of operation video files corresponding from the connection reference information storage means based on a plurality of input information in time order input, and a plurality of operation video files operation image acquisition means for extracting a plurality of the operation image from the behavior video storage means on the basis of the read information, retrieved by (2) operation image acquisition unit For a plurality of motion video files, the last frame of the previous motion video file that is temporally mixed and the first frame of the subsequent motion video file are acquired, and the link reference information of the last frame read from the link criterion information storage means And intermediate data creation means for creating intermediate data of the video displayed in the last frame and the first frame based on the link reference information of the first frame and the first frame, and (3) created by the intermediate data creation means from the linkage reference information storage means (4) using the reference frame searched by the reference frame search means, and using the reference frame searched by the reference frame search means, the last frame of the previous motion video file, A synthesizing means for synthesizing the motion video obtained by interpolating the intermediate motion video between the first frame of the motion video file, 5) to function as output means for outputting the operation image synthesized by the synthesizing means, the operation image acquisition means, when the search result of the input information are multiple connection operation picture file immediately preceding information for at least the input information This is a program for functioning to select an operation video file corresponding to the input information by comparing the reference information with the connection reference information of each searched operation video file .

本発明の映像装置、映像表示方法及びプログラムにより、連続動作映像合成の際に、動作映像内の表示映像の構成情報を利用することで、合成時に必要な処理を少なくし、かつ品質の高い合成映像を作成、表示できる。 With the video device, video display method and program of the present invention, the composition information of the display video in the motion video is used when synthesizing the continuous motion video, thereby reducing the processing required at the time of synthesis and high quality synthesis. Can create and display video.

以下、本発明の映像装置、映像表示方法及びプログラムの実施形態について図面を参照して説明する。 Embodiments of a video apparatus, a video display method, and a program according to the present invention will be described below with reference to the drawings.

（Ａ）第１の実施形態
本実施形態は、入力文章の意味を表現する手話を連続映像として表示する映像表示装置に適用した場合について説明する。 (A) 1st Embodiment This embodiment demonstrates the case where it applies to the video display apparatus which displays the sign language expressing the meaning of an input sentence as a continuous video.

（Ａ−１）第１の実施形態の構成
図１は、本実施形態の映像表示装置の機能ブロック図である。なお、本実施形態の映像表示装置は、ハードウェア的には、入出力装置、中央演算処理装置、記憶装置などからなるワークステーションやパソコン等の情報処理装置で実現されるものであるが、その機能構成の説明便宜上、図１では機能ブロックとして示す。 (A-1) Configuration of First Embodiment FIG. 1 is a functional block diagram of a video display device of the present embodiment. The video display device according to the present embodiment is realized in hardware by an information processing device such as a workstation or a personal computer including an input / output device, a central processing unit, and a storage device. For convenience of explanation of the functional configuration, FIG. 1 shows functional blocks.

図１において、映像表示装置１は、入力部１０、形態素解析部１１、翻訳部１２、データ選択部１３、単語辞書１４、映像ファイル群１５、選択映像ファイル１６、表示部１７を備える。 In FIG. 1, the video display device 1 includes an input unit 10, a morphological analysis unit 11, a translation unit 12, a data selection unit 13, a word dictionary 14, a video file group 15, a selected video file 16, and a display unit 17.

入力部１０は、ユーザが映像表示を希望する意味の文章（本実施形態では日本語とする。）を取り込み、取り込んだ文章を形態素解析部１１に与えるものである。入力部１０の機能を実現する例として、例えば、キーボードによる文字入力受付、マイクからの入力音声をテキスト変換した音声テキスト入力受付、又はアンテナが捕捉した放送電波によるデータストリームの入力受付など多様な形態を想定している。 The input unit 10 captures a sentence meaning that the user wants to display a video (in this embodiment, Japanese), and gives the captured sentence to the morphological analysis unit 11. Examples of realizing the function of the input unit 10 include various forms such as character input reception by a keyboard, voice text input reception by converting voice input from a microphone into text, or data stream input reception by broadcast radio waves captured by an antenna. Is assumed.

形態素解析部１１は、入力部１０から入力文章を受け取り、その入力文章に対して形態素解析を行ない、入力文を形態素に分解し、図示しない辞書を用いて形態素に品詞を割り当てるものである。また、形態素解析部１１は、解析した単語を翻訳部１２に与えるものである。 The morphological analysis unit 11 receives an input sentence from the input unit 10, performs morphological analysis on the input sentence, decomposes the input sentence into morphemes, and assigns parts of speech to the morphemes using a dictionary (not shown). The morpheme analysis unit 11 gives the analyzed word to the translation unit 12.

翻訳部１２は、形態素解析部１１により解析された単語を受け取り、単語の情報、単語の品詞情報に従って、単語の語順の変更や不要単語の除去や必要な語の追加などの処理を行なうものである。また、翻訳部１２は、処理結果をデータ選択部１３に与えるものである。 The translation unit 12 receives the word analyzed by the morpheme analysis unit 11, and performs processing such as changing the word order of words, removing unnecessary words, and adding necessary words in accordance with word information and word part-of-speech information. is there. The translation unit 12 gives the processing result to the data selection unit 13.

データ選択部１３は、翻訳部１２から調整された単語を受け取ると、単語辞書１４の中からその単語に対応付けられている情報（後述するが、例えば映像ファイル名、映像内のメタデータ、動作継続時間など）を取り出し、映像ファイル名とメタデータ等を検索キーとして、映像ファイル群１５からその単語意味を表現するために必要となる映像ファイルを検索するものである。また、データ選択部１３は、検索した映像ファイルを選択映像ファイル１６に蓄積するものである。また、データ選択部１３は、入力単語が単語辞書１４に登録されていない場合、単語の音を手話で表現した指文字を示す映像をから検索し、選択映像ファイル１６に与えるものである。 When the data selection unit 13 receives the adjusted word from the translation unit 12, information associated with the word from the word dictionary 14 (described later, for example, video file name, metadata in the video, operation, etc. And a video file necessary for expressing the word meaning from the video file group 15 is searched using the video file name and metadata as search keys. The data selection unit 13 stores the searched video file in the selected video file 16. In addition, when the input word is not registered in the word dictionary 14, the data selection unit 13 searches the video showing the finger character expressing the sound of the word in sign language and gives it to the selected video file 16.

ここで、映像ファイルとは、手話を行っている人物を撮影した映像を、各単語意味を表現する部分で区切り、名前をつけたものである。また、映像ファイルは、複数枚の静止画像を連続的に並べたものとして構成されており、その静止画像の一枚一枚をフレームという。 Here, the video file is a video obtained by shooting a person who is sign language and separated by a part expressing each word meaning and given a name. The video file is configured as a sequence of a plurality of still images, and each still image is called a frame.

単語辞書１４は、各単語ごとに、単語意味を表現する映像を含んでいる映像ファイルの映像ファイル名、映像ファイル中の動作継続時間、映像内のメタデータなどが対応付けられて記憶される記憶領域である。なお、単語辞書１４は、その内容を新規登録、追加、削除、変更が可能である。 The word dictionary 14 stores, for each word, the video file name of the video file including the video representing the word meaning, the operation duration in the video file, the metadata in the video, and the like in association with each other. It is an area. The word dictionary 14 can be newly registered, added, deleted, and changed.

図２は、本実施形態の単語辞書１４の構成例を示す。図２に示すように、単語辞書１４は、各単語に対して、読み、品詞、ファイル名、作成日時、継続時間が対応付けられ、また各映像ファイルを構成する各フレームのメタデータも対応付けられている。また、メタデータは、映像に映し出されている人物の手の位置、手の形、手の向きを示すデータから構成される。 FIG. 2 shows a configuration example of the word dictionary 14 of the present embodiment. As shown in FIG. 2, the word dictionary 14 associates each word with a reading, a part of speech, a file name, a creation date / time, and a duration, and also associates metadata of each frame constituting each video file. It has been. The metadata is composed of data indicating the position, shape, and orientation of a person's hand shown in the video.

単語名は、登録されている単語の名前を示す。読みは、その単語の読みを示す。品詞は、単語の品詞名を示す。ファイル名は、単語を表している映像ファイルの名前を示す。作成日時は、作成された映像ファイルの単語辞書１４へのデータ登録された日時を示す。継続時間は、映像ファイル中の動作の継続時間を示し、図２の単位は秒とする。 The word name indicates the name of a registered word. Reading indicates the reading of the word. The part of speech indicates the part of speech name of the word. The file name indicates the name of the video file representing the word. The creation date and time indicates the date and time when the data of the created video file was registered in the word dictionary 14. The duration indicates the duration of the operation in the video file, and the unit in FIG. 2 is seconds.

メタデータは、映像ファイルの内容の中から映像合成、映像連結に必要なデータをフレーム単位で抜き出したデータである。メタデータの構成は、映像ファイルの内容、出力すべき映像の内容によって異なる構成とすることが可能である。なお、メタデータは、特許請求の範囲における連結基準情報の一例である。 Metadata is data obtained by extracting data necessary for video composition and video connection from the content of a video file in units of frames. The configuration of the metadata can be different depending on the content of the video file and the content of the video to be output. Note that the metadata is an example of connection standard information in the claims.

この連結基準情報は、少なくとも、連結する映像合成時に、連結するフレーム間で映像表示のぶれをなくすよう調整するための表示映像に関する情報である。 This connection reference information is at least information related to a display image for adjusting to eliminate blurring of image display between connected frames at the time of combining the images to be connected.

本実施形態では、出力映像を手話映像とするため、単語辞書１４は入力された単語意味を表す手話映像の各フレーム中での人物の手の位置、手の形を示す情報（記号）、手の向きを示す情報（記号）、動作速度ベクトルなどをメタデータとする。 In the present embodiment, since the output video is a sign language video, the word dictionary 14 includes the position of the person's hand in each frame of the sign language video representing the meaning of the input word, information (symbol) indicating the shape of the hand, The information (symbol) indicating the direction of the movement, the motion speed vector, and the like are used as metadata.

図２において、始動点とは映像ファイルの先頭フレームのメタデータをいい、終了点とは映像ファイルの最終フレームのメタデータをいう。また、それ以外のフレームのメタデータも始動点及び終了点と同様の構造をしている。 In FIG. 2, the start point refers to the metadata of the first frame of the video file, and the end point refers to the metadata of the last frame of the video file. Further, the metadata of other frames has the same structure as the start point and end point.

次に、メタデータ中の項目について図面を参照して説明する。 Next, items in the metadata will be described with reference to the drawings.

手の位置とは、映像中に現れる人物の手の領域の重心位置を映像中での二次元座標で表したものである。なお、図２には示していないが、手の位置を時間で微分した動作速度ベクトルを設けるようにしてもよい。 The position of the hand represents the position of the center of gravity of the person's hand area appearing in the video, expressed by two-dimensional coordinates in the video. Although not shown in FIG. 2, an operation speed vector obtained by differentiating the hand position with respect to time may be provided.

図３に示す画面例において、垂直方向をｙ軸とし、水平方向をｘ軸とし、ｙ軸とｘ軸の交点を基準点（０，０）とする。この場合、当該フレーム中の右手の位置を座標（ｘ，ｙ）とし左手の位置を座標（ｘ０，ｙ０）とすることができる。 In the screen example shown in FIG. 3, the vertical direction is the y-axis, the horizontal direction is the x-axis, and the intersection of the y-axis and the x-axis is the reference point (0, 0). In this case, the position of the right hand in the frame can be the coordinates (x, y) and the position of the left hand can be the coordinates (x0, y0).

手の形とは、手話をする人物が手の指を伸ばしたり、曲げたりしている指の形を示す。通常、手話においては、約８０種類程度の手の形が区別されているから、本実施形態では、その区別されている手の形を示す記号を予め取り決め、その記号を用いる。図４に手の形とそれに対応する記号の一例を示す。 The hand shape indicates a finger shape in which a person who is sign language is stretching or bending the finger of the hand. Usually, in sign language, about 80 types of hand shapes are distinguished, and in this embodiment, a symbol indicating the distinguished hand shape is determined in advance and the symbol is used. FIG. 4 shows an example of a hand shape and a corresponding symbol.

手の向きとは、手の向いている方向と手のひらが向いている方向とを組み合わせて表したものである。ここでいう手の向いている方向とは、人物の肘から手首まで引いた直線が向いている方向である。通常、手話においては手の向いている方向は２０〜２８種類程度であり、手のひらの向きは６〜８種類程度が区別されているから、これらの向きの種類をそれぞれ記号で表現し、これらの記号の組み合わせを用いる。 The direction of the hand is a combination of the direction in which the hand is facing and the direction in which the palm is facing. Here, the direction in which the hand is facing is the direction in which a straight line drawn from the elbow of the person to the wrist is facing. Usually, in sign language, there are about 20 to 28 types of directions in which the hand is facing, and about 6 to 8 types of palm directions are distinguished. Use a combination of symbols.

図５、図６、図７に手の向いている方向と記号との対応付けの例を示し、図８に手のひらの向きと記号との対応付けの例を示す。 5, FIG. 6, and FIG. 7 show examples of correspondence between the direction of the hand and the symbol, and FIG. 8 shows examples of correspondence between the palm direction and the symbol.

図５は、人物を真正面から見た平面、いわゆる前頭面上での方向を示す。例えば、図５での人物の右手の指している方向をａ８と表す。図５中のｘ軸、ｙ軸は図３で説明したものと同様の軸である。図６は、人物を側面から見た平面、いわゆる矢状面上での６方向を示す。図６での人物の右手の指している方向をａ１０と表す。図６中のｚ軸とは、図５のｘ軸、ｙ軸に垂直な軸で人物の体から顔の向いている方向に向かっている軸である。図７は、人物を真上から見おろした平面、いわゆる水平面上での４方向を示す。図７での人物の右手の指している方向をａ１６と表す。図７中の各軸は図５及び６のものと同様である。図８の軸ｖはこれまでに説明した手の向いている方向を示す軸である。細い矢印は軸に対して垂直な平面状に配置されている手のひらの向いている方向を示す。図８での手のひらが向いている方向をａＧと表す。 FIG. 5 shows a direction on a plane when a person is seen from the front, that is, a so-called frontal plane. For example, the direction in which the person's right hand is pointing in FIG. The x-axis and y-axis in FIG. 5 are the same as those described in FIG. FIG. 6 shows six directions on a plane when the person is seen from the side, so-called sagittal plane. The direction in which the person's right hand is pointing in FIG. The z-axis in FIG. 6 is an axis that is perpendicular to the x-axis and y-axis in FIG. 5 and is directed from the person's body toward the face. FIG. 7 shows four directions on a plane in which a person is viewed from directly above, a so-called horizontal plane. The direction in which the person's right hand is pointing in FIG. Each axis in FIG. 7 is the same as that in FIGS. The axis v in FIG. 8 is an axis indicating the direction in which the hand is facing as described above. A thin arrow indicates a direction in which a palm arranged in a plane perpendicular to the axis faces. The direction in which the palm faces in FIG. 8 is represented as aG.

なお、本実施形態では、手話映像を表示する映像表示装置に適用するため、メタデータは以上のような構成となっているが、映像の表現する内容の情報や映像周波数の情報やカラーヒストグラムなど映像自体の持つ情報の一部をメタデータとして用いる構成としてもよい。 In this embodiment, the metadata is configured as described above in order to be applied to a video display device that displays a sign language video. However, the content information, video frequency information, color histogram, etc. A configuration may be adopted in which a part of information of the video itself is used as metadata.

図１に戻って、映像ファイル群１５は、最終的に表示部１６に表示される連続映像を構成する要素となる、表現すべき日本語の各単語と日本語の音を手話で表現するための動作である指文字動作に対応する動作が撮影された映像を含んでいる複数の映像ファイルを示す。 Returning to FIG. 1, the video file group 15 is used to represent each Japanese word to be expressed and Japanese sound in sign language, which are elements that will eventually form the continuous video displayed on the display unit 16. A plurality of video files including videos in which an action corresponding to the finger character action which is the action is taken.

選択映像ファイル１６は、データ選択部１３が検索した単語毎の映像ファイルを受け取り、これら映像ファイルを保持するものである。 The selected video file 16 receives a video file for each word searched by the data selection unit 13 and holds these video files.

表示部１７は、選択映像ファイル１６の各映像ファイルを連続して表示する機能部である。 The display unit 17 is a functional unit that continuously displays each video file of the selected video file 16.

（Ａ−２）第１の実施形態の動作
次に、本実施形態の映像表示装置１の動作について図面を参照して説明する。図９は、本実施形態の映像表示動作のフローチャートである。 (A-2) Operation of the First Embodiment Next, the operation of the video display device 1 of the present embodiment will be described with reference to the drawings. FIG. 9 is a flowchart of the video display operation of the present embodiment.

図９において、まず、ユーザが手話映像で表現させる文を入力部１０に入力する（Ｓ９０、Ｓ９１、Ｓ９２）。このとき、ユーザは、例えばキーボード等の文章入力手段を用いて入力する方法（Ｓ９０）や、例えばマイクに向けて音声を発して入力する方法（Ｓ９１）が適用可能である。音声入力の場合、入力音声はテキスト変換（Ｓ９２）がなされる。 In FIG. 9, first, a sentence to be expressed by a user in sign language video is input to the input unit 10 (S90, S91, S92). At this time, for example, the user can apply a method of inputting using a text input means such as a keyboard (S90) or a method of inputting voice by speaking toward a microphone (S91). In the case of voice input, the input voice is subjected to text conversion (S92).

入力部１０が入力文を取り込むと、入力文は、形態素解析部１１により形態素に分解される（Ｓ９３）。そして、分解された形態素は、形態素解析部１１において、図示しない一般的な日本語の辞書を用いて品詞が割り当てられ、その品詞が割り当てられた形態素（これを単語という）が翻訳部１２に与えられる（Ｓ９４）。 When the input unit 10 captures the input sentence, the input sentence is decomposed into morphemes by the morpheme analysis unit 11 (S93). The decomposed morpheme is assigned a part of speech by using a general Japanese dictionary (not shown) in the morpheme analysis unit 11, and the morpheme to which the part of speech is assigned (this is called a word) is given to the translation unit 12. (S94).

入力文の単語が翻訳部１２に与えられると、翻訳部１２により、単語に分割された入力文は、構文解析され（Ｓ９５）、既存の研究と同様に、単語間の状態遷移として表現されている文法に従った並び順に変更される（Ｓ９６）。 When the words of the input sentence are given to the translation unit 12, the input unit divided into words is parsed by the translation unit 12 (S95), and is expressed as a state transition between words as in the existing research. The order is changed in accordance with the existing grammar (S96).

翻訳部１２により入力文の構成が調整されると、各単語はデータ選択部１３により１つずつ読み出され、データ選択部１３は、各単語名と割り当てられた品詞とをキーとして単語辞書１４を検索し、登録されている単語が単語辞書１４にあるか否かを調べる（Ｓ９７）。 When the structure of the input sentence is adjusted by the translation unit 12, each word is read one by one by the data selection unit 13, and the data selection unit 13 uses the word name and the assigned part of speech as a key for the word dictionary 14. Is searched to see if the registered word is in the word dictionary 14 (S97).

Ｓ９７において、単語が単語辞書１４に登録されていない場合は、Ｓ９８に進み、その単語の音を手話で表現する指文字を撮影した映像ファイルを検索する指文字検索処理を行う（Ｓ９８）。 If the word is not registered in the word dictionary 14 in S97, the process proceeds to S98, and a finger character search process is performed to search for a video file obtained by photographing a finger character that expresses the sound of the word in sign language (S98).

ここで、指文字検索処理について図１０のフローチャートを参照して説明する。 Here, the finger character search process will be described with reference to the flowchart of FIG.

まず、指文字で表現する入力単語の音が文字単位に分解され（Ｓ９８１）、単語を構成する文字が１つずつ読み出される（Ｓ９８２）。文字に相当する指文字動作を表しているファイルの名前を単語辞書１４から検索して（Ｓ９８３）、そのファイル名を指文字用データに蓄積する（Ｓ９８４）。 First, the sound of an input word expressed by finger characters is decomposed into character units (S981), and characters constituting the word are read one by one (S982). The name of the file representing the finger character operation corresponding to the character is searched from the word dictionary 14 (S983), and the file name is stored in finger character data (S984).

すべての文字についてＳ９８２〜Ｓ９８４の処理が終了するまで繰り返し続行され（Ｓ９８５）、最終的に指文字用データを用いて呼び出すファイル名を決定する（Ｓ９８６）。Ｓ９８７で、ファイルを映像ファイル群１５から読み出し（Ｓ９８７）、読み出した映像ファイルを選択映像ファイルに追加する（Ｓ９１３）。 The processing is repeated until the processing of S982 to S984 is completed for all characters (S985), and finally the file name to be called is determined using the finger character data (S986). In S987, the file is read from the video file group 15 (S987), and the read video file is added to the selected video file (S913).

このようにして、入力単語の音を手話で表現した指文字を示す映像を検索することができる。 In this way, it is possible to search for an image showing a finger character expressing the sound of the input word in sign language.

図９に戻り、Ｓ９７において、入力単語が単語辞書１４に登録されている場合、入力単語の登録件数が何件であるか判断される（Ｓ９９）。 Returning to FIG. 9, when the input word is registered in the word dictionary 14 in S97, it is determined how many registrations of the input word are (S99).

このとき、入力単語の登録件数が１件である場合、Ｓ９１２に進み、検索結果のファイル名の項目を利用して、映像ファイルを映像ファイル群１５から読み出し（Ｓ９１２）、読み出した映像ファイルを選択映像ファイル１６に追加する（Ｓ９１３）。 At this time, if the number of registered input words is 1, the process proceeds to S912, and the file name item of the search result is used to read the video file from the video file group 15 (S912), and the read video file is selected. It adds to the video file 16 (S913).

一方、入力単語の登録件数が複数ある場合、単語辞書１４中のメタデータを利用して候補の中から読み出すファイルを決定する（Ｓ９１０、Ｓ９１１）。 On the other hand, when there are a plurality of registered input words, a file to be read from the candidates is determined using the metadata in the word dictionary 14 (S910, S911).

まず、データ選択部１３は１つ前の単語を表す映像ファイルのメタデータを参照する（Ｓ９１０）。 First, the data selection unit 13 refers to the metadata of the video file representing the previous word (S910).

例えば、メタデータが参照できた場合の具体的な例として、図２を参照して説明する。ここでは、現在の単語が「会う」であり、１つ前の単語が「明日」であるとする。このとき、１つ前の単語「明日」の映像ファイル名は「０１．ａｖｉ」であるから、このファイル名が示す映像ファイルが選択される。 For example, a specific example of when metadata can be referred will be described with reference to FIG. Here, it is assumed that the current word is “Meet” and the previous word is “Tomorrow”. At this time, since the video file name of the previous word “Tomorrow” is “01.avi”, the video file indicated by this file name is selected.

図２のように、「会う」を表す映像ファイルの候補は複数あるため、どのファイルを使用するかファイルを選択する必要が生じる。このとき、まず、前の単語である「明日」を現す映像ファイル「０１．ａｖｉ」のメタデータのうち終了時の右手の位置と左手の位置の座標を得る（Ｓ９１０）。 As shown in FIG. 2, since there are a plurality of video file candidates representing “Meet”, it is necessary to select which file to use. At this time, first, the coordinates of the right hand position and the left hand position at the end of the metadata of the video file “01.avi” representing the previous word “Tomorrow” are obtained (S910).

そして、「会う」を表す複数候補のファイルのメタデータから始動時の右手の位置と左手の位置の座標を得る。 Then, the coordinates of the right-hand position and the left-hand position at the start are obtained from the metadata of a plurality of candidate files representing “Meet”.

ここで、終了時の手の位置と各始動時の手の位置のユークリッド距離の左右の和をそれぞれ求め、その値がもっとも小さいファイルを選択ファイルとして決定する（Ｓ９１１）。 Here, the sum of right and left of the Euclidean distance between the hand position at the end and the hand position at each start is obtained, and the file with the smallest value is determined as the selected file (S911).

この例の場合、「０１．ａｖｉ」と「０４．ａｖｉ」とを比較した場合には左右の手のユークリッド距離の和は、
右手：（（５０−５０）^２＋（５０−６０）^２）^１／２＝１０
左手：（（１００−８０）^２＋（９０−６０）^２）^１／２＝３６．０６
であるから、１０＋３６．０６＝４６．０６となる。 In this example, when “01.avi” and “04.avi” are compared, the sum of the Euclidean distances of the left and right hands is
Right hand: ((50-50) ² + (50-60) ² ) ^1/2 = 10
Left ^hand: ^((100-80) 2 ^{+ (90-60)} 2) 1/2 = 36.06
Therefore, 10 + 36.06 = 46.06.

同様にして、「０１．ａｖｉ」と「０２．ａｖｉ」とを比較して左右の手のユークリッド距離の和は、
右手：（（５０−１００）^２＋（５０−６０）^２）^１／２＝５０．９９
左手：（（１００−１３０）^２＋（９０−６０）^２）^１／２＝５８．２１
であるから、５０．９９＋５８．２１＝１０９．３０となる。 Similarly, comparing "01.avi" and "02.avi", the sum of the Euclidean distances of the left and right hands is
Right hand: ((50-100) ² + (50-60) ² ) ^1/2 = 50.99
Left hand: ((100-130) ² + (90-60) ² ) ^1/2 = 58.21
Therefore, 50.99 + 58.21 = 109.30.

従って、これらの比較の結果、左右の手のユークリッド距離の和が小さい「０４．ａｖｉ」が選択される。 Therefore, as a result of these comparisons, “04.avi” having a small sum of the Euclidean distances of the left and right hands is selected.

この処理によって、前の単語の終了時点から現在の単語の始動時の手の位置のずれが小さい映像ファイルが連続するファイルとして選択されるため単語間のつなぎの部分での手の位置のずれが小さい映像が表示されるという効果が得られる。 By this process, a video file with a small shift in the position of the hand at the start of the current word from the end of the previous word is selected as a continuous file, so the shift in the position of the hand at the connecting portion between words is selected. The effect that a small image is displayed is obtained.

なお、選択されたファイルのメタデータが参照できない場合には更新日時の新しいデータを出力結果として出力するようにしてもよい。 If the metadata of the selected file cannot be referred to, the new data with the updated date and time may be output as the output result.

以降、データ選択部１３は、選択された映像ファイルを映像ファイル群１５から読み出し（Ｓ９１２）、その読み出した映像ファイルを選択映像ファイル１６に追加する（Ｓ９１３）。入力文を構成するすべての単語についてＳ９７〜Ｓ９１３の処理を繰り返す。 Thereafter, the data selection unit 13 reads the selected video file from the video file group 15 (S912), and adds the read video file to the selected video file 16 (S913). The processes of S97 to S913 are repeated for all words constituting the input sentence.

そして、表示部１７は、選択映像ファイル１６に保持されている映像ファイルを読み出して連続的な映像を出力する。 The display unit 17 reads the video file held in the selected video file 16 and outputs a continuous video.

図１１は、表示部１７の表示画面のイメージを示す。図１１において、１９０１はユーザにより入力された入力文が表示され、入力後例えばスタートボタン１９０３が押下されることで、上述した映像表示動作が開始し、１９０２の表示部に入力文に対する手話の連続映像が表示される。 FIG. 11 shows an image of the display screen of the display unit 17. In FIG. 11, an input sentence 1901 is displayed by the user. After the input, for example, when the start button 1903 is pressed, the above-described video display operation is started, and the sign language is continuously displayed on the display unit 1902. An image is displayed.

なお、本実施形態では、手話の映像を表示するために、以上のような処理を行うが、映像の表現する内容の情報や映像周波数の情報やカラーヒストグラムなど映像自体の持つ情報の一部をメタデータとして用いる構成として、それらの値を比較することにより類似した映像を探して利用するといった構成とすることも可能である。 In this embodiment, in order to display the sign language image, the above processing is performed. However, some of the information of the image itself, such as information on the contents of the image, information on the image frequency, color histogram, etc. As a configuration used as metadata, it is possible to search for and use similar videos by comparing their values.

また、２つの映像ファイルについて、それぞれの終了付近の速度ベクトルと開始付近の動作速度ベクトルとを比較し、その類似度の近いものを選択するといった方法も可能である。 For two video files, it is possible to compare the velocity vector near the end of each of the video files with the motion velocity vector near the start and select the ones with similar similarity.

（Ａ−３）第１の実施形態の効果
以上、本実施形態によれば、入力した文章を単語に分解し、辞書を参照して手話を行っている人物を撮影した映像ファイル群から対応する手話の単語を表現する映像ファイルを決定することができる。 (A-3) Effect of First Embodiment As described above, according to the present embodiment, the input sentence is decomposed into words, and correspondence is made from a video file group obtained by photographing a person who is sign language with reference to a dictionary. A video file representing a sign language word can be determined.

また、その際に、辞書に表される先行する映像ファイル内の人物の動作終了部分での手の位置などのメタ情報を、次の単語を表す映像ファイルを決定する際の指標として用いて合成した場合に前後の映像の接続部分のずれを小さくすることができる。 At that time, the meta information such as the position of the hand at the motion end part of the person in the preceding video file represented in the dictionary is used as an index for determining the video file representing the next word. In this case, it is possible to reduce the shift of the connection portion between the front and rear images.

さらに、本実施形態によれば、連結して表示するための映像ファイルの候補が複数ある場合に、映像ファイルのメタデータを利用して、前後のファイルでの人物の動きのずれが小さいものを選択することができるため、出力結果である連結された映像がより滑らかで見易いものになる効果が得られる。このとき、位置情報や速度ベクトル情報を参照して結合すると、映像データ収録時点でもともと連続していた単語同士が結合される確率が高くなり、ほとんどギャップのない連結が可能となる。 Furthermore, according to the present embodiment, when there are a plurality of video file candidates to be connected and displayed, the video file metadata is used to reduce the movement deviation of the person in the preceding and following files. Since it is possible to select, it is possible to obtain an effect that the connected video as an output result becomes smoother and easier to see. At this time, if the reference is made with reference to the position information and the velocity vector information, the probability that words that were originally continuous at the time of recording the video data will be high, and connection with almost no gap becomes possible.

（Ｂ）第２の実施形態
次に、本発明の映像装置、映像表示方法及びプログラムの第２の実施形態について図面を参照して説明する。 (B) Second Embodiment Next, a second embodiment of the video apparatus, video display method, and program of the present invention will be described with reference to the drawings.

本実施形態は、第１の実施形態と同様に手話映像を表示する映像表示装置に適用した場合について説明する。また、本実施形態は、選択映像ファイルの一部を使用し、２つの映像ファイルの間に単語の間の人物の動作を補完する映像ファイルを合成することにより、より見易い映像の効果が得られるシステムの構成について説明する。 In the present embodiment, a case will be described in which the present invention is applied to a video display device that displays a sign language video as in the first embodiment. In addition, in this embodiment, by using a part of the selected video file and synthesizing a video file that complements a person's motion between words between the two video files, it is possible to obtain an easier-to-view video effect. The system configuration will be described.

（Ｂ−１）第２の実施形態の構成
図１２は、第２の実施形態の映像表示装置の機能ブロック図である。第２の実施形態の映像表示装置２が、第１の実施形態と異なる点は、合成部１０１を新たに追加する点である。 (B-1) Configuration of Second Embodiment FIG. 12 is a functional block diagram of a video display apparatus according to the second embodiment. The video display device 2 of the second embodiment is different from the first embodiment in that a synthesis unit 101 is newly added.

従って、図１２において、図１に同一、対応する構成要件については対応する符号を付して示す。また、対応する構成要件の詳細な機能説明は第１の実施形態で説明したので省略する。 Therefore, in FIG. 12, the same and corresponding components as those in FIG. Further, the detailed functional description of the corresponding constituent elements has been described in the first embodiment, and will be omitted.

合成部１０１は、蓄積された選択映像ファイル１６から前後する２つの映像ファイルを抜き出し、２つの映像ファイルの中間の映像ファイルを合成する処理を行なうものである。 The synthesizing unit 101 performs a process of extracting two preceding and following video files from the stored selected video file 16 and synthesizing an intermediate video file between the two video files.

（Ｂ−２）第２の実施形態の動作
次に、本実施形態の映像表示装置２の動作について図面を参照して説明する。なお、入力文の入力処理、入力文の形態素・構文解析処理、映像ファイルの選択処理など基本的な動作は、第１の実施形態と同様であるので、ここでは、映像表示装置２の合成部１０１による中間フレームの作成、合成処理を中心に説明する。 (B-2) Operation | movement of 2nd Embodiment Next, operation | movement of the video display apparatus 2 of this embodiment is demonstrated with reference to drawings. Since basic operations such as input sentence input processing, input sentence morpheme / syntax analysis processing, and video file selection processing are the same as those in the first embodiment, the synthesis unit of the video display device 2 is used here. The description will focus on the creation and synthesis processing of an intermediate frame 101.

図１３（Ａ）は合成部１０１の動作フローチャートを示し、図１３（Ｂ）は動作フローに対応したイメージを示す。 FIG. 13A shows an operation flowchart of the synthesis unit 101, and FIG. 13B shows an image corresponding to the operation flow.

図１３（Ａ）において、合成部１０１は、選択映像ファイル１６に保存されている映像ファイルのうち、まだ処理されていない先行する単語を表わす先映像ファイル１１０７とそのすぐ後の後映像ファイル１１０９とを抜き出す（Ｓ１１０１）。 In FIG. 13A, the compositing unit 101 includes a preceding video file 1107 representing a preceding word that has not yet been processed among the video files stored in the selected video file 16, and a subsequent video file 1109 immediately thereafter. Is extracted (S1101).

次に、合成部１０１は、抜き出した先映像ファイル１１０７の最終フレーム１１０８と、後映像ファイル１１０９の先頭フレーム１１１０とをそれぞれ取り出す（Ｓ１１０２）。 Next, the synthesizing unit 101 extracts the last frame 1108 of the extracted previous video file 1107 and the first frame 1110 of the subsequent video file 1109, respectively (S1102).

最終フレーム１１０８と先頭フレーム１１１０を取り出すと、合成部１０１は、最終フレーム１１０８と先頭フレーム１１１０とに基づいて、所定のモーフィング処理を行ない、複数の中間フレーム１１１１を作成する（Ｓ１１０３）。 When the final frame 1108 and the leading frame 1110 are extracted, the synthesis unit 101 performs a predetermined morphing process based on the final frame 1108 and the leading frame 1110 to create a plurality of intermediate frames 1111 (S1103).

ここで、モーフィング処理とは、２つのフレーム間の映像の割合を徐々に変化させることで２フレーム間の中間フレームを作成する処理であって、例えば、既存の技術であるクロスディゾルブやワーピングなどの方法を用いることが可能である。 Here, the morphing process is a process for creating an intermediate frame between two frames by gradually changing the ratio of the video between the two frames. For example, the existing techniques such as cross dissolve and warping are used. It is possible to use a method.

複数の中間フレームが作成されると、その作成された中間フレームは、作成枚数に応じた時間情報が追加され、最終フレーム１１０８と先頭フレーム１１１０との間に補完するための映像ファイル１１０６が作成される（Ｓ１１０４）。 When a plurality of intermediate frames are created, time information corresponding to the number of created intermediate frames is added to the created intermediate frames, and a video file 1106 for complementation is created between the last frame 1108 and the first frame 1110. (S1104).

映像ファイル１１０６が作成されると、選択映像ファイル１６に追加される（Ｓ１１０５）。このとき、先映像ファイル１１０７と後映像ファイル１１０９との中間部分に挿入される映像ファイルとして選択映像ファイル１６に保管される。 When the video file 1106 is created, it is added to the selected video file 16 (S1105). At this time, the selected video file 16 is stored as a video file to be inserted in an intermediate portion between the preceding video file 1107 and the subsequent video file 1109.

なお、この処理は、選択映像ファイル１６内のすべての映像ファイルについて処理が完了するまで反復される。 This process is repeated until the process is completed for all the video files in the selected video file 16.

（Ｂ−３）第２の実施形態の効果
以上、本実施形態によれば、連結して表示する映像ファイルの中間の映像を合成する処理を追加することにより、連結する２映像間に比較的大きなギャップがあっても出力結果である連結された映像がより滑らかで見易いものになるという効果が得られる。 (B-3) Effects of the Second Embodiment As described above, according to the present embodiment, by adding a process for synthesizing an intermediate video of video files to be connected and displayed, it is relatively possible to connect the two videos to be connected. Even if there is a large gap, it is possible to obtain an effect that the connected video as an output result becomes smoother and easier to see.

（Ｃ）第３の実施形態
次に、本発明の映像装置、映像表示方法及びプログラムの第３の実施形態について図面を参照して説明する。 (C) Third Embodiment Next, a third embodiment of the video apparatus, video display method, and program of the present invention will be described with reference to the drawings.

本実施形態も、第１及び第２の実施形態と同様に、手話映像を表示する映像表示装置に適用した場合である。また本実施形態は、映像ファイル合成の際に、単語辞書のメタデータを利用して映像ファイル群に蓄積されている映像ファイルから逐次適当な参照フレームを取り出し、その参照フレームも利用して映像合成するものである。 Similarly to the first and second embodiments, this embodiment is a case where the present embodiment is applied to a video display device that displays a sign language video. Also, in this embodiment, when synthesizing a video file, an appropriate reference frame is sequentially extracted from the video files stored in the video file group using the word dictionary metadata, and the video frame is also synthesized using the reference frame. To do.

（Ｃ−１）第３の実施形態の構成
図１４は、第３の実施形態の映像表示装置の機能ブロック図である。なお、図１４において、図１と同一又は対応する構成要件については対応符号を付して示す。また、第１の実施形態で説明した構成要件の機能説明は省略する。 (C-1) Configuration of Third Embodiment FIG. 14 is a functional block diagram of a video display apparatus according to the third embodiment. In FIG. 14, the same or corresponding components as those in FIG. Also, the functional description of the configuration requirements described in the first embodiment is omitted.

図１４において、合成部１２０１の機能が、第１及び第２の実施形態と異なる。合成部１２０１は、蓄積された選択映像ファイルから前後する２つの映像ファイルを抜き出し、２つの映像ファイルの中間を埋める映像ファイルを合成する処理を行なうものである。このとき、合成部１２０１は、単語辞書１４内のメタデータを用いて映像ファイル群を検索し、映像ファイル中の適当なフレームを取得し、合成に用いるものである。 In FIG. 14, the function of the synthesis unit 1201 is different from those of the first and second embodiments. The synthesizing unit 1201 performs processing for extracting two preceding and following video files from the stored selected video file and synthesizing a video file that fills the middle of the two video files. At this time, the synthesis unit 1201 searches the video file group using the metadata in the word dictionary 14, acquires an appropriate frame in the video file, and uses it for synthesis.

（Ｃ−２）第３の実施形態の動作
次に、本実施形態の映像表示装置３の動作について図面を参照して説明する。なお、入力文の入力処理、入力文の形態素・構文解析処理、映像ファイルの選択処理など基本的な動作は、第１の実施形態と同様であるので、ここでは、映像表示装置３の合成部１２０１による中間フレームの作成、合成処理を中心に説明する。 (C-2) Operation of Third Embodiment Next, the operation of the video display device 3 of the present embodiment will be described with reference to the drawings. Since basic operations such as input sentence input processing, input sentence morpheme / syntax analysis processing, and video file selection processing are the same as those in the first embodiment, the synthesis unit of the video display device 3 is used here. Description will be made centering on the creation and synthesis processing of an intermediate frame in 1201.

図１５（Ａ）は合成部１２０１の動作フローチャートを示し、図１５（Ｂ）は動作フローに対応したイメージを示す。 FIG. 15A shows an operation flowchart of the synthesizing unit 1201, and FIG. 15B shows an image corresponding to the operation flow.

なお、図１５において、図１３と対応する処理については対応符号を付して示し、その処理の詳細な説明は省略する。 In FIG. 15, processes corresponding to those in FIG. 13 are denoted by corresponding reference numerals, and detailed description thereof is omitted.

図１５（Ａ）において、合成部１２０１は、第２の実施形態と同様に、選択映像ファイル１６から咲き映像１１０７と後映像ファイル１１０９とを抜き出し（Ｓ１１０１）、最終フレーム１１０８と先頭フレーム１１１０を取り出す（Ｓ１１０２）。 In FIG. 15A, the composition unit 1201 extracts the blooming video 1107 and the subsequent video file 1109 from the selected video file 16 (S1101), and extracts the final frame 1108 and the top frame 1110, as in the second embodiment. (S1102).

この処理と同時に、合成部１２０１は、先映像ファイル１１０７及び後映像ファイル１１０９のファイル名をキーとして単語辞書１４を検索し（Ｓ１３０１）、当該ファイル名に対応するメタデータを取り出す（Ｓ１３０２）。 Simultaneously with this processing, the synthesis unit 1201 searches the word dictionary 14 using the file names of the previous video file 1107 and the subsequent video file 1109 as keys (S1301), and extracts metadata corresponding to the file names (S1302).

各ファイル名に対応するメタデータを取得すると、合成部１２０１は、そのメタデータを用いて所定処理を施すことにより検索データを作成する（Ｓ１３０３）。この検索データとは、参照フレームを検索するためのデータをいう。 When the metadata corresponding to each file name is acquired, the synthesis unit 1201 creates search data by performing predetermined processing using the metadata (S1303). This search data refers to data for searching for a reference frame.

ここで、検索データを作成するための所定処理は、種々の方法が考えられるが、本実施形態では、手話動作の滑らかな動きを図るため、最終フレーム１１０８と先頭フレーム１１１０との中間部分に近いフレームを参照フレームとする。従って、このような参照フレームを検索するために、次のような方法により検索データを作成するものとする。 Here, although various methods can be considered as the predetermined processing for creating the search data, in this embodiment, in order to achieve a smooth movement of the sign language operation, it is close to an intermediate portion between the final frame 1108 and the first frame 1110. Let the frame be the reference frame. Therefore, in order to search for such a reference frame, search data is created by the following method.

例えば、合成部１２０１は、最終フレーム１１０８のメタデータと先頭フレームのメタデータのうち、手話を行なう人物の左右の手の位置、手の向きのデータを取得し（Ｓ１３０１、Ｓ１３０２）、これらフレーム間での手の位置の中間位置を検索データとして求める（Ｓ１３０３）。 For example, the synthesizing unit 1201 acquires the data of the left and right hand positions and hand orientations of the person performing sign language from the metadata of the last frame 1108 and the metadata of the first frame (S1301, S1302). The intermediate position of the hand position is obtained as search data (S1303).

まず、最終フレームの右手の座標位置が（ｘ，ｙ）であり、先頭フレームの右手の座標位置が（ｘ１，ｙ１）であるときには、検索データの右手の座標位置（ｍ，ｎ）は、（（ｘ＋ｘ１）／２、（ｙ＋ｙ１）／２）とする。左手の位置も同様にして座標を求める。 First, when the coordinate position of the right hand of the last frame is (x, y) and the coordinate position of the right hand of the first frame is (x1, y1), the coordinate position (m, n) of the right hand of the search data is ( (X + x1) / 2, (y + y1) / 2). The left hand position is similarly determined.

次に、手の向きについて、最終フレームの右の手の向きが図５でのａ３で、先頭フレームの右の手の向きがａ５であるときには、検索データの手の向きはおおまかにａ４と推定する。このようにして、本実施形態の検索データを作成する。 Next, regarding the orientation of the hand, when the orientation of the right hand of the final frame is a3 in FIG. 5 and the orientation of the right hand of the first frame is a5, the orientation of the hand of the search data is roughly estimated as a4. To do. Thus, the search data of this embodiment is created.

Ｓ１３０３において検索データが作成されると、合成部１２０１は、その作成された検索データをキーとして単語辞書１４を検索し（Ｓ１３０４）、検索データに最も近いデータを持つフレームを映像ファイル群１５から１つ取り出す（Ｓ１３０５）。なお、この取り出したフレームを参照フレーム１３０７とする。 When the search data is created in S1303, the synthesizing unit 1201 searches the word dictionary 14 using the created search data as a key (S1304), and selects a frame having data closest to the search data from the video file group 15 as one. (S1305). This extracted frame is referred to as a reference frame 1307.

合成部１２０１が参照フレーム１３０７を取得すると、合成部１２０１は、最終フレーム１１０８、参照フレーム１３０７、先頭フレーム１１１０を用いて、モーフィング処理によって、複数の中間フレーム１１１１を作成する（Ｓ１３０６）。 When the synthesizing unit 1201 acquires the reference frame 1307, the synthesizing unit 1201 creates a plurality of intermediate frames 1111 by morphing processing using the final frame 1108, the reference frame 1307, and the first frame 1110 (S1306).

このとき、合成部１２０１は、最終フレーム１１０８から参照フレーム１３０７へと徐々に変化する複数の中間フレームを作成すると共に、参照フレーム１３０７から先頭フレーム１１１０へと徐々に変化する複数の中間フレームをそれぞれ作成する。なお、中間フレームの作成方法は、第２の実施形態と同様に、モーフィング、クロスディゾルブなどを適用し得る。 At this time, the synthesizing unit 1201 creates a plurality of intermediate frames that gradually change from the final frame 1108 to the reference frame 1307, and also creates a plurality of intermediate frames that gradually change from the reference frame 1307 to the first frame 1110. To do. Note that morphing, cross dissolve, and the like can be applied to the method of creating the intermediate frame, as in the second embodiment.

以降の動作は、第１及び第２の実施形態と同様であるので省略する。 Subsequent operations are the same as those in the first and second embodiments, and are therefore omitted.

（Ｃ−３）第３の実施形態の効果
以上、本実施形態によれば、中間フレームの合成の際に、前後の映像ファイルのフレームのメタデータを処理した結果を利用して検索し、撮影された実際の映像データをもつ参照フレームを合成に用いるフレームとして追加することで、手の位置や手の向きの変化が最終フレームと先頭フレームとで著しいときに通常の合成を行うと合成したフレームに合成画像特有のゆがみなどの不自然さが生じるという影響を小さくすることができる。 (C-3) Effect of Third Embodiment As described above, according to the present embodiment, when combining intermediate frames, a search is performed using the result of processing metadata of frames of previous and subsequent video files, and shooting is performed. By adding a reference frame with actual video data as a frame to be used for synthesis, a frame that is synthesized when normal synthesis is performed when changes in hand position or hand orientation are significant between the last frame and the first frame. Therefore, the influence of unnaturalness such as distortion peculiar to the composite image can be reduced.

（Ｄ）第４の実施形態
次に、本発明の映像装置、映像表示方法及びプログラムの第４の実施形態について図面を参照して説明する。 (D) Fourth Embodiment Next, a fourth embodiment of the video apparatus, video display method, and program of the present invention will be described with reference to the drawings.

本実施形態も、手話映像を表示する映像表示装置に適用した場合である。また、本実施形態は、システムに辞書作成部1401と呼ばれる要素を追加することにより
手話の文章を表現している人物を撮影した映像ファイルを解析、文節することにより単語に相当する映像ファイルに切り分け映像ファイル群に格納し、単語辞書に検索に必要なメタデータを追加することを可能にしたシステムについて説明する。 This embodiment is also applied to a video display device that displays a sign language video. In addition, the present embodiment adds an element called a dictionary creation unit 1401 to the system, analyzes a video file that captures a person expressing a sign language sentence, and divides it into video files corresponding to words. A system that is stored in a video file group and that can add metadata necessary for search to a word dictionary will be described.

本実施例のシステム構成で、手話を撮影した映像ファイルをシステムに入力するだけで必要なメタデータ等を簡便に単語辞書に追加することが容易になる。 With the system configuration of the present embodiment, it becomes easy to easily add necessary metadata and the like to the word dictionary simply by inputting a video file of sign language into the system.

（Ｄ−１）第４の実施形態の構成
図１６は、本実施形態の映像表示装置の機能ブロック図である。第４の実施形態は、辞書作成部１４０１を新たに追加する点で第１の実施形態と異なる。従って、図１６において、図１と同一又は対応する構成要件については対応符号を付して示す。また、第１の実施形態で説明した構成要件の機能説明は省略する。 (D-1) Configuration of the Fourth Embodiment FIG. 16 is a functional block diagram of the video display apparatus of the present embodiment. The fourth embodiment is different from the first embodiment in that a dictionary creation unit 1401 is newly added. Therefore, in FIG. 16, the same or corresponding components as those in FIG. Also, the functional description of the configuration requirements described in the first embodiment is omitted.

辞書作成部１４０１は、手話の文章を表現している人物を撮影した映像ファイルを解析、文節することにより、単語に相当する映像ファイルに切り分け映像ファイル群に格納し、単語辞書に検索に必要なメタデータを追加することを特徴とする機能部である。 The dictionary creation unit 1401 analyzes and parses a video file that captures a person representing a sign language sentence, divides it into video files corresponding to words, stores them in a video file group, and stores them in the word dictionary. It is a functional unit characterized by adding metadata.

（Ｄ−２）第４の実施形態の動作
次に、本実施形態の映像表示装置４の動作について図１７を参照して説明する。なお、入力文の入力処理、入力文の形態素・構文解析処理、映像ファイルの選択処理など基本的な動作は、第１の実施形態と同様であるので、ここでは、映像表示装置４の辞書作成部１４０１による辞書作成処理を中心に説明する。 (D-2) Operation of the Fourth Embodiment Next, the operation of the video display device 4 of the present embodiment will be described with reference to FIG. The basic operations such as input sentence input processing, input sentence morpheme / syntactic analysis processing, and video file selection processing are the same as those in the first embodiment. The dictionary creation process by the unit 1401 will be mainly described.

図１７において、映像ファイル１５１０は、ある文章を手話で表現している人物を撮影した映像ファイルである。以下では、この映像ファイル１５１０を単語毎に切り分けて、映像ファイル群１５及び単語辞書１４に追加する場合について説明する。 In FIG. 17, a video file 1510 is a video file obtained by photographing a person expressing a certain sentence in sign language. Hereinafter, a case will be described in which the video file 1510 is cut into words and added to the video file group 15 and the word dictionary 14.

まず、辞書作成部１４０１は、追加対象の映像ファイル１５１０を取り込み、映像ファイル１５１０を構成する複数の映像フレームについて、所定のフレーム群１５１１に分割する（Ｓ１５００）。このフレーム群１５１１は、複数の静止画像をまとめたものであり、例えば、映像ファイルを所定時間ごとに分割した時系列的なフレーム群とすることができる。 First, the dictionary creation unit 1401 takes in a video file 1510 to be added, and divides a plurality of video frames constituting the video file 1510 into a predetermined frame group 1511 (S1500). The frame group 1511 is a collection of a plurality of still images, and can be, for example, a time-series frame group obtained by dividing a video file every predetermined time.

映像フレーム１５１０が複数のフレーム群１５１１に分割されると、辞書作成部１４０１は、１つのフレーム群を取り出し（Ｓ１５０１）、この取り出したフレーム群を処理フレーム１５１２とする。 When the video frame 1510 is divided into a plurality of frame groups 1511, the dictionary creation unit 1401 extracts one frame group (S1501), and sets the extracted frame group as a processing frame 1512.

また、辞書作成部１４０１は、処理フレーム１５１２中の人物の手や肘の位置などを認識する（Ｓ１５０２）。この画面上での手や肘などの位置認識方法として、例えば、既存のアルゴリズムである色検出や重心検出処理などを用いて処理フレーム１５１２中の人物の手や、肘の位置などを画面上の二次元座標として認識する方法が考えられる。これにより、処理フレーム１５１２中の位置データ１５１３を作成できる。 Further, the dictionary creation unit 1401 recognizes the positions of human hands and elbows in the processing frame 1512 (S1502). As a method for recognizing the position of the hand or elbow on the screen, for example, the position of the person's hand or elbow in the processing frame 1512 using the existing algorithm such as color detection or centroid detection processing is displayed on the screen. A method of recognizing as two-dimensional coordinates is conceivable. Thereby, the position data 1513 in the processing frame 1512 can be created.

各処理フレームの位置データを作成すると、辞書作成部１４０１は、位置データ１５１３の情報を用いて、当該処理フレーム１５１２中の人物の手と肘の位置関係から手の向きを認識し、その認識した手の向きについて記号分類する（Ｓ１５０３）。 When the position data of each processing frame is created, the dictionary creation unit 1401 recognizes the orientation of the hand from the positional relationship between the hand and the elbow of the person in the processing frame 1512 using the information of the position data 1513, and recognizes the recognition. The direction of the hand is classified (S1503).

次に、認識した手の位置周辺の一定面積の画素情報を取得し、その画素情報に基づいて手の形を認識し、その認識した手の形について記号分類する（Ｓ１５０４）。このとき、手の形を認識する方法として、例えば、手の位置周辺の一定面積の画素に対して、既存の画像認識アルゴリズムである、ニューラルネットワークや高次局所自己相関特徴の計算などの手法を適用できる。 Next, pixel information of a certain area around the recognized hand position is acquired, the shape of the hand is recognized based on the pixel information, and the recognized hand shape is classified into symbols (S1504). At this time, as a method for recognizing the shape of the hand, for example, a method such as neural network or high-order local autocorrelation feature calculation, which is an existing image recognition algorithm, for pixels of a certain area around the hand position. Applicable.

このようにして、辞書作成部１４０１は、映像ファイル１５１０中における処理フレーム１５１２の表示時間を示す情報を加えて、手の位置、手の向き、手の形のデータを時系列データに追加する（Ｓ１５０５）。 In this way, the dictionary creation unit 1401 adds information indicating the display time of the processing frame 1512 in the video file 1510 and adds hand position, hand orientation, and hand shape data to the time-series data ( S1505).

ここで、図１８に時系列データの構造例を示す。図１８に示すように、時系列データは、所定時間（図１８では時間単位はミリ秒とする）ごとのメタデータ（手の位置、手の向き、手の形）からなる。 Here, FIG. 18 shows a structural example of time-series data. As shown in FIG. 18, the time-series data is composed of metadata (hand position, hand orientation, hand shape) for each predetermined time (in FIG. 18, the time unit is milliseconds).

なお、Ｓ１５０１〜Ｓ１５０５の処理をすべてのフレーム群１５１１が終了するまで繰り返す。 Note that the processing of S1501 to S1505 is repeated until all the frame groups 1511 are completed.

すべてのフレーム群１５１１について処理が終了すると、辞書作成部１４０１は、作成した時系列データに対し、手の位置やその時間微分である動作速度ベクトル、手の向きなどを入力とする隠れマルコフモデル（ＨＭＭ）などの既存の認識方法を用いて手話認識処理を行い、時系列データを手話単語に相当する部分ごとに区切る（Ｓ１５０７）。 When the processing is completed for all the frame groups 1511, the dictionary creation unit 1401 applies a hidden Markov model (for example, a hand position, an operation speed vector that is a time derivative thereof, a hand orientation, etc.) to the created time-series data. Sign language recognition processing is performed using an existing recognition method such as HMM), and time-series data is divided into portions corresponding to sign language words (S1507).

また、辞書作成部１４０１は、区切った時系列データに認識結果の手話単語名のラベルをつけ（Ｓ１５０８）、映像ファイル１５１０を読み込み、時系列データに付されたラベルごとに時系列データの時間情報を用いて映像ファイルをラベルごとに分割する（Ｓ１５０９）。 Further, the dictionary creation unit 1401 attaches a label of the sign language word name of the recognition result to the divided time series data (S1508), reads the video file 1510, and time information of the time series data for each label attached to the time series data. Is used to divide the video file into labels (S1509).

そして、辞書作成部１４０１は、分割した映像ファイル１５１０に名前をつけて映像ファイル群１５に保存し、単語辞書１４に映像ファイル名と区切られた時系列データを新しいレコードとして追加する処理を行う（Ｓ１５１４）。 Then, the dictionary creation unit 1401 assigns a name to the divided video file 1510 and stores it in the video file group 15, and performs processing for adding time series data separated from the video file name to the word dictionary 14 as a new record ( S1514).

（Ｄ−３）第４の実施形態の効果
以上、本実施形態によれば、第１〜第３の実施形態と同様の効果を奏すことができる。 (D-3) Effects of Fourth Embodiment As described above, according to the present embodiment, the same effects as those of the first to third embodiments can be achieved.

また、本実施形態によれば、辞書作成部１４０１を備えることで、容易に映像ファイルに対するメタデータを付与することができるので、作業者が大量の映像ファイルに対してメタデータ付与作業の手間をなくすことができる。 In addition, according to the present embodiment, since the dictionary creation unit 1401 is provided, it is possible to easily add metadata to a video file, so that an operator can save time for adding metadata to a large number of video files. Can be eliminated.

（Ｅ）第５の実施形態
次に、本発明の映像装置、映像表示方法及びプログラムの第５の実施形態について図面を参照して説明する。 (E) Fifth Embodiment Next, a fifth embodiment of the video apparatus, the video display method, and the program of the present invention will be described with reference to the drawings.

本実施形態も、手話映像を表示する映像表示装置に適用した場合である。また、本実施形態は、辞書作成部が作成した映像ファイルに関するデータついて、データの類似度などを用いて単語辞書に登録されているか否かを判断し、辞書内容をメンテナンスする点に特徴がある。これにより、同一単語の過剰登録を回避することができ、単語辞書検索負担の軽減ができるので、結果として映像ファイルの選択性を高めることができる。 This embodiment is also applied to a video display device that displays a sign language video. In addition, the present embodiment is characterized in that the data relating to the video file created by the dictionary creation unit is determined whether or not it is registered in the word dictionary using the similarity of the data and the like, and the dictionary contents are maintained. . Thereby, excessive registration of the same word can be avoided, and the word dictionary search burden can be reduced. As a result, the selectivity of the video file can be improved.

（Ｅ−１）第５の実施形態の構成
図１９は、本実施形態の映像表示装置の機能ブロック図である。第５の実施形態は、辞書作成部１４０１に接続する辞書保全部１７０１を新たに追加する点で第４の実施形態と異なる。従って、図１９において、図１及び図１６と同一又は対応する構成要件については対応符号を付して示す。また、第１及び第４の実施形態で説明した構成要件の機能説明は省略する。 (E-1) Configuration of Fifth Embodiment FIG. 19 is a functional block diagram of a video display device of the present embodiment. The fifth embodiment is different from the fourth embodiment in that a dictionary maintenance unit 1701 connected to the dictionary creation unit 1401 is newly added. Accordingly, in FIG. 19, the same or corresponding components as those in FIGS. 1 and 16 are denoted by the corresponding reference numerals. Also, the functional description of the constituent elements described in the first and fourth embodiments is omitted.

辞書保全部１７０１は、辞書作成部１４０１が映像ファイルと位置データから新しいデータを単語辞書１４と映像ファイル群１５に追加する際に、新しいデータの単語名と既存の単語辞書のデータを比較して、新しいデータを追加したり又は破棄したり、既存のデータと新しいデータを統合したりするような処理を行なうものである。 When the dictionary creation unit 1401 adds new data from the video file and the position data to the word dictionary 14 and the video file group 15, the dictionary maintenance unit 1701 compares the word name of the new data with the data of the existing word dictionary. New data is added or discarded, or existing data and new data are integrated.

（Ｅ−２）第５の実施形態の動作
次に、本実施形態の映像表示装置５の動作について図２０を参照して説明する。なお、入力文の入力処理、入力文の形態素・構文解析処理、映像ファイルの選択処理など基本的な動作は、第１の実施形態と同様であるので、ここでは、映像表示装置５の辞書保全部１７０１の単語辞書１４の保全処理を中心に説明する。 (E-2) Operation of Fifth Embodiment Next, the operation of the video display device 5 of the present embodiment will be described with reference to FIG. Since basic operations such as input sentence input processing, input sentence morpheme / syntax analysis processing, and video file selection processing are the same as those in the first embodiment, the dictionary maintenance of the video display device 5 is performed here. Description will be made centering on maintenance processing of all 1701 word dictionaries 14.

図２０では、辞書作成部１４０１により手話単語ごとの映像ファイルのデータを新規データとする場合の辞書保全部１７０１の動作を示す。 FIG. 20 shows the operation of the dictionary maintenance unit 1701 when the dictionary creation unit 1401 sets the video file data for each sign language word as new data.

まず、辞書作成部１４０１が出力した新規データが辞書保全部１７０１に与えられる（Ｓ１８０１）。新規データが辞書保全部１７０１に与えられると、辞書保全部１７０１は、新規データの単語名を検索キーとして単語辞書１４を検索し（Ｓ１８０２）、既存データを取得する（Ｓ１８０３）。 First, new data output by the dictionary creation unit 1401 is given to the dictionary maintenance unit 1701 (S1801). When new data is given to the dictionary maintenance unit 1701, the dictionary maintenance unit 1701 searches the word dictionary 14 using the word name of the new data as a search key (S1802), and acquires existing data (S1803).

ここで、新規データとは、辞書作成部１４０１から出力される区切られた映像ファイルと単語辞書１４に追加されるメタデータの組である。なお、既存データは、該当するデータがないとき空である場合もある。 Here, the new data is a set of a divided video file output from the dictionary creation unit 1401 and metadata added to the word dictionary 14. The existing data may be empty when there is no corresponding data.

辞書保全部１７０１は、映像ファイル群１５に全体で映像ファイルが何個含まれているかを調べ（Ｓ１８０４）、それが一定数以上である場合、Ｓ１８０５に進み、新規データと既存データとの適合度を計算する。 The dictionary maintenance unit 1701 checks how many video files are included in the video file group 15 as a whole (S1804), and if it is greater than a certain number, the process proceeds to S1805, and the degree of matching between the new data and the existing data Calculate

一方、映像ファイル群の映像ファイルが一定数以下である場合、辞書保全部１７０１は、Ｓ１８０３で取得した単語辞書１４内の既存データの個数を調べ（Ｓ１８０４ｂ）、それが一定数以下である場合にはＳ１８１０に進み、辞書保全部１７０１はデータを追加する（Ｓ１８１０）。なお、既存データが単語辞書１４にない場合にもそのままデータを追加する（Ｓ１８１０）。また、Ｓ１８０４ｂにおいて、単語辞書１４内の既存データの個数が一定数以上である場合、Ｓ１８０５に進み、新規データと既存データとの適合度を計算する。 On the other hand, if the number of video files in the video file group is less than a certain number, the dictionary maintenance unit 1701 checks the number of existing data in the word dictionary 14 acquired in S1803 (S1804b), and if it is less than a certain number. In step S1810, the dictionary maintenance unit 1701 adds data (S1810). Even if the existing data is not in the word dictionary 14, the data is added as it is (S1810). In S1804b, if the number of existing data in the word dictionary 14 is equal to or greater than a certain number, the process proceeds to S1805, and the degree of matching between the new data and the existing data is calculated.

ここで、適合度とは、新規データと既存データとの整合性を表す値である。この適合度の算出方法は、新規データと既存データとの比較ができれば、種々の方法が考えられる。 Here, the fitness is a value representing the consistency between new data and existing data. Various methods can be considered as a method for calculating the fitness as long as new data and existing data can be compared.

例えば、まず、継続時間に基づいて適合度を計算する方法について説明する。この場合、新規データの継続時間と既存データの継続時間の平均値とに基づいてその変化率の絶対値を適合度とするものである。 For example, first, a method for calculating the fitness based on the duration will be described. In this case, based on the duration of new data and the average value of duration of existing data, the absolute value of the rate of change is used as the fitness.

適合度＝｜（ｎ−ｍ）／ｍ｜…（１）
ｎ＝新規データの継続時間であり、ｍ＝既存データの継続時間の平均値である。 Goodness of fit = | (n−m) / m | (1)
n = the duration of new data and m = the average value of the duration of existing data.

このとき、新規データの動作の継続時間が既存のデータの継続時間の平均値に比べて著しく短かったり又は長かったりする場合、新規データの適合度が低くなる。 At this time, if the duration of the operation of the new data is significantly shorter or longer than the average value of the duration of the existing data, the fitness of the new data is lowered.

また例えば、適合度として加速度の平均値を用いた場合についても説明する。この場合、まず、新規データの動作速度ベクトルの各フレーム間の差分を求め、加速度ベクトルを計算する。そして、全フレームにおける加速度ベクトルの大きさを計算し、その平均値を求める。また、既存データについても同様にして加速度ベクトルの平均値を求める。 For example, the case where the average value of acceleration is used as the fitness is described. In this case, first, the difference between the frames of the operation speed vector of the new data is obtained, and the acceleration vector is calculated. And the magnitude | size of the acceleration vector in all the frames is calculated, and the average value is calculated | required. Similarly, the average value of the acceleration vectors is obtained for existing data.

そして、新規データの加速度の平均値と既存データの加速度の平均値とに基づいて変化率の絶対値を適合度とする。 Then, based on the average value of acceleration of new data and the average value of acceleration of existing data, the absolute value of the change rate is set as the fitness.

このとき、新規データの動きの変化量が、既存データに比べて大きい場合、新規データの適合度が低くなる。 At this time, when the amount of change in the movement of the new data is larger than that of the existing data, the fitness of the new data is low.

次に、Ｓ１８０５で計算した適合度について閾値を用いて判断する（Ｓ１８０６）。そして、適合度が閾値より低い場合、辞書保全部１７０１は新規データを破棄する（Ｓ１８０９）。ここで、閾値とは例えば現在の既存データの適合度の平均値の８０％の値とする。 Next, the degree of matching calculated in S1805 is determined using a threshold (S1806). If the fitness level is lower than the threshold, the dictionary maintenance unit 1701 discards new data (S1809). Here, the threshold value is, for example, a value that is 80% of the average value of the fitness of the existing data.

一方、新規データの適合度が閾値以上の場合、新規データと既存データとの類似度を求める（Ｓ１８０７）。 On the other hand, when the fitness of the new data is equal to or greater than the threshold, the similarity between the new data and the existing data is obtained (S1807).

ここで、類似度とは、新規データと既存データとの類似性の度合いを表す値である。 Here, the similarity is a value representing the degree of similarity between new data and existing data.

例えば、新規データの各フレームの手の位置と、比較される既存データの同一フレームでの手の位置の差分の絶対値を、両データの始動点からどちらかのデータが終了点に達するまで加算し、加算した値を加算したフレーム数で割った平均値ｄのような値から計算することができる。 For example, add the absolute value of the difference between the hand position in each frame of new data and the hand position in the same frame of the existing data to be compared until either data reaches the end point from the start point of both data Then, it can be calculated from a value such as an average value d obtained by dividing the added value by the number of added frames.

この値ｄの逆数を類似度Ｓとすると、類似度の値が大きいほど類似性の高いデータであることを示す。 If the reciprocal of this value d is the similarity S, the larger the similarity value is, the higher the similarity is.

ｄ＝（｜ｒ＿１−Ｒ＿１｜＋｜ｒ＿２−Ｒ＿２｜＋…
＋｜ｒ＿ｎ−１−Ｒ＿ｎ−１｜＋｜ｒ＿ｎ−Ｒ＿ｎ｜）／ｎ …（２）
ここで、ｒ＿ｔは、フレームｔにおける新規データの手の位置（ベクトルで表現）であり、Ｒ＿１は、フレームｔにおける既存データの手の位置（ベクトルで表現）であり、ｎは、既存データのフレーム数と新規データのフレーム数で小さいほうの値である。 d = (| r_1-R_1 | + | r_2-R_2 | + ...
+ | R_n-1-R_n-1 | + | r_n-R_n |) / n (2)
Here, r_t is the hand position (represented by a vector) of new data in frame t, R_1 is the hand position (represented by a vector) of existing data in frame t, and n is the frame of the existing data. The smaller value of the number and the number of frames of new data.

Ｓ＝１／ｄ …（３）
次に、Ｓ１８０７で新規データと既存データとの類似度Ｓを求めると、類似度について閾値を用いて判断する（Ｓ１８０８）。ここで閾値とはたとえば現在の既存データの類似度の平均値の８０％の値とする。 S = 1 / d (3)
Next, when the similarity S between new data and existing data is obtained in S1807, the similarity is determined using a threshold (S1808). Here, the threshold is, for example, a value that is 80% of the average value of the similarity of existing data.

そして、類似度が閾値より大きい場合、新規データは既存データによく似たデータであると判断できるので、辞書保全部１７０１は新規データを破棄する（Ｓ１８０９）。 If the degree of similarity is greater than the threshold, it can be determined that the new data is very similar to the existing data, and the dictionary maintenance unit 1701 discards the new data (S1809).

一方、類似度が閾値以下である場合、新規データに類似する既存データ存在しないと判断できるので、辞書保全部１７０１は新規データを単語辞書１４及び映像ファイル群１５に追加する（Ｓ１８１０）。 On the other hand, if the similarity is equal to or less than the threshold value, it can be determined that there is no existing data similar to the new data, so the dictionary maintenance unit 1701 adds the new data to the word dictionary 14 and the video file group 15 (S1810).

（Ｅ−３）第５の実施形態の効果
以上、本実施形態によれば、辞書保全部１７０１を備えることにより、類似度や適合度などを用いて１つの単語に相当する映像ファイルの候補が一定の数、一定の品質を保つことが可能となる。これにより、映像データが必要以上に肥大化したり、検索速度が低下したりすることを防ぐことができる。 (E-3) Effects of Fifth Embodiment As described above, according to the present embodiment, by providing the dictionary maintenance unit 1701, a candidate for a video file corresponding to one word can be obtained using the similarity and the suitability. A certain number and a certain quality can be maintained. Thereby, it is possible to prevent the video data from becoming unnecessarily large or the search speed from being lowered.

(Ｆ)他の実施形態
（Ｆ−１）第１〜第５の実施形態において、入力部としてキーボードなどの方法を用いたが、これは音声入力、ＧＵＩ的なインターフェース、ボタン、バーコードリーダーなど別の入力装置を用いる構成としてもよい。 (F) Other Embodiments (F-1) In the first to fifth embodiments, a method such as a keyboard is used as the input unit. This is a voice input, a GUI-like interface, a button, a barcode reader, etc. Another input device may be used.

（Ｆ−２）第１〜第５の実施形態において、映像ファイル１５と単語辞書１４とは別の実体として構成されてものとして説明したが、映像ファイルのヘッダなどにメタデータを埋め込み、特定の映像ファイル群のヘッダを単語辞書として随時検索するといった構成としてもよい。 (F-2) In the first to fifth embodiments, it has been described that the video file 15 and the word dictionary 14 are configured as separate entities, but metadata is embedded in the header of the video file, etc. The header of the video file group may be searched as needed as a word dictionary.

（Ｆ−３）第１〜第５の実施形態では、映像ファイルのメタデータを、手話を表現するために、手の位置、手の向き、手の形としたが、これらに限定されず、例えば、顔の表情など他のデータを用いる構成としてもよい。 (F-3) In the first to fifth embodiments, the metadata of the video file is the position of the hand, the direction of the hand, and the shape of the hand in order to express the sign language. For example, another data such as facial expressions may be used.

（Ｆ−４）第１の実施形態では、映像ファイルを選択する評価基準についてユークリッド距離を用いて説明したが、これは同様の効果が得られる別の値としてもよい。 (F-4) In the first embodiment, the evaluation criterion for selecting a video file has been described using the Euclidean distance, but this may be another value that provides the same effect.

（Ｆ−５）第２の実施形態では、モーフィングの方法としてクロスディゾルブ、ワーピングを例としてあげたが、これらは排他的なものでなく、どちらか一方又は両方を組み合わせて使う構成としてもよい。またその他の合成方法を用いてもよい。 (F-5) In the second embodiment, cross dissolve and warping are given as examples of morphing methods, but these are not exclusive, and one or both may be used in combination. Other synthesis methods may also be used.

（Ｆ−６）第４の実施形態では、位置データ１５１３を画像処理の方法によって得たが、これは撮影時に別のデータとしてモーションキャプチャ装置などを用いて取得し、データ解析時にフレームデータと同期を取るような構成としてもよい。 (F-6) In the fourth embodiment, the position data 1513 is obtained by the image processing method, but this is acquired by using a motion capture device or the like as another data at the time of shooting, and is synchronized with the frame data at the time of data analysis. It is good also as composition which takes.

（Ｆ−７）第５の実施形態では、新規データについて既存のファイルとの適合度、類似度の計算方法の例を挙げたがこれは同様の効果が得られる別の計算方法で求めてもよい。 (F-7) In the fifth embodiment, an example of a method for calculating the degree of matching and similarity of an existing file with respect to new data has been described. Good.

（Ｆ−８）第１〜第５の実施形態において、手話映像を表示する映像表示装置について説明したが、手話にかぎらず、語や文章、記号に対応付けることができる意味を持った映像を含んだ映像を複数連結して、連続する動作映像を表示する装置であれば、例えば、手旗信号、パントマイム、ダンスなど特定の意味を表現する為の動作の映像を表示するシステムに応用可能である。 (F-8) In the first to fifth embodiments, the video display device that displays the sign language video has been described. However, the video display device includes not only the sign language but also a video having a meaning that can be associated with a word, a sentence, or a symbol. Any device that connects a plurality of video images and displays continuous motion video images can be applied to a system that displays video images of motions for expressing a specific meaning, such as a hand flag signal, pantomime, and dance.

（Ｆ−９）第１〜第５の実施形態において、日本語による辞書検索を行ったが、辞書のデータを変更することで、その他外国語による手話表示システムとすることも可能である。 (F-9) In the first to fifth embodiments, Japanese-language dictionary search is performed. However, by changing the dictionary data, a sign language display system in other foreign languages can be obtained.

（Ｆ−１０）第１〜第５の実施形態において、各辞書の構成を表の形として示したが、ツリーなどの異なるデータ構造を用いてもよい。 (F-10) In the first to fifth embodiments, the configuration of each dictionary is shown as a table, but different data structures such as a tree may be used.

（Ｆ−１１）第４及び第５の実施形態では、第１の実施形態で説明した映像表示装置１に辞書作成部１４０１、辞書保全部１７０１を設けたものとして説明したが、第２及び第３の実施形態の映像表示装置２、３にも適用できる。 (F-11) In the fourth and fifth embodiments, the video display device 1 described in the first embodiment has been described as having the dictionary creation unit 1401 and the dictionary maintenance unit 1701. The present invention can also be applied to the video display devices 2 and 3 of the third embodiment.

（Ｆ−１２）本発明を構成する構成要素は、１台のＰＣ上に存在していてもよいし、データベース等要素の一部又はすべてがネットワーク上のサーバーなどである構成としてもよい。 (F-12) The constituent elements constituting the present invention may exist on one PC, or a part or all of the elements such as a database may be a server on a network.

第１の実施形態の映像表示装置の機能ブロック図である。It is a functional block diagram of the video display device of a 1st embodiment. 第１の実施形態の単語辞書のデータ構造図である。It is a data structure figure of the word dictionary of a 1st embodiment. 第１の実施形態の画面中の人の手の位置を示す説明図である。It is explanatory drawing which shows the position of the person's hand in the screen of 1st Embodiment. 第１の実施形態の手の形と記号との関係を説明する説明図である。It is explanatory drawing explaining the relationship between the hand shape and symbol of 1st Embodiment. 第１の実施形態の手の向きと記号の関係を説明する説明図である。It is explanatory drawing explaining the relationship between the direction of the hand of 1st Embodiment, and a symbol. 第１の実施形態の手の向きと記号の関係を説明する説明図である。It is explanatory drawing explaining the relationship between the direction of the hand of 1st Embodiment, and a symbol. 第１の実施形態の手の向きと記号の関係を説明する説明図である。It is explanatory drawing explaining the relationship between the direction of the hand of 1st Embodiment, and a symbol. 第１の実施形態の手の向きと記号の関係を説明する説明図である。It is explanatory drawing explaining the relationship between the direction of the hand of 1st Embodiment, and a symbol. 第１の実施形態の映像表示装置の動作フローチャートである。It is an operation | movement flowchart of the video display apparatus of 1st Embodiment. 第１の実施形態の指文字検索の動作フローチャートである。It is an operation | movement flowchart of the finger character search of 1st Embodiment. 第１の実施形態における表示部の表示画面例を示す。The example of the display screen of the display part in 1st Embodiment is shown. 第２の実施形態の映像表示装置の機能ブロック図である。It is a functional block diagram of the video display apparatus of 2nd Embodiment. 第２の実施形態の映像表示装置の動作フローチャートである。It is an operation | movement flowchart of the video display apparatus of 2nd Embodiment. 第３の実施形態の映像表示装置の機能ブロック図である。It is a functional block diagram of the video display apparatus of 3rd Embodiment. 第３の実施形態の映像表示装置の動作フローチャートである。It is an operation | movement flowchart of the video display apparatus of 3rd Embodiment. 第４の実施形態の映像表示装置の機能ブロック図である。It is a functional block diagram of the video display apparatus of 4th Embodiment. 第４の実施形態の映像表示装置の動作フローチャートである。It is an operation | movement flowchart of the video display apparatus of 4th Embodiment. 第４の実施形態の時系列データの構造例を示す図である。It is a figure which shows the structural example of the time series data of 4th Embodiment. 第５の実施形態の映像表示装置の機能ブロック図である。It is a functional block diagram of the video display apparatus of 5th Embodiment. 第５の実施形態の映像表示装置の動作フローチャートである。It is an operation | movement flowchart of the video display apparatus of 5th Embodiment.

Explanation of symbols

１、２、３、４、５…映像表示装置、１０…入力部、１１…形態素解析部、
１２…翻訳部、１３…データ選択部、１４…単語辞書、１５…映像ファイル群、
１６…選択映像ファイル、１７…表示部、１０１、１２０１…合成部、
１４０１…辞書作成部、１７０１…辞書保全部。

1, 2, 3, 4, 5 ... video display device, 10 ... input unit, 11 ... morphological analysis unit,
12 ... Translation unit, 13 ... Data selection unit, 14 ... Word dictionary, 15 ... Video file group,
16 ... Selected video file, 17 ... Display section, 101, 1201 ... Composition section,
1401 ... Dictionary creation unit, 1701 ... Dictionary maintenance unit.

Claims

In a video device that outputs a continuous motion image by connecting a plurality of motion images representing operations according to certain information,
Motion video storage means for storing a plurality of motion video files composed of a plurality of consecutive frames ;
To the above information, and the read information of the operation image file, and the linking criterion information storage means for storing in association with one or more linking criterion information is displayed on the operation picture file,
Based on the plurality of input information that is input temporal order to find the reading information of a plurality of the operation image file corresponding from the connection reference information storage unit, the based on the read information of the plurality of the operation image file Action video acquisition means for retrieving the plurality of action video files from the action video storage means;
For the plurality of motion video files extracted by the motion video acquisition means, obtain the last frame of the previous motion video file and the first frame of the subsequent motion video file that are temporally mixed, and store the connection reference information Intermediate data creating means for creating intermediate data of the video displayed in the last frame and the first frame based on the connection standard information of the last frame read from the means and the connection standard information of the first frame;
Reference frame search means for searching for a reference frame having concatenation standard information that approximates the intermediate data created by the intermediate data creation means from the linkage standard information storage means;
Using the reference frame searched by the reference frame search means, a motion video obtained by interpolating an intermediate motion video between the last frame of the previous motion video file and the first frame of the subsequent motion video file is synthesized. A synthesis means to
Output means for outputting the motion image synthesized by the synthesis means ,
When there are a plurality of search results of the input information, the operation video acquisition means includes at least the connection reference information of the operation video file immediately preceding the input information and the connection reference of each of the searched operation video files. A video apparatus, wherein the operation video file corresponding to the input information is selected by comparison with information.

A frame group generating means for generating a frame group composed of a plurality of processing frames by dividing an input operation video file input as an addition target at predetermined time intervals;
Time-series data generating means for recognizing the position, orientation and shape of the video displayed in each processing frame divided by the frame group generating means and generating time-series data of the position, orientation and shape for each predetermined time When,
By the predetermined information recognition process using each time-series data generated by the time-series data generating means, the time-series data constituting these parts is divided into portions corresponding to one information in the frame group. File dividing means for associating the information with each other and dividing the input operation video file for each information;
The input motion video file is added with read information and added to the motion video storage means, and the time series data of each part is used as the link reference information, and the link criterion is associated with the read information of the input motion video file. Registration means to be added to the information storage means;
Video apparatus according to claim 1, characterized in that it comprises a.

When adding by the registration means, it is determined whether or not existing information equivalent to the information to be added this time exists in the connection standard information storage means, and if present, the information to be added this time is discarded without being registered. The video apparatus according to claim 2 , further comprising registration maintenance means.

The video apparatus according to claim 1, wherein the connection reference information is metadata indicating configuration information of a display video displayed in the operation video.

The motion video file is a sign language video file expressing the meaning of a word,
The connection reference information, the position of the hand of the person sign language, the hand orientation, image apparatus according to any one of claims 1 to 4, wherein the meta data including the shape of a hand.

In a video display method for outputting a continuous motion image by connecting a plurality of motion images representing motion according to certain information,
Motion video storage means for storing a plurality of motion video files composed of a plurality of consecutive frames ;
To the above information, comprising: a readout of the operation image file, and a connection reference information storage means for storing in association with one or more linking criterion information is displayed on the operation picture file,
The motion video acquisition means retrieves the readout information of the plurality of motion video files corresponding from the connection reference information storage means based on the plurality of input information in time order, and the plurality of motion video files Based on the read information, a plurality of the motion video files are extracted from the motion video storage means,
The intermediate data creation means acquires the last frame of the previous motion video file and the first frame of the subsequent motion video file for the plurality of motion video files extracted by the motion video acquisition means. Based on the connection reference information of the last frame read from the connection reference information storage means and the connection reference information of the first frame, intermediate data of the video displayed in the last frame and the first frame is created. ,
The reference frame search means searches the reference frame having connection standard information that approximates the intermediate data created by the intermediate data creation means from the linkage standard information storage means,
The synthesizing means interpolates an intermediate motion video between the last frame of the previous motion video file and the first frame of the subsequent motion video file using the reference frame retrieved by the reference frame retrieval means. Synthesize motion video,
The output means outputs the operation video synthesized by the synthesis means ,
When there are a plurality of search results of the input information, the operation video acquisition means includes at least the connection reference information of the operation video file immediately preceding the input information and the connection reference of each of the searched operation video files. A video display method comprising: selecting the operation video file corresponding to the input information by comparison with information.

Concatenating a plurality of motion images representing operations according to certain information and outputting a continuous motion image ,
Motion video storage means for storing a plurality of motion video files composed of a plurality of consecutive frames ;
To the above information, and the read information of the operation image file, and the linking criterion information storage means for storing in association with one or more linking criterion information is displayed on the operation picture file
A video device comprising
Based on the plurality of input information that is input temporal order to find the reading information of a plurality of the operation image file corresponding from the connection reference information storage unit, the based on the read information of the plurality of the operation image file Motion video acquisition means for extracting a plurality of the motion video files from the motion video storage means;
For the plurality of motion video files extracted by the motion video acquisition means, obtain the last frame of the previous motion video file and the first frame of the subsequent motion video file that are temporally mixed, and store the connection reference information Intermediate data creating means for creating intermediate data of the video displayed in the last frame and the first frame based on the connection reference information of the last frame read from the means and the connection reference information of the first frame;
Reference frame search means for searching for a reference frame having concatenation standard information that approximates the intermediate data created by the intermediate data creation means from the linkage standard information storage means,
Using the reference frame searched by the reference frame search means, a motion video obtained by interpolating an intermediate motion video between the last frame of the previous motion video file and the first frame of the subsequent motion video file is synthesized. Synthesis means to
Function as an output means for outputting the motion video synthesized by the synthesis means ,
When there are a plurality of search results of the input information, the operation video acquisition means includes at least the connection reference information of the operation video file immediately preceding the input information and the connection reference of each of the searched operation video files. A program for causing a function to select the operation video file corresponding to the input information by comparison with the information.