JP7009997B2

JP7009997B2 - Video generation system and video display system

Info

Publication number: JP7009997B2
Application number: JP2017553773A
Authority: JP
Inventors: 健夫五十嵐; 伸樹依田
Original assignee: Nidec Corp
Current assignee: Nidec America Corp
Priority date: 2015-12-04
Filing date: 2016-11-18
Publication date: 2022-01-26
Anticipated expiration: 2036-11-18
Also published as: JPWO2017094527A1; WO2017094527A1

Description

本願は、動画生成システムおよび動画表示システムに関する。また、本願は、動画生成システムに用いられるデータ構造、および動画生成システムを備えるロボットシステムにも関している。 The present application relates to a moving image generation system and a moving image display system. The present application also relates to a data structure used in a moving image generation system and a robot system including a moving image generation system.

コンピュータを用いる「擬人化対話エージェント（ｅｍｂｏｄｉｅｄｃｏｎｖｅｒｓａｔｉｏｎａｌａｇｅｎｔ）」が開発されている。擬人化対話エージェントは、人間または人間のようなキャラクターを表示装置（ディスプレイ）の画面に表示し、ユーザの音声またはキーボードによる入力に応じて、ユーザと対話を行う。ディスプレイの画面には動画が表示され、人間などの被写体が発する音声の内容に合わせて被写体像の表情（特に口の動き）および姿勢などが変化する。動画は、動画生成システムによって生成される。 An "emboded conversational agent" using a computer has been developed. The anthropomorphic dialogue agent displays a human or a human-like character on the screen of a display device (display) and interacts with the user in response to the user's voice or keyboard input. A moving image is displayed on the screen of the display, and the facial expression (especially the movement of the mouth) and the posture of the subject image change according to the content of the sound emitted by the subject such as a human being. The video is generated by the video generation system.

写実的な人間をディスプレイの画面に表示する主な動画生成システムは２通りに分類される。第１の動画生成システムでは、２次元（２Ｄ）の写真または動画からリップシンクアニメーションが合成される。このような合成動画の生成は、静止画のモーフィングを用いる方法（非特許文献１）、多数の口元の画像を用意して最適な順番で表示する方法（非特許文献２）が用いられ得る。第２の動画生成システムでは、人間の３次元（３Ｄ）モデルとその発話アニメーションとが作成されて描画が行われる。この描画に必要な人間の顔のモデリングおよびアニメーションの研究が行われている（非特許文献３）。写実的な描画のためには、髪の毛のモデリング、アニメーション、レンダリング、皮膚のレンダリングなどの様々な技術を組み合わせる必要がある。 The main video generation systems that display realistic humans on the screen of a display are classified into two types. In the first moving image generation system, a lip sync animation is synthesized from a two-dimensional (2D) photograph or moving image. For the generation of such a composite moving image, a method using morphing of a still image (Non-Patent Document 1) or a method of preparing a large number of images of the mouth and displaying them in an optimum order (Non-Patent Document 2) can be used. In the second moving image generation system, a human three-dimensional (3D) model and its utterance animation are created and drawn. Studies on modeling and animation of human faces necessary for this drawing have been conducted (Non-Patent Document 3). For realistic drawing, it is necessary to combine various techniques such as hair modeling, animation, rendering, and skin rendering.

米国特許第６６３６２２０号明細書U.S. Pat. No. 6,636,220 特開２００６－３２３７７４号公報Japanese Unexamined Patent Publication No. 2006-323774

T. Ezzat and T. Poggio, "Visual speech synthesis by morphing visemes", International Journal of Computer Vision, 38（1）:45-57, 2000.T. Ezzat and T. Poggio, "Visual speech synthesis by morphing visemes", International Journal of Computer Vision, 38 (1): 45-57, 2000. L. Wang, X. Qian, W. Han, and F. K.Soong, "Synthesizing photo-real talking head via trajectory-guided sample selection", In INTER-SPEECH, Vol. 10, pp. 446-449, 2010.L. Wang, X. Qian, W. Han, and F. K. Soong, "Synthesizing photo-real talking head via trajectory-guided sample selection", In INTER-SPEECH, Vol. 10, pp. 446-449, 2010. Z. Deng and J. Noh, "Computer facial animation: A survey", In Data-driven 3D facial animation, pp. 1-28. Springer, 2008.Z. Deng and J. Noh, "Computer facial animation: A survey", In Data-driven 3D facial animation, pp. 1-28. Springer, 2008. A. Schodl, R. Szeliski, D. H. Salesin, and I. Essa, "Video Textures", In Proceedings of the 27th Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH '00, pp. 489-498, New York, NY, USA, 2000. ACM Press/Addison-Wesley Publishing Co.A. Schodl, R. Szeliski, DH Salesin, and I. Essa, "Video Textures", In Proceedings of the 27th Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH '00, pp. 489-498, New York, NY, USA, 2000. ACM Press / Addison-Wesley Publishing Co.

上記の従来技術のうち、第１の方法によると、発話内容に合わせて口が動くだけであるため、写実的な人間の表情変化には程遠い。より自然なアニメーションを生成するためには、３Ｄモデルと組み合わせる必要がある。第２の方法による場合は、髪の毛のモデリング、アニメーション、レンダリングなどの様々な技術を組み合わせて写実的な対話シーンを描画することは困難である。 Among the above-mentioned conventional techniques, according to the first method, since the mouth only moves according to the utterance content, it is far from a realistic change in human facial expression. In order to generate a more natural animation, it needs to be combined with a 3D model. When the second method is used, it is difficult to draw a realistic dialogue scene by combining various techniques such as hair modeling, animation, and rendering.

本開示の実施形態は、写実的な人間の表情または姿勢の変化を表示し得る動画生成システムおよび動画表示システムを提供する。 The embodiments of the present disclosure provide a moving image generation system and a moving image display system capable of displaying a realistic change in a human facial expression or posture.

本開示の例示的な動画生成システムは、被写体像を含む複数のフレームのシーケンスである動画クリップが記録されている記録装置と、前記複数のフレームから選択されたフレーム群に含まれる各フレームと他のそれぞれのフレームとの間の類似度に基づいて前記複数のフレームを再構成して合成動画のデータを生成するコンピュータとを備え、前記類似度は、各フレーム内における前記被写体像の特徴量に基づいて規定されており、前記コンピュータは、前記合成動画の第Ｎ番目のフレーム（Ｎは正の整数）の次のフレームである第Ｎ＋１番目のフレームとして、第Ｎ番目のフレームとの類似度に基づいて前記フレーム群から選択された複数の候補フレームから、１つのフレームを決定する。 An exemplary moving image generation system of the present disclosure includes a recording device in which a moving image clip, which is a sequence of a plurality of frames including a subject image, is recorded, and each frame included in a frame group selected from the plurality of frames and others. The computer is provided with a computer that reconstructs the plurality of frames based on the similarity between the frames and generates the data of the composite moving image, and the similarity is the feature amount of the subject image in each frame. Based on the above, the computer has a similarity with the Nth frame as the N + 1th frame, which is the next frame of the Nth frame (N is a positive integer) of the synthetic moving image. Based on this, one frame is determined from a plurality of candidate frames selected from the frame group.

本開示の実施形態によると、撮影によって得られた動画クリップのフレームシーケンスを被写体像の類似度に基づいて再構成した動画を生成することが可能になる。 According to the embodiment of the present disclosure, it is possible to generate a moving image in which the frame sequence of the moving image clip obtained by shooting is reconstructed based on the similarity of the subject images.

図１は、本開示による動画生成システムの基本的な構成の一例を示す図である。FIG. 1 is a diagram showing an example of a basic configuration of a moving image generation system according to the present disclosure. 図２は、あるフレームと、そのフレームに対応づけられる「複数の候補フレーム」との関係を規定するテーブルの構成例を模試的に示す図である。FIG. 2 is a diagram simulating a configuration example of a table that defines a relationship between a certain frame and "a plurality of candidate frames" associated with the frame. 図３は、本開示による動画生成システムの限定的ではない例示的な実施形態の構成例を示す図である。FIG. 3 is a diagram showing a configuration example of a non-limiting exemplary embodiment of the moving image generation system according to the present disclosure. 図４は、待機区間と発話区間とが交互に繰り返された動画クリップの構成例を示す図である。FIG. 4 is a diagram showing a configuration example of a moving image clip in which a standby section and an utterance section are alternately repeated. 図５は、撮影装置によって自動的に抽出された被写体像の複数の特徴点を示す図である。FIG. 5 is a diagram showing a plurality of feature points of the subject image automatically extracted by the photographing apparatus. 図６は、動画クリップ内のｉ番目のフレームｉおよびｊ番目のフレームｊにおける被写体像の特徴点を模式的に示す図である。FIG. 6 is a diagram schematically showing the feature points of the subject image in the i-th frame i and the j-th frame j in the moving image clip. 図７は、事前計算によって得られたフレームｉとフレームｊとの間の距離Ｄ_i、_jを規定する行列の一例を模式的に示す図である。FIG. 7 is a diagram schematically showing an example of a matrix that defines the distances D _i and _j between the frames i and the frame j obtained by the pre-calculation. 図８は、事前計算によって得られたフレームｉとフレームｊとの間の遷移コストＣ_i、_jを規定する行列の一例を模式的に示す図である。FIG. 8 is a diagram schematically showing an example of a matrix that defines the transition costs C _i and _j between the frames i and the frame j obtained by the pre-calculation. 図９は、フレーム補間にクロスディゾルブ処理を用いるときのフレーム間遷移の一例を示す図である。FIG. 9 is a diagram showing an example of inter-frame transition when cross-dissolve processing is used for frame interpolation. 図１０は、候補フレームから所定のフレームを除く例を模式的に示す図である。FIG. 10 is a diagram schematically showing an example in which a predetermined frame is excluded from the candidate frames. 図１１は、動画表示システムの変形例を示す図である。FIG. 11 is a diagram showing a modified example of the moving image display system. 図１２は、本開示によるロボットシステムの例示的な実施形態を模式的に示す図である。FIG. 12 is a diagram schematically showing an exemplary embodiment of the robot system according to the present disclosure. 図１３は、事前に行う処理手順の一例を示すフローチャートである。FIG. 13 is a flowchart showing an example of the processing procedure to be performed in advance. 図１４は、動画生成および動画表示の処理手順の一例を示すフローチャートである。FIG. 14 is a flowchart showing an example of a moving image generation and moving image display processing procedure. 図１５は、コンピュータ２０の内部構成例を示す図である。FIG. 15 is a diagram showing an example of the internal configuration of the computer 20.

＜本願発明者の知見＞人間の写実的な動画を生成し得る技術として、「ＶｉｄｅｏＴｅｘｔｕｒｅｓ」の技術が非特許文献４および特許文献１に開示されている。「ＶｉｄｅｏＴｅｘｔｕｒｅｓ」とは、本来ならば始まりと終わりがある有限時間の動画クリップを撮影によって取得し、その動画クリップから合成した動画を連続的に限りなく再生し続けることが可能な技術である。この技術によれば、単に動画クリップの終点と始点とを繋げて無限にループさせるのではなく、表示される動画内のあるフレームから次のフレームへの遷移をフレーム間の類似度に基づいて確率的に行う。このため、単純な繰り返しがないフレームシーケンスを生成することができる。 <Findings of the Inventor of the present Application> As a technique capable of generating a realistic moving image of a human being, the technique of "Video Textures" is disclosed in Non-Patent Document 4 and Patent Document 1. "Video Textures" is a technology that can acquire a moving image clip for a finite time, which normally has a start and an end, by shooting, and continuously play back a moving image synthesized from the moving image clip. With this technique, instead of simply connecting the end and start points of a video clip and looping infinitely, the probability of transition from one frame to the next in the displayed video is based on the similarity between the frames. Do it. Therefore, it is possible to generate a frame sequence without simple repetition.

「ＶｉｄｅｏＴｅｘｔｕｒｅｓ」では、撮影によって取得した動画クリップを構成する多数のフレームの中から、相互に類似する複数のフレームを前もって候補フレームとして選択しておく。フレームを再配列して動画を合成するとき、現在のフレームから次のフレームへの「遷移」を類似度に応じた確率分布に従って行う。より具体的には、現在のフレームに類似する複数の候補フレームから、類似度に応じた確率分布に従って、１つのフレームが選択される。この確率分布は、フレーム間の類似度を変数とする関数で定義され、類似度の高いフレームへの遷移が高い頻度で発生することになる。 In "Video Textures", a plurality of frames similar to each other are selected in advance as candidate frames from a large number of frames constituting the moving image clip acquired by shooting. When rearranging frames and synthesizing moving images, the "transition" from the current frame to the next frame is performed according to a probability distribution according to the degree of similarity. More specifically, one frame is selected from a plurality of candidate frames similar to the current frame according to the probability distribution according to the degree of similarity. This probability distribution is defined by a function whose variable is the similarity between frames, and transitions to frames with high similarity occur frequently.

「ＶｉｄｅｏＴｅｘｔｕｒｅｓ」では、フレーム間の類似度が、着目する２枚のフレームの距離（ｆｒａｍｅ－ｔｏ－ｆｒａｍｅｄｉｓｔａｎｃｅ）によって与えられる。この「距離」は、２枚のフレームにおける画素値の差分（絶対値または二乗）を、フレームを構成する全ての画素について加算することによって定義される。このように定義された「距離」が小さいほど、フレーム間の類似度は高い。 In "Video Textures", the similarity between frames is given by the distance (frame-to-frame distance) between the two frames of interest. This "distance" is defined by adding the difference (absolute value or square) of the pixel values in the two frames to all the pixels constituting the frame. The smaller the "distance" defined in this way, the higher the similarity between frames.

本願発明者らの検討によると、「ＶｉｄｅｏＴｅｘｔｕｒｅｓ」の技術を用いて動画生成システムを実現しようとすると、フレーム間の類似度の計算量が膨大であるという問題、および、類似度の高いフレーム間の遷移が不自然に繰り返されるという問題のあることがわかった。 According to the study by the inventors of the present application, when trying to realize a video generation system using the technology of "Video Textures", there is a problem that the amount of calculation of the similarity between frames is enormous, and between frames with high similarity. It turns out that there is a problem that the transition of is repeated unnaturally.

本願発明者は、フレーム全体の類似度ではなく、フレーム内の被写体像の特徴量に基づいて規定される類似度を用いることにより、対話に用いられ得る自然な動画を、より少ない計算量で合成できることを見いだした。また、類似度に基づいて選択された候補フレームの範囲の一部を過去の表示状況に応じて制限する（採用しない）ことにより、合成動画におけるシーケンスの不自然な繰り返しを抑制できることを見いだした。なお、特許文献２には、顔の部品上に配置した複数の特徴点の位置を検出することが開示されている。特許文献２の装置は、撮影した顔の画像間で特徴位置の対応をとり、発声に伴う各特徴位置の動き方と映像パターンの変化の仕方が解析される。 The inventor of the present application synthesizes a natural moving image that can be used for dialogue with a smaller amount of calculation by using the similarity defined based on the feature amount of the subject image in the frame instead of the similarity of the entire frame. I found what I could do. We also found that by limiting (not adopting) a part of the range of candidate frames selected based on the degree of similarity according to the past display situation, it is possible to suppress unnatural repetition of the sequence in the synthetic video. In addition, Patent Document 2 discloses that the positions of a plurality of feature points arranged on facial parts are detected. The apparatus of Patent Document 2 associates feature positions between captured facial images, and analyzes how each feature position moves and how the image pattern changes with vocalization.

本開示では、フレーム内の「被写体像の特徴量」に基づいて類似度を規定する。被写体の典型例は、人間の顔であり、人間の身体の一部または全体を含み得る。また、被写体は、擬人化され得る動物またはロボットであってもよい。好ましい実施形態において、被写体像は、人間などの表情を規定する顔、特に目および口の像を含む。このような被写体像は、パターン認識の分野で利用され得る種々の特徴量によって識別される。ある実施形態において、この特徴量は、顔の表情および向き、すなわち顔の外観を表現するために抽出された複数の点（特徴点）に基づいて規定され得る。たとえば、被写体像の特徴点のフレーム内における位置座標そのものが特徴量であり得る。この場合、２枚のフレームにおいて、対応する特徴点の位置座標の差異の総和が類似度を規定するための距離に相当する。特徴点の位置は、被写体が位置する空間内の３次元座標であってもよい。なお、被写体像の各特徴点における画素値、たとえば色度または明度を特徴量に含めてもよい。 In the present disclosure, the degree of similarity is defined based on the "feature amount of the subject image" in the frame. A typical example of a subject is a human face, which may include part or all of the human body. Further, the subject may be an animal or a robot that can be anthropomorphized. In a preferred embodiment, the subject image includes images of faces, especially eyes and mouth, that define facial expressions such as humans. Such subject images are identified by various features that can be used in the field of pattern recognition. In certain embodiments, this feature quantity can be defined based on facial expressions and orientations, i.e., a plurality of points (feature points) extracted to represent the appearance of the face. For example, the position coordinates themselves in the frame of the feature points of the subject image can be the feature quantity. In this case, in the two frames, the sum of the differences in the position coordinates of the corresponding feature points corresponds to the distance for defining the degree of similarity. The position of the feature point may be three-dimensional coordinates in the space where the subject is located. The pixel value at each feature point of the subject image, for example, chromaticity or lightness may be included in the feature amount.

このように本開示では、フレームの全体ではなく、フレームに含まれる被写体像の特徴量に基づいてフレーム間の類似度を定義するため、より少ない計算量でフレーム間の類似度を求めることができる。 As described above, in the present disclosure, since the similarity between frames is defined based on the feature amount of the subject image included in the frame, not the entire frame, the similarity between frames can be obtained with a smaller amount of calculation. ..

また、本開示の他の側面によると、類似度に基づいて選択された複数の候補フレームから、合成動画を構成するフレームの順序に応じて一部のフレームを除外する。一部のフレームを除外することにより、短い期間内に類似度の高い同じフレームを繰り返して合成動画に使用することを抑制できる。たとえば、第Ｎ番目のフレームよりも前の所定数Ｔ_Lのフレーム（Ｔ_Lは２以上の整数）に所定数Ｋ_L回以上含まれるフレーム（Ｋ_Lは１以上の整数）があるとき、当該クレームを上記の「一部のフレーム」として選択し、候補フレームから除外することができる。Further, according to another aspect of the present disclosure, some frames are excluded from the plurality of candidate frames selected based on the similarity according to the order of the frames constituting the composite moving image. By excluding some frames, it is possible to suppress the repeated use of the same frames with high similarity in a synthetic video within a short period of time. For example, when there is a frame ( _KL is an integer of 1 or more) contained in a predetermined number of _TL frames ( _TL is an integer of 2 or more) before the Nth frame by a predetermined number of _KL times or more. Claims can be selected as "some frames" above and excluded from candidate frames.

＜本開示にかかる動画生成の原理の説明＞
以下、実施の形態の説明に先立って、本開示による動画生成システムの基本原理を説明する。<Explanation of the principle of video generation related to this disclosure>
Hereinafter, prior to the description of the embodiment, the basic principle of the moving image generation system according to the present disclosure will be described.

まず、図１を参照する。図１は、本開示による動画生成システムの基本的な構成の一例を示している。図示されている動画生成システム１００は、記録装置１０とコンピュータ２０とを備えている。 First, refer to FIG. FIG. 1 shows an example of the basic configuration of the moving image generation system according to the present disclosure. The moving image generation system 100 shown in the figure includes a recording device 10 and a computer 20.

記録装置１０には、事前の撮影によって取得された動画クリップが記録されている。動画クリップは、たとえば正面を向いた人間などの被写体像を含む複数のフレームのシーケンスである。 The recording device 10 records a moving image clip acquired by shooting in advance. A moving image clip is a sequence of multiple frames containing an image of a subject, such as a front-facing person.

コンピュータ２０は、動画クリップのフレーム群に含まれるフレーム間の類似度に基づいて、フレームを再構成して合成動画のデータを生成する。この「類似度」は、前述したように、各フレーム内における被写体像の特徴量に基づいて規定されている。 The computer 20 reconstructs the frames based on the similarity between the frames included in the frame group of the moving image clip to generate the data of the synthesized moving image. As described above, this "similarity" is defined based on the feature amount of the subject image in each frame.

動画生成システム１００によって生成される合成動画は、表示順序が再構成された複数のフレームのシーケンスである。すなわち、撮影によって取得された動画クリップにおけるフレームの順序を、少なくとも一部において、新しい順序に並び替えたフレームによって合成動画が構成される。動画を合成する際、コンピュータ２０は、合成動画における第Ｎ番目のフレーム（Ｎは正の整数）の次のフレームである第Ｎ＋１番目のフレームを「複数の候補フレーム」から選択する。「複数の候補フレーム」は、動画クリップのフレーム群から類似度に基づいて選択される。 The composite moving image generated by the moving image generation system 100 is a sequence of a plurality of frames whose display order is rearranged. That is, the composite moving image is composed of frames in which the order of the frames in the moving image clip acquired by shooting is rearranged in a new order at least in part. When synthesizing a moving image, the computer 20 selects the N + 1st frame, which is the next frame of the Nth frame (N is a positive integer) in the synthesized moving image, from "a plurality of candidate frames". "Multiple candidate frames" are selected from the frame group of the moving image clip based on the similarity.

コンピュータ１０は、このようにして選択された複数の候補フレームから、合成動画における第Ｎ＋１番目のフレームとして１つのフレーム（たとえば、被写体像１４を含むフレーム１２）を決定する。複数の候補フレームから１つのフレームを決定する仕方は多様であり得る。類似度が最も高いフレームを選択するようにすると、候補フレームから特定のフレームが常に選択されることになる。好ましい実施形態においては、類似度をパラメータとする確率分布関数に基づいて確率的にフレームが選択される。 From the plurality of candidate frames selected in this way, the computer 10 determines one frame (for example, the frame 12 including the subject image 14) as the N + 1th frame in the synthesized moving image. There can be various ways to determine one frame from multiple candidate frames. If you try to select the frame with the highest similarity, a specific frame will always be selected from the candidate frames. In a preferred embodiment, frames are stochastically selected based on a probability distribution function with similarity as a parameter.

このようにして生成された合成動画では、連続するフレーム間で被写体像の特徴量の変化が小さくなる。たとえば、フレーム内における顔、目および口の位置ならびにそれらの大きさが、自然に変化するように動画が合成される。しかも、同一のフレームから遷移する先のフレームは、１個に固定されておらず、確率的に変化するため、長時間、動画を表示しても、被写体の表情または動きの単調さが低減され得る。 In the composite moving image generated in this way, the change in the feature amount of the subject image becomes small between consecutive frames. For example, the moving images are synthesized so that the positions of the face, eyes and mouth in the frame and their sizes change naturally. Moreover, since the destination frame transitioning from the same frame is not fixed to one and changes stochastically, the monotonousness of the facial expression or movement of the subject is reduced even if the moving image is displayed for a long time. obtain.

「複数の候補フレーム」は、各フレームに対して事前に決定される。各フレームの中身（被写体像の外観）に応じて、そのフレームに対応づけられる「複数の候補フレーム」の構成は異なり得る。好ましい実施形態において、あるフレームと、そのフレームに対応づけられる「複数の候補フレーム」との関係は、たとえばテーブルによって表現され得る。 "Multiple candidate frames" are determined in advance for each frame. Depending on the content of each frame (appearance of the subject image), the configuration of the "plurality of candidate frames" associated with that frame may differ. In a preferred embodiment, the relationship between a frame and the "plurality of candidate frames" associated with that frame can be represented, for example, by a table.

図２は、このようなテーブルの構成例を模試的に示す図である。このテーブルは、動画クリップから選択された１２枚のフレームについて、類似度に対応する値（たとえばフレーム間の「距離」）を格納し得る構成を有している。現実には、数万枚を超える膨大な枚数のフレームについてテーブルが用意され得る。１２枚のフレームは、典型的には、撮影によって取得された連続する１２枚のフレームである。しかし、フレームの順序は、必ずしも、撮影時に取得された順序に一致する必要はない。 FIG. 2 is a diagram simulating an example of the configuration of such a table. This table has a configuration capable of storing a value corresponding to the similarity (for example, "distance" between frames) for 12 frames selected from the moving image clip. In reality, a table can be prepared for a huge number of frames exceeding tens of thousands. The twelve frames are typically twelve consecutive frames acquired by photography. However, the order of the frames does not necessarily have to match the order acquired at the time of shooting.

ここで、ｉおよびｊを、１～１２の整数とする。テーブルのｉ行ｊ列には、フレームｉとフレームｊとの間の類似度に対応する数値が記憶される。この例における数値は、被写体である人の顔における特徴点の座標がフレームごとにどれだけシフトしているかを示しているとする。図２のテーブルでは、簡単のため、テーブルのｉ＝３の行のみに類似度を示す数値の例が記載されている。この数値が小さいほど、被写体像の類似度が高い。すなわち、この数値が小さいほど、フレームが遷移しても、被写体である人の顔における特徴点の位置変動が小さい。 Here, i and j are integers of 1 to 12. In the i-by-j column of the table, the numerical value corresponding to the similarity between the frame i and the frame j is stored. It is assumed that the numerical value in this example indicates how much the coordinates of the feature points on the face of the person who is the subject are shifted for each frame. In the table of FIG. 2, for the sake of simplicity, an example of a numerical value indicating the degree of similarity is described only in the row of i = 3 in the table. The smaller this value, the higher the similarity of the subject images. That is, the smaller this value is, the smaller the change in the position of the feature point on the face of the person who is the subject is small even if the frame changes.

ｉ＝３のフレームｉに対して類似度が最も高いフレームｊは、同一フレームであるｊ＝３のフレームを除外すると、ｊ＝４のフレームである。類似度が相対的に高い３枚の候補フレームとして、たとえばｊ＝４、５、１１のフレームｊが選択される。このような候補フレームは、ｉ＝１～１２のフレームｉとｊ＝１～１２のフレームｊとの間で事前に類似度を求めておけば、ｉ＝１～１２のフレームｉのそれぞれについて事前に選択しておくことが可能である。 The frame j having the highest similarity to the frame i of i = 3 is a frame of j = 4 excluding the frame of j = 3, which is the same frame. For example, frames j with j = 4, 5, and 11 are selected as the three candidate frames having a relatively high degree of similarity. For such candidate frames, if the similarity between the frames i of i = 1 to 12 and the frames j of j = 1 to 12 is obtained in advance, the similarities are obtained in advance for each of the frames i of i = 1 to 12. It is possible to select.

複数の候補フレームから１つのフレームを決定する動作は、合成動画を生成するときにコンピュータによって実行される。前述のように、好ましい実施形態において、この決定の動作は、確率的に行われる。たとえば、ｉ＝３のフレームｉが合成動画のＮ番目のフレームである場合、第Ｎ＋１番目のフレームを複数の候補フレーム、すなわち、ｊ＝４、５、１１のフレームｊから１つのフレームが選択され得る。このとき、被写体像の類似度が最も高い（特徴点の位置変動が最も小さい）フレームを常に選択するならば、ｉ＝３のフレームｉからは、常にｊ＝４のフレームｊに遷移することになり、合成動画における被写体像の変化の多様性が失われてしまう。このような多様性を確保するためには、ｊ＝４、５、１１のフレームｊによって構成される候補フレームの中から、確率的にフレームが選択されることが好ましい。しかし、候補フレームが、相対的に類似度の高いフレームから構成されている場合、そのような候補フレームからは、類似度によらず、ランダムにフレームが選択されてもよいし、他の何らかのルールに従ってフレームが選択されてもよい。 The operation of determining one frame from a plurality of candidate frames is executed by a computer when generating a composite moving image. As mentioned above, in the preferred embodiment, the operation of this determination is stochastical. For example, when the frame i of i = 3 is the Nth frame of the composite video, one frame is selected from a plurality of candidate frames, that is, the frames j of j = 4, 5, and 11 for the N + 1th frame. obtain. At this time, if the frame with the highest degree of similarity of the subject image (the smallest change in the position of the feature point) is always selected, the frame i with i = 3 always transitions to the frame j with j = 4. As a result, the variety of changes in the subject image in the composite video is lost. In order to secure such diversity, it is preferable that a frame is stochastically selected from the candidate frames composed of the frames j of j = 4, 5, and 11. However, if the candidate frame is composed of frames with relatively high similarity, frames may be randomly selected from such candidate frames regardless of the similarity, or some other rule. The frame may be selected according to.

動画クリップにおける被写体像の典型例は、人間であるが、擬人化された動物またはロボットであってもよい。「再構成」とは、実際に撮影された複数のフレームをそのまま用いて合成動画を生成する態様のみならず、撮影によって取得されたフレーム以外のフレームを追加的に利用して合成動画を生成する態様も含むものとする。実際に撮影された複数のフレームをそのまま用いて合成動画を生成する態様の典型例は、撮影時に取得されたフレームの表示順序を並べ替えることによって合成動画を生成することである。「撮影によって取得されたフレーム以外のフレーム」の例は、撮影時に取得されたフレームから、たとえばモーフィングによって生成されたフレームであり得る。 A typical example of a subject image in a moving image is a human being, but it may be an anthropomorphic animal or a robot. "Reconstruction" is not only a mode in which a plurality of frames actually shot are used as they are to generate a composite video, but also a frame other than the frames acquired by shooting is additionally used to generate a composite video. Aspects are also included. A typical example of a mode in which a plurality of frames actually shot are used as they are to generate a composite moving image is to generate a composite moving image by rearranging the display order of the frames acquired at the time of shooting. An example of "a frame other than a frame acquired by shooting" may be a frame generated by morphing, for example, from a frame acquired at the time of shooting.

＜実施形態＞
以下、適宜図面を参照しながら、本開示による実施の形態を詳細に説明する。ただし、必要以上に詳細な説明は省略する場合がある。たとえば、既によく知られた事項の詳細説明や実質的に同一の構成に対する重複説明を省略する場合がある。これは、以下の説明が不必要に冗長になるのを避け、当業者の理解を容易にするためである。本願発明者らは、当業者が本開示を十分に理解するために添付図面および以下の説明を提供する。これらによって特許請求の範囲に記載の主題を限定することを意図するものではない。<Embodiment>
Hereinafter, embodiments according to the present disclosure will be described in detail with reference to the drawings as appropriate. However, more detailed explanation than necessary may be omitted. For example, detailed explanations of already well-known matters and duplicate explanations for substantially the same configuration may be omitted. This is to avoid unnecessary redundancy of the following description and to facilitate the understanding of those skilled in the art. The inventors of the present application provide the accompanying drawings and the following description for those skilled in the art to fully understand the present disclosure. These are not intended to limit the subject matter described in the claims.

１．基本構成
まず、図３を参照する。図３は、本開示による動画生成システムの限定的ではない例示的な実施形態の構成例を示している。 1. 1. Basic configuration First, refer to FIG. FIG. 3 shows a configuration example of a non-limiting exemplary embodiment of the moving image generation system according to the present disclosure.

図示されている動画生成システム１００は、動画クリップが記録されている記録装置１０と、動画クリップを構成する複数のフレームを再構成して合成動画のデータを生成するコンピュータ２０とを備えている。 The illustrated moving image generation system 100 includes a recording device 10 in which a moving image clip is recorded, and a computer 20 that reconstructs a plurality of frames constituting the moving image clip to generate synthetic moving image data.

動画生成システム１００は、たとえば液晶ディスプレイまたは有機ＥＬディスプレイなどのディスプレイ５０の画面に合成動画を表示する。ディスプレイ５０は、スクリーン上に動画を投影するプロジェクタであってもよい。ディスプレイ５０は、常時または必要に応じて、動画生成システム１００に有線または無線によって接続され得る。動画生成システム１００における記録装置１０およびコンピュータ２０の一方または両方は、ディスプレイ５０から離れた位置に設けられていても良い。ディスプレイ５０は、動画生成システム１００の一部であってもよいし、動画生成システム１００に含まれなくてもよい。本実施形態におけるディスプレイ５０は、音声を出力するスピーカ（不図示）を内蔵している。 The moving image generation system 100 displays a synthetic moving image on the screen of a display 50 such as a liquid crystal display or an organic EL display. The display 50 may be a projector that projects a moving image on the screen. The display 50 may be connected to the moving image generation system 100 by wire or wirelessly at all times or as needed. One or both of the recording device 10 and the computer 20 in the moving image generation system 100 may be provided at a position away from the display 50. The display 50 may be a part of the moving image generation system 100 or may not be included in the moving image generation system 100. The display 50 in this embodiment has a built-in speaker (not shown) that outputs sound.

コンピュータ２０は、公知の構成を備えていれば良く、典型的には、中央演算ユニット（ＣＰＵ）およびメモリを内蔵する市販の汎用コンピュータであり得る。メモリには、コンピュータ２０による処理の順序を規定する複数の指令を含むプログラムが内蔵される。動作時において、このプログラムの指令に従ってコンピュータ２０は動作する。 The computer 20 may have a known configuration and may typically be a commercially available general purpose computer with a built-in central processing unit (CPU) and memory. The memory contains a program including a plurality of instructions that define the order of processing by the computer 20. During operation, the computer 20 operates according to the instructions of this program.

コンピュータ２０は、前述したように、合成動画における第Ｎ番目のフレーム（Ｎは正の整数）の次のフレームである第Ｎ＋１番目のフレームとして、複数の候補フレームから１つのフレームを決定する。この「決定」は、好ましい実施形態において、待機中の被写体像を含む合成動画をディスプレイ５０に表示するモードで連続して行われ得る。合成動画のあるフレーム（第Ｎ番目のフレーム）から、次のフレーム（第Ｎ＋１番目のフレーム）に表示フレームが遷移するとき、類似度が所定の関係を満足する２枚のフレームの間で遷移が生じる。これにより、合成動画における被写体像がユーザからみて自然な動きを示すことができる。 As described above, the computer 20 determines one frame from a plurality of candidate frames as the N + 1th frame, which is the next frame of the Nth frame (N is a positive integer) in the synthetic moving image. In a preferred embodiment, this "decision" can be continuously performed in a mode in which a composite moving image including a standby subject image is displayed on the display 50. When the display frame transitions from one frame (Nth frame) of the composite video to the next frame (N + 1st frame), the transition occurs between two frames whose similarity satisfies a predetermined relationship. Occurs. As a result, the subject image in the composite moving image can show a natural movement from the user's point of view.

図３には、動画生成システム１００を含む、動画表示システム２００が示されている。動画表示システム２００は、動画生成システム１００に加え、マイクまたはキーボードなどのインタフェース装置６０と、対話エンジン７０とを含んでいる。対話エンジン７０は、ユーザからインタフェース装置６０を介して何らかの入力を得て、この入力に応じた対話、たとえば、ユーザに適切な返事をすることをコンピュータ２０に実行させる。たとえばユーザが「こんにちは」と発声すると、ユーザの音声はインタフェース装置６０によってデジタル信号に変換されて対話エンジン７０に入力される。対話エンジン７０は、音声認識により、ユーザの発した言葉の内容を判別し、判別結果に応じた返答を行う。具体的には、コンピュータ２０は、対話エンジン７０による返答の指示に応答して、記録装置１０から動画クリップの適切な部分を読み出して再生する。 FIG. 3 shows a moving image display system 200 including a moving image generation system 100. The moving image display system 200 includes an interface device 60 such as a microphone or a keyboard, and a dialogue engine 70, in addition to the moving image generation system 100. The dialogue engine 70 obtains some input from the user via the interface device 60, and causes the computer 20 to perform a dialogue in response to the input, for example, an appropriate reply to the user. For example, when the user utters "hello", the user's voice is converted into a digital signal by the interface device 60 and input to the dialogue engine 70. The dialogue engine 70 discriminates the content of the words uttered by the user by voice recognition, and makes a response according to the discriminant result. Specifically, the computer 20 reads an appropriate portion of the moving image clip from the recording device 10 and reproduces it in response to the response instruction by the dialogue engine 70.

対話エンジン７０は、コンピュータ２０にインストールされたプログラムによって実現されてもよい。対話エンジン７０は、ディスプレイ５０から離れた位置にあってもよい。インタフェース装置６０は、典型的には、ディスプレイ５０の近くに置かれる。または、インタフェース装置６０はディスプレイ５０と一体化されてもよい。ある態様における動画表示システム２００は、記録装置１０、コンピュータ２０、ディスプレイ５０およびインタフェース装置６０を収容する筐体を備えている。 The dialogue engine 70 may be realized by a program installed in the computer 20. The dialogue engine 70 may be located away from the display 50. The interface device 60 is typically placed near the display 50. Alternatively, the interface device 60 may be integrated with the display 50. In some embodiments, the moving image display system 200 includes a housing that houses a recording device 10, a computer 20, a display 50, and an interface device 60.

記録装置１０に記録されている動画クリップには、後述するように、返答のための動画部分が含まれている。これにより、ディスプレイ５０のスピーカから、たとえば「こんにちは」の音声が発せられる場合、その音声に合わせて人間が「こんにちは」と発する動画がディスプレイ５０の画面上に表示される。すなわち、「こ」、「ん」、「に」、「ち」、「わ」の音声に対応した顔の外見変化を示す動画が音声に合わせて表示される。このような動画は、前もって撮影によって取得されていた、「こんにちは」と発話する一連のフレームを撮影時のシーケンス通りに再生することによって最も簡単に生成され得る。 The moving image clip recorded on the recording device 10 includes a moving image portion for reply, as will be described later. As a result, when, for example, a voice of "hello" is emitted from the speaker of the display 50, a moving image of a human being saying "hello" in accordance with the voice is displayed on the screen of the display 50. That is, a moving image showing a change in the appearance of the face corresponding to the voices of "ko", "n", "ni", "chi", and "wa" is displayed in accordance with the voice. Such a moving image can be most easily generated by playing a series of frames that say "hello", which were previously acquired by shooting, according to the sequence at the time of shooting.

本実施形態による動画生成システムは、前述の先行技術とは異なり、入力された文字列や音声から発話アニメーションを生成するものではない。本実施形態による動画生成システムは、簡単な来客対応および予約受付システムなどのように、あらかじめ定められた台詞の中から適切な返答を選ぶ対話システムに好適に使用される。 Unlike the above-mentioned prior art, the moving image generation system according to the present embodiment does not generate an utterance animation from an input character string or voice. The moving image generation system according to the present embodiment is suitably used for a dialogue system that selects an appropriate response from predetermined dialogues, such as a simple visitor response and reservation reception system.

ユーザからの入力がない期間、ディスプレイ５０に待機状態の被写体像が写し出される。このような期間を「待機期間」と呼ぶ。また「待機期間」における動画生成システム１００の動作状態を「待機モード」と呼ぶ。これに対して、ユーザに返答するための音声と動画が再生される期間を「発話期間」と呼ぶ。また、「発話期間」における動画生成システム１００の動作状態を「発話モード」と呼ぶ。 During the period when there is no input from the user, the subject image in the standby state is projected on the display 50. Such a period is called a "waiting period". Further, the operating state of the moving image generation system 100 in the "standby period" is called a "standby mode". On the other hand, the period during which the voice and the moving image for responding to the user are played is called the "utterance period". Further, the operating state of the moving image generation system 100 in the "utterance period" is called "utterance mode".

本開示の実施形態によれば、待機期間にディスプレイ５０に表示される被写体像は、静止画ではなく、コンピュータ５０によって合成された動画である。この合成動画は、図１を参照して説明したように、動画クリップのフレーム群をコンピュータ５０が再構成して生成したフレームの新しいシースケンスである。したがって、待機期間に表示される動画を構成するフレームの順序は、前述したように、確率的なフレーム間遷移によって決定される。本開示の動画生成システムによれば、待機期間の合成動画のデータと、発話期間の再生動画のデータとが、交互にディスプレイ５０に送られる。待機期間の長さは、ユーザによる入力のタイミングによって異なるが、発話期間の長さは発話の内容に応じて選択される動画部分の長さによって規定される。本実施形態における「動画クリップ」は、あらかじめ定まった台詞を発している被写体像の動画部分と、そのような台詞を発する前後の状態にある被写体像の動画部分とを含んでいる。 According to the embodiment of the present disclosure, the subject image displayed on the display 50 during the standby period is not a still image but a moving image synthesized by the computer 50. As described with reference to FIG. 1, this composite moving image is a new sheathkens of frames generated by the computer 50 reconstructing the frame group of the moving image clip. Therefore, the order of the frames constituting the moving image displayed in the waiting period is determined by the probabilistic inter-frame transition as described above. According to the moving image generation system of the present disclosure, the data of the synthetic moving image during the waiting period and the data of the reproduced moving image during the utterance period are alternately sent to the display 50. The length of the waiting period varies depending on the timing of input by the user, but the length of the utterance period is defined by the length of the moving image portion selected according to the content of the utterance. The "moving image clip" in the present embodiment includes a moving image portion of the subject image that emits a predetermined line and a moving image portion of the subject image that is in a state before and after issuing such a line.

このように本実施形態におけるコンピュータ２０は、ユーザの入力に応じて、合成動画のデータの生成、および動画クリップの一部である部分クリップの選択の一方を行うようにプログラムされている。コンピュータ２０の指示により、ディスプレイ５０は、生成された合成動画または選択された部分クリップを表示する。 As described above, the computer 20 in the present embodiment is programmed to generate data of the synthetic moving image and select a partial clip that is a part of the moving image clip according to the input of the user. At the direction of the computer 20, the display 50 displays the generated synthetic moving image or the selected partial clip.

２．動画クリップの取得
動画クリップを取得するために用いる録画システムは、たとえば１台のコンピュータと１つの動画撮影装置を含む。本実施形態で使用する動画撮影装置は、Ｍｉｃｒｏｓｏｆｔ（登録商標）社製のＫｉｎｅｃｔ（登録商標）である。奥行き情報を取得しない通常のカメラなどの動画撮影装置を用いても良い。 2. 2. Acquisition of Video Clips The recording system used to acquire video clips includes, for example, one computer and one video recording device. The moving image shooting device used in this embodiment is Kinect (registered trademark) manufactured by Microsoft (registered trademark). A moving image shooting device such as a normal camera that does not acquire depth information may be used.

対話エージェントのモデルとなる人物（被写体）に、あらかじめ用意した台詞を読み上げて演じてもらい、その様子をカメラで録画する。台詞を読み始めたタイミングを適切に記録しておく。録画は、モデルとなる俳優とコンピュータを操作する人（オペレータ）の２人で行うことが可能である。録画前に、オペレータは俳優に演じてもらう台詞を用意し、そのテキストと音声をシステムに登録しておく。 Have the person (subject) who will be the model of the dialogue agent read out the lines prepared in advance and perform them, and record the situation with the camera. Make an appropriate record of when you started reading the lines. Recording can be performed by two people, a model actor and a person (operator) who operates a computer. Before recording, the operator prepares a line to be performed by the actor, and registers the text and voice in the system.

本実施形態では、録画開始直後、待機状態を録画するために、俳優は何もせずに待つ（対話区間）。オペレータが台詞を選択すると、選択された台詞を俳優が発話する（発話区間）。たとえば「こんにちは」という台詞が選択されたとき、俳優は「こんにちは」と声を出す。オペレータは俳優が台詞を演じ終えたタイミングをシステムに入力する。その後しばらく待機状態の録画を継続し、再びオペレータが台詞を選択すると、次の台詞の録画が始まる。 In the present embodiment, immediately after the start of recording, the actor waits without doing anything in order to record the standby state (dialogue section). When the operator selects a line, the actor speaks the selected line (speech section). For example, when the line "hello" is selected, the actor says "hello". The operator inputs to the system when the actor has finished playing the line. After that, recording in the standby state is continued for a while, and when the operator selects a line again, recording of the next line starts.

上記の手順を繰り返すと、図４に例示されるように、待機区間と発話区間とが交互に繰り返された動画クリップが得られる。これにより、発話状態前後の遷移位置の候補フレームが増えるため、動画生成時に、より自然なフレーム遷移を実現しやすくなる。同じ台詞を複数回録画しても良い。同じ台詞を発する発話区間の動画が複数取得されていると、動画生成時に、ユーザの入力に応答して同じ台詞を繰り返して発する場合でも、全く同じ動画を繰り返して再生することが避けられ得る。また、同じ台詞が発せられる発話状態の動画が増えると、待機状態における種々のフレームから、各発話状態における最初のフレームに遷移するときに選択可能なフレームが増えるため、自然なフレーム間遷移を実現しやすくなる。 By repeating the above procedure, as illustrated in FIG. 4, a moving image clip in which the waiting section and the utterance section are alternately repeated is obtained. As a result, the number of candidate frames for the transition position before and after the utterance state increases, so that it becomes easier to realize a more natural frame transition when generating a moving image. The same line may be recorded multiple times. When a plurality of videos of utterance sections that utter the same line are acquired, it is possible to avoid repeatedly playing the exact same video even when the same line is repeatedly uttered in response to user input at the time of video generation. Also, as the number of uttered videos in which the same line is uttered increases, the number of frames that can be selected when transitioning from various frames in the standby state to the first frame in each uttered state increases, thus realizing a natural inter-frame transition. It will be easier to do.

必ずしも全ての録画を連続して行う必要はない。録画を連続して行えば、髪型や服装などの細かい見た目の変化を最小限に留めることができる。録画システムが保存する情報は、音声、カメラのＲＧＢデータ、顔認識の結果、姿勢認識の結果、オペレータが入力した発話の開始と終了のタイミングである。 It is not always necessary to make all recordings in succession. If you record continuously, you can minimize small changes in appearance such as hairstyle and clothes. The information stored by the recording system is voice, camera RGB data, face recognition result, posture recognition result, and timing of start and end of utterance input by the operator.

図５に示すように、本実施形態によれば、撮影装置８０によって被写体像の複数の特徴点を自動的に抽出し、顔の向きおよび姿勢情報を取得することができる。図５では、動画クリップにおけるｉ番目のフレームｉにおける被写体像の顔特徴点の位置ｆ_i,k（１≦ｋ≦Ｆ）の幾つかが記載されている。また、被写体像の姿勢特徴点の位置ｓ_i,k（１≦ｋ≦Ｓ）の幾つかが記載されている。ここで、「Ｆ」および「Ｓ」は、それぞれ、処理に利用する顔特徴点の個数の最大値および姿勢特徴点の個数の最大値である。As shown in FIG. 5, according to the present embodiment, the photographing device 80 can automatically extract a plurality of feature points of a subject image and acquire face orientation and posture information. In FIG. 5, some of the positions f _{i, k} (1 ≦ k ≦ F) of the facial feature points of the subject image in the i-th frame i in the moving image clip are described. In addition, some of the positions s _{i, k} (1 ≦ k ≦ S) of the posture feature points of the subject image are described. Here, "F" and "S" are the maximum value of the number of facial feature points and the maximum value of the number of posture feature points used for processing, respectively.

被写体の顔の向き、表情、および被写体の姿勢が変化すると、これらの特徴点の位置ｆ_i,k、ｓ_i,kが変化する。対応する同じ特徴点であっても、それらの位置は、フレームごとに異なり得る。本実施形態では、被写体の撮影によって動画クリップを取得するとき、各フレームについて、所定数の特徴点位置が取得される。When the orientation, facial expression, and posture of the subject's face change, the positions f _{i, k} , s _{i, k} of these feature points change. Even for the same corresponding feature points, their positions can vary from frame to frame. In the present embodiment, when a moving image clip is acquired by shooting a subject, a predetermined number of feature point positions are acquired for each frame.

なお、図５に示す例では、水平方向（Ｘ軸方向）に延びる一点鎖線１５Ｘと垂直方向（Ｙ軸方向）に延びる一点鎖線１５Ｙとの交点により、被写体である人間の頭の位置が示されている。また、水平方向に延びる１本の破線１６Ｘと垂直方向に延びる２本の破線１６Ｙの各々との交点により、被写体である人間の姿勢を示す位置が示されている。この撮影装置８０によれば、録画開始時との姿勢の差分を録画システムのディスプレイに表示することが可能である。撮影中にその差分が大きすぎる場合、オペレータが録画を中断したり、撮影中に休憩を挟む場合の再開時の姿勢決定に用いたりすることができる。 In the example shown in FIG. 5, the position of the human head as a subject is indicated by the intersection of the alternate long and short dash line 15X extending in the horizontal direction (X-axis direction) and the alternate long and short dash line 15Y extending in the vertical direction (Y-axis direction). ing. Further, the intersection of each of the one broken line 16X extending in the horizontal direction and the two broken lines 16Y extending in the vertical direction indicates the position indicating the posture of the human being as the subject. According to the photographing device 80, it is possible to display the difference in posture from the time when recording is started on the display of the recording system. If the difference is too large during shooting, the operator can interrupt the recording or use it to determine the posture when resuming when there is a break during shooting.

３．事前計算
次に、事前計算を行い、合成動画の生成時に行うフレーム遷移の候補フレームを決定する。 3. 3. Pre-calculation Next, pre-calculation is performed to determine a candidate frame for the frame transition to be performed when the composite video is generated.

本実施形態では、事前計算として、録画クリップ中の任意のフレーム対について、フレーム間の遷移コストを計算し、コスト行列を求める。遷移コストとは、録画クリップ内のあるフレームから、ある別のフレームへ遷移する際の、不自然さを数値化したものである。この計算は、非特許文献４に開示されている方法に従うことができる。非特許文献４の全体の内容をここに援用する。ただし、本実施形態におけるフレーム間の類似度は、画素ごとの比較ではなく、顔認識の特徴点を用いて行われる。 In the present embodiment, as a preliminary calculation, the transition cost between frames is calculated for any frame pair in the recording clip, and the cost matrix is obtained. The transition cost is a numerical value of the unnaturalness when transitioning from one frame in the recording clip to another frame. This calculation can follow the method disclosed in Non-Patent Document 4. The entire contents of Non-Patent Document 4 are incorporated herein by reference. However, the similarity between frames in the present embodiment is performed by using the feature points of face recognition, not the comparison for each pixel.

求めたコスト行列を用いて、待機状態から待機状態への遷移と、待機状態から発話状態への遷移のそれぞれについて、コストの少ない遷移を実現する複数のフレームを抽出する。コストの少ない遷移先となる複数のフレームが、候補フレームである。候補フレームの撮影と並行して収録された音声を切り出す処理、および各フレームのＲＧＢデータを再生用に変換する処理も行う。 Using the obtained cost matrix, a plurality of frames that realize low-cost transitions are extracted for each of the transition from the standby state to the standby state and the transition from the standby state to the utterance state. A plurality of frames that are transition destinations with low cost are candidate frames. It also performs a process of cutting out the sound recorded in parallel with the shooting of the candidate frame, and a process of converting the RGB data of each frame for reproduction.

３．１．フレーム間の類似度
図６は、動画クリップ内のｉ番目のフレームｉおよびｊ番目のフレームｊにおける被写体像の特徴点を模式的に示している。ここで、１≦ｉ＜ｊとする。図６のフレームｉでは、例示的に２つの特徴点の位置ｆ_i,15およびｓ_i,2を示す矢印が記載されている。位置ｆ_i,15は１５番目の顔特徴点であり、ｓ_i,2は２番目の姿勢特徴であるとする。また、図６のｊ番目のフレームでは、フレームｉにおける２つの特徴点に対応する２つの特徴点の位置ｆ_j,15およびｓ_j,2を示す矢印が記載されている。フレームｉからフレームｊに遷移が生じるとき、１５番目の顔特徴点について、フレームｉにおけるｆ_i,15とフレームｊにおける位置ｆ_j,15との距離（差異）が発生し得る。このような距離（差異）が大きいと、動画生成時にフレームｉからフレームｊに遷移が生じるとき、特徴点の動きがユーザに認識され得る。 3.1. Similarity between Frames FIG. 6 schematically shows the feature points of the subject image in the i-th frame i and the j-th frame j in the moving image clip. Here, 1 ≦ i <j. In frame i of FIG. 6, arrows indicating the positions f _{i, 15} and s _{i, 2} of the two feature points are shown schematically. It is assumed that the positions f _{i and 15} are the 15th facial feature points, and the s _{i and 2} are the second posture features. Further, in the j-th frame of FIG. 6, arrows indicating the positions f _{j, 15} and s _{j, 2} of the two feature points corresponding to the two feature points in the frame i are described. When a transition occurs from frame i to frame j, a distance (difference) between f _{i, 15} in frame i and position f _{j, 15} in frame j may occur for the fifteenth facial feature point. When such a distance (difference) is large, the movement of the feature point can be recognized by the user when the transition from the frame i to the frame j occurs at the time of moving image generation.

本実施形態では、フレーム間の類似度を示す値として、被写体像の特徴点の距離（差異）を用いる。この距離が小さいほど、類似度は高い。本実施形態では、まず、Ｋｉｎｅｃｔ（登録商標）による顔検出および姿勢情報から、被写体像の特徴点の距離を求める。そして、複数の特徴点の距離の合計を「フレーム間距離」として定義する。以下、この「フレーム間距離」の一例を説明する。 In the present embodiment, the distance (difference) between the feature points of the subject image is used as a value indicating the degree of similarity between the frames. The smaller this distance, the higher the similarity. In the present embodiment, first, the distance between the feature points of the subject image is obtained from the face detection by Kinect (registered trademark) and the posture information. Then, the total of the distances of the plurality of feature points is defined as the "inter-frame distance". Hereinafter, an example of this “inter-frame distance” will be described.

Ｋｉｎｅｃｔ（登録商標）の顔認識の仕様に従うと、顔特徴点の個数Ｆは１２１である。姿勢特徴点は、Ｋｉｎｅｃｔ（登録商標）が定義する２０個のうち、上半身の撮影で常に見えているもの（頭、右肩、左肩、肩中央）を利用することができる。この場合、姿勢特徴点の個数Ｓは４である。ｉ番目のフレームとｊ番目のフレームとの間の距離は、重み付けのパラメータαとβを導入すると、以下の式で表される。

According to the face recognition specifications of Kinect (registered trademark), the number F of facial feature points is 121. Of the 20 posture feature points defined by Kinect (registered trademark), those that are always visible when the upper body is photographed (head, right shoulder, left shoulder, center of shoulder) can be used. In this case, the number S of posture feature points is 4. The distance between the i-th frame and the j-th frame is expressed by the following equation when the weighting parameters α and β are introduced.

本実施形態では、距離の平均で標準化し、以下の式２および式３に示すように、顔および姿勢の距離を７対３の割合で用いる。

In this embodiment, the average distance is standardized, and the face and posture distances are used in a ratio of 7 to 3 as shown in the following

equations

2 and 3.

このような計算によって得られたフレームｉとフレームｊとの間の距離Ｄ_i、_jは、図７に示されるような行列として表現され得る。The distances D _i and _j between the frames i and the frame j obtained by such calculation can be expressed as a matrix as shown in FIG. 7.

３．２．遷移コスト
本開示の実施形態では、隣接しないフレーム間の遷移時に、クロスディゾルブ処理で補間を行う。クロスディゾルブするフレーム数は固定とし、それをｄとする。（本実施形態では、０．２５秒にあたるｄ＝３を用いる。）。画像内のエッジの距離が十分に小さいほど、自然な補間が可能となる。このエッジの距離を特徴点の距離で近似できると考える。すなわち、フレームｉからフレームｊに遷移するコストを、以下の式４を用いて定義することができる。

3.2. Transition cost In the embodiment of the present disclosure, interpolation is performed by cross-dissolve processing at the time of transition between non-adjacent frames. The number of frames to be cross-dissolved is fixed, and it is d. (In this embodiment, d = 3, which corresponds to 0.25 seconds, is used.). The smaller the distance between the edges in the image, the more natural interpolation is possible. We think that the distance of this edge can be approximated by the distance of the feature points. That is, the cost of transitioning from frame i to frame j can be defined by using the following equation 4.

このような計算によって得られたフレームｉとフレームｊとの間の遷移コストＣ_i、_jは、図８に示されるような行列として表現され得る。The transition costs C _i and _j between the frames i and the frame j obtained by such a calculation can be expressed as a matrix as shown in FIG.

ｉ番目のフレームからｉ＋１番目のフレームへの遷移コストはゼロである。また、フレーム間の静的な差異だけでなく、複数のフレームで表現される動きの差異も考慮すると、隣接する複数のフレームについて距離の和をとることが好ましい。 The transition cost from the i-th frame to the i + 1th frame is zero. Further, considering not only the static difference between the frames but also the difference in the movement expressed by the plurality of frames, it is preferable to take the sum of the distances for the plurality of adjacent frames.

あるフレームから他のフレームに遷移するコストを、複数のフレーム（フレームセット）を利用して特定する場合には、フレームセット内のフレームの順番も考慮することが好ましい。その理由は、フレームセットによって定まる類似度を単に各フレームの類似度の和として求めると、動画が不自然になるおそれが生じるからである。たとえば、口が開く方向の合成動画をＮ枚のフレーム（フレームセット）から生成する例を考える。単に各フレームの類似度の和を利用するだけでは、口が閉じる方向のＮ枚分のフレームセットが選択されるおそれがある。その順序で再生された動画は非常に不自然である。よって、遷移コストの計算にあたっては、フレームッセット内のフレームの順番をパラメータとして加えてもよい。 When specifying the cost of transitioning from one frame to another by using a plurality of frames (framesets), it is preferable to consider the order of the frames in the frameset. The reason is that if the similarity determined by the frame set is simply calculated as the sum of the similarity of each frame, the moving image may become unnatural. For example, consider an example in which a synthetic moving image in the direction in which the mouth opens is generated from N frames (frame sets). By simply using the sum of the similarity of each frame, there is a possibility that a frame set for N frames in the direction in which the mouth closes is selected. The videos played in that order are very unnatural. Therefore, in calculating the transition cost, the order of the frames in the frame set may be added as a parameter.

たとえば、フレームセットを利用して、フレームｉからフレームｊに遷移するときの遷移コストを計算する際、遷移コストは、フレームｉを含む連続したフレーム群の第１のセット、および、フレームｊを含む連続したフレーム群の第２のセットについて、対応する同じ順番のフレームどうしの類似度の線形結合として得ることができる。 For example, when using a frame set to calculate the transition cost when transitioning from frame i to frame j, the transition cost includes a first set of consecutive frames including frame i, and frame j. A second set of contiguous frames can be obtained as a linear combination of similarities between corresponding frames of the same order.

３．３．待機状態間の遷移
非特許文献１には、前述の遷移コスト行列から実際に使う遷移を選ぶ方法として、フレームごとに最小コストの遷移を選ぶ方法と、ある閾値以下のコストの遷移を選ぶ方法とが開示されている。本開示の実施形態では、前者を用いるが、より精度の高い遷移を必要最小限の数だけ選ぶためには、後者を用いても良い。 3.3. Transitions between standby states Non-Patent Document 1 describes, as a method of selecting a transition to be actually used from the above-mentioned transition cost matrix, a method of selecting a transition of the minimum cost for each frame and a method of selecting a transition of a cost below a certain threshold. Is disclosed. In the embodiment of the present disclosure, the former is used, but the latter may be used in order to select the minimum number of more accurate transitions.

本開示の実施形態における動画クリップでは、前述したように、待機状態の区間と発話状態の区間とが交互に含まれている。このため、ＶｉｄｅｏＴｅｘｔｕｒｅｓの技術を単純に適用すると、同じ区間内での遷移のみが生じる可能性がある。本開示の実施形態では、同じ区間内の遷移（局所遷移）と別区間への遷移（大域遷移）を区別し、それぞれについてコストが最小となる遷移を指定した個数だけ選ぶ。具体的には、待機状態の各フレームについて、以下の計算を実行する。
（１）同じ区間への遷移のうち、コストが小さいものをＮ_L個選ぶ。
（２）異なる区間への遷移のうち、コストが小さいものをＮ_G個選ぶ。
ただし、遷移先の区間が重複しないように選ぶ。In the moving image clip in the embodiment of the present disclosure, as described above, the standby state section and the utterance state section are alternately included. Therefore, if the technique of Video Textures is simply applied, only transitions within the same section may occur. In the embodiment of the present disclosure, transitions within the same section (local transitions) and transitions to different sections (global transitions) are distinguished, and only a specified number of transitions having the lowest cost are selected for each. Specifically, the following calculation is executed for each frame in the standby state.
(1) Select N _L of transitions to the same section with the lowest cost.
(2) Of the transitions to different sections, select _NG ones with the lowest cost.
However, select so that the transition destination sections do not overlap.

たとえばＮ_L＝１０、Ｎ_G＝５を用いることができる。録画時の失敗などに起因して利用したくないフレームがある場合、それらのフレームを除外したフレーム群から候補フレームを選択してもよい。For example, N _L = 10 and _NG = 5 can be used. If there are frames that you do not want to use due to recording failures, you may select candidate frames from the frame group that excludes those frames.

３．４．発話状態への遷移
待機状態から発話状態への遷移は、対話エンジンからの指示によって起こる。その際、必ずしも指示の直後に遷移をする必要はなく、被写体像による「反応」までに１秒程度時間がかかるのは自然である。指示から実際の遷移までに時間的猶予を設けても良い。その場合、遷移するタイミングの候補が増え、より自然な遷移を選ぶことができる。本実施形態では、指示があった時点から発話状態の最初のフレームに到達するまでに費やして良い最大のフレーム数をｋとする。ｉ番目のフレームの表示中に、ｊ番目のフレームから始まる発話状態への遷移を指示された場合、以下に示すように遷移が生じる。

3.4. Transition to the utterance state The transition from the standby state to the utterance state occurs by an instruction from the dialogue engine. At that time, it is not always necessary to make a transition immediately after the instruction, and it is natural that it takes about 1 second for the "reaction" by the subject image. There may be a time grace between the instruction and the actual transition. In that case, the number of transition timing candidates increases, and a more natural transition can be selected. In the present embodiment, k is the maximum number of frames that can be spent from the time when the instruction is given until the first frame of the utterance state is reached. When the transition to the utterance state starting from the jth frame is instructed during the display of the i-th frame, the transition occurs as shown below.

ただし、ａ＋ｂ≦ｋである。事前計算では、各フレームと各発話区間について、Ｃ_i+a,j-bが最小となるようなａとｂを求める。However, a + b ≦ k. In the pre-calculation, a and b are obtained so that C _{i + a and jb} are minimized for each frame and each utterance section.

このように、事前計算によって得られた情報に基づいて、本実施形態の動画生成システムに用いられるデータ構造が事前に用意され得る。このようなデータ構造は、被写体像を含む複数のフレームのシーケンスである動画クリップと、複数のフレームから選択されたフレーム群に含まれる各フレームと他のそれぞれのフレームとの間の類似度を示す類似度情報であって、前記類似度は、各フレーム内における前記被写体像の特徴量に基づいて規定される類似度情報とを有している。 As described above, the data structure used in the moving image generation system of the present embodiment can be prepared in advance based on the information obtained by the prior calculation. Such a data structure shows the similarity between a moving image clip, which is a sequence of a plurality of frames including a subject image, and each frame contained in a frame group selected from the plurality of frames and each other frame. It is the similarity information, and the similarity has the similarity information defined based on the feature amount of the subject image in each frame.

４．動画生成
動画生成時、録画データおよび事前計算の結果を用いて対話エージェントを描画する。本実施形態における動画生成システムを動作させると、対話エンジンからの指示がない限り、待機状態の合成動画が生成され続ける。 4. Movie generation When a movie is generated, the dialogue agent is drawn using the recorded data and the result of pre-calculation. When the moving image generation system in the present embodiment is operated, the synthetic moving image in the standby state continues to be generated unless instructed by the dialogue engine.

待機モードにおける動画生成システムは、登録された台詞のどれかを発話する指示を受け付ける。すると、指示された台詞への最も良い遷移を選んで遷移し、動画を再生する。発話区間の再生が終わると、システムは再び待機モードに戻る。待機状態と発話状態、どちらへ遷移する場合も、フレーム補完を行うと、よりスムーズな遷移が可能となる。 The video generation system in standby mode accepts instructions to speak any of the registered lines. Then, the best transition to the instructed line is selected and transitioned, and the video is played. When the playback of the utterance section is finished, the system returns to the standby mode again. When transitioning to either the standby state or the utterance state, frame complementation enables a smoother transition.

４．１．遷移時のフレーム補間
フレーム補間は、隣接しないフレーム間を遷移する際に行われる。補完の方法としては、クロスディゾルブ、モーフィング、オプティカルフローを用いる方法、またはそれをさらに発展させた方法などが考えられる。 4.1. Frame interpolation at transition Frame interpolation is performed when transitioning between non-adjacent frames. As a method of complementation, a method using cross-dissolve, morphing, an optical flow, or a method obtained by further developing it can be considered.

モーフィングに関しては、フレームごとにいくつかの特徴線さえ用意しておけば、リアルタイムに高速に描画ができる。クロスディゾルブは、ピクセルごとに線形補間を行うという最も単純な方法である。本実施形態では、フレーム補間にクロスディゾルブを用いる。 As for morphing, if you prepare some feature lines for each frame, you can draw at high speed in real time. Crosssolve is the simplest method of performing linear interpolation on a pixel-by-pixel basis. In this embodiment, a cross dissolve is used for frame interpolation.

本実施形態では、図９に示すようにフレーム間の遷移を行う。時間ｔにフレームｉが表示されており、時間ｔ＋１にフレームｊ（ｊ≠ｉ＋１）に遷移する場合を考える。この場合、時間ｔ＋ｋでフレームｉ＋ｋとフレームｊ＋ｋ－１をｄ＋１－ｋ：ｋの比率で合成して表示する。ただし、ｄは「３．２．遷移コスト」で定義した値であり、１≦ｋ≦ｄの関係を満足する。 In this embodiment, the transition between frames is performed as shown in FIG. Consider a case where the frame i is displayed at the time t and the frame j (j ≠ i + 1) transitions to the time t + 1. In this case, the frame i + k and the frame j + k-1 are combined and displayed at the ratio of d + 1-k: k at the time t + k. However, d is a value defined in "3.2. Transition cost" and satisfies the relationship of 1 ≦ k ≦ d.

４．２．待機状態
待機状態は、クロスディゾルブによる遷移が行われている区間と、遷移が行われていない区間の２種類に分けられる。「遷移が行われていない区間」とは、たとえば、遷移する前までの動画クリップ再生区間、および遷移した後の動画クリップ再生区間を含む。今、遷移する前までの動画クリップ再生区間（時間ｔ）で、ｉ番目のフレームにいるとする。ただし、ｔは録画データと同じフレームレートによるフレーム数で表す。このとき、時間ｔ＋１にｊ番目のフレームに遷移する確率が、

となるように、事前計算で選んだＮ_L＋Ｎ_G個の中から遷移先を選ぶ。σ₀は小さいほどコストの小さい遷移に確率が偏る。ここまでは非特許文献に開示されている方法と全く同じである。フレームｊ（ｊ≠ｉ＋１）が選ばれた場合、時間ｔ＋１からｔ＋ｄの間クロスディゾルブ処理を行い、ｔ＋ｄ＋１で再び通常の状態に戻る。なお、待機状態では頻繁にフレーム遷移が起こるため、録画時の音声は再生しない。 4.2. Standby state The standby state is divided into two types: a section in which a transition by cross-dissolve is performed and a section in which a transition is not performed. The “section in which no transition is performed” includes, for example, a moving image clip reproduction section before the transition and a moving image clip reproduction section after the transition. It is assumed that the user is in the i-th frame in the moving image clip playback section (time t) before the transition. However, t is represented by the number of frames at the same frame rate as the recorded data. At this time, the probability of transitioning to the jth frame at time t + 1 is

Select the transition destination from the _NL + _NG items selected in the pre-calculation so that. The smaller σ ₀ , the more the probability is biased toward the transition with lower cost. Up to this point, the method is exactly the same as that disclosed in the non-patent document. When the frame j (j ≠ i + 1) is selected, the cross-dissolve process is performed for the time t + 1 to t + d, and the normal state is restored again at t + d + 1. Since frame transitions occur frequently in the standby state, the audio at the time of recording is not reproduced.

なお、上述の数（６）は、類似度に関する単調減少関数の一例である。他の単調減少関数を用いてもよい。 The above-mentioned number (6) is an example of a monotonous decrease function regarding the degree of similarity. Other monotonic decreasing functions may be used.

４．３．発話状態への遷移
本実施形態における動画生成システムは、対話エンジンから台詞の発話の指示を受け付けると、待機モードから発話モードに切り替わる。動画クリップでは、同じ台詞が複数の動画部分に含まれているため、確率分布を調整するパラメータσ１を導入し、式６を用いて同様に確率を計算する。その際の遷移コストとしては前述した値を用いる。クロスディゾルブで遷移を行った後、発話状態の開始から終了までを音声付きで単純に再生する。本実施形態では、発話状態の後には待機状態が録画されている。そのため、自然と待機状態に遷移する。 4.3. Transition to the utterance state The moving image generation system in the present embodiment switches from the standby mode to the utterance mode when it receives an instruction to utter a line from the dialogue engine. In the moving image clip, since the same line is included in a plurality of moving image parts, the parameter σ1 for adjusting the probability distribution is introduced, and the probability is calculated in the same manner using Equation 6. The above-mentioned value is used as the transition cost at that time. After making a transition with cross dissolve, simply play back the utterance state from the start to the end with voice. In the present embodiment, the standby state is recorded after the utterance state. Therefore, it naturally transitions to the standby state.

ある実施形態における動画生成システムには、ＨＴＴＰ（ＨｙｐｅｒｔｅｘｔＴｒａｎｓｆｅｒＰｒｏｔｏｃｏｌ）サーバーが実装され得る。その場合、発話指示は、ＨＴＴＰリクエストとして与えられ得る。 An HTTP (Hypertext Transfer Protocol) server may be implemented in the moving image generation system in a certain embodiment. In that case, the utterance instruction can be given as an HTTP request.

４．４待機状態の局所的な繰り返し対策
本実施形態によれば、上記の構成を採用することにより、動画全体としては繰り返しのない無限シーケンスが生成され得る。しかし、局所的には、有限回の不自然な細かい繰り返しが発生する場合がある。たとえば、あるフレームからの最も確率の高い遷移がｄフレーム前に戻る遷移だとすると、その間を何度もループしてしまう可能性が高い。 4.4 Local Repeat Countermeasures in Standby State According to this embodiment, by adopting the above configuration, an infinite sequence without repetition can be generated for the entire moving image. However, locally, a finite number of unnatural fine repetitions may occur. For example, if the most probable transition from a certain frame is a transition that returns to the d frame, there is a high possibility that the transition will loop many times between them.

これを避ける方法として、事前計算でそのような遷移を除去する方法と、動画生成時に回避する方法とがある。本実施形態では、動画生成時に回避する態様を採用する。すなわち、待機状態の遷移先を選ぶ際に、過去Ｔ_Lフレーム以内にＫ_L回以上訪れたフレームは候補から除外する。たとえば、Ｔ_Lを１０秒に相当するフレーム数、Ｋ_L＝２とすることができる。図１０は、たとえばＮ＋Ｘ番目（Ｘは２以上の整数）のフレームから次のフレームに遷移するとき、候補フレームの中に、過去１０秒間に２回表示に使用されたフレームが存在していた場合を模式的に示している。この場合、そのフレーム（図１０においてＸマークが付されている）が候補フレームから除外される。ただし、これにより遷移候補がひとつも無くなってしまう場合には、このような除外は行わない。各フレームで過去Ｋ_L回分の訪れた時間の履歴を記録する。As a method of avoiding this, there are a method of removing such a transition by pre-calculation and a method of avoiding it at the time of moving image generation. In this embodiment, an aspect of avoiding when generating a moving image is adopted. That is, when selecting the transition destination of the standby state, the frames that have visited _KL times or more within the past _TL frames are excluded from the candidates. For example, _TL can be the number of frames corresponding to 10 seconds, _KL = 2. FIG. 10 shows the case where, for example, when transitioning from the N + Xth frame (X is an integer of 2 or more) to the next frame, a frame used for display twice in the past 10 seconds exists among the candidate frames. Is schematically shown. In this case, the frame (marked with an X in FIG. 10) is excluded from the candidate frames. However, if there are no transition candidates due to this, such exclusion is not performed. Record the history of the past _KL visits in each frame.

４．５動画全体の有効利用
事前計算で十分な数の大域遷移を用意すれば、確率的にはどの待機区間も再生される可能性がある。しかし、実際に再生してみると、いくつかの待機区間を往来し続け、使われる区間が極端に偏る場合がある。これを回避する方法も、事前計算と再生時の２通りが考えられるが、本実施形態では、後者を採用している。すなわち、「過去Ｔ_G回の大域遷移の中で、Ｋ_G回以上遷移先に選ばれた区間は、大域遷移先から除外する」というルールを適用する。たとえばＴ_G＝１０、Ｋ_G＝２が設定され得る。 4.5 Effective use of the entire video If a sufficient number of global transitions are prepared by pre-calculation, there is a possibility that any waiting section will be reproduced stochastically. However, when actually playing back, some waiting sections continue to come and go, and the used sections may be extremely biased. There are two possible ways to avoid this, pre-calculation and reproduction, but in this embodiment, the latter is adopted. That is, the rule "Among the past _TG times of global transitions, the section selected as the transition destination of _KG times or more is excluded from the global transition destination" is applied. For example, _TG = 10 and _KG = 2 can be set.

＜実施例＞
本実施例では、人間（俳優）を被写体としてＫｉｎｅｃｔ（登録商標）のカメラを用いた撮影を行った。１２８０×９６０ピクセルのＲＧＢデータを１２ＦＰＳで取得するとともに、近接モードで６４０×４８０ピクセルの深度データを３０ＦＰＳで取得した。録画時のプレビューや再生には、１２８０×９６０ピクセルのうち中央の６４０×４８０を切り抜いて用いた。これは、カメラと俳優の距離が小さいと、姿勢認識や顔認識が機能しないためである。<Example>
In this embodiment, a human (actor) was taken as a subject and a photograph was taken using a Kinect (registered trademark) camera. RGB data of 1280 × 960 pixels was acquired at 12 FPS, and depth data of 640 × 480 pixels was acquired at 30 FPS in the proximity mode. For preview and playback during recording, the central 640 x 480 of the 1280 x 960 pixels was cut out and used. This is because posture recognition and face recognition do not work if the distance between the camera and the actor is small.

読み上げる台詞は「おはようございます」および「はじめまして」などの挨拶、デモンストレーション用の対話に用いる台詞など、合計６２個を用意し、あらかじめシステムに入力した。どの台詞も２～３回ずつ録画し、失敗も含めて合計１７０個の発話状態を録画した。各台詞の発話後はそれぞれ５～１０秒ほど待機状態として録画した。録画は、休憩を挟みながら３回に分けて行われた。時間はそれぞれ、７分４６秒、２５分４２秒、１０分２０秒で、合計４３分４８秒間であった。フレーム数は合計３１３１５フレームであった。 A total of 62 lines were prepared and entered into the system in advance, including greetings such as "Good morning" and "Nice to meet you", and lines used for demonstration dialogue. Each line was recorded 2-3 times, and a total of 170 utterances including failures were recorded. After each line was spoken, it was recorded as a standby state for about 5 to 10 seconds. The recording was divided into three times with a break in between. The time was 7 minutes 46 seconds, 25 minutes 42 seconds, 10 minutes 20 seconds, respectively, for a total of 43 minutes 48 seconds. The total number of frames was 31315 frames.

各休憩後の再開時には、図５を参照しながら説明した方法で姿勢を決定した。姿勢データの奥行方向についてもディスプレイの画面に表示して俳優に見せたり、姿勢制御に実際の録画画像も用いたりすることにより、被写体像の姿勢変化を更に小さくすることができる。本実施例では、大きめのサイズで録画して中央部分を切り抜いてフレームを形成している。切り抜きの対象とする中央部分の範囲を、姿勢データに基づいて調整してもよい。 At the time of resumption after each break, the posture was determined by the method described with reference to FIG. By displaying the depth direction of the posture data on the screen of the display and showing it to the actor, or by using the actual recorded image for the posture control, the posture change of the subject image can be further reduced. In this embodiment, a frame is formed by recording with a large size and cutting out the central portion. The range of the central portion to be cut out may be adjusted based on the posture data.

本実施例のシステムにおけるプログラムは基本的にはＣ♯で書かれている。本実施例では、コスト行列がコンピュータ内のメモリに格納されておらず、他の記録装置として動作するハード・ディスク・ドライブ（ＨＤＤ）に格納されている。このため、ディスクアクセスの多い処理が増えている。事前計算を行ったコンピュータは、３．２ＧＨｚで動作する中央演算処理ユニット（ＣＰＵ）および８ＧＢのメモリを搭載し、７２００ｒｐｍ、５００ＧＢのＨＤＤに接続されている。画像の変換に５２分、巨大な単一ファイルである録画データをコピーして分解するのに８２分、後述の不要なフレームの除去作業に２分２０秒、コスト行列の計算と書き出しに６時間６分、最終的な遷移候補の選択に４分、合計すると約８時間３０分かかった。 The program in the system of this embodiment is basically written in C #. In this embodiment, the cost matrix is not stored in the memory in the computer, but in a hard disk drive (HDD) that operates as another recording device. For this reason, the number of processes with many disk accesses is increasing. The pre-calculated computer is equipped with a central processing unit (CPU) operating at 3.2 GHz and 8 GB of memory, and is connected to a 7200 rpm, 500 GB HDD. 52 minutes for image conversion, 82 minutes for copying and disassembling recorded data, which is a huge single file, 2 minutes and 20 seconds for removing unnecessary frames, which will be described later, and 6 hours for calculating and writing cost matrices. It took 6 minutes, 4 minutes to select the final transition candidate, and about 8 hours and 30 minutes in total.

録画時、待機状態の録画中に俳優が話してしまうなどのミスがあったため、待機状態の録画クリップを構成する一部のフレームを遷移先の候補フレームから除外した。本実施例では、顔認識の上唇と下唇の特徴点が一定以上離れているフレームとその周辺のフレームを機械的に検出し、待機状態の約２１５５１フレーム中、４７０８フレームを取り除いた。 During recording, there was a mistake such as the actor talking while recording in the standby state, so some frames that make up the recording clip in the standby state were excluded from the candidate frames for the transition destination. In this embodiment, the frames in which the feature points of the upper lip and the lower lip of face recognition are separated by a certain distance or more and the frames around them are mechanically detected, and 4708 frames are removed from the approximately 21551 frames in the standby state.

本実施例の再生システムは、Ｗｉｎｄｏｗｓ８．１（登録商標）がインストールされているタブレット（１．３３ＧＨｚで動作するＣＰＵ、２ＧＢのメモリ、および３２ＧＢのストレージメディアを備える）によって実現されている。本実施例では、人間を被写体とするビデオ映像のフレームをそのまま用いて合成動画を生成しているため、従来技術に比べると、被写体像が自然である。また、フレームの遷移も、特に注意していなければ気にならなった。 The playback system of this embodiment is realized by a tablet (with a CPU operating at 1.33 GHz, 2 GB of memory, and 32 GB of storage media) on which Windows 8.1 (registered trademark) is installed. In this embodiment, since the composite moving image is generated by using the frame of the video image with a human as the subject as it is, the subject image is more natural than that of the conventional technique. Also, I was worried about the frame transition unless I was careful.

変形例１
本開示の動画生成システムは、被写体の視線を追跡し、この視線に関する視線情報を出力する視線追跡装置をさらに備えていてもよい。この場合、コンピュータは、視線情報に基づいて特定される被写体の視線方向が近いほど小さく、被写体の視線方向が遠いほど大きくなる追加コストを算出する。このような追加コストを利用して前述の遷移コストを修正する。このような変形例によれば、被写体像の視線方向が不連続に変化するような合成画像の生成が防止され得る。 Modification 1
The moving image generation system of the present disclosure may further include a line-of-sight tracking device that tracks the line of sight of a subject and outputs line-of-sight information related to this line of sight. In this case, the computer calculates an additional cost that is smaller as the line-of-sight direction of the subject specified based on the line-of-sight information is closer, and becomes larger as the line-of-sight direction of the subject is farther. The above-mentioned transition cost is corrected by utilizing such an additional cost. According to such a modification, it is possible to prevent the generation of a composite image in which the line-of-sight direction of the subject image changes discontinuously.

変形例２
上記の実施形態では、被写体像の類似度の算出に特徴点の位置を用いているが、被写体の類似度は、この例に限定されない。被写体像は、あらかじめ用意された複数の基底画像の各々と、複数の係数の各々との積を線形的に結合して表現され得る。被写体が人の顔である場合、このような基底画像は「固有顔」と呼ばれることがある。基底画像の線形結合によって被写体像を表現するとき、被写体像の特徴量は、これらの複数の係数の組み合わせによって定義され得る。たとえば、ｉ番目のフレームに現れる「顔」を複数の顔画像の線形的な重ねあわせによって表現する場合、ｉ番目のフレームにおける顔を規定する重み係数のセットと、ｊ番目のフレームにおける顔を規定する重み係数のセットとの距離または差異によってフレーム間の「類似度」を定義しても良い。 Modification 2
In the above embodiment, the positions of the feature points are used to calculate the similarity of the subject image, but the similarity of the subject is not limited to this example. The subject image can be represented by linearly combining the product of each of the plurality of base images prepared in advance and each of the plurality of coefficients. When the subject is a human face, such a basal image is sometimes referred to as an "eigenface". When the subject image is represented by a linear combination of the basal images, the feature amount of the subject image can be defined by a combination of these plurality of coefficients. For example, when expressing the "face" appearing in the i-th frame by linearly superimposing a plurality of face images, the set of weighting factors that define the face in the i-th frame and the face in the j-th frame are defined. The "similarity" between frames may be defined by the distance or difference from the set of weighting factors.

変形例３
図１１は、動画表示システムの変形例を示す図である。この例における動画表示システムでは、動画生成システム１００は、生成した合成動画の一部または全部を、通信網３００を介して、複数のモバイル端末装置５００に配信することができる。 Modification 3
FIG. 11 is a diagram showing a modified example of the moving image display system. In the moving image display system of this example, the moving image generation system 100 can distribute a part or all of the generated synthetic moving image to a plurality of mobile terminal devices 500 via the communication network 300.

動画生成システム１００は、コンピュータ２０によって生成された合成動画のデータを送信するシステム側通信回路１２０を備えている。またモバイル端末装置５００は、動画生成システム１００から送信された合成動画のデータを受信する装置側通信回路５２０と、合成動画を表示するディスプレイ５５０とを備えている。 The moving image generation system 100 includes a system-side communication circuit 120 that transmits data of a synthetic moving image generated by the computer 20. Further, the mobile terminal device 500 includes a device-side communication circuit 520 that receives data of a synthetic moving image transmitted from the moving image generation system 100, and a display 550 that displays the synthetic moving image.

システム側通信回路１２０は、たとえば通信網３００を介して、外部から送信された指示信号を受信することが可能である。通信網３００の一部はインターネット回線であってもよい。指示信号は、たとえば不図示の対話エンジンにより、ユーザの音声入力などに応答して送出される。対話エンジンは、動画生成システム１００内に設けられていてもよい。システム側通信回路１２０が指示信号を受け取るまで、コンピュータ２０は合成動画のデータを生成する。指示信号を受信した後、コンピュータ２０は、指示信号に応じて、動画クリップの一部である部分クリップを選択する。システム側通信回路１２０は、通信網３００を介して、合成動画および部分クリップのデータを送信する。モバイル端末装置５００のディスプレイ５５０には、装置側通信回路５２０によって受け取られた合成動画および部分クリップが表示される。 The system-side communication circuit 120 can receive an instruction signal transmitted from the outside, for example, via the communication network 300. A part of the communication network 300 may be an internet line. The instruction signal is transmitted in response to a user's voice input or the like by, for example, a dialogue engine (not shown). The dialogue engine may be provided in the moving image generation system 100. The computer 20 generates synthetic moving image data until the system-side communication circuit 120 receives the instruction signal. After receiving the instruction signal, the computer 20 selects a partial clip that is a part of the moving image clip according to the instruction signal. The system-side communication circuit 120 transmits the composite moving image and the partial clip data via the communication network 300. The display 550 of the mobile terminal device 500 displays a synthetic moving image and a partial clip received by the device-side communication circuit 520.

変形例４
図１２は、本開示によるロボットシステムの例示的な実施形態を模式的に示す図である。この図に示されるロボットシステム４００は、前述した動画生成システム１００と、表情および姿勢の少なくとも一方を変化させるための複数のアクチュエータ（電動モータ）９２を備えたロボット９０とを備えている。ロボット９０は、合成動画における被写体像の表情および姿勢の少なくとも一方の変化に追従して複数のアクチュエータ９２を駆動することにより、ロボット９０の表情および姿勢の少なくとも一方を変化させる駆動回路９４を備えている。駆動回路９４は、合成動画の被写体像がたとえば口を開くとき、ロボット９０の口が開くように適切なアクチュエータ９２に電力を供給する。ロボット９０には、発話に用いられる不図示のスピーカが搭載されており、対話エンジン７０が選択した台詞が音声として発せられる。 Modification 4
FIG. 12 is a diagram schematically showing an exemplary embodiment of the robot system according to the present disclosure. The robot system 400 shown in this figure includes the above-mentioned moving image generation system 100 and a robot 90 provided with a plurality of actuators (electric motors) 92 for changing at least one of facial expressions and postures. The robot 90 includes a drive circuit 94 that changes at least one of the facial expressions and postures of the robot 90 by driving a plurality of actuators 92 following changes in at least one of the facial expressions and postures of the subject image in the synthetic moving image. There is. The drive circuit 94 supplies power to an appropriate actuator 92 so that the mouth of the robot 90 opens when, for example, the subject image of the synthetic moving image opens its mouth. The robot 90 is equipped with a speaker (not shown) used for utterance, and the dialogue selected by the dialogue engine 70 is uttered as voice.

このようなロボットシステム４００によれば、ディスプレイに表示される映像ではなく、立体的な構造を有するロボット９０がユーザと対話する。ロボット９０は、人間型である必要はなく、たとえば動物型であってもよい。 According to such a robot system 400, the robot 90 having a three-dimensional structure interacts with the user instead of the image displayed on the display. The robot 90 does not have to be humanoid, and may be animal, for example.

フローチャート
図１３は、事前に行う処理手順の一例を示すフローチャートである。 Flow chart FIG. 13 is a flowchart showing an example of a processing procedure to be performed in advance.

この例では、まず、ステップＳ１０において、待機区間および発話区間を含む動画クリップのデータが記録装置に格納される。この記録装置は、典型的には、動画生成システムが備える記録装置とは異なる記録装置であり得る。ステップＳ１２において、動画クリップを構成する多数のフレームのうち、不要なフレームを除いたフレーム群が選択される。この選択は、典型的には、人間が不要なフレームを選択して指定することによって行われ得る。次に、ステップＳ１４において、事前計算に用いるコンピュータが前述したコスト行列を計算する。計算結果は、図８を参照して説明したようにテーブルの形式で記録装置に格納される。 In this example, first, in step S10, the data of the moving image clip including the waiting section and the utterance section is stored in the recording device. This recording device can typically be a recording device different from the recording device included in the moving image generation system. In step S12, a group of frames excluding unnecessary frames is selected from a large number of frames constituting the moving image clip. This selection can typically be made by selecting and specifying frames that are not needed by humans. Next, in step S14, the computer used for the pre-calculation calculates the cost matrix described above. The calculation result is stored in the recording device in the form of a table as described with reference to FIG.

こうして、本開示の動画生成システムおよび動画表示システムで利用される動画クリップと、コスト行列とを有するデータが用意される。 In this way, data having a moving image clip used in the moving image generation system and the moving image display system of the present disclosure and a cost matrix are prepared.

次に、図１４を参照する。図１４は、動画生成および動画表示の処理手順の一例を示すフローチャートである。 Next, refer to FIG. FIG. 14 is a flowchart showing an example of a moving image generation and moving image display processing procedure.

この例では、処理の開始後、ステップＳ２０において、入力装置がユーザからの入力を受けたか否かが判定される。入力がない場合、ステップＳ３０に進み、待機モードの動作を実行する。待機モードでは、あらかじめ用意されて記録装置に格納されているコスト行列を参照し、現在の表示フレームから遷移する次のフレームを含む複数の候補フレームを選択する。ステップＳ３２において、候補フレームから１つのフレームを前述した原理に従って確率的に決定する。ステップＳ３４において、決定した次のフレームをディスプレイに表示させる。この後、ステップＳ２０に進む。こうして、確率的なフレーム間遷移が実現する。 In this example, after the start of the process, in step S20, it is determined whether or not the input device has received the input from the user. If there is no input, the process proceeds to step S30 to execute the operation in the standby mode. In the standby mode, the cost matrix prepared in advance and stored in the recording device is referred to, and a plurality of candidate frames including the next frame transitioning from the current display frame are selected. In step S32, one frame is stochastically determined from the candidate frames according to the above-mentioned principle. In step S34, the next frame determined is displayed on the display. After that, the process proceeds to step S20. In this way, a probabilistic inter-frame transition is realized.

ステップＳ２０において、入力装置がユーザからの入力を受けたとされたとき、ステップＳ４０に進み、発話モードの動作を実行する。ステップＳ４０において、入力に応じた発話区間を選択する。複数の発話区間がある場合、いずれかの発話区間における先頭フレームが待機モードの動作と同様の動作により、確率的に選択される。ステップＳ４２において、選択された先頭フレームがディスプレイに表示される。ステップＳ４２において、この先頭フレームを含む発話区間の動画が再生される。この動画は、動画クリップにおけるフレームシーケンスの通りにフレームが遷移する。発話内容に即した動画の再生が完了すると、ステップＳ２０に戻る。 When it is determined that the input device has received an input from the user in step S20, the process proceeds to step S40 to execute the operation of the utterance mode. In step S40, the utterance section corresponding to the input is selected. When there are a plurality of utterance sections, the first frame in any of the utterance sections is stochastically selected by the same operation as the operation in the standby mode. In step S42, the selected first frame is displayed on the display. In step S42, the moving image of the utterance section including the first frame is played. In this moving image, the frames change according to the frame sequence in the moving image clip. When the reproduction of the moving image according to the utterance content is completed, the process returns to step S20.

図１５は、コンピュータ２０の内部構成例を示す図である。コンピュータ２０は、ＣＰＵ２２と、メモリ２４と、ＧＰＵ２６と、インタフェース（Ｉ／Ｆ）端子２８とを備えており、これらはバスを介して相互に通信可能に接続されている。 FIG. 15 is a diagram showing an example of the internal configuration of the computer 20. The computer 20 includes a CPU 22, a memory 24, a GPU 26, and an interface (I / F) terminal 28, which are connected to each other via a bus so as to be communicable with each other.

図１５には、コンピュータ２０に接続された記録装置１０も示されている。記録装置１０は、たとえばソリッド・ステート・ドライブ（ＳＳＤ）またはＨＤＤである。 FIG. 15 also shows a recording device 10 connected to the computer 20. The recording device 10 is, for example, a solid state drive (SSD) or an HDD.

ＣＰＵ２２は、一つのチップ上に集積された演算回路である。メモリ２４は、プロセッサ２２による処理を規定する指令から構成されたコンピュータブログラムを格納する、ＲＯＭおよび／またはＲＡＭである。コンピュータブログラムは、上述した処理（たとえば図１４）を実現するための命令群である。コンピュータブログラムの実行時には、ＲＯＭに格納されていたコンピュータブログラムがＲＡＭに展開される。ＣＰＵ２２はコンピュータブログラムに含まれる命令をＲＡＭから逐次読み出し、解釈して実行する。 The CPU 22 is an arithmetic circuit integrated on one chip. The memory 24 is a ROM and / or RAM that stores a computer program composed of instructions defining processing by the processor 22. A computer program is a group of instructions for realizing the above-mentioned processing (for example, FIG. 14). When the computer program is executed, the computer program stored in the ROM is expanded in the RAM. The CPU 22 sequentially reads the instructions included in the computer program from the RAM, interprets them, and executes them.

ＧＰＵ２６は、いわゆるグラフィックス・プロセッシング・ユニットと呼ばれる画像処理回路である。ＧＰＵ２６は、上述した動画を生成する際の画像処理、または動画を表示する際の画像処理を行う。ＧＰＵ２６は、処理の結果得られた画像をディスプレイ５０（図３）に表示するための処理を行ってもよい。 The GPU 26 is an image processing circuit called a so-called graphics processing unit. The GPU 26 performs image processing when generating the above-mentioned moving image or image processing when displaying the moving image. The GPU 26 may perform a process for displaying the image obtained as a result of the process on the display 50 (FIG. 3).

Ｉ／Ｆ端子２８は、コンピュータ２０が、コンピュータ２０の外部に設けられた機器との間で情報の授受を行う接続端子である。図３に示す例で説明する。コンピュータ２０の外部に、記録装置１０、ディスプレイ２０および対話エンジン７０が設けられているとする。コンピュータ２０と記録装置１０との接続に関しては、Ｉ／Ｆ端子２８は、たとえばシリアルＡＴＡ規格の接続端子であり得る。コンピュータ２０とディスプレイ５０との接続に関しては、Ｉ／Ｆ端子２８は、たとえばＤｉｓｐｌａｙＰｏｒｔ（登録商標）、ＨＤＭＩ（登録商標）またはＤＶＩの映像接続端子であり得る。また、コンピュータ２０と対話エンジン７０との接続に関しては、Ｉ／Ｆ端子２８は、たとえばＥｔｈｅｒｎｅｔ（登録商標）規格またはＩＥＥＥ１３９４規格の有線通信を行うことが可能な通信端子、またはＷｉ－ｆｉ（登録商標）規格等の無線通信を行うことが可能な通信回路である。 The I / F terminal 28 is a connection terminal for the computer 20 to exchange information with and from a device provided outside the computer 20. The example shown in FIG. 3 will be described. It is assumed that the recording device 10, the display 20, and the dialogue engine 70 are provided outside the computer 20. Regarding the connection between the computer 20 and the recording device 10, the I / F terminal 28 may be, for example, a serial ATA standard connection terminal. With respect to the connection between the computer 20 and the display 50, the I / F terminal 28 may be, for example, a DisplayPort®, HDMI® or DVI video connection terminal. Regarding the connection between the computer 20 and the dialogue engine 70, the I / F terminal 28 is, for example, a communication terminal capable of performing wired communication of Ethernet (registered trademark) standard or IEEE 1394 standard, or Wi-fi (registered trademark). ) A communication circuit capable of wireless communication such as standards.

本開示の動画生成システムおよび動画表示システムは、たとえばコンピュータを用いて動作する来客対応システムおよび予約受付システムなどにおける擬人化対話エージェントの動画生成および表示に利用され得る。また、本開示の動画生成システムは、ロボットが合成動画における被写体像の表情および姿勢の少なくとも一方の変化に追従して動作するシステムにも利用され得る。 The moving image generation system and the moving image display system of the present disclosure can be used for moving image generation and display of an anthropomorphic dialogue agent in, for example, a visitor response system and a reservation reception system operated by using a computer. Further, the moving image generation system of the present disclosure can also be used in a system in which a robot operates by following a change in at least one of a facial expression and a posture of a subject image in a synthetic moving image.

１０…記録装置、２０…コンピュータ、５０…ディスプレイ、６０…インタフェース装置、７０…対話エンジン、８０…撮影装置、９０…ロボット、９２…アクチュエータ、９４…駆動回路、１００…動画生成システム、１２０…システム側通信回路、２００…動画表示システム、３００…通信網、４００…ロボットシステム、５００…端末装置、５２０…装置側通信回路、５５０…ディスプレイ 10 ... recording device, 20 ... computer, 50 ... display, 60 ... interface device, 70 ... dialogue engine, 80 ... shooting device, 90 ... robot, 92 ... actuator, 94 ... drive circuit, 100 ... video generation system, 120 ... system Side communication circuit, 200 ... Movie display system, 300 ... Communication network, 400 ... Robot system, 500 ... Terminal device, 520 ... Device side communication circuit, 550 ... Display

Claims

A recording device that records a moving image clip that is a sequence of multiple frames including a subject image, and
A computer that reconstructs the plurality of frames based on the similarity between each frame included in the frame group selected from the plurality of frames and each other frame to generate data of a synthetic moving image, and a computer.
Equipped with
The similarity is defined based on the feature amount of the subject image in each frame.
The computer
A plurality of frames selected from the frame group based on the similarity with the Nth frame as the N + 1th frame which is the next frame of the Nth frame (N is a positive integer) of the composite video. While deciding one frame from the candidate frames of
The computer is a moving image generation system that selects the N + 1st frame according to a probability distribution including a transition cost when transitioning from the Nth frame to each of the candidate frames as a parameter.

A recording device that records a moving image clip that is a sequence of multiple frames including a subject image, and
A computer that reconstructs the plurality of frames based on the similarity between each frame included in the frame group selected from the plurality of frames and each other frame to generate data of a synthetic moving image, and a computer.
Equipped with
The similarity is defined based on the feature amount of the subject image in each frame.
The computer
A plurality of frames selected from the frame group based on the similarity with the Nth frame as the N + 1th frame which is the next frame of the Nth frame (N is a positive integer) of the composite video. While deciding one frame from the candidate frames of
The plurality of candidate frames are selected according to the transition cost calculated based on the similarity.
Further equipped with a line-of-sight tracking device that tracks the line of sight of the subject image and outputs line-of-sight information related to the line of sight.
The computer
The additional cost, which is specified based on the line-of-sight information, is smaller as the direction of the line of sight of the subject image is closer and larger as the direction of the line of sight of the subject image is farther is calculated.
Modifying the transition cost using the additional cost,
Video generation system.

The moving image generation system according to claim 1 or 2, wherein the computer determines one frame from a plurality of candidate frames selected from the frame group based on the similarity.

The moving image generation system according to claim 1 or 2, wherein the computer stochastically determines one frame from a plurality of candidate frames selected from the frame group based on the monotonous decreasing function of the similarity. ..

The moving image generation system according to any one of claims 1 to 4, wherein the feature amount of the subject image is the position of a plurality of feature points included in the subject image.

The moving image generation system according to claim 5, wherein the feature amount of the subject image further includes at least one of the luminance and the color of the plurality of feature points.

The similarity between any first frame included in the frame group and a second frame other than the first frame included in the frame group is each of the object images in the first frame. The moving image generation system according to claim 5 or 6, which is defined by the arrangement relationship between the feature points and the corresponding feature points in the second frame.

The similarity between any first frame included in the frame group and a second frame other than the first frame included in the frame group is each of the object images in the first frame. The moving image generation system according to claim 5 or 6, which is a function of the distance between the position of the feature point and the position of each corresponding feature point in the second frame.

The transition cost when transitioning from the first frame to the second frame is the first set of a predetermined number of consecutive frames including the first frame in the frame group, and the second frame in the frame group. The moving image generation system according to claim 7 or 8, wherein the second set of the predetermined number of consecutive frames including the above is a linear combination of the similarities between the corresponding frames in the same order.

The moving image generation system according to any one of claims 1 to 9, wherein the frame group includes a plurality of interpolation frames prepared in advance in addition to a frame selected from the plurality of frames.

The moving image generation system according to any one of claims 1 to 10, wherein the computer displays the synthesized moving image on a display.

The moving image generation system according to claim 11, wherein the computer performs cross-dissolving according to the degree of similarity between the Nth frame and the N + 1th frame.

The moving image generation system according to any one of claims 1 to 12, wherein at least a part of the subject image includes at least one of a human face and a body.

The subject image is represented by linearly combining the products of each of a plurality of base images prepared in advance and each of a plurality of coefficients.
The moving image generation system according to any one of claims 1 to 4, wherein the feature amount is defined by a combination of the plurality of coefficients.

A recording device that records a moving image clip that is a sequence of multiple frames including a subject image, and
A computer that reconstructs the plurality of frames based on the similarity between each frame included in the frame group selected from the plurality of frames and each other frame to generate data of a synthetic moving image, and a computer.
Equipped with
The computer
A plurality of frames selected from the frame group based on the similarity with the Nth frame as the N + 1th frame which is the next frame of the Nth frame (N is a positive integer) of the composite video. A video generation system that selects one frame after excluding some frames from the candidate frames.

The moving image generation system according to claim 15, wherein the computer selects some of the frames according to the order of the frames constituting the synthesized moving image and excludes them from the candidate frames.

When the computer has a frame ( _KL is an integer of 1 or more) contained in a predetermined number of TL frames ( _TL is an integer of 2 or more) before the Nth frame by a predetermined number of _KL times or more ( _KL is an integer of 1 or more). The moving image generation system according to claim 16, wherein the frame is selected as the partial frame and excluded from the candidate frames.

The motion generation system according to any one of claims 15 to 17, wherein the computer determines one frame from a plurality of candidate frames selected from the frame group based on the similarity.

13. Video generation system.

The moving image generation system according to any one of claims 1 to 19, further comprising a display for displaying the synthesized moving image.

It also has an interface device that accepts user input.
The computer performs one of the generation of the data of the composite moving image and the selection of the partial clip which is a part of the moving image clip in response to the input of the user.
The moving image generation system according to claim 20, wherein the display displays the generated synthetic moving image or the selected partial clip.

21. The moving image generation system according to claim 21, further comprising a housing for accommodating the recording device, the computer, the display, and the interface device.

The moving image generation system according to any one of claims 1 to 19.
With the terminal device
It is a video display system equipped with
The moving image generation system further includes a system-side communication circuit that transmits data of the synthesized moving image generated by a computer.
The terminal device is
The device-side communication circuit that receives the composite video data transmitted from the video generation system, and
A display that displays the composite video and
A video display system equipped with.

The system-side communication circuit can receive an instruction signal transmitted from the outside, and can receive the instruction signal.
Until the instruction signal is received, the computer generates the data of the composite moving image.
After receiving the instruction signal, the computer selects a partial clip that is a part of the moving image clip in response to the instruction signal.
The system-side communication circuit transmits the composite video and the data of the partial clip, and receives the data.
23. The moving image display system according to claim 23, wherein the display of the terminal device displays the synthesized moving image and the partial clip received by the device side communication circuit.

The moving image generation system according to any one of claims 1 to 12 and 15 to 19.
With a robot equipped with multiple actuators to change at least one of facial expression and posture,
A drive circuit that changes at least one of the facial expressions and postures of the robot by driving the plurality of actuators following changes in at least one of the facial expressions and postures of the subject image in the composite moving image.
A robot system equipped with.