JP2016070891A

JP2016070891A - Video data processor and video data processing program

Info

Publication number: JP2016070891A
Application number: JP2014203430A
Authority: JP
Inventors: 康輔高橋; Kosuke Takahashi; 志織杉本; Shiori Sugimoto; 豊國田; Yutaka Kunida; 明小島; Akira Kojima
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2014-10-01
Filing date: 2014-10-01
Publication date: 2016-05-09

Abstract

PROBLEM TO BE SOLVED: To provide a video data processor capable of robustly estimating the position attitudes of cameras or mutual positional relationships between the cameras from video data and sensor data, and visualizing them.SOLUTION: A video data processor includes: data input means for inputting a plurality of sets of video data of an imaged video and sensor data from which the position attitude of imaging means that has imaged the video can be estimated; position attitude estimation means for estimating the position attitude of each of a plurality of imaging means from the plurality of sensor data; position attitude diagram creation means for creating and outputting a position attitude diagram indicating the position attitude of each of the plurality of imaging means from the estimated position attitude; display means for displaying the position attitude diagram and the video data on a screen; and video selection means for selecting the video data of the imaging means selected from the position attitude diagram, and outputting the video data to the display means.SELECTED DRAWING: Figure 1

Description

本発明は、映像データ処理装置及び映像データ処理プログラムに関する。 The present invention relates to a video data processing apparatus and a video data processing program.

スマートフォンを始めとするセンサ付きカメラデバイスの普及に伴い、ライブや路上パフォーマンスといったイベントを聴衆が撮影した映像がウェブ上で共有されるようになってきている。そして、あるイベントの様々な視点から撮影された膨大な数の映像を誰もが自由に視聴することが可能になっている。これらの膨大な数の映像に対し、より快適な視聴を目的としてイベントごとに映像をまとめて視聴させる多視点視聴サービスなども登場してきている。こうした多視点視聴サービスの多くは、撮影された映像群をサムネイル画像（あるいは映像）として複数表示し、ユーザがその中から任意の視点の映像を選んで試聴するという方式をとっている。 With the spread of sensor-equipped camera devices such as smartphones, videos of events taken by audiences such as live performances and street performances are being shared on the web. And anyone can freely view a huge number of videos taken from various viewpoints of an event. Multi-viewpoint viewing services that allow users to view videos together for each event for the purpose of more comfortable viewing of these enormous numbers of videos have also appeared. Many of these multi-viewpoint viewing services use a method in which a plurality of captured video images are displayed as thumbnail images (or video images), and a user selects a video from an arbitrary viewpoint from among them and listens to it.

しかし、こうした形でユーザがサムネイル群から所望の視点の映像を選択するためには、ユーザ自身がそのサムネイル画像群がそれぞれどの視点から撮影されているかを一つ一つ推定する必要がある。そのため、例えば対象の映像群が数十、数百に上る場合、所望の視点の映像を素早く選択することは困難である。このような問題に対し、システム側でどの映像がどの位置からどういう姿勢で撮影されているか、あるいはそれぞれのカメラがどういった位置関係にあるのかといった「視点の位置関係・位置姿勢の可視化」を行うことが試みられている。 However, in order for the user to select a video of a desired viewpoint from the thumbnail group in this manner, it is necessary for the user himself to estimate from which viewpoint each thumbnail image group is shot. Therefore, for example, when there are tens or hundreds of target video groups, it is difficult to quickly select a video of a desired viewpoint. In order to deal with such problems, "visualization of the positional relationship / position and orientation of the viewpoint" such as which image is taken from which position and from what position on the system side, and what positional relationship each camera is in Attempts to do so.

これは、ユーザが映像から視点を推定する手間を省くため、有効な解決策であると考えられる。なお、ここで言う位置関係とはどのカメラ同士が隣接しており、どちら側にあるのかといった、いわゆるカメラをノードと見立てたネットワークトポロジーも含む。また、位置とはカメラの持つ座標系の原点の世界座標系における位置を意味し、並進ベクトルで表される。姿勢とはカメラの持つ座標系の各軸の向きを意味し、回転行列で表される。位置姿勢とはこれらの並進ベクトルおよび回転行列のことを指す。 This is considered to be an effective solution because it saves the user from estimating the viewpoint from the video. The positional relationship mentioned here includes a network topology in which a so-called camera is regarded as a node, such as which cameras are adjacent to each other. The position means the position of the origin of the coordinate system of the camera in the world coordinate system, and is represented by a translation vector. The posture means the direction of each axis of the coordinate system of the camera and is represented by a rotation matrix. The position and orientation refers to these translation vectors and rotation matrices.

また、一方で非特許文献１のように映像から各カメラデバイスの位置姿勢を推定する手法が提案されている。 On the other hand, as in Non-Patent Document 1, a method for estimating the position and orientation of each camera device from a video has been proposed.

Photo Tourism: Exploring Photo Collections in 3DNoah Sanvely, Steven M.Seitz, Richard Szeliski SIGGRAPH Conference Proceedings, 2006, 1-59593-364-6, pp:835-846, ACM PressPhoto Tourism: Exploring Photo Collections in 3DNoah Sanvely, Steven M. Seitz, Richard Szeliski SIGGRAPH Conference Proceedings, 2006, 1-59593-364-6, pp: 835-846, ACM Press

しかしながら、これらの手法には各カメラデバイスの撮影領域が重複していなければならないという制約がある。任意に撮影された映像が全てのフレームにおいて重複領域を持つことは難しい。また、映像にブラー（被写体のぶれ）または符号化ノイズなどの劣化が生じることがある。これらの状況が生じると位置姿勢の推定に必要な特徴点対応を取れない場合が多い。すなわち、任意に撮影された映像では、この位置姿勢推定手法が適用できないフレームが大半を占めるという問題がある。 However, these methods have a limitation that the shooting areas of the camera devices must overlap. It is difficult for arbitrarily shot images to have overlapping areas in all frames. In addition, the image may be deteriorated such as blur (blurring of subject) or encoding noise. When these situations occur, it is often impossible to take the feature point correspondence necessary for position and orientation estimation. That is, there is a problem in that a video that is arbitrarily shot has a large number of frames to which this position and orientation estimation method cannot be applied.

本発明は、このような事情に鑑みてなされたもので、映像データとセンサデータとからロバストにカメラの位置姿勢あるいは互いの位置関係を推定することができる映像データ処理装置及び映像データ処理プログラムを提供することを目的とする。 The present invention has been made in view of such circumstances, and provides a video data processing apparatus and a video data processing program capable of robustly estimating the position and orientation of a camera or the mutual positional relationship from video data and sensor data. The purpose is to provide.

本発明は、撮像した映像の映像データと、該映像を撮像した撮像手段の位置姿勢を推定可能なセンサデータとの組を複数入力するデータ入力手段と、複数の前記センサデータから複数の前記撮像手段それぞれの位置姿勢を推定する位置姿勢推定手段と、推定した前記位置姿勢から複数の前記撮像手段それぞれの位置姿勢を示す位置姿勢図を作成して出力する位置姿勢図作成手段と、前記位置姿勢図と、前記映像データとを画面に表示する表示手段と、前記位置姿勢図の図上から選択された前記撮像手段の前記映像データを選択して前記表示手段に出力する映像選択手段とを備えたことを特徴とする。 The present invention includes a data input unit that inputs a plurality of sets of video data of a captured video and sensor data that can estimate the position and orientation of the imaging unit that captured the video, and a plurality of the imagings from a plurality of the sensor data. Position and orientation estimation means for estimating the position and orientation of each means, position and orientation diagram creation means for creating and outputting a position and orientation diagram indicating the position and orientation of each of the plurality of imaging means from the estimated position and orientation, and the position and orientation And a display means for displaying the video data on a screen, and a video selection means for selecting the video data of the imaging means selected from the position and orientation diagram and outputting the selected video data to the display means. It is characterized by that.

本発明は、前記映像データから、前記撮像手段の相対位置姿勢を推定する相対位置姿勢推定手段をさらに備え、位置姿勢推定部は、前記相対位置姿勢と、前記センサデータに基づき推定した前記撮像手段の位置姿勢との間に不整合が生じた際に、前記センサデータに基づき推定した前記撮像手段の位置姿勢を修正することを特徴とする。 The present invention further includes a relative position and orientation estimation unit that estimates a relative position and orientation of the imaging unit from the video data, and the position and orientation estimation unit estimates the imaging unit based on the relative position and orientation and the sensor data. When a mismatch occurs between the position and orientation of the image pickup unit, the position and orientation of the imaging unit estimated based on the sensor data is corrected.

本発明は、前記データ入力手段により入力した複数の前記映像データ間及び複数の前記センサデータ間を同期させる処理を行うデータ処理部をさらに備えたことを特徴とする。 The present invention further includes a data processing unit that performs processing for synchronizing the plurality of video data and the plurality of sensor data input by the data input unit.

本発明は、前記データ処理部は、前記同期させる処理を行う前に前記センサデータのノイズ除去を行うことを特徴とする。 The present invention is characterized in that the data processing unit performs noise removal of the sensor data before performing the synchronizing process.

本発明は、前記データ処理部は、前記同期させる処理を行う前に前記センサデータの獲得できていない時刻の値を補間することを特徴とする。 The present invention is characterized in that the data processing unit interpolates a time value at which the sensor data cannot be acquired before performing the synchronization process.

本発明は、コンピュータを、前記映像データ処理装置として機能させるための映像データ処理プログラムである。 The present invention is a video data processing program for causing a computer to function as the video data processing apparatus.

本発明によれば、カメラにより得られた映像データと、それに対応するセンサデータを用い、各カメラの位置姿勢あるいは互いの位置関係をロバストに推定することができるという効果が得られる。 According to the present invention, it is possible to robustly estimate the position and orientation of each camera or the mutual positional relationship using video data obtained by the camera and sensor data corresponding thereto.

本発明の一実施形態による映像データ処理装置１の構成を示すブロック図である。1 is a block diagram showing a configuration of a video data processing apparatus 1 according to an embodiment of the present invention. 図１に示す映像データ処理装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the video data processing apparatus shown in FIG. 映像データおよびセンサデータを用いてデータ獲得装置の位置姿勢を推定する方法の概念図である。It is a conceptual diagram of the method of estimating the position and orientation of a data acquisition device using video data and sensor data. 映像データおよびセンサデータを用いてデータ獲得装置間の位置姿勢を推定する処理動作を示すフローチャートである。It is a flowchart which shows the processing operation which estimates the position and orientation between data acquisition apparatuses using video data and sensor data. 図１に示す表示部１５の詳細な構成を示すブロック図である。It is a block diagram which shows the detailed structure of the display part 15 shown in FIG. 図５に示す表示部１５の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the display part 15 shown in FIG. 表示装置に表示する映像を提示するウインドウと、データ獲得装置Ｄｉの位置関係（鳥瞰図）を提示するウインドウの一例を示す図である。It is a figure which shows an example of the window which shows the image | video which displays on a display apparatus, and the window which shows the positional relationship (bird's-eye view) of the data acquisition apparatus Di.

以下、図面を参照して、本発明の一実施形態による映像データ処理装置を説明する。図１は同実施形態による映像データ処理装置１の構成を示すブロック図である。映像データ処理装置１は、図１に示すように、データ入力部１１、データ蓄積部１２、データ処理部１３、位置姿勢推定部１４、表示部１５を備えている。 Hereinafter, a video data processing apparatus according to an embodiment of the present invention will be described with reference to the drawings. FIG. 1 is a block diagram showing a configuration of a video data processing apparatus 1 according to the embodiment. As shown in FIG. 1, the video data processing apparatus 1 includes a data input unit 11, a data storage unit 12, a data processing unit 13, a position / orientation estimation unit 14, and a display unit 15.

データ入力部１１は、処理対象である映像データと、この映像を撮像したときのセンサデータとを入力する。データ蓄積部１２は入力された映像データおよびセンサデータを記憶する。データ処理部１３は映像データ及びセンサデータに対してノイズ除去や補間や同期といった前処理を適用する。位置姿勢推定部１４では映像データ及びセンサデータを用いて、これらの映像データ及びセンサデータを取得したデータ獲得装置の位置姿勢を推定する。ここで、データ獲得装置とは、映像データを取得する画像センサ及びジャイロセンサや加速度センサといった各種センサを同一の筐体内に備える装置のことである。具体的にはスマートフォンやタブレット端末などが例として挙げられる。表示部１５は、推定されたデータ獲得装置の位置姿勢を示す模式図と映像とを表示する。 The data input unit 11 inputs video data to be processed and sensor data when the video is captured. The data storage unit 12 stores the input video data and sensor data. The data processing unit 13 applies preprocessing such as noise removal, interpolation, and synchronization to the video data and sensor data. The position / orientation estimation unit 14 uses the image data and sensor data to estimate the position and orientation of the data acquisition apparatus that has acquired the image data and sensor data. Here, the data acquisition device is a device that includes an image sensor that acquires video data, and various sensors such as a gyro sensor and an acceleration sensor in the same housing. Specifically, a smart phone, a tablet terminal, etc. are mentioned as an example. The display unit 15 displays a schematic diagram and a video showing the estimated position and orientation of the data acquisition device.

次に、図２を参照して、図１に示す映像データ処理装置１の処理動作を説明する。図２は図１に示す映像データ処理装置の動作を示すフローチャートである。なお、本実施形態では映像データ及びセンサデータは同一シーンを複数のデータ獲得装置で記録したものであり、映像とはカメラおよびマイクが獲得した画像列及び音声を有するデータを指し、センサデータとは加速度センサ、ジャイロセンサ、地磁気センサなどのカメラ以外のセンサが獲得したカメラの位置姿勢を推定することができるデータを指す。本実施形態では、各映像と各センサデータは１対１に対応しているものとし、対応する両者は同一のデータ獲得装置で記録されたものでありその位置姿勢は完全に一致しているものとする。 Next, the processing operation of the video data processing apparatus 1 shown in FIG. 1 will be described with reference to FIG. FIG. 2 is a flowchart showing the operation of the video data processing apparatus shown in FIG. In this embodiment, video data and sensor data are the same scene recorded by a plurality of data acquisition devices, and video refers to data having an image sequence and sound acquired by a camera and a microphone, and sensor data is Data that can estimate the position and orientation of a camera acquired by a sensor other than the camera such as an acceleration sensor, a gyro sensor, and a geomagnetic sensor. In this embodiment, it is assumed that each image and each sensor data have a one-to-one correspondence, and both corresponding ones are recorded by the same data acquisition device, and their positions and orientations are completely the same. And

まず、データ入力部１１は、複数のデータ獲得装置によって獲得された映像データおよびセンサデータを入力し、データ蓄積部１２に記憶する（ステップＳ１）。ここでは、全時刻全データ獲得装置の映像データ及びセンサデータを全て入力し記憶することとするが、数フレームずつ区切って入力し記憶して処理しても構わない。また、いくつかのデータ獲得装置ごとに区切っても構わない。 First, the data input unit 11 inputs video data and sensor data acquired by a plurality of data acquisition devices, and stores them in the data storage unit 12 (step S1). Here, all the video data and sensor data of the all-time all-data acquisition device are input and stored, but may be input, stored, and processed by dividing every several frames. Further, it may be divided for each of several data acquisition devices.

映像データおよびセンサデータの記憶が終了したら、データ処理部１３は、データ獲得装置それぞれについてのセンサデータに対してノイズ除去処理、および時間方向に補間処理を施す（ステップＳ２）。なお、得られるセンサデータは時刻を示すタイムスタンプと、その時刻におけるセンサの値（センサデータ）が離散的に記録されたものである。一般に、センサが記録する値はノイズを含んでいることが多い。そこで、これらの値に対してノイズ除去フィルタを施すことでノイズの少ない値を得る。ここで用いるノイズ除去フィルタはどのような処理で行っても構わない。一般的な手法として、ローパスフィルタをかけてスパイクのような値の変動を取り除く、あるいは移動平均フィルタをかけることで平滑化を行ってもよい。もし、センサデータのノイズを許容していい場合は、ノイズ除去の処理を施さなくても構わない。 When the storage of the video data and the sensor data is finished, the data processing unit 13 performs noise removal processing and interpolation processing in the time direction on the sensor data for each data acquisition device (step S2). The obtained sensor data is obtained by discretely recording a time stamp indicating the time and a sensor value (sensor data) at the time. In general, a value recorded by a sensor often includes noise. Therefore, a value with less noise is obtained by applying a noise removal filter to these values. The noise removal filter used here may be performed by any process. As a general method, smoothing may be performed by applying a low-pass filter to remove fluctuations in values such as spikes, or by applying a moving average filter. If it is acceptable to allow noise in the sensor data, the noise removal process may not be performed.

また、これらセンサデータを記録するレートは映像及び音声のフレームレートとは一致していない。そのため、センサデータの映像の各フレームの時刻におけるセンサデータの値を、記録されたセンサデータの値から補間処理によって生成する（ステップＳ３）。ここで用いる補間処理はどのような手法で行っても構わない。一般的な手法としては、線形補間やスプライン補間が挙げられる。もし、センサデータのフレームレートが映像および音声のフレームレートと一致する場合、またはフレームレートは一致しないが映像の各フレーム時刻に対応するセンサデータが記録できている場合、または時刻は一致しないがズレを許容して構わない場合には、補間処理を行わずに近い時刻のセンサデータをそのまま使用しても構わない。 Further, the rate at which these sensor data are recorded does not match the frame rate of video and audio. Therefore, the value of the sensor data at the time of each frame of the sensor data video is generated from the recorded value of the sensor data by interpolation processing (step S3). The interpolation process used here may be performed by any method. Common methods include linear interpolation and spline interpolation. If the frame rate of the sensor data matches the frame rate of video and audio, or if the frame rate does not match but sensor data corresponding to each frame time of the video can be recorded, or the time does not match, but there is a gap Can be used, sensor data at a close time may be used as it is without performing the interpolation process.

次に、データ処理部は、映像データおよびセンサデータを獲得したデータ獲得装置間で時刻の同期を行う（ステップＳ４）。予め同期して記録している場合、既に同期処理されている場合は行わなくてもよい。ここで用いる同期処理はどのような手法で行っても構わない。一般的な手法としては、音声と画像は同期して記録されていることとして、映像の音声データを用いて同期を行う手法が挙げられる（例えば、参考文献１参照）。
参考文献１：Markerless Motion Capture with Unsynchronized Moving Cameras Nils Hasler1, Bodo Rosenhahn1, Thorsten Thormahlen ¨, Michael Wand, Juergen Gall1, Hans-Peter Seidel Computer Vision and Pattern Recognition, 2009. CVPR2009. IEEE Conference on,pp:224-231, ISSN:1063-6919 Next, the data processing unit synchronizes time between the data acquisition devices that have acquired the video data and the sensor data (step S4). When recording is performed in advance in synchronization, it may not be performed when synchronization processing has already been performed. The synchronization process used here may be performed by any method. As a general method, there is a method in which audio and video are recorded in synchronization, and a method of performing synchronization using audio data of video (for example, see Reference 1).
Reference 1: Markerless Motion Capture with Unsynchronized Moving Cameras Nils Hasler1, Bodo Rosenhahn1, Thorsten Thormahlen ¨, Michael Wand, Juergen Gall1, Hans-Peter Seidel Computer Vision and Pattern Recognition, 2009. CVPR2009. IEEE Conference on, pp: 224-231 , ISSN: 1063-6919

また、記録時にストロボが炊かれている場合などには画像処理により同期を行う手法も存在する。また、映像の画像データを見て手動で同期を行ってもよい。 There is also a method of performing synchronization by image processing when a strobe is cooked at the time of recording. Alternatively, the synchronization may be performed manually by looking at the image data of the video.

次に、位置姿勢推定部１４は、映像データおよびセンサデータから全ての時刻における各データ獲得装置の世界座標系での位置姿勢を求める（ステップＳ５）。ここでいう世界座標系での位置姿勢とは、全データ獲得装置からなる系を世界座標系とした時の位置姿勢であり、必ずしも物理的な量と対応している必要はない。ここで用いる位置姿勢を求める手法はどのような手法を用いても構わない。 Next, the position / orientation estimation unit 14 obtains the position / orientation in the world coordinate system of each data acquisition device at all times from the video data and the sensor data (step S5). The position and orientation in the world coordinate system here is a position and orientation when the system composed of all data acquisition devices is the world coordinate system, and does not necessarily correspond to a physical quantity. Any method may be used for obtaining the position and orientation used here.

一般的な手法としては、各映像が重複領域を持つ場合には、ＳｔｒｕｃｔｕｒｅｆｒｏｍＭｏｔｉｏｎ（例えば、非特許文献１参照）の手法を用いることで、複数のカメラ間の相対的な位置姿勢を推定することができる。各カメラについて１つ以上の他のカメラとの相対位置姿勢が推定でき、全てのカメラについて間接的に他のカメラとの相対位置が判明すれば、どれか一つのカメラを基準にすることで、基準カメラを中心とした世界座標系における全カメラの位置姿勢を決定できる。 As a general method, when each video has an overlapping area, a relative position and orientation between a plurality of cameras is estimated by using a method of Structure from Motion (see, for example, Non-Patent Document 1). be able to. For each camera, the relative position and orientation with one or more other cameras can be estimated, and for all cameras, if the relative positions with other cameras are found indirectly, one of the cameras can be used as a reference, It is possible to determine the position and orientation of all cameras in the world coordinate system centered on the reference camera.

しかしながら、任意に撮影された映像が必ずしも全ての時刻において重複領域を持つとは限らない。そこで、映像が重複領域を持たない場合にはセンサデータを用いて位置姿勢を求める。ある一つのデータ獲得装置に関しては、ジャイロセンサを用いて、ある時刻からの姿勢の変位を求めることが可能である。地磁気センサや加速度センサを用いれば、Ｅａｓｔ−Ｎｏｒｔｈ−Ｕｐ座標系における姿勢を常に求めることができるので、こちらを利用してもよい。 However, an arbitrarily shot video does not necessarily have an overlapping area at all times. Therefore, when the video does not have an overlapping area, the position and orientation are obtained using sensor data. With respect to a certain data acquisition device, it is possible to obtain the displacement of the posture from a certain time using a gyro sensor. If a geomagnetic sensor or an acceleration sensor is used, the posture in the East-North-Up coordinate system can always be obtained, and this may be used.

映像では位置姿勢推定を行うことのできないフレームについてこれらを利用して位置姿勢の推移を推定することで、全ての時刻においてデータ獲得装置の位置姿勢を求めることが可能である。しかしながら、これらのセンサデータは累積誤差を発生しやすいため、長時間センサデータのみを用いて位置姿勢を推定すると推定精度が著しく低下する場合がある。そのため、定期的にセンサデータの累積誤差をリフレッシュすることが望ましい。例えば、あるフレームにおいて映像から位置姿勢が推定できる場合には、その位置姿勢にセンサデータを用いて求められた位置姿勢を一致させることでリフレッシュすることも可能である。また、データ獲得装置の位置に関しては、数十メートル規模の誤差を許容するならば、ＧＰＳセンサを利用してもよい。 It is possible to obtain the position and orientation of the data acquisition device at all times by estimating the transition of the position and orientation using these frames that cannot be estimated in the video. However, since these sensor data are likely to generate a cumulative error, estimation accuracy may be significantly reduced if the position and orientation is estimated using only the sensor data for a long time. Therefore, it is desirable to periodically refresh the accumulated error of sensor data. For example, when the position and orientation can be estimated from the video in a certain frame, it is possible to refresh by matching the position and orientation obtained using the sensor data with the position and orientation. As for the position of the data acquisition device, a GPS sensor may be used if an error of several tens of meters is allowed.

次に、表示部１５は、得られた各データ獲得装置の位置姿勢を模式図（位置姿勢図）として提示する（ステップＳ６）。また、表示部１５は、選択されたデータ獲得装置の映像を提示する（ステップＳ７）。これらの提示が、映像データ処理装置１の出力となる。 Next, the display unit 15 presents the obtained position and orientation of each data acquisition device as a schematic diagram (position and orientation diagram) (step S6). Further, the display unit 15 presents an image of the selected data acquisition device (step S7). These presentations are the output of the video data processing apparatus 1.

次に、図３、図４を参照して、データ獲得装置の位置姿勢を推定する方法（ステップＳ５の処理）について説明する。図３は、映像データおよびセンサデータを用いてデータ獲得装置の位置姿勢を推定する方法の概念図である。Ｄ０、Ｄ１、Ｄ２、Ｄ３は各データ獲得装置を表し（４つのデータ獲得装置の場合）、Ｄ０は基準となるデータ獲得装置を表す。また、ｔは時間を表す。Ｄｉ（ｉ＝０，１，…）間およびｔ方向の点線はその時刻におけるＤｉ間の相対的な位置姿勢が未推定であることを表し、実線は推定が完了したことを表す。位置姿勢推定部の目的は映像データおよびセンサデータを用いてＤｉ間に存在する全ての点線を実線にすることである。 Next, a method for estimating the position and orientation of the data acquisition device (processing in step S5) will be described with reference to FIGS. FIG. 3 is a conceptual diagram of a method for estimating the position and orientation of the data acquisition device using video data and sensor data. D0, D1, D2, and D3 represent each data acquisition device (in the case of four data acquisition devices), and D0 represents a reference data acquisition device. T represents time. Di (i = 0, 1,...) And a dotted line in the t direction indicate that the relative position and orientation between Di at that time have not been estimated, and a solid line indicates that the estimation has been completed. The purpose of the position / orientation estimation unit is to make all dotted lines existing between Di using a video data and sensor data into a solid line.

図４は、映像データおよびセンサデータを用いてデータ獲得装置間の位置姿勢を推定する処理動作を示すフローチャートである。まず、基準であるＤ０となるデータ獲得装置を定める（ステップＳ１１）。Ｄ０はどのような方法で定めても構わない。例えば、手動で定める、あるいはセンサデータを基に、時間方向への変位が最も少なかったデータ獲得装置を自動的に選択するなどの方法が挙げられる。 FIG. 4 is a flowchart showing a processing operation for estimating the position and orientation between data acquisition devices using video data and sensor data. First, a data acquisition device to be a reference D0 is determined (step S11). D0 may be determined by any method. For example, there is a method of manually selecting or automatically selecting a data acquisition device having the smallest displacement in the time direction based on sensor data.

次に、映像を用いて基準であるＤ０に対する各Ｄｉ（ｉ＝１，２…）の相対位置姿勢を求める（ステップＳ１２）。映像から相対位置姿勢を求める手法として、ＳｔｒｕｃｔｕｒｅｆｒｏｍＭｏｔｉｏｎを用いてもよい。しかしながら、ＳｔｒｕｃｔｕｒｅｆｒｏｍＭｏｔｉｏｎでは映像間の特徴点の対応を利用しているため、映像が重複領域を持たない場合、あるいはブラーや符号化ノイズなどにより正しい特徴点対応が取得できない場合には位置姿勢の推定精度が著しく低下するか、推定ができない場合がある。これは図３のある時刻においてＤ０−Ｄｉ間の実線が求まらないことに相当する。 Next, the relative position and orientation of each Di (i = 1, 2,...) With respect to the reference D0 is obtained using the video (step S12). As a method for obtaining the relative position and orientation from the video, Structure from Motion may be used. However, Structure from Motion uses feature point correspondence between videos, so if the video does not have overlapping regions, or if correct feature point correspondence cannot be obtained due to blur, coding noise, etc., the position and orientation There are cases where the estimation accuracy is significantly reduced or estimation is impossible. This corresponds to the fact that a solid line between D0 and Di cannot be obtained at a certain time in FIG.

ステップＳ１２の処理を全ての時刻に対して行う（ステップＳ１３）ことで、Ｄ０−Ｄｉ間の一部の時刻において相対位置姿勢を求めることができる。すなわち、図３においてＤ０−Ｄｉ間の点線が一部実線になったことを意味する。また、図３におけるＤ２のように、直接Ｄ０に対する相対位置姿勢が求まらない場合でも、Ｄ０に対して相対位置姿勢が直接求まっているＤ１との相対位置姿勢が求まっていれば、Ｄ１を経由することでＤ０に対する相対位置姿勢を求めることが可能である。どのデータ獲得装置を経由してもＤ０に対する相対位置姿勢が求められなかったデータ獲得装置に関しては、本実施形態では除外する。 By performing the process of step S12 for all times (step S13), the relative position and orientation can be obtained at some times between D0 and Di. That is, in FIG. 3, the dotted line between D0 and Di is partly a solid line. Further, even when the relative position and orientation with respect to D0 are not directly obtained as in D2 in FIG. 3, if the relative position and orientation with D1 with which the relative position and orientation is directly obtained with respect to D0 is obtained, D1 is determined. It is possible to obtain the relative position and orientation with respect to D0 by going through. In this embodiment, a data acquisition device in which the relative position and orientation with respect to D0 cannot be obtained through any data acquisition device is excluded.

次に、各Ｄｉについて時間方向の位置姿勢の変化を求める（ステップＳ１４）。これは映像データやセンサデータを用いて推定が可能である。推定する手法はどのような手法を用いても構わない。映像を用いる場合は時間方向にＳｔｒｕｃｔｕｒｅｆｒｏｍＭｏｔｉｏｎを用いてもよい。センサデータを用いる場合はジャイロセンサなどを用いてもよい。ジャイロセンサはセンサが持つ座標系における単位時間あたりの各軸に対する変位を記録する。これらに実際の時間をかけ、積算することである時刻に対する相対的な姿勢の変化を求めることが可能である。また、加速度センサでは単位時間あたりの各軸に対する加速度が求まる。これらに実際の時間をかけ、積算することである時刻に対する相対的な位置の変化を求めることが可能である。この処理を全ての時刻および全てのＤｉに対して行う（ステップＳ１５）ことで、Ｄｉの時間方向の位置姿勢の変位を求めることができる。すなわち、図３においてはｔ方向の点線が実線になったことを意味する。 Next, a change in position and orientation in the time direction is obtained for each Di (step S14). This can be estimated using video data or sensor data. Any method may be used as the estimation method. When using an image, Structure from Motion may be used in the time direction. When using sensor data, a gyro sensor or the like may be used. The gyro sensor records the displacement with respect to each axis per unit time in the coordinate system of the sensor. It is possible to obtain a change in the relative posture with respect to the time, which is to add and accumulate the actual time. In addition, the acceleration sensor determines the acceleration for each axis per unit time. It is possible to obtain a change in relative position with respect to time, which is to add and accumulate the actual time. By performing this process for all times and all Di (step S15), the displacement of the position and orientation in the time direction of Di can be obtained. That is, in FIG. 3, it means that the dotted line in the t direction is a solid line.

なお、これらのセンサデータを用いることで求められる位置姿勢はセンサの持つ座標系に対する位置姿勢であり、必ずしもカメラの持つ座標系とは一致しない。しかしながら、スマートフォンのようにカメラとセンサが同一の筐体に搭載され、その相対位置が不変かつ相対位置の差が無視できる程度の系であれば、センサデータの位置姿勢の変化をカメラの位置姿勢の変化と見てもよい。 Note that the position and orientation obtained by using these sensor data are the position and orientation with respect to the coordinate system of the sensor, and do not necessarily match the coordinate system of the camera. However, if the system is such that the camera and sensor are mounted in the same housing, such as a smartphone, and the relative position is unchanged and the difference in relative position is negligible, the change in the position and orientation of the sensor data It may be seen as a change in.

次に、位置姿勢の不一致を修正する（ステップＳ１６）。図３に示すように時刻ｔｉとｔｊにおいてＤ０とＤ１の位置姿勢が映像から求まっている場合、時刻ｔｊに映像から求まった位置姿勢と、時刻ｔｉに映像から求まった位置姿勢からセンサデータを用いて推定された値が一致するとは限らない。これはセンサの累積誤差によるものが大きい。これらの値を必ずしも一致させる必要はないが、一致させることによってセンサの累積誤差をリフレッシュすることが可能になる。なお、これらの値を一致させる手法としてどのような手法を用いても構わない。例えば以下のような手法を用いて一致させてもよい。 Next, the position and orientation mismatch is corrected (step S16). As shown in FIG. 3, when the positions and orientations of D0 and D1 are obtained from the video at times ti and tj, sensor data is used from the position and orientation obtained from the video at time tj and the position and orientation obtained from the video at time ti. The estimated values do not always match. This is largely due to the accumulated error of the sensor. Although it is not necessary to make these values coincide, it becomes possible to refresh the accumulated error of the sensor. Any method may be used as a method for matching these values. For example, matching may be performed using the following method.

時刻ｔにおいて映像から求められる姿勢の角度表記を（ｘｍ（ｔ），ｙｍ（ｔ），ｚｍ（ｔ））とし、時刻ｔ０から時刻ｔ１までのセンサデータから求められる時間方向の相対姿勢の角度表記を（ｘｓ（ｔ０，ｔ１），ｙｓ（ｔ０，ｔ１），ｚｓ（ｔ０，ｔ１））とする。時刻ｔｉにおいて映像から位置姿勢（ｘｍ（ｔｉ），ｙｍ（ｔｉ），ｚｍ（ｔｉ））が求まった後、時刻ｔｊ（＞ｔｉ）において映像からの位置姿勢（ｘｍ（ｔｊ），ｙｍ（ｔｊ），ｚｍ（ｔｊ））が求められた時、（ｘｍ（ｔｊ），ｙｍ（ｔｊ），ｚｍ（ｔｊ））と（ｘｍ（ｔｉ），ｙｍ（ｔｉ），ｚｍ（ｔｉ））＋（ｘｓ（ｔｉ，ｔｊ），ｙｓ（ｔｉ，ｔｊ），ｚｓ（ｔｉ，ｔｊ））は一致させるため、各時刻の（ｘｓ（ｔｉ，ｔ），ｙｓ（ｔｉ，ｔ），ｚｓ（ｔｉ，ｔ））に対して以下の係数をかければよい。
ａｘ＝（ｘｍ（ｔｊ）−ｘｍ（ｔｉ））／ｘｚ（ｔｉ，ｔｊ）
ａｙ＝（ｙｍ（ｔｊ）−ｙｍ（ｔｉ））／ｙｚ（ｔｉ，ｔｊ）
ａｚ＝（ｚｍ（ｔｊ）−ｚｍ（ｔｉ））／ｚｚ（ｔｉ，ｔｊ） The angle notation of the posture obtained from the video at time t is (xm (t), ym (t), zm (t)), and the angle notation of the relative posture in the time direction obtained from the sensor data from time t0 to time t1. Is (xs (t0, t1), ys (t0, t1), zs (t0, t1)). After the position and orientation (xm (ti), ym (ti), zm (ti)) are obtained from the video at time ti, the position and orientation (xm (tj), ym (tj)) from the video at time tj (> ti). , Zm (tj)) is obtained, (xm (tj), ym (tj), zm (tj)) and (xm (ti), ym (ti), zm (ti)) + (xs (ti) , Tj), ys (ti, tj), and zs (ti, tj)) are matched with each other for (xs (ti, t), ys (ti, t), zs (ti, t)) at each time. The following coefficients should be applied.
ax = (xm (tj) −xm (ti)) / xz (ti, tj)
ay = (ym (tj) −ym (ti)) / yz (ti, tj)
az = (zm (tj) -zm (ti)) / zz (ti, tj)

最後に、各係数をかけた（ｘｓ（ｔｉ，ｔ），ｙｓ（ｔｉ，ｔ），ｚｓ（ｔｉ，ｔ））を回転行列に変化すればよい。なお、位置に関しても同様の修正方法が適用可能である。この処理を全ての不一致に対して行う（ステップＳ１７）。 Finally, (xs (ti, t), ys (ti, t), zs (ti, t)) multiplied by each coefficient may be changed into a rotation matrix. The same correction method can be applied to the position. This process is performed for all mismatches (step S17).

次に、各時刻におけるＤ０に対するＤｉ間の相対的な位置姿勢を求める（ステップＳ１８）。今、各ＤｉはＤ０といずれかの時刻において位置姿勢が求まっており、かつ各Ｄｉ内での位置姿勢の変位が求まっている。そのため、ある時刻においてＤ０とＤｉの位置姿勢が映像から求まっていなくても、映像から位置姿勢が求まっている時刻まで遡り、その位置姿勢を基にして各時刻の位置姿勢の変位を追加すればよい。この処理を全ての時刻、Ｄｉに対して行う（ステップＳ１９）。以上が位置姿勢を求める処理動作である。 Next, the relative position and orientation between Di with respect to D0 at each time is obtained (step S18). Now, the position and orientation of each Di is obtained at D0 and at any time, and the displacement of the position and orientation within each Di is obtained. Therefore, even if the position and orientation of D0 and Di are not obtained from the video at a certain time, it is possible to go back to the time when the position and orientation are obtained from the video and add the position and orientation displacement at each time based on the position and orientation. Good. This process is performed for all times and Di (step S19). The above is the processing operation for obtaining the position and orientation.

次に、表示部１５が、得られた各データ獲得装置の位置姿勢図および映像を多視点映像視聴システムの出力として提示する処理動作（ステップＳ６、Ｓ７の処理動作）を図５、図６、図７を参照して説明する。始めに、図５を参照して、図１に示す表示部１５の詳細な構成を説明する。図５は、図１に示す表示部１５の詳細な構成を示すブロック図である。表示部１５は、図５に示すように表示部データ入力部１５１、位置姿勢図作成部１５２、提示映像決定部１５３、外部入力受付部１５４を備える。表示部データ入力部１５１はデータ獲得装置の獲得した映像（データ処理部１３の出力）および位置姿勢（位置姿勢推定部１４の出力）を入力する。位置姿勢図作成部１５２はデータ獲得装置の位置姿勢を表す模式図を作成する。提示映像決定部１５３は提示する映像を決定する。外部入力受付部１５４はユーザの操作内容を外部入力として受け付ける。 Next, the processing operation in which the display unit 15 presents the obtained position and orientation diagram and video of each data acquisition device as the output of the multi-view video viewing system (processing operation in steps S6 and S7) is shown in FIGS. This will be described with reference to FIG. First, the detailed configuration of the display unit 15 shown in FIG. 1 will be described with reference to FIG. FIG. 5 is a block diagram showing a detailed configuration of the display unit 15 shown in FIG. As shown in FIG. 5, the display unit 15 includes a display unit data input unit 151, a position / orientation diagram creation unit 152, a presentation video determination unit 153, and an external input reception unit 154. The display unit data input unit 151 inputs the video acquired by the data acquisition device (output of the data processing unit 13) and the position and orientation (output of the position and orientation estimation unit 14). The position / orientation diagram creation unit 152 creates a schematic diagram representing the position / orientation of the data acquisition device. The presentation video determination unit 153 determines a video to be presented. The external input receiving unit 154 receives the user's operation content as an external input.

次に、図６を参照して、図５に示す表示部１５の動作を説明する。図６は、図５に示す表示部１５の動作を示すフローチャートである。まず、表示部データ入力部１５１はデータ獲得装置の獲得した映像およびデータ獲得装置Ｄｉの位置姿勢を入力する（ステップＳ２１）。続いて、位置姿勢図作成部１５２は入力された位置姿勢データから各時刻において各データ獲得装置がどのような位置に配置されているかを表す位置姿勢図（模式図）を生成する（ステップＳ２２）。位置姿勢図の図示方法はどのような方法でも構わない。例えば図７（ｂ）にあるようにデータ獲得装置の鳥瞰図を作成する。図７は、表示装置に表示する映像を提示するウインドウ（図７（ａ））と、データ獲得装置Ｄｉの位置関係（鳥瞰図、図７（ｂ））を提示するウインドウの一例を示す図である。 Next, the operation of the display unit 15 shown in FIG. 5 will be described with reference to FIG. FIG. 6 is a flowchart showing the operation of the display unit 15 shown in FIG. First, the display unit data input unit 151 inputs the video acquired by the data acquisition device and the position and orientation of the data acquisition device Di (step S21). Subsequently, the position / orientation diagram creation unit 152 generates a position / orientation diagram (schematic diagram) indicating the position of each data acquisition device at each time from the input position / orientation data (step S22). . Any method may be used to display the position and orientation diagram. For example, as shown in FIG. 7B, a bird's eye view of the data acquisition device is created. FIG. 7 is a diagram illustrating an example of a window that presents a window (FIG. 7A) that presents an image to be displayed on the display device and a positional relationship (bird's eye view, FIG. 7B) of the data acquisition device Di. .

次に、選択されたデータ獲得装置（以下、視点とも言う）の映像および位置姿勢図を提示する（ステップＳ２３）。提示する方法はどのような方法でも構わない。例えば、図７に示すように２つのウィンドウを提示し、そのそれぞれに対して映像および位置姿勢図を提示する。視点が選択されていない場合（初期状態など）では自動で視点を与えてもよい。 Next, an image and a position / orientation diagram of the selected data acquisition device (hereinafter also referred to as viewpoint) are presented (step S23). The presenting method may be any method. For example, as shown in FIG. 7, two windows are presented, and an image and a position / orientation diagram are presented for each of them. When a viewpoint is not selected (such as an initial state), the viewpoint may be automatically given.

次に、外部入力受付部１５４は、視点選択を受け付ける（ステップＳ２４）。視点選択の方法としてはユーザ操作により外部入力から受け付けるものと、自動的に選択するものが考えられるが、本実施例ではどちらでも構わない。自動であれば、一定時間すぎれば選択する視点を変化させるようにしてもよい。また、トラッキングの手法を用いることで、ある人物が常に写っているように視点を自動選択するようにしてもよい。外部入力に関しては、キーボードやマウスなどの外部インタフェース用いて、位置姿勢図上において所望のカメラを選択入力するようにしてもよい。そして、視点選択が入力されたかを判定し（ステップＳ２５）、入力された場合、ステップＳ２３に戻って処理を繰り返す。一方、視点選択が入力されず、終了条件が入力された場合にはシステムの処理動作を終了する（ステップＳ２６）。 Next, the external input receiving unit 154 receives a viewpoint selection (step S24). As a method for selecting a viewpoint, a method of accepting from an external input by a user operation and a method of selecting automatically are conceivable. If automatic, the viewpoint to be selected may be changed after a certain period of time. Also, by using a tracking method, the viewpoint may be automatically selected so that a certain person is always captured. Regarding external input, a desired camera may be selected and input on the position and orientation diagram using an external interface such as a keyboard or a mouse. Then, it is determined whether viewpoint selection has been input (step S25). If it has been input, the process returns to step S23 to repeat the process. On the other hand, when the viewpoint selection is not input and the end condition is input, the processing operation of the system is ended (step S26).

このように、複数のデータ獲得装置が取得した映像およびセンサデータを用いて、ロバストに位置姿勢を推定し、より快適な視聴を提供する多視点映像視聴システムを作成することが可能である。 In this way, it is possible to create a multi-view video viewing system that provides a more comfortable viewing by robustly estimating the position and orientation using video and sensor data acquired by a plurality of data acquisition devices.

なお、前述した実施形態では各カメラについてそれぞれ回転行列と並進ベクトルからなる位置姿勢を推定し可視化に使用したが、各カメラの間の関係を別の方法で記述することとし、その関係を推定することとしてもよい。例えば各カメラをノードとしたグラフ構造を構成し、各ノード間の距離として物理的距離を与えることとして、その物理的距離を推定するなどである。表示装置はグラフを可視化し、グラフ上のノード位置を指定したり隣接ノードへエッジにそって遷移したりすることで視点切り替えを実現するなどしてもよい。 In the above-described embodiment, the position and orientation composed of the rotation matrix and the translation vector are estimated for each camera and used for visualization. However, the relationship between the cameras is described in another method, and the relationship is estimated. It is good as well. For example, a graph structure having each camera as a node is constructed, and the physical distance is estimated by giving a physical distance as a distance between the nodes. The display device may visualize the graph and realize viewpoint switching by designating a node position on the graph or transitioning to an adjacent node along an edge.

以上説明したように、スマートフォンなどのカメラデバイスの映像およびデバイスに付属するセンサのデータを活用することでロバストにカメラの位置姿勢あるいは互いの位置関係を推定し、それらを可視化する多視点映像視聴システムを実現することを目的とする。 As described above, a multi-viewpoint video viewing system that robustly estimates the position and orientation of each camera or the positional relationship between them by using the video of a camera device such as a smartphone and the sensor data attached to the device, and visualizes them. It aims at realizing.

本発明によれば、カメラにより得られた映像データと、それに対応するセンサデータを用い、各カメラの位置姿勢あるいは互いの位置関係をロバストに推定することができる。また、本発明ではそれらの位置姿勢を可視化し視聴に利用する多視点映像視聴システムを提供することができる。これらはユーザが大量の映像の中から所望の視点の映像を素早く選択することを可能にする。 According to the present invention, it is possible to robustly estimate the position and orientation of each camera or the positional relationship between them using video data obtained by a camera and corresponding sensor data. In addition, the present invention can provide a multi-view video viewing system that visualizes these positions and orientations and uses them for viewing. These allow the user to quickly select a desired viewpoint image from a large number of images.

前述した実施形態における映像データ処理装置をコンピュータで実現するようにしてもよい。その場合、この機能を実現するためのプログラムをコンピュータ読み取り可能な記録媒体に記録して、この記録媒体に記録されたプログラムをコンピュータシステムに読み込ませ、実行することによって実現してもよい。なお、ここでいう「コンピュータシステム」とは、ＯＳや周辺機器等のハードウェアを含むものとする。また、「コンピュータ読み取り可能な記録媒体」とは、フレキシブルディスク、光磁気ディスク、ＲＯＭ、ＣＤ−ＲＯＭ等の可搬媒体、コンピュータシステムに内蔵されるハードディスク等の記憶装置のことをいう。さらに「コンピュータ読み取り可能な記録媒体」とは、インターネット等のネットワークや電話回線等の通信回線を介してプログラムを送信する場合の通信線のように、短時間の間、動的にプログラムを保持するもの、その場合のサーバやクライアントとなるコンピュータシステム内部の揮発性メモリのように、一定時間プログラムを保持しているものも含んでもよい。また上記プログラムは、前述した機能の一部を実現するためのものであっても良く、さらに前述した機能をコンピュータシステムにすでに記録されているプログラムとの組み合わせで実現できるものであってもよく、ＰＬＤ（Programmable Logic Device）やＦＰＧＡ（Field Programmable Gate Array）等のハードウェアを用いて実現されるものであってもよい。 You may make it implement | achieve the video data processing apparatus in embodiment mentioned above with a computer. In that case, a program for realizing this function may be recorded on a computer-readable recording medium, and the program recorded on this recording medium may be read into a computer system and executed. Here, the “computer system” includes an OS and hardware such as peripheral devices. The “computer-readable recording medium” refers to a storage device such as a flexible medium, a magneto-optical disk, a portable medium such as a ROM and a CD-ROM, and a hard disk incorporated in a computer system. Furthermore, the “computer-readable recording medium” dynamically holds a program for a short time like a communication line when transmitting a program via a network such as the Internet or a communication line such as a telephone line. In this case, a volatile memory inside a computer system serving as a server or a client in that case may be included and a program held for a certain period of time. Further, the program may be for realizing a part of the functions described above, and may be a program capable of realizing the functions described above in combination with a program already recorded in the computer system. It may be realized using hardware such as PLD (Programmable Logic Device) or FPGA (Field Programmable Gate Array).

以上、図面を参照して本発明の実施の形態を説明してきたが、上記実施の形態は本発明の例示に過ぎず、本発明が上記実施の形態に限定されるものではないことは明らかである。したがって、本発明の技術思想及び範囲を逸脱しない範囲で構成要素の追加、省略、置換、その他の変更を行ってもよい。 As mentioned above, although embodiment of this invention has been described with reference to drawings, the said embodiment is only the illustration of this invention, and it is clear that this invention is not limited to the said embodiment. is there. Therefore, additions, omissions, substitutions, and other modifications of the components may be made without departing from the technical idea and scope of the present invention.

映像データとセンサデータとからロバストにカメラの位置姿勢あるいは互いの位置関係を推定し、それらを可視化することが不可欠な用途に適用できる。 The present invention can be applied to applications where it is indispensable to robustly estimate the position / orientation of each camera or the positional relationship between them from video data and sensor data and visualize them.

１・・・映像データ処理装置、１１・・・データ入力部、１２・・・データ蓄積部、１３・・・データ処理部、１４・・・位置姿勢推定部、１５・・・表示部、１５１・・・表示部データ入力部、１５２・・・位置姿勢図作成部、１５３・・・提示映像決定部、１５４・・・外部入力受付部 DESCRIPTION OF SYMBOLS 1 ... Video | video data processing apparatus, 11 ... Data input part, 12 ... Data storage part, 13 ... Data processing part, 14 ... Position and orientation estimation part, 15 ... Display part, 151 ... Display unit data input unit, 152 ... Position and orientation diagram creation unit, 153 ... Presentation video determination unit, 154 ... External input reception unit

Claims

Data input means for inputting a plurality of sets of video data of the captured video and sensor data capable of estimating the position and orientation of the imaging means for capturing the video;
Position and orientation estimation means for estimating the position and orientation of each of the plurality of imaging means from the plurality of sensor data; and
Position and orientation diagram creating means for creating and outputting a position and orientation diagram indicating the position and orientation of each of the plurality of imaging means from the estimated position and orientation;
Display means for displaying the position and orientation diagram and the video data on a screen;
A video data processing apparatus comprising: video selection means for selecting the video data of the imaging means selected from the position and orientation diagram and outputting the selected video data to the display means.

A relative position and orientation estimating means for estimating a relative position and orientation of the imaging means from the video data;
The position and orientation estimation unit is configured to estimate the position and orientation of the imaging unit based on the sensor data when a mismatch occurs between the relative position and orientation and the position and orientation of the imaging unit estimated based on the sensor data. The video data processing device according to claim 1, wherein the video data processing device is modified.

3. The video data processing apparatus according to claim 1, further comprising a data processing unit configured to perform a process of synchronizing a plurality of the video data input by the data input unit and a plurality of the sensor data. .

The video data processing apparatus according to claim 3, wherein the data processing unit performs noise removal of the sensor data before performing the synchronization process.

5. The video data processing apparatus according to claim 3, wherein the data processing unit interpolates a time value at which the sensor data cannot be acquired before performing the synchronization process. 6.

A video data processing program for causing a computer to function as the video data processing device according to any one of claims 1 to 5.