JP2022553846A

JP2022553846A - Generating Arbitrary Views

Info

Publication number: JP2022553846A
Application number: JP2022525979A
Authority: JP
Inventors: チュイ・クラレンス; パーマー・マヌ; アディシシャ・アモー・スバクリシュナ; グプタ・ハーシュル; ウプルリ・アヴィナッシュ・ヴェンカタ
Original assignee: Outward Inc
Current assignee: Outward Inc
Priority date: 2019-11-08
Filing date: 2020-11-06
Publication date: 2022-12-26
Also published as: KR20220074959A; JP2022553845A; EP4055525A1; EP4055525A4; WO2021092455A1; EP4055524A4; KR20220078651A; WO2021092454A1; JP7410289B2; EP4055524A1

Abstract

【解決手段】機械学習ベースの画像処理および生成フレームワークが開示されている。いくつかの実施形態において、受信された入力画像内のオブジェクトまたはアセットの深度値が、既知の所定の環境に制約されている機械学習ベースのフレームワークを用いて少なくとも部分的に決定される。決定された深度値は、オブジェクトまたはアセットの他のビューの生成を容易にする。【選択図】図１０A machine learning based image processing and generation framework is disclosed. In some embodiments, depth values of objects or assets in received input images are determined at least in part using a machine learning-based framework that is constrained to a known predetermined environment. The determined depth value facilitates generating other views of the object or asset. [Selection drawing] Fig. 10

Description

他の出願への相互参照
本願は、「ＡＲＢＩＴＲＡＲＹＶＩＥＷＧＥＮＥＲＡＴＩＯＮ」と題する２０１９年７月２６日出願の米国特許出願第１６／５２３，８８８号の一部継続出願であり、当該一部継続出願は、「ＡＲＢＩＴＲＡＲＹＶＩＥＷＧＥＮＥＲＡＴＩＯＮ」と題する２０１８年１１月６日出願の米国特許出願第１６／１８１，６０７号の一部継続出願であり、当該一部継続出願は、「ＡＲＢＩＴＲＡＲＹＶＩＥＷＧＥＮＥＲＡＴＩＯＮ」と題する２０１７年９月２９日出願の米国特許出願第１５／７２１，４２６号（現在の米国特許第１０，１６３，２５０号）の継続出願であり、当該継続出願は、「ＦＡＳＴＲＥＮＤＥＲＩＮＧＯＦＡＳＳＥＭＢＬＥＤＳＣＥＮＥＳ」と題する２０１７年８月４日出願の米国仮特許出願第６２／５４１，６０７号に基づく優先権を主張し、「ＡＲＢＩＴＲＡＲＹＶＩＥＷＧＥＮＥＲＡＴＩＯＮ」と題する２０１６年３月２５日出願の米国特許出願第１５／０８１，５５３号（現在の米国特許第９，９９６，９１４号）の一部継続出願であり、これらはすべて、すべての目的のために参照によって本明細書に組み込まれる。 CROSS- REFERENCES TO OTHER APPLICATIONS This application is a continuation-in-part of U.S. patent application Ser. No. 16/181,607, filed November 6, 2018, entitled "ARBITRARY VIEW GENERATION," which is a continuation-in-part of September 2017, entitled "ARBITRARY VIEW GENERATION." No. 15/721,426 (now U.S. Patent No. 10,163,250), filed May 29, 2017, entitled "FAST RENDERING OF ASSEMBLED SCENES." U.S. Patent Application No. 15/081,553, filed March 25, 2016, entitled "ARBITRARY VIEW GENERATION," claiming priority to U.S. Provisional Patent Application No. 62/541,607, filed Aug. 4; (now US Pat. No. 9,996,914), all of which are incorporated herein by reference for all purposes.

本願は、「ＦＡＳＴＲＥＮＤＥＲＩＮＧＯＦＩＭＡＧＥＳＥＱＵＥＮＣＥＳＦＯＲＰＲＯＤＵＣＴＶＩＳＵＡＬＩＺＡＴＩＯＮ」と題する２０１９年１１月８日出願の米国仮特許出願第６２／９３３，２５８号、および、「ＳＹＳＴＥＭＡＮＤＭＥＴＨＯＤＦＯＲＡＣＱＵＩＲＩＮＧＩＭＡＧＥＳＦＯＲＳＰＡＣＥＰＬＡＮＮＩＮＧＡＰＰＬＩＣＡＴＩＯＮＳ」と題する２０１９年１１月８日出願の米国仮特許出願第６２／９３３，２６１号に基づく優先権を主張し、これら双方は、すべての目的のために参照によって本明細書に組み込まれる。 No. 62/933,258, filed November 8, 2019, entitled "FAST RENDERING OF IMAGE SEQUENCES FOR PRODUCT VISUALIZATION" and "SYSTEM AND METHOD FOR ACQUIRING IMAGES FOR PLANNING PLANNING". No. 62/933,261, filed November 8, 2019, entitled, U.S. Provisional Patent Application No. 62/933,261, both of which are incorporated herein by reference for all purposes.

既存のレンダリング技術は、品質および速度という相反する目標の間のトレードオフに直面している。高品質なレンダリングは、かなりの処理リソースおよび時間を必要とする。しかしながら、遅いレンダリング技術は、インタラクティブなリアルタイムアプリケーションなど、多くのアプリケーションで許容できない。一般的には、低品質だが高速なレンダリング技術が、かかるアプリケーションでは好まれる。例えば、比較的高速なレンダリングのために品質を犠牲にして、ラスタ化が、リアルタイムグラフィックスアプリケーションによって一般に利用される。したがって、品質も速度も大きく損なうことのない改良技術が求められている。 Existing rendering techniques face trade-offs between the competing goals of quality and speed. High quality rendering requires significant processing resources and time. However, slow rendering techniques are unacceptable for many applications, such as interactive real-time applications. In general, lower quality but faster rendering techniques are preferred for such applications. For example, rasterization is commonly utilized by real-time graphics applications, sacrificing quality for relatively fast rendering. Therefore, there is a need for an improved technique that does not significantly impair quality or speed.

以下の詳細な説明と添付の図面において、本発明の様々な実施形態を開示する。 Various embodiments of the invention are disclosed in the following detailed description and accompanying drawings.

シーンの任意ビューを生成するためのシステムの一実施形態を示すハイレベルブロック図。1 is a high-level block diagram illustrating one embodiment of a system for generating arbitrary views of a scene; FIG.

データベースアセットの一例を示す図。The figure which shows an example of a database asset.

任意パースペクティブを生成するための処理の一実施形態を示すフローチャート。4 is a flowchart illustrating one embodiment of a process for generating arbitrary perspectives;

アセットの任意ビューが生成されうる元となるアセットの参照画像またはビューを生成するための処理の一実施形態を示すフローチャート。4 is a flowchart illustrating one embodiment of a process for generating a reference image or view of an asset from which arbitrary views of the asset may be generated.

シーンの要求されたビューを提供するための処理の一実施形態を示すフローチャート。4 is a flow chart illustrating one embodiment of a process for providing a requested view of a scene.

画像データセットに関連付けられている属性を学習するための機械学習ベース画像処理フレームワークの一実施形態を示すハイレベルブロック図。1 is a high-level block diagram illustrating one embodiment of a machine learning-based image processing framework for learning attributes associated with image datasets; FIG.

アセットの他の任意ビューを生成するために利用できるアセットに関連付けられている画像をデータベースに入力するための処理の一実施形態を示すフローチャート。4 is a flowchart illustrating one embodiment of a process for entering images associated with an asset into a database that can be used to generate other arbitrary views of the asset.

画像またはフレームを生成するための処理の一実施形態を示すフローチャート。4 is a flow chart illustrating one embodiment of a process for generating an image or frame;

オブジェクトまたはアセットの任意または新規のビューまたはパースペクティブを生成するための処理の一実施形態を示すハイレベルフローチャート。1 is a high-level flowchart illustrating one embodiment of a process for generating arbitrary or new views or perspectives of objects or assets.

本発明は、処理、装置、システム、物質の組成、コンピュータ読み取り可能な格納媒体上に具現化されたコンピュータプログラム製品、および／または、プロセッサ（プロセッサに接続されたメモリに格納および／またはそのメモリによって提供される命令を実行するよう構成されたプロセッサ）を含め、様々な形態で実装されうる。本明細書では、これらの実施例または本発明が取りうる任意の他の形態が、技術と呼ばれうる。一般に、開示されている処理の工程の順序は、本発明の範囲内で変更されてもよい。特に言及しない限り、タスクを実行するよう構成されるものとして記載されたプロセッサまたはメモリなどの構成要素は、或る時間にタスクを実行するよう一時的に構成された一般的な構成要素として、または、タスクを実行するよう製造された特定の構成要素として実装されてよい。本明細書では、「プロセッサ」という用語は、１または複数のデバイス、回路、および／または、コンピュータプログラム命令などのデータを処理するよう構成された処理コアを指すものとする。 The present invention may be a process, apparatus, system, composition of matter, computer program product embodied on a computer readable storage medium, and/or a processor (stored in and/or by a memory coupled to the processor). implemented in various forms, including a processor configured to execute the provided instructions). In this specification, these examples, or any other form that the invention may take, may be referred to as techniques. In general, the order of steps in disclosed processes may be altered within the scope of the invention. Unless otherwise stated, a component such as a processor or memory described as being configured to perform a task is a general component temporarily configured to perform the task at a time; , may be implemented as specific components manufactured to perform a task. As used herein, the term "processor" shall refer to one or more devices, circuits, and/or processing cores configured to process data, such as computer program instructions.

以下では、本発明の原理を示す図面を参照しつつ、本発明の１または複数の実施形態の詳細な説明を行う。本発明は、かかる実施形態に関連して説明されているが、どの実施形態にも限定されない。本発明の範囲は、特許請求の範囲によってのみ限定されるものであり、本発明は、多くの代替物、変形物、および、等価物を含む。以下の説明では、本発明の完全な理解を提供するために、多くの具体的な詳細事項が記載されている。これらの詳細事項は、例示を目的としたものであり、本発明は、これらの具体的な詳細事項の一部または全てがなくとも特許請求の範囲に従って実施可能である。簡単のために、本発明に関連する技術分野で周知の技術事項については、本発明が必要以上にわかりにくくならないように、詳細には説明していない。 A detailed description of one or more embodiments of the invention is provided below along with accompanying figures that illustrate the principles of the invention. The invention is described in connection with such embodiments, but is not limited to any embodiment. The scope of the invention is limited only by the claims and the invention includes many alternatives, modifications and equivalents. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the invention. These details are for the purpose of example and the invention may be practiced according to the claims without some or all of these specific details. For the sake of brevity, technical material that is well known in the technical fields related to the invention has not been described in detail so as not to unnecessarily obscure the invention.

シーンの任意ビューを生成するための技術が開示されている。本明細書に記載の実例は、非常に低い処理オーバヘッドまたは計算オーバヘッドを伴いつつ、高精細度出力も提供し、レンダリング速度と品質との間の困難なトレードオフを効果的に排除する。開示されている技術は、インタラクティブなリアルタイムグラフィックスアプリケーションに関して、高品質出力を非常に高速に生成するために特に有効である。かかるアプリケーションは、提示されたインタラクティブなビューまたはシーンのユーザ操作に応答してそれに従って、好ましい高品質出力を実質的に即時に提示することに依存する。 Techniques are disclosed for generating arbitrary views of a scene. The examples described herein also provide high definition output with very low processing or computational overhead, effectively eliminating difficult tradeoffs between rendering speed and quality. The disclosed techniques are particularly useful for producing high quality output very quickly for interactive real-time graphics applications. Such applications rely on the substantially immediate presentation of desirable high quality output in response to and in accordance with user manipulation of a presented interactive view or scene.

図１は、シーンの任意ビューを生成するためのシステム１００の一実施形態を示すハイレベルブロック図である。図に示すように、任意ビュー生成器１０２が、任意ビューの要求を入力１０４として受信し、既存のデータベースアセット１０６に基づいて、要求されたビューを生成し、入力された要求に応答して、生成されたビューを出力１０８として提供する。様々な実施形態において、任意ビュー生成器１０２は、中央処理装置（ＣＰＵ）またはグラフィックス処置装置（ＧＰＵ）などのプロセッサを備えてよい。図１に示すシステム１００の構成は、説明のために提示されている。一般に、システム１００は、記載した機能を提供する任意の他の適切な数および／または構成の相互接続された構成要素を備えてもよい。例えば、別の実施形態において、任意ビュー生成器１０２は、異なる構成の内部構成要素１１０～１１６を備えてもよく、任意ビュー生成器１０２は、複数の並列物理および／または仮想プロセッサを備えてもよく、データベース１０６は、複数のネットワークデータベースまたはアセットのクラウドを備えてもよい、などである。 FIG. 1 is a high-level block diagram illustrating one embodiment of a system 100 for generating arbitrary views of a scene. As shown, an arbitrary view generator 102 receives a request for an arbitrary view as input 104, generates the requested view based on existing database assets 106, responds to the input request, The generated view is provided as output 108 . In various embodiments, the arbitrary view generator 102 may comprise a processor such as a central processing unit (CPU) or graphics processing unit (GPU). The configuration of system 100 shown in FIG. 1 is presented for illustration. In general, system 100 may include any other suitable number and/or configuration of interconnected components that provide the described functionality. For example, in another embodiment, the arbitrary view generator 102 may comprise different configurations of internal components 110-116, and the arbitrary view generator 102 may comprise multiple parallel physical and/or virtual processors. Often, database 106 may comprise multiple network databases or a cloud of assets, and so on.

任意ビュー要求１０４は、シーンの任意パースペクティブの要求を含む。いくつかの実施形態において、シーンの他のパースペクティブすなわち視点を含むシーンの要求パースペクティブは、アセットデータベース１０６内にまだ存在してはいない。様々な実施形態において、任意ビュー要求１０４は、プロセスまたはユーザから受信されてよい。例えば、入力１０４は、提示されたシーンまたはその一部のユーザ操作（提示されたシーンのカメラ視点のユーザ操作など）に応答して、ユーザインターフェスから受信されうる。別の例において、任意ビュー要求１０４は、シーンのフライスルーなど、仮想環境内での運動または移動の経路の指定に応答して受信されてもよい。いくつかの実施形態において、要求できるシーンの可能な任意ビューは、少なくとも部分的に制約されている。例えば、ユーザは、提示されたインタラクティブシーンのカメラ視点を任意のランダムな位置に操作することができない場合があり、シーンの特定の位置またはパースペクティブに制約される。 Arbitrary view requests 104 include requests for arbitrary perspectives of the scene. In some embodiments, other perspectives of the scene, ie, the requested perspective of the scene containing the viewpoint, do not yet exist in the asset database 106 . In various embodiments, the arbitrary view request 104 may be received from a process or user. For example, input 104 may be received from a user interface in response to user manipulation of a presented scene or portion thereof (such as user manipulation of a camera view of the presented scene). In another example, the arbitrary view request 104 may be received in response to specifying a path of movement or movement within the virtual environment, such as a flythrough of a scene. In some embodiments, the possible arbitrary views of the scene that can be requested are at least partially constrained. For example, a user may not be able to steer the camera viewpoint of a presented interactive scene to arbitrary random positions, and is constrained to a particular position or perspective of the scene.

データベース１０６は、格納された各アセットの複数のビューを格納している。所与の文脈において、アセットとは、仕様が複数のビューとしてデータベース１０６に格納されている個々のシーンのことである。様々な実施形態において、シーンは、単一のオブジェクト、複数のオブジェクト、または、リッチな仮想環境を含みうる。具体的には、データベース１０６は、各アセットの異なるパースペクティブすなわち視点に対応する複数の画像を格納する。データベース１０６に格納されている画像は、高品質の写真または写実的レンダリングを含む。データベース１０６に入力されるかかる高精細度すなわち高解像度の画像は、オフライン処理中にキャプチャまたはレンダリングされ、もしくは、外部ソースから取得されてよい。いくつかの実施形態において、対応するカメラ特性が、データベース１０６に格納された各画像と共に格納される。すなわち、相対的な位置または場所、向き、回転、奥行情報、焦点距離、絞り、ズームレベルなどのカメラ属性が、各画像と共に格納される。さらに、シャッター速度および露出などのカメラの光学情報が、データベース１０６に格納された各画像と共に格納されてもよい。 Database 106 stores multiple views of each stored asset. In a given context, an asset is an individual scene whose specifications are stored in the database 106 as multiple views. In various embodiments, a scene may include a single object, multiple objects, or a rich virtual environment. Specifically, database 106 stores multiple images corresponding to different perspectives or viewpoints of each asset. The images stored in database 106 include high quality photographs or photorealistic renderings. Such high definition or high resolution images that are input to database 106 may be captured or rendered during offline processing, or obtained from external sources. In some embodiments, corresponding camera characteristics are stored with each image stored in database 106 . That is, camera attributes such as relative position or location, orientation, rotation, depth information, focal length, aperture, zoom level are stored with each image. In addition, camera optical information such as shutter speed and exposure may be stored with each image stored in database 106 .

様々な実施形態において、アセットの任意の数の異なるパースペクティブがデータベース１０６に格納されてよい。図２は、データベースアセットの一例を示す。与えられた例では、椅子オブジェクトの周りの異なる角度に対応する７３のビューがキャプチャまたはレンダリングされ、データベース１０６に格納される。ビューは、例えば、椅子の周りでカメラを回転させるかまたはカメラの前で椅子を回転させることによってキャプチャされてよい。相対的なオブジェクトおよびカメラの位置および向きの情報が、生成された各画像と共に格納される。図２は、１つのオブジェクトを含むシーンのビューを具体的に示している。データベース１０６は、複数のオブジェクトまたはリッチな仮想環境を含むシーンの仕様も格納してよい。かかるケースにおいては、シーンまたは三次元空間の中の異なる位置または場所に対応する複数のビューがキャプチャまたはレンダリングされ、対応するカメラ情報と共にデータベース１０６に格納される。一般に、データベース１０６に格納された画像は、二次元または三次元を含んでよく、アニメーションまたはビデオシーケンスのスチールまたはフレームを含んでよい。 Any number of different perspectives of an asset may be stored in database 106 in various embodiments. FIG. 2 shows an example of a database asset. In the example given, 73 views corresponding to different angles around the chair object are captured or rendered and stored in database 106 . The view may be captured, for example, by rotating the camera around the chair or rotating the chair in front of the camera. Relative object and camera position and orientation information is stored with each image generated. FIG. 2 illustrates a view of a scene containing one object. Database 106 may also store specifications for scenes containing multiple objects or rich virtual environments. In such cases, multiple views corresponding to different positions or locations in the scene or three-dimensional space are captured or rendered and stored in database 106 along with corresponding camera information. In general, the images stored in database 106 may include two or three dimensions and may include stills or frames of animations or video sequences.

データベース１０６にまだ存在しないシーンの任意ビューの要求１０４に応答して、任意ビュー生成器１０２は、データベース１０６に格納されたシーンの複数の他の既存ビューから、要求された任意ビューを生成する。図１の構成例では、任意ビュー生成器１０２のアセット管理エンジン１１０が、データベース１０６を管理する。例えば、アセット管理エンジン１１０は、データベース１０６におけるデータの格納およびリトリーブを容易にしうる。シーン１０４の任意ビューの要求に応答して、アセット管理エンジン１１０は、データベース１０６からシーンの複数の他の既存ビューを特定して取得する。いくつかの実施形態において、アセット管理エンジン１１０は、データベース１０６からシーンのすべての既存ビューをリトリーブする。あるいは、アセット管理エンジン１１０は、既存ビューの一部（例えば、要求された任意ビューに最も近いビュー）を選択してリトリーブしてもよい。かかるケースにおいて、アセット管理エンジン１１０は、要求された任意ビューを生成するためのピクセルの収集元になりうる一部の既存ビューをインテリジェントに選択するよう構成される。様々な実施形態において、複数の既存ビューが、アセット管理エンジン１１０によって一緒にリトリーブされてもよいし、任意ビュー生成器１０２のその他の構成要素によって必要になり次第リトリーブされてもよい。 In response to a request 104 for an arbitrary view of a scene that does not already exist in database 106 , arbitrary view generator 102 generates the requested arbitrary view from multiple other existing views of the scene stored in database 106 . In the example configuration of FIG. 1, the asset management engine 110 of the arbitrary view generator 102 manages the database 106 . For example, asset management engine 110 may facilitate storing and retrieving data in database 106 . In response to a request for an arbitrary view of scene 104 , asset management engine 110 identifies and retrieves multiple other existing views of the scene from database 106 . In some embodiments, asset management engine 110 retrieves all existing views of the scene from database 106 . Alternatively, asset management engine 110 may select a portion of an existing view (eg, the view closest to any requested view) to retrieve. In such cases, the asset management engine 110 is configured to intelligently select some existing view from which pixels may be collected to generate the requested arbitrary view. In various embodiments, multiple existing views may be retrieved together by the asset management engine 110 or by other components of the arbitrary view generator 102 as needed.

アセット管理エンジン１１０によってリトリーブされた各既存ビューのパースペクティブは、任意ビュー生成器１０２のパースペクティブ変換エンジン１１２によって、要求された任意ビューのパースペクティブに変換される。上述のように、正確なカメラ情報が既知であり、データベース１０６に格納された各画像と共に格納されている。したがって、既存ビューから要求任意ビューへのパースペクティブ変更は、単純な幾何マッピングまたは幾何変換を含む。様々な実施形態において、パースペクティブ変換エンジン１１２は、既存ビューのパースペクティブを任意ビューのパースペクティブに変換するために、任意の１または複数の適切な数学的手法を用いてよい。要求されたビューがどの既存ビューとも同一ではない任意ビューを含む場合、任意ビューのパースペクティブへの既存ビューの変換は、少なくともいくつかのマッピングされていないピクセルまたは失われたピクセル、すなわち、既存ビューに存在しない任意ビューに導入された角度または位置にあるピクセルを含むことになる。 Each existing view perspective retrieved by the asset management engine 110 is transformed into the requested arbitrary view perspective by the perspective transformation engine 112 of the arbitrary view generator 102 . As noted above, the exact camera information is known and stored with each image stored in database 106 . Perspective change from an existing view to any desired view therefore involves a simple geometric mapping or transformation. In various embodiments, perspective transformation engine 112 may use any one or more suitable mathematical techniques to transform the perspective of an existing view into the perspective of an arbitrary view. If the requested view includes an arbitrary view that is not identical to any existing view, the conversion of the existing view to the perspective of the arbitrary view will result in at least some unmapped or missing pixels, i.e. It will contain pixels at angles or positions introduced into arbitrary views that do not exist.

単一のパースペクティブ変換された既存ビューからのピクセル情報では、別のビューのすべてのピクセルを埋めることができない。しかしながら、多くの場合、すべてではないが、要求された任意ビューのほとんどのピクセルが、複数のパースペクティブ変換された既存ビューから収集されうる。任意ビュー生成器１０２のマージエンジン１１４が、複数のパースペクティブ変換された既存ビューからのピクセルを組み合わせて、要求された任意ビューを生成する。理想的には、任意ビューを構成するすべてのピクセルが既存ビューから収集される。これは、例えば、考慮対象となるアセットについて十分に多様なセットの既存ビューまたはパースペクティブが利用可能である場合、および／または、要求されたパースペクティブが既存のパースペクティブとはそれほど異なっていない場合に、可能でありうる。 Pixel information from a single perspective-transformed existing view cannot fill all the pixels of another view. However, in many cases, most, if not all, of the pixels of any requested view can be collected from multiple perspective-transformed existing views. A merge engine 114 of the arbitrary view generator 102 combines pixels from multiple perspective-transformed existing views to generate the requested arbitrary view. Ideally, all pixels that make up an arbitrary view are collected from existing views. This is possible, for example, if a sufficiently diverse set of existing views or perspectives are available for the considered asset and/or if the requested perspective is not significantly different from the existing perspectives. can be

複数のパースペクティブ変換された既存ビューからのピクセルを組み合わせまたはマージして、要求された任意ビューを生成するために、任意の適切な技術が用いられてよい。一実施形態において、要求された任意ビューに最も近い第１既存ビューが、データベース１０６から選択されてリトリーブされ、要求された任意ビューのパースペクティブに変換される。次いで、ピクセルが、このパースペクティブ変換された第１既存ビューから収集され、要求された任意ビュー内の対応するピクセルを埋めるために用いられる。第１既存ビューから取得できなかった要求任意ビューのピクセルを埋めるために、これらの残りのピクセルの少なくとも一部を含む第２既存ビューが、データベース１０６から選択されてリトリーブされ、要求任意ビューのパースペクティブへ変換される。次いで、第１既存ビューから取得できなかったピクセルは、このパースペクティブ変換された第２既存ビューから収集され、要求任意ビュー内の対応するピクセルを埋めるために用いられる。この処理は、要求任意ビューのすべてのピクセルが埋められるまで、および／または、すべての既存ビューが使い果たされるかまたは所定の閾値数の既存ビューが利用されるまで、任意の数のさらなる既存ビューについて繰り返されてよい。 Any suitable technique may be used to combine or merge pixels from multiple perspective-transformed existing views to generate the desired arbitrary view. In one embodiment, the first existing view closest to the requested arbitrary view is selected and retrieved from the database 106 and transformed into the perspective of the requested arbitrary view. Pixels are then collected from this perspective-transformed first existing view and used to fill the corresponding pixels in the requested arbitrary view. To fill in the pixels of the requested arbitrary view that could not be obtained from the first existing view, a second existing view containing at least some of these remaining pixels is selected and retrieved from the database 106 to provide the perspective of the requested arbitrary view. is converted to Pixels that could not be obtained from the first existing view are then collected from this perspective transformed second existing view and used to fill the corresponding pixels in the requested arbitrary view. This process continues until all pixels of the requested arbitrary view are filled and/or until all existing views are exhausted or a predetermined threshold number of existing views are utilized. may be repeated for

いくつかの実施形態において、要求任意ビューは、どの既存ビューからも取得できなかったいくつかのピクセルを含みうる。かかる場合、補間エンジン１１６が、要求任意ビューのすべての残りのピクセルを埋めるよう構成されている。様々な実施形態において、要求任意ビュー内のこれらの埋められていないピクセルを生成するために、任意の１または複数の適切な補間技術が補間エンジン１１６によって用いられてよい。利用可能な補間技術の例は、例えば、線形補間、最近隣補間などを含む。ピクセルの補間は、平均法または平滑化を導入する。全体の画像品質は、ある程度の補間によって大きい影響を受けることはないが、過剰な補間は、許容できない不鮮明さを導入しうる。したがって、補間は、控えめに用いることが望ましい場合がある。上述のように、要求任意ビューのすべてのピクセルを既存ビューから取得できる場合には、補間は完全に回避される。しかしながら、要求任意ビューが、どのビューからも取得できないいくつかのピクセルを含む場合には、補間が導入される。一般に、必要な補間の量は、利用可能な既存ビューの数、既存ビューのパースペクティブの多様性、および／または、任意ビューのパースペクティブが既存ビューのパースペクティブに関してどれだけ異なるか、に依存する。 In some embodiments, the requested arbitrary view may contain some pixels that could not be obtained from any existing view. In such cases, interpolation engine 116 is configured to fill in all remaining pixels of the requested arbitrary view. In various embodiments, any one or more suitable interpolation techniques may be used by interpolation engine 116 to generate these unfilled pixels in the requested arbitrary view. Examples of interpolation techniques that can be used include, for example, linear interpolation, nearest neighbor interpolation, and the like. Pixel interpolation introduces averaging or smoothing. Overall image quality is not significantly affected by some interpolation, but excessive interpolation can introduce unacceptable blurriness. Therefore, it may be desirable to use interpolation sparingly. As mentioned above, interpolation is avoided entirely if all pixels of the requested arbitrary view can be obtained from existing views. However, if the requested arbitrary view contains some pixels that cannot be obtained from any view, interpolation is introduced. In general, the amount of interpolation required depends on the number of existing views available, the diversity of perspectives of the existing views, and/or how the perspective of any view differs with respect to the perspective of the existing views.

図２に示した例に関して、椅子オブジェクトの周りの７３のビューが、椅子の既存ビューとして格納される。格納されたビューとのいずれとも異なるすなわち特有の椅子オブジェクトの周りの任意ビューが、もしあったとしても好ましくは最小限の補間で、複数のこれらの既存ビューを用いて生成されうる。しかしながら、既存ビューのかかる包括的なセットを生成して格納することが、効率的でなかったり望ましくなかったりする場合がある。いくつかの場合、その代わりに、十分に多様なセットのパースペクティブを網羅する大幅に少ない数の既存ビューが生成および格納されてもよい。例えば、椅子オブジェクトの７３のビューが、椅子オブジェクトの周りの少数のビューの小さいセットに縮小されてもよい。 For the example shown in FIG. 2, 73 views around the chair object are stored as existing views of the chair. An arbitrary view around the chair object that is different or unique from any of the stored views can be generated using a plurality of these existing views, preferably with minimal, if any, interpolation. However, it may not be efficient or desirable to generate and store such a comprehensive set of existing views. In some cases, a significantly smaller number of existing views covering a sufficiently diverse set of perspectives may instead be generated and stored. For example, 73 views of a chair object may be reduced to a small set of a few views around the chair object.

上述のように、いくつかの実施形態において、要求できる可能な任意ビューが、少なくとも部分的に制約される場合がある。例えば、ユーザは、インタラクティブなシーンに関連付けられている仮想カメラを特定の位置に動かすことを制限されうる。図２で与えられた例に関しては、要求できる可能な任意ビューは、椅子オブジェクトの周りの任意の位置に制限され、例えば、椅子オブジェクトの底部のために存在するピクセルデータが不十分であるので、椅子オブジェクトの下の任意の位置を含みえない。許容される任意ビューについてのかかる制約は、要求任意ビューを任意ビュー生成器１０２によって既存データから生成できることを保証する。 As noted above, in some embodiments, the possible arbitrary views that can be requested may be at least partially constrained. For example, a user may be restricted from moving a virtual camera associated with an interactive scene to certain positions. For the example given in Figure 2, the possible arbitrary views that can be requested are limited to arbitrary positions around the chair object, e.g. It cannot include any position under the chair object. Such constraints on allowed arbitrary views ensure that the requested arbitrary view can be generated by the arbitrary view generator 102 from existing data.

任意ビュー生成器１０２は、入力された任意ビュー要求１０４に応答して、要求任意ビュー１０８を生成して出力する。生成された任意ビュー１０８の解像度または品質は、既存ビューからのピクセルが任意ビューを生成するために用いられているので、それを生成するために用いられた既存ビューの品質と同じであるかまたは同等である。したがって、ほとんどの場合に高精細度の既存ビューを用いると、高精細度の出力が得られる。いくつかの実施形態において、生成された任意ビュー１０８は、関連シーンの他の既存ビューと共にデータベース１０６に格納され、後に、任意ビューに対する将来の要求に応答して、そのシーンの他の任意ビューを生成するために用いられてよい。入力１０４がデータベース１０６内の既存ビューの要求を含む場合、要求ビューは、上述のように、他のビューから生成される必要がなく、その代わり、要求ビューは、簡単なデータベースルックアップを用いてリトリーブされ、出力１０８として直接提示される。 Arbitrary view generator 102 generates and outputs requested arbitrary view 108 in response to input arbitrary view request 104 . The resolution or quality of the generated arbitrary view 108 is the same as the quality of the existing view used to generate it, since pixels from the existing view are used to generate the arbitrary view, or are equivalent. Therefore, using high definition existing views in most cases yields high definition output. In some embodiments, the generated arbitrary view 108 is stored in the database 106 along with other existing views of the relevant scene, and later, in response to future requests for arbitrary views, other arbitrary views of that scene are generated. may be used to generate If the input 104 contains a request for an existing view in the database 106, then the requested view need not be generated from other views as described above; retrieved and presented directly as output 108 .

任意ビュー生成器１０２は、さらに、記載した技術を用いて任意アンサンブルビューを生成するよう構成されてもよい。すなわち、入力１０４は、複数のオブジェクトを組み合わせて単一のカスタムビューにするための要求を含んでよい。かかる場合、上述の技術は、複数のオブジェクトの各々に対して実行され、複数のオブジェクトを含む単一の統合されたビューすなわちアンサンブルビューを生成するように組み合わせられる。具体的には、複数のオブジェクトの各々の既存ビューが、アセット管理エンジン１１０によってデータベース１０６から選択されてリトリーブされ、それらの既存ビューは、パースペクティブ変換エンジン１１２によって、要求されたビューのパースペクティブに変換され、パースペクティブ変換された既存ビューからのピクセルが、マージエンジン１１４によって、要求されたアンサンブルビューの対応するピクセルを埋めるために用いられ、アンサンブルビュー内の任意の残りの埋められていないピクセルが、補間エンジン１１６によって補間される。いくつかの実施形態において、要求されたアンサンブルビューは、アンサンブルを構成する１または複数のオブジェクトのためにすでに存在するパースペクティブを含みうる。かかる場合、要求されたパースペクティブに対応するオブジェクトアセットの既存ビューは、オブジェクトの他の既存ビューから、要求されたパースペクティブを最初に生成する代わりに、アンサンブルビュー内のオブジェクトに対応するピクセルを直接埋めるために用いられる。 Arbitrary view generator 102 may also be configured to generate arbitrary ensemble views using the described techniques. That is, input 104 may include a request to combine multiple objects into a single custom view. In such cases, the techniques described above are performed on each of the multiple objects and combined to produce a single unified or ensemble view containing the multiple objects. Specifically, existing views of each of the plurality of objects are selected and retrieved from the database 106 by the asset management engine 110, and those existing views are transformed by the perspective transformation engine 112 into the perspective of the requested view. , the pixels from the perspective-transformed existing views are used by the merge engine 114 to fill the corresponding pixels of the requested ensemble view, and any remaining unfilled pixels in the ensemble view are used by the interpolation engine 116. In some embodiments, the requested ensemble view may include perspectives that already exist for one or more of the objects that make up the ensemble. In such cases, the existing view of the object asset corresponding to the requested perspective fills the pixels corresponding to the object in the ensemble view directly instead of first generating the requested perspective from other existing views of the object. used for

複数のオブジェクトを含む任意アンサンブルビューの一例として、図２の椅子オブジェクトおよび別個に撮影またはレンダリングされたテーブルオブジェクトを考える。椅子オブジェクトおよびテーブルオブジェクトは、両方のオブジェクトの単一のアンサンブルビューを生成するために、開示されている技術を用いて組み合わせられてよい。したがって、開示された技術を用いて、複数のオブジェクトの各々の別個にキャプチャまたはレンダリングされた画像またはビューが、複数のオブジェクトを含み所望のパースペクティブを有するシーンを生成するために、矛盾なく組み合わせられうる。上述のように、各既存ビューの奥行情報は既知である。各既存ビューのパースペクティブ変換は、奥行変換を含んでおり、複数のオブジェクトが、アンサンブルビュー内で互いに対して適切に配置されることを可能にする。 As an example of an arbitrary ensemble view containing multiple objects, consider the chair object and separately photographed or rendered table object in FIG. Chair objects and table objects may be combined using the disclosed techniques to produce a single ensemble view of both objects. Thus, using the disclosed techniques, separately captured or rendered images or views of each of multiple objects can be consistently combined to produce a scene containing multiple objects and having a desired perspective. . As mentioned above, the depth information for each existing view is known. The perspective transformation of each existing view, including the depth transformation, allows multiple objects to be properly positioned relative to each other within the ensemble view.

任意アンサンブルビューの生成は、複数の単一オブジェクトを組み合わせてカスタムビューにすることに限定されない。むしろ、複数のオブジェクトまたは複数のリッチな仮想環境を有する複数のシーンが、同様に組み合わせられてカスタムアンサンブルビューにされてもよい。例えば、複数の別個に独立して生成された仮想環境（おそらくは異なるコンテンツ生成源に由来し、おそらくは異なる既存の個々のパースペクティブを有する）が、所望のパースペクティブを有するアンサンブルビューになるように組み合わせられてよい。したがって、一般に、任意ビュー生成器１０２は、おそらくは異なる既存ビューを含む複数の独立したアセットを、所望のおそらくは任意パースペクティブを有するアンサンブルビューに矛盾なく組み合わせまたは調和させるよう構成されてよい。すべての組み合わせられたアセットが同じパースペクティブに正規化されるので、完璧に調和した結果としてのアンサンブルビューが生成される。アンサンブルビューの可能な任意パースペクティブは、アンサンブルビューを生成するために利用可能な個々のアセットの既存ビューに基づいて制約されうる。 Generating arbitrary ensemble views is not limited to combining multiple single objects into custom views. Rather, multiple scenes with multiple objects or multiple rich virtual environments may be similarly combined into a custom ensemble view. For example, multiple separately and independently generated virtual environments (perhaps from different content generation sources and possibly with different pre-existing individual perspectives) are combined into an ensemble view with a desired perspective. good. Thus, in general, the arbitrary view generator 102 may be configured to consistently combine or match multiple independent assets, including possibly different existing views, into an ensemble view with a desired, possibly arbitrary perspective. Since all combined assets are normalized to the same perspective, a perfectly matched resulting ensemble view is produced. The possible arbitrary perspectives of the ensemble view can be constrained based on existing views of individual assets available to generate the ensemble view.

図３は、任意パースペクティブを生成するための処理の一実施形態を示すフローチャートである。処理３００は、例えば、図１の任意ビュー生成器１０２によって用いられてよい。様々な実施形態において、処理３００は、所定のアセットの任意ビューまたは任意アンサンブルビューを生成するために用いられてよい。 FIG. 3 is a flow diagram illustrating one embodiment of a process for generating arbitrary perspectives. Process 300 may be used, for example, by arbitrary view generator 102 of FIG. In various embodiments, process 300 may be used to generate arbitrary or arbitrary ensemble views of a given asset.

処理３００は、任意パースペクティブの要求が受信される工程３０２において始まる。いくつかの実施形態において、工程３０２において受信された要求は、シーンのどの既存の利用可能なパースペクティブとも異なる所定のシーンの任意パースペクティブの要求を含みうる。かかる場合、例えば、任意パースペクティブ要求は、そのシーンの提示されたビューのパースペクティブの変更を要求されたことに応じて受信されてよい。パースペクティブのかかる変更は、カメラのパン、焦点距離の変更、ズームレベルの変更など、シーンに関連付けられている仮想カメラの変更または操作によって促されてよい。あるいは、いくつかの実施形態において、工程３０２において受信された要求は、任意アンサンブルビューの要求を含んでもよい。一例として、かかる任意アンサンブルビュー要求は、複数の独立したオブジェクトの選択を可能にして、選択されたオブジェクトの統合されたパースペクティブ修正済みのアンサンブルビューを提供するアプリケーションに関して受信されうる。 Process 300 begins at step 302 where a request for an arbitrary perspective is received. In some embodiments, the request received at step 302 may include a request for an arbitrary perspective of a given scene that differs from any existing available perspective of the scene. In such cases, for example, an arbitrary perspective request may be received in response to a requested change in perspective of the presented view of the scene. Such changes in perspective may be prompted by changes or manipulations of the virtual camera associated with the scene, such as panning the camera, changing the focal length, changing the zoom level, and the like. Alternatively, in some embodiments, the request received at step 302 may include a request for arbitrary ensemble views. As an example, such an arbitrary ensemble view request may be received for an application that allows selection of multiple independent objects and provides an integrated perspective-modified ensemble view of the selected objects.

工程３０４では、要求された任意パースペクティブの少なくとも一部を生成する元となる複数の既存画像が、１または複数の関連アセットデータベースからリトリーブされる。複数のリトリーブされた画像は、工程３０２において受信された要求が所定のアセットの任意パースペクティブの要求を含む場合には、所定のアセットに関連してよく、また、工程３０２において受信された要求が任意アンサンブルビューの要求を含む場合には、複数のアセットに関連してよい。 At step 304, a plurality of existing images from which to generate at least a portion of the requested arbitrary perspective are retrieved from one or more related asset databases. Multiple retrieved images may be associated with a given asset if the request received at step 302 includes a request for an arbitrary perspective of the given asset, and the request received at step 302 may be any If it contains an ensemble view request, it may relate to multiple assets.

工程３０６では、異なるパースペクティブを有する工程３０４でリトリーブされた複数の既存画像の各々が、工程３０２において要求された任意パースペクティブに変換される。工程３０４においてリトリーブされた既存画像の各々は、関連付けられているパースペクティブ情報を含む。各画像のパースペクティブは、相対位置、向き、回転、角度、奥行、焦点距離、絞り、ズームレベル、照明情報など、その画像の生成に関連付けられているカメラ特性によって規定される。完全なカメラ情報が各画像について既知であるので、工程３０６のパースペクティブ変換は、単純な数学演算を含む。いくつかの実施形態において、工程３０６は、任意選択的に、すべての画像が同じ所望の照明条件に一貫して正規化されるような光学変換をさらに含む。 At step 306 , each of the plurality of existing images retrieved at step 304 having different perspectives is converted to the arbitrary perspective requested at step 302 . Each of the existing images retrieved in step 304 includes associated perspective information. The perspective of each image is defined by the camera properties associated with generating that image, such as relative position, orientation, rotation, angle, depth, focal length, aperture, zoom level, and lighting information. Since complete camera information is known for each image, the perspective transformation of step 306 involves simple mathematical operations. In some embodiments, step 306 optionally further includes an optical transformation such that all images are consistently normalized to the same desired lighting conditions.

工程３０８では、工程３０２において要求された任意パースペクティブを有する画像の少なくとも一部が、パースペクティブ変換済みの既存画像から収集されたピクセルで埋められる。すなわち、複数のパースペクティブ補正済みの既存画像からのピクセルが、要求された任意パースペクティブを有する画像を生成するために用いられる。 At step 308, at least a portion of the image having the arbitrary perspective requested in step 302 is filled with pixels collected from an existing image that has been perspective transformed. That is, pixels from multiple perspective-corrected existing images are used to generate an image with a desired arbitrary perspective.

工程３１０では、要求された任意パースペクティブを有する生成された画像が完成したか否かが判定される。要求された任意パースペクティブを有する生成された画像が完成していないと工程３１０において判定された場合、生成された画像の任意の残りの埋められていないピクセルを取得するためのさらなる既存画像が利用可能であるか否かが工程３１２において判定される。さらなる既存画像が利用可能であると工程３１２で判定された場合、１または複数のさらなる既存画像が工程３１４においてリトリーブされ、処理３００は工程３０６に進む。 At step 310, it is determined whether the generated image with the requested arbitrary perspective is complete. If it is determined at step 310 that the generated image with the requested arbitrary perspective is not complete, then additional existing images are available to obtain any remaining unfilled pixels of the generated image. It is determined in step 312 whether . If additional existing images are available as determined at step 312 , one or more additional existing images are retrieved at step 314 and process 300 proceeds to step 306 .

要求された任意パースペクティブを有する生成された画像が完成していないと工程３１０においてで判定され、かつ、もはや既存画像が利用できないと工程３１２において判定された場合、生成された画像のすべての残りの埋められていないピクセルが工程３１６において補間される。任意の１または複数の適切な補間技術が、工程３１６において用いられてよい。 If it was determined at step 310 that the generated image with the requested arbitrary perspective was not complete and it was determined at step 312 that no more existing images were available, all remaining of the generated image Unfilled pixels are interpolated in step 316 . Any one or more suitable interpolation techniques may be used at step 316 .

要求された任意パースペクティブを有する生成された画像が完成したと工程３１０において判定された場合、または、工程３１６においてすべての残りの埋められていないピクセルを補間した後、要求された任意パースペクティブを有する生成済みの画像が工程３１８において出力される。その後、処理３００は終了する。 If it is determined in step 310 that the generated image with the requested arbitrary perspective is complete, or after interpolating all remaining unfilled pixels in step 316, generate with the requested arbitrary perspective. The finished image is output at step 318 . Thereafter, process 300 ends.

上述のように、開示されている技術は、他の既存のパースペクティブに基づいて任意パースペクティブを生成するために用いられてよい。カメラ情報が各既存パースペクティブと共に保存されているので、異なる既存のパースペクティブを共通の所望のパースペクティブに正規化することが可能である。所望のパースペクティブを有する結果としての画像は、パースペクティブ変換された既存画像からピクセルを取得することで構築できる。開示されている技術を用いた任意パースペクティブの生成に関連付けられている処理は、高速でほぼ即時であるだけでなく、高品質の出力も生み出すため、開示されている技術は、インタラクティブなリアルタイムグラフィックスアプリケーションに対して特に強力な技術となっている。 As noted above, the disclosed techniques may be used to generate arbitrary perspectives based on other existing perspectives. Since camera information is stored with each existing perspective, it is possible to normalize different existing perspectives to a common desired perspective. A resulting image with a desired perspective can be constructed by taking pixels from an existing image that has been perspective transformed. Not only is the processing associated with generating arbitrary perspectives using the disclosed technique fast and near-instantaneous, but it also produces high-quality output, so the disclosed technique is useful for interactive real-time graphics. It has become a particularly powerful technology for applications.

上述の技術は、所望のパースペクティブとは異なるパースペクティブを有する既存参照ビューまたは画像を用いてシーンの所望の任意ビューまたはパースペクティブを生成するための比類なく効率的なパラダイムを含む。より具体的には、開示されている技術は、所望の任意パースペクティブの全部ではないとしてもほとんどのピクセルが収集される１または複数の既存参照画像から、所望の任意パースペクティブを有する高精細度の画像を迅速に生成することを容易にする。上述のように、既存参照画像は、高品質の写真または写実的レンダリングを含み、オフライン処理中にキャプチャまたはレンダリングされ、もしくは、外部ソースから取得されてよい。さらに、（仮想）カメラ特性が、各参照画像と共にメタデータとして格納され、画像のパースペクティブ変換を容易にするために後で用いられてよい。図１のアセットデータベース１０６に格納されている画像またはビューなどの参照画像と、それらに関連付けられているメタデータに関するさらなる詳細とを生成するための様々な技術について、次に説明する。 The techniques described above comprise a uniquely efficient paradigm for generating any desired arbitrary view or perspective of a scene using an existing reference view or image having a different perspective than the desired perspective. More specifically, the disclosed techniques generate a high-definition image having any desired arbitrary perspective from one or more existing reference images in which most, if not all, pixels of the desired arbitrary perspective are collected. facilitates the rapid generation of As noted above, the existing reference images may include high-quality photographs or photorealistic renderings, captured or rendered during offline processing, or obtained from external sources. Furthermore, the (virtual) camera properties are stored as metadata with each reference image and may be used later to facilitate perspective transformation of the images. Various techniques for generating reference images, such as the images or views stored in the asset database 106 of FIG. 1, and further details regarding their associated metadata are now described.

図４は、アセットの任意ビューまたは任意パースペクティブが生成されうる元となるアセットの参照画像またはビューを生成するための処理の一実施形態を示すフローチャートである。いくつかの実施形態において、処理４００は、図１のデータベース１０６に格納されるアセットの参照画像またはビューを生成するために用いられる。処理４００は、オフライン処理を含んでよい。 FIG. 4 is a flowchart illustrating one embodiment of a process for generating reference images or views of assets from which arbitrary views or perspectives of assets may be generated. In some embodiments, process 400 is used to generate reference images or views of assets stored in database 106 of FIG. Process 400 may include offline processing.

処理４００は、アセットが撮像および／またはスキャンされる工程４０２において始まる。アセットの複数のビューまたはパースペクティブが、例えば、アセットの周りで撮像装置またはスキャン装置を回転させ、もしくは、かかる装置の前でアセットを回転させることによって、工程４０２においてキャプチャされる。いくつかの場合において、カメラなどの撮像装置が、工程４０２においてアセットの高品質な写真をキャプチャするために用いられてよい。いくつかの場合において、３Ｄスキャナなどのスキャン装置が、工程４０２においてアセットに関連付けられている点群データを収集するために用いられてもよい。工程４０２では、さらに、画像データおよび／またはスキャンデータと共に適用可能なメタデータ（カメラ属性、相対的な場所または位置、深度情報、照明情報、面法線ベクトル、など）を収集する工程を含む。これらのメタデータパラメータの一部は推定されてよい。例えば、法線データが、深度データから推定されてよい。いくつかの実施形態において、アセットの対象領域または対象表面の全部ではないとしてもほとんどを網羅するアセットの少なくとも所定のセットのパースペクティブが、工程４０２においてキャプチャされる。さらに、異なる特性または属性を有する異なる撮像装置またはスキャン装置が、所与のアセットの異なるパースペクティブに対して、および／または、データベース１０６に格納された異なるアセットに対して、工程４０２において用いられてよい。 Process 400 begins at step 402 where an asset is imaged and/or scanned. Multiple views or perspectives of the asset are captured at step 402, for example, by rotating an imaging or scanning device around the asset or rotating the asset in front of such a device. In some cases, an imaging device such as a camera may be used to capture a high quality picture of the asset at step 402 . In some cases, a scanning device such as a 3D scanner may be used to collect point cloud data associated with the asset at step 402 . Step 402 also includes collecting applicable metadata (camera attributes, relative location or position, depth information, lighting information, surface normal vectors, etc.) along with the image data and/or scan data. Some of these metadata parameters may be estimated. For example, normal data may be estimated from depth data. In some embodiments, at least a predetermined set of perspectives of the asset covering most, if not all, of the target area or surface of the asset is captured in step 402 . Additionally, different imaging or scanning devices with different characteristics or attributes may be used in step 402 for different perspectives of a given asset and/or for different assets stored in database 106. .

工程４０４では、アセットの三次元ポリゴンメッシュモデルが、工程４０２においてキャプチャされた画像データおよび／またはスキャンデータから生成される。すなわち、完全に調整された三次元メッシュモデルが、工程４０２においてキャプチャされた写真および／または点群データならびに関連メタデータに基づいて生成される。いくつかの実施形態において、完全なメッシュモデルが工程４０４において構築されうることを保証するのに足りるだけのアセットデータが、工程４０２においてキャプチャされる。工程４０２において十分にキャプチャされなかった生成済みメッシュモデルの部分は補間されてよい。いくつかの場合に、工程４０４では、完全には自動化されず、生成された三次元メッシュモデルが秩序正しいことを保証するために、少なくとも何らかの人的介入を伴う。 At step 404 , a three-dimensional polygonal mesh model of the asset is generated from the image data and/or scan data captured at step 402 . That is, a fully tuned three-dimensional mesh model is generated based on the photograph and/or point cloud data captured in step 402 and associated metadata. In some embodiments, enough asset data is captured at step 402 to ensure that a complete mesh model can be built at step 404 . Portions of the generated mesh model not sufficiently captured in step 402 may be interpolated. In some cases, step 404 is not fully automated and involves at least some human intervention to ensure that the generated three-dimensional mesh model is orderly.

工程４０６では、アセットの複数の参照画像またはビューが、工程４０４において生成された三次元メッシュモデルからレンダリングされる。任意の適切なレンダリング技術が、利用可能なリソースに応じて工程４０６において用いられてよい。例えば、レンダリング品質を犠牲にすることになるが、計算リソースおよび／またはレンダリング時間に関して制約が存在する時に、より簡単なレンダリング技術（スキャンラインレンダリングまたはラスタ化など）が用いられてよい。いくつかの場合に、より多くのリソースを消費するが高品質の写実的な画像を生成するより複雑なレンダリング技術（レイトレーシングなど）が用いられてもよい。工程４０６においてレンダリングされた各参照画像は、三次元メッシュモデルから決定される関連メタデータを備え、（仮想）カメラ属性、相対的な場所または位置、深度情報、照明情報、面法線ベクトル、などのパラメータを含んでよい。 At step 406 , multiple reference images or views of the asset are rendered from the three-dimensional mesh model generated at step 404 . Any suitable rendering technique may be used in step 406 depending on available resources. For example, simpler rendering techniques (such as scanline rendering or rasterization) may be used when there are constraints on computational resources and/or rendering time, at the expense of rendering quality. In some cases, more complex rendering techniques (such as ray tracing) that consume more resources but produce high quality photorealistic images may be used. Each reference image rendered in step 406 has associated metadata determined from the 3D mesh model, including (virtual) camera attributes, relative location or position, depth information, lighting information, surface normal vectors, etc. may contain parameters for

いくつかの実施形態において、ステップ４０２でキャプチャされた任意のソース画像は、データベース１０６に格納されたアセットの参照画像またはビューの非常に小さい一部を含む。むしろ、データベース１０６に格納されたアセットの画像またはビューのほとんどは、工程４０４で生成されたアセットの三次元メッシュモデルを用いてレンダリングされる。いくつかの実施形態において、アセットの参照画像またはビューは、アセットの１または複数の正投影ビューを含む。複数の異なるアセットのかかる正投影ビューは、複数の別個にキャプチャまたはレンダリングされた個々のアセットからまたはそれらを組み合わせることによって構築された合成アセットの正投影ビューを生成するために、組み合わせられてよく（例えば、積木のように、一緒にスタックされ、または、隣り合わせに配置される、など）、その後、合成アセットの正投影ビューは、個々のアセットの各々の正投影ビューを所望の任意パースペクティブに変換することによって集合的に任意の任意カメラビューに変換されうる。 In some embodiments, any source image captured in step 402 includes a very small portion of the reference image or view of the asset stored in database 106 . Rather, most of the images or views of the asset stored in database 106 are rendered using the 3D mesh model of the asset generated at step 404 . In some embodiments, the reference image or view of the asset includes one or more orthographic views of the asset. Such orthographic views of multiple different assets may be combined to produce an orthographic view of a composite asset constructed from or by combining multiple separately captured or rendered individual assets ( stacked together, or placed next to each other, like building blocks, etc.), then the orthographic view of the composite asset transforms the orthographic view of each of the individual assets into any desired perspective. collectively into any arbitrary camera view.

図４の処理４００の三次元メッシュモデルベースのレンダリングは、計算集約的で時間がかかる。したがって、ほとんどの場合、処理４００は、オフライン処理を含む。さらに、アセットの三次元メッシュモデルが存在しうるが、かかるモデルから直接的に高品質な任意パースペクティブをレンダリングすることは、ほとんどのリアルタイムまたはオンデマンドのアプリケーションを含む多くのアプリケーションで効率的に達成することができない。むしろ、アセットの任意の所望の任意パースペクティブをレンダリングできる基礎となる三次元メッシュモデルの存在にもかかわらず、速度制約を満たすために、より効率的な技術を用いる必要がある。例えば、図１～図３の記載に関して上述した任意ビュー生成技術は、アセットの既存参照ビューまたは画像に基づいて所望の任意ビューまたはパースペクティブを非常に高速で生成しつつも参照ビューの品質に匹敵する品質を維持するために用いられてよい。しかしながら、いくつかの実施形態において、三次元メッシュモデルを構築する工程およびモデルから参照ビューをレンダリングする工程に関連付けられている非効率性は、これらの工程をオフラインで実行する選択肢を有するにもかかわらず、望ましくないまたは許容できない場合がある。いくつかのかかる場合に、次でさらに記載するように、メッシュモデルを構築する工程および参照ビューを生成するために複雑なレンダリング技術を利用する工程が省略されてもよい。 3D mesh model-based rendering of process 400 of FIG. 4 is computationally intensive and time consuming. Therefore, in most cases, process 400 includes offline processing. Additionally, although there may be a 3D mesh model of the asset, rendering high quality arbitrary perspectives directly from such a model is an efficient accomplishment for many applications, including most real-time or on-demand applications. I can't. Rather, despite the existence of an underlying 3D mesh model that can render any desired arbitrary perspective of the asset, more efficient techniques need to be used to meet speed constraints. For example, the arbitrary view generation techniques described above with respect to the descriptions of FIGS. 1-3 very quickly generate a desired arbitrary view or perspective based on an existing reference view or image of an asset while matching the quality of the reference view. May be used to maintain quality. However, in some embodiments, the inefficiencies associated with building a 3D mesh model and rendering a reference view from the model can be avoided despite having the option of performing these steps offline. may be undesirable or unacceptable. In some such cases, the steps of building the mesh model and utilizing complex rendering techniques to generate the reference view may be omitted, as described further below.

図５は、アセットの任意ビューまたは任意パースペクティブが生成されうる元となるアセットの参照画像またはビューを生成するための処理の一実施形態を示すフローチャートである。いくつかの実施形態において、処理５００は、図１のデータベース１０６に格納されるアセットの参照画像またはビューを生成するために用いられる。処理５００は、オフライン処理を含んでよい。 FIG. 5 is a flowchart illustrating one embodiment of a process for generating reference images or views of assets from which arbitrary views or perspectives of assets may be generated. In some embodiments, process 500 is used to generate reference images or views of assets stored in database 106 of FIG. Process 500 may include offline processing.

処理５００は、アセットが撮像および／またはスキャンされる工程５０２において始まる。アセットの複数のビューまたはパースペクティブが、例えば、アセットの周りで撮像装置またはスキャン装置を回転させ、もしくは、かかる装置の前でアセットを回転させることによって、工程５０２においてキャプチャされる。工程５０２においてキャプチャされたビューは、少なくとも一部は、アセットの正投影ビューを含んでよい。いくつかの実施形態において、工程５０２においてキャプチャされた画像／スキャンは、工程５０２においてキャプチャされた少なくとも１つの他の画像／スキャンと重複する視野を有し、両者の間の相対的な（カメラ／スキャナ）姿勢は既知であり、格納されている。いくつかの場合において、ＤＳＬＲ（デジタル一眼レフ）カメラなどの撮像装置が、工程５０２においてアセットの高品質な写真をキャプチャするために用いられてよい。例えば、長焦点レンズを備えたカメラが、正投影ビューをシミュレートするために用いられてよい。いくつかの場合において、３Ｄスキャナなどのスキャン装置が、工程５０２においてアセットに関連付けられている点群データを収集するために用いられてもよい。工程５０２では、さらに、カメラ属性、相対的な場所または位置、照明情報、面法線ベクトル、重複する視野を有する画像／スキャン間の相対的な姿勢など、適用可能なメタデータを画像および／またはスキャンデータと共に格納する工程を含む。これらのメタデータパラメータの一部は推定されてよい。例えば、法線データが、深度データから推定されてよい。いくつかの実施形態において、アセットの対象領域または対象表面の全部ではないとしてもほとんどを十分に網羅するアセットの少なくとも所定のセットのパースペクティブが、工程５０２でキャプチャされる。さらに、異なる特性または属性を有する異なる撮像装置またはスキャン装置が、所与のアセットの異なるパースペクティブに対して、および／または、データベース１０６に格納された異なるアセットに対して、工程５０２において用いられてよい。 Process 500 begins at step 502 where an asset is imaged and/or scanned. Multiple views or perspectives of the asset are captured at step 502, for example, by rotating an imaging or scanning device around the asset or rotating the asset in front of such a device. The views captured in step 502 may include, at least in part, orthographic views of the asset. In some embodiments, the image/scan captured in step 502 has a field of view that overlaps with at least one other image/scan captured in step 502, and the relative (camera/scan) between the two. Scanner) Attitude is known and stored. In some cases, an imaging device such as a DSLR (digital single lens reflex) camera may be used to capture a high quality picture of the asset at step 502 . For example, a camera with a long focus lens may be used to simulate an orthographic view. In some cases, a scanning device, such as a 3D scanner, may be used to collect point cloud data associated with the asset at step 502 . Step 502 also includes applying any applicable metadata, such as camera attributes, relative location or position, lighting information, surface normal vectors, relative poses between images/scans with overlapping fields of view, to images and/or Storing with scan data. Some of these metadata parameters may be estimated. For example, normal data may be estimated from depth data. In some embodiments, at least a predetermined set of perspectives of the asset that sufficiently covers most, if not all, of the target area or surface of the asset is captured in step 502 . Additionally, different imaging or scanning devices with different characteristics or attributes may be used in step 502 for different perspectives of a given asset and/or for different assets stored in database 106. .

工程５０４では、アセットの複数の参照画像またはビューが、工程５０２においてキャプチャされたデータに基づいて生成される。参照ビューは、単に、工程５０２においてキャプチャされた画像／スキャンおよび関連メタデータだけから、工程５０４において生成される。すなわち、工程５０２においてキャプチャされた適切なメタデータおよび重複するパースペクティブを用いて、アセットの任意の任意ビューまたはパースペクティブが生成されてよい。いくつかの実施形態において、データベース１０６に格納されるアセットの参照ビューの包括的なセットが、工程５０２においてキャプチャされた画像／スキャンおよびそれらの関連メタデータから生成される。工程５０２においてキャプチャされたデータは、メッシュモデルのフラグメントを形成するのに十分でありうるが、統合的な完全に調整されたメッシュモデルが生成される必要はない。したがって、アセットの完全な三次元メッシュモデルは生成されず、メッシュモデルから参照画像をレンダリングするためにレイトレーシングなどの複雑なレンダリング技術が用いられることもない。処理５００は、最も多くの処理リソースおよび時間を消費する処理４００の工程を排除することによって効率を改善する。 At step 504 , multiple reference images or views of the asset are generated based on the data captured at step 502 . A reference view is generated at step 504 simply from the images/scans captured at step 502 and associated metadata. That is, with the appropriate metadata and overlapping perspectives captured in step 502, any arbitrary view or perspective of the asset may be generated. In some embodiments, a comprehensive set of reference views of assets stored in database 106 is generated from the images/scans captured in step 502 and their associated metadata. The data captured in step 502 may be sufficient to form fragments of the mesh model, but it is not necessary that an integral, fully-tuned mesh model is generated. Therefore, no full 3D mesh model of the asset is generated, nor are complex rendering techniques such as ray tracing used to render the reference image from the mesh model. Process 500 improves efficiency by eliminating the steps of process 400 that consume the most processing resources and time.

工程５０４において生成された参照画像は、図１～図３の記載に関して上述した技術を用いて、任意ビューまたはパースペクティブのより高速な生成を容易にしうる。しかしながら、いくつかの実施形態において、参照画像のリポジトリが、工程５０４において生成される必要はない。むしろ、工程５０２においてキャプチャされたビューおよびそれらの関連メタデータは、図１～図３の記載に関して上述された技術を用いて、アセットの任意の所望の任意ビューを生成するのに十分である。すなわち、単に、アセットの領域や表面の全部ではないとしてもほとんどをキャプチャし、関連メタデータと共に登録された、重複した視野を持つ少数の高品質の画像／スキャンから、任意の所望の任意ビューまたはパースペクティブが生成されうる。工程５０２でキャプチャされたソース画像のみから所望の任意ビューを生成する工程に関連付けられている処理は、多くのオンデマンドのリアルタイムアプリケーションにとって十分に高速である。しかしながら、速度のさらなる効率性が望まれる場合、参照ビューのリポジトリが、処理５００の工程５０４などで生成されてもよい。 The reference images generated in step 504 may facilitate faster generation of arbitrary views or perspectives using the techniques described above with respect to the descriptions of FIGS. 1-3. However, in some embodiments, a repository of reference images need not be created at step 504 . Rather, the views captured in step 502 and their associated metadata are sufficient to generate any desired arbitrary view of the asset using the techniques described above with respect to the description of FIGS. i.e. simply any desired arbitrary view or Perspectives can be generated. The processing associated with generating the desired arbitrary view from only the source images captured in step 502 is sufficiently fast for many on-demand real-time applications. However, if further efficiencies in speed are desired, a repository of reference views may be generated, such as at step 504 of process 500 .

上述のように、データベース１０６内のアセットの各画像またはビューは、対応するメタデータと共に格納されてよい。メタデータは、モデルからビューをレンダリングする時、アセットを撮像またはスキャンする時（この場合、深度および／または面法線のデータが推定されてよい）、または、それら両方を組み合わせた時に、三次元メッシュモデルから生成されてよい。 As noted above, each image or view of an asset in database 106 may be stored with corresponding metadata. Metadata may be used when rendering a view from a model, imaging or scanning an asset (in which case depth and/or surface normal data may be estimated), or a combination of both. It may be generated from a mesh model.

アセットの所定のビューまたは画像が、画像を含む各ピクセルのピクセル強度値（例えば、ＲＧＢ値）と、各ピクセルに関連付けられている様々なメタデータパラメータとを含む。いくつかの実施形態において、ピクセルの赤、緑、および、青（ＲＧＢ）のチャネルまたは値の内の１または複数が、ピクセルメタデータを符号化するために用いられてよい。ピクセルメタデータは、例えば、そのピクセルに投影される三次元空間内の点の相対的な場所または位置（例えば、ｘ、ｙ、および、ｚ座標値）に関する情報を含んでよい。さらに、ピクセルメタデータは、その位置における面法線ベクトルに関する情報（例えば、ｘ、ｙ、および、ｚ軸となす角度）を含んでもよい。また、ピクセルメタデータは、テクスチャマッピング座標（例えば、ｕおよびｖ座標値）を含んでもよい。かかる場合、点における実際のピクセル値は、テクスチャ画像における対応する座標のＲＧＢ値を読み取ることによって決定される。 A given view or image of an asset includes pixel intensity values (eg, RGB values) for each pixel in the image and various metadata parameters associated with each pixel. In some embodiments, one or more of a pixel's red, green, and blue (RGB) channels or values may be used to encode pixel metadata. Pixel metadata may include, for example, information about the relative location or position (eg, x, y, and z coordinate values) of the point in three-dimensional space that is projected onto that pixel. Additionally, the pixel metadata may include information about the surface normal vector at that location (eg, angles with the x, y, and z axes). Pixel metadata may also include texture mapping coordinates (eg, u and v coordinate values). In such cases, the actual pixel value at the point is determined by reading the RGB values of the corresponding coordinates in the texture image.

面法線ベクトルは、生成された任意ビューまたはシーンの照明の修正または変更を容易にする。より具体的には、シーンの照明変更は、ピクセルの面法線ベクトルが、新たに追加、削除、または、その他の方法で変更された光源の方向にどれだけうまく一致するか（例えば、光源方向とピクセルの法線ベクトルとのドット積によって、少なくとも部分的に定量化されうる）に基づいて、ピクセル値をスケーリングすることを含む。テクスチャマッピング座標を用いてピクセル値を規定すると、生成された任意ビューまたはシーンもしくはその一部のテクスチャの修正または変更が容易になる。より具体的には、テクスチャは、参照されたテクスチャ画像を、同じ寸法を有する別のテクスチャ画像と単に交換または置換することによって変更されることができる。 Surface normal vectors facilitate modifying or changing the lighting of any generated view or scene. More specifically, scene lighting changes determine how well a pixel's surface normal vector matches the direction of newly added, removed, or otherwise changed light sources (e.g., light source direction and the pixel's normal vector, which can be quantified at least in part by the dot product of the pixel's normal vector. Using texture mapping coordinates to define pixel values facilitates modifying or altering the texture of any generated view or scene or portion thereof. More specifically, the texture can be modified by simply exchanging or replacing the referenced texture image with another texture image having the same dimensions.

上述のように、アセットの参照画像またはビューは、アセットの基礎となるメッシュモデルを用いてまたはモデルなしで生成されてよい。最も効率的な実施形態において、単に、アセットの周りの様々な（重複した）ビューをキャプチャする小さいセットのソース画像／スキャン、および、それらに関連付けられている関連メタデータのみが、図１～図３の記載に関して上述した技術を用いて、アセットの任意の所望の任意ビュー、および／または、所望の任意ビューが生成されうる元となる１セットの参照ビューを生成するために必要とされる。かかる実施形態において、モデリングおよびレンダリングに基づいたパストレーシングという最もリソース集約的な工程が排除される。開示されている任意ビュー生成技術を用いて生成された画像またはビューは、静的シーンまたは動的シーンを含んでよく、静止画、または、アニメーションまたはビデオシーケンスのフレームを含んでよい。モーションキャプチャの場合、１または複数のアセットの画像またはビューのセットが、各タイムスライスに対して生成されてよい。開示されている技術は、ゲームアプリケーション、仮想／代替現実アプリケーション、ＣＧＩ（コンピュータ生成画像）アプリケーションなど、高品質な任意ビューの高速な生成を要求するアプリケーションで特に有用である。 As noted above, a reference image or view of an asset may be generated with or without an underlying mesh model of the asset. In the most efficient embodiment, only a small set of source images/scans capturing various (overlapping) views around the asset, and the associated metadata associated with them, are shown in FIGS. 3, to generate any desired arbitrary view of the asset and/or a set of reference views from which the desired arbitrary view can be generated. In such embodiments, the most resource intensive process of modeling and rendering based path tracing is eliminated. Images or views generated using the disclosed arbitrary view generation techniques may include static or dynamic scenes and may include still images or frames of animations or video sequences. For motion capture, a set of images or views of one or more assets may be generated for each time slice. The disclosed technique is particularly useful in applications requiring fast generation of high quality arbitrary views, such as gaming applications, virtual/alternative reality applications, CGI (Computer Generated Image) applications, and the like.

三次元モデルからのレンダリングに基づいた既存の三次元コンテンツフレームワークは、典型的には、特定の用途向けに開発および最適化され、異なるプラットフォームおよびアプリケーションに対する拡張性を欠く。結果として、実質的な努力およびリソースが、異なる利用例に対して同じ三次元コンテンツを生成する際に投入され反復される必要がある。さらに、三次元コンテンツの要件は、経時的に対象物を移動させることに直面する。したがって、三次元コンテンツは、要件の変化に伴って手動で再生成される必要がある。そのため、異なるプラットフォーム、デバイス、アプリケーション、利用例、および、一般に様々な品質条件にわたって、三次元コンテンツフォーマットを標準化することが困難である結果として、三次元コンテンツの普及が阻まれてきた。したがって、本明細書に開示されているように任意の所望の品質レベルを実現するために利用できる三次元コンテンツを表現するためのより拡張可能なフォーマットが必要とされている。 Existing 3D content frameworks based on rendering from 3D models are typically developed and optimized for specific uses and lack scalability for different platforms and applications. As a result, substantial effort and resources must be invested and repeated in generating the same 3D content for different use cases. Furthermore, the requirement of 3D content is faced with moving objects over time. Therefore, 3D content must be manually regenerated as requirements change. As such, the proliferation of 3D content has been hampered as a result of the difficulty in standardizing 3D content formats across different platforms, devices, applications, use cases, and generally varying quality requirements. Accordingly, there is a need for a more extensible format for representing three-dimensional content that can be used to achieve any desired level of quality as disclosed herein.

開示されている技術は、三次元コンテンツを二次元コンテンツとして表現するための基本的に新規なフレームワークを備えつつも、従来の三次元フレームワークの属性、ならびに、様々なその他の特徴および利点の全てを提供する。上述のように、三次元コンテンツおよび対応する情報は、関連付けられているアセットの基礎となる三次元モデルを必要とすることなしに任意の所望の任意ビューが生成されうる元となる複数の画像に符号化される。すなわち、上述の技術は、三次元ソースコンテンツの二次元コンテンツ（すなわち、画像）への変換を効果的に含む。より具体的には、三次元モデルを含む従来の三次元プラットフォームと効果的に置き換わる、アセットに関連付けられている１セットの画像を含む二次元プラットフォームをもたらす。上述のように、二次元プラットフォームを構成する画像は、三次元モデルから、および／または、ソース画像またはスキャンの小さいセットから生成されてよい。関連メタデータは、アセットの各ビューに関して格納され、いくつかの場合、ピクセル値として符号化される。所与の二次元アーキテクチャの画像ベースビューおよびメタデータは、二次元コンテンツを三次元ソースとして用いることを容易にする。したがって、開示されている技術は、基礎となる三次元ポリゴンメッシュモデルを用いたレンダリングに依存する従来の三次元アーキテクチャと完全に置き換わる。三次元ソースコンテンツ（物理的なアセットまたはアセットの三次元メッシュモデルなど）は、アセットの複数の異なるビューまたはパースペクティブを生成する機能など、従来的には三次元フレームワークを用いてのみ利用可能であった特徴を表現し提供するために代わりに用いられる、１セットのビューおよびメタデータを含む二次元フォーマットにエンコードまたは変換される。従来の三次元フレームワークの特徴すべてを提供することに加えて、開示されている二次元表現は、従来の画像処理技術に適していることなど、様々なさらなる固有の特徴を提供する。 The disclosed technology provides a fundamentally new framework for representing 3D content as 2D content, while retaining the attributes of conventional 3D frameworks, as well as various other features and advantages. provide everything. As described above, three-dimensional content and corresponding information can be divided into multiple images from which any desired arbitrary view can be generated without requiring an underlying three-dimensional model of the associated asset. encoded. That is, the techniques described above effectively involve the conversion of three-dimensional source content into two-dimensional content (ie, images). More specifically, it provides a two-dimensional platform containing a set of images associated with assets that effectively replaces traditional three-dimensional platforms containing three-dimensional models. As noted above, the images that make up the 2D platform may be generated from a 3D model and/or from a small set of source images or scans. Associated metadata is stored for each view of the asset, in some cases encoded as pixel values. The image-based view and metadata of a given 2D architecture facilitates using 2D content as a 3D source. Thus, the disclosed technique completely replaces conventional 3D architectures that rely on rendering with an underlying 3D polygonal mesh model. 3D source content (such as physical assets or 3D mesh models of assets) has traditionally only been available using 3D frameworks, including the ability to generate multiple different views or perspectives of the asset. is encoded or converted into a two-dimensional format containing a set of views and metadata that are used in turn to represent and present the features that have been used. In addition to providing all the features of conventional three-dimensional frameworks, the disclosed two-dimensional representation offers various additional unique features, such as being suitable for conventional image processing techniques.

三次元コンテンツを表現するための開示されている二次元フレームワークにおいて、アセットに関する情報が、画像データとして符号化される。画像は、ピクセル値を含む高さ、幅、および、第３寸法を有するアレイを備える。アセットに関連付けられている画像は、アセットの様々な参照ビューまたはパースペクティブ、および／または、ピクセル値（例えば、ＲＧＢチャネル値）として符号化された対応するメタデータを備えてよい。かかるメタデータは、例えば、カメラ特性、テクスチャ、ｕｖ座標値、ｘｙｚ座標値、面法線ベクトル、照明情報（グローバルイルミネーション値、または、所定の照明モデルに関連付けられている値、など）、などを含んでよい。様々な実施形態において、アセットの参照ビューまたはパースペクティブを含む画像は、（高品質の）写真または（写実的な）レンダリングであってよい。 In the disclosed two-dimensional framework for representing three-dimensional content, information about assets is encoded as image data. The image comprises an array having a height, a width and a third dimension containing pixel values. Images associated with an asset may comprise various reference views or perspectives of the asset and/or corresponding metadata encoded as pixel values (eg, RGB channel values). Such metadata includes, for example, camera characteristics, textures, uv coordinate values, xyz coordinate values, surface normal vectors, lighting information (global illumination values, or values associated with a given lighting model, etc.), etc. may contain. In various embodiments, the image containing the reference view or perspective of the asset may be a (high quality) photograph or a (photorealistic) rendering.

例えば、任意のカメラ特性（カメラ位置およびレンズタイプなど）、任意のアセットアンサンブルまたは組み合わせ、任意の照明、任意のテクスチャバリエーション、などを有するアセットの所望の任意ビューまたはパースペクティブをレンダリングする機能など、様々な特徴が、開示されている二次元フレームワークによってサポートされる。完全なカメラ情報が、アセットの参照ビューについて既知であり、参照ビューと共に格納されるので、任意のカメラ特性を含むアセットの他の新規ビューが、アセットの複数のパースペクティブ変換された参照ビューから生成されてよい。より具体的には、単一のオブジェクトまたはシーンの所定の任意ビューまたはパースペクティブが、オブジェクトまたはシーンに関連付けられている複数の既存参照画像から生成されてよく、一方、所定の任意アンサンブルビューが、オブジェクトまたはシーンに関連付けられている参照画像のセットからの複数のオブジェクトまたはシーンを正規化して統合ビューに矛盾なく組み合わせることによって生成されてよい。アセットの参照ビューは、１または複数の照明モデル（グローバルイルミネーションモデルなど）によってモデル化された照明を有してよい。参照ビューについて既知の面法線ベクトルは、任意の所望の照明モデルに従って画像またはシーンの照明を変更する機能など、任意照明制御を容易にする。アセットの参照ビューは、テクスチャマッピング（ｕｖ）座標で指定されたテクスチャを有し、これは、参照されたテクスチャ画像を変更するだけで任意の所望のテクスチャを置き換えることを可能にすることによって、任意テクスチャ制御を容易にする。 For example, the ability to render any desired arbitrary view or perspective of assets with arbitrary camera characteristics (such as camera position and lens type), arbitrary asset ensembles or combinations, arbitrary lighting, arbitrary texture variations, etc. Features are supported by the disclosed two-dimensional framework. Since complete camera information is known for and stored with the reference view of an asset, other new views of the asset, including any camera characteristics, can be generated from multiple perspective-transformed reference views of the asset. you can More specifically, a given arbitrary view or perspective of a single object or scene may be generated from multiple pre-existing reference images associated with the object or scene, while a given arbitrary ensemble view of the object Or it may be generated by normalizing and consistently combining multiple objects or scenes from a set of reference images associated with the scene into an integrated view. A reference view of an asset may have lighting modeled by one or more lighting models (such as a global illumination model). A surface normal vector known for the reference view facilitates arbitrary lighting control, such as the ability to change the lighting of an image or scene according to any desired lighting model. A reference view of an asset has a texture specified in texture mapping (uv) coordinates, which can be used to replace any desired texture by simply changing the referenced texture image Facilitates texture control.

上述のように、開示されている二次元フレームワークは、画像データセットに基づいており、そのため、画像処理技術に適している。したがって、三次元コンテンツを表現するための開示されている画像ベースの二次元フレームワークは、本質的に、計算および帯域幅スペクトルの上下両方でシームレスに拡張可能かつリソース適応型である。画像を拡大縮小するための既存の技術（画像圧縮技術など）が、開示されているフレームワークの画像ベースの三次元コンテンツをスケーリングするために有利に用いられてよい。開示されている二次元フレームワークを含む画像は、異なるチャネル、プラットフォーム、デバイス、アプリケーション、および／または、利用例の要件に適切に従うように、品質または解像度の観点で、容易にスケーリングされうる。画像品質または解像度の要件は、異なるプラットフォーム（モバイル対デスクトップなど）、所与のプラットフォームのデバイスの異なるモデル、異なるアプリケーション（オンラインビューワ対マシン上でローカルに動作するネイティブアプリケーションなど）、時間の経過、異なるネットワーク帯域幅、などに対して大幅に変化しうる。したがって、異なる利用例の要件を包括的に満たし、経時的な要件の変化に影響されないアーキテクチャ（開示されている二次元フレームワークのなど）の必要性が存在する。 As mentioned above, the disclosed two-dimensional framework is based on image datasets and is therefore suitable for image processing techniques. Thus, the disclosed image-based two-dimensional framework for representing three-dimensional content is inherently seamlessly scalable and resource-adaptive both up and down the computation and bandwidth spectrum. Existing techniques for scaling images (such as image compression techniques) may be advantageously used to scale the image-based 3D content of the disclosed framework. Images, including the disclosed two-dimensional framework, can be easily scaled in terms of quality or resolution to appropriately comply with the requirements of different channels, platforms, devices, applications, and/or use cases. Image quality or resolution requirements may vary across different platforms (e.g. mobile vs. desktop), different models of devices for a given platform, different applications (e.g. online viewers vs. native applications running locally on a machine), over time, and different It can vary significantly over network bandwidth, etc. Therefore, a need exists for an architecture (such as the disclosed two-dimensional framework) that comprehensively meets the requirements of different use cases and is immune to changes in requirements over time.

一般に、開示されている二次元フレームワークは、リソース適応型のレンダリングをサポートする。さらに、時間に変化する品質／解像度の適合が、計算リソースおよび／またはネットワーク帯域幅の現在またはリアルタイムの利用可能性に基づいて提供されてよい。スケーリング（すなわち、画像品質レベルを円滑かつシームレスに上下させる機能は、ほとんどの場合、完全に自動化される。例えば、開示されている二次元フレームワークは、手動介入を必要とすることなしに、参照ビューまたはパースペクティブ、ならびに、メタデータ（例えば、テクスチャ、面法線ベクトル、ｘｙｚ座標、ｕｖ座標、照明値、など）を符号化する画像など、１または複数の特徴にわたって、アセット（すなわち、アセットを含む１または複数の画像）を自動的にダウンサンプリングする機能を提供する。いくつかのかかる場合に、アセットのスケーリングは、アセットのすべての特徴にわたって一様でなくてもよく、アセットに関連付けられている画像を含む情報またはその画像内に符号化された情報のタイプに応じて変化してよい。例えば、アセットの参照ビューまたはパースペクティブの実際の画像ピクセル値は、不可逆的に圧縮されてよいが、特定のメタデータ（深度（すなわち、ｘｙｚ値）および法線値など）を符号化した画像は、同じ方法で圧縮されなくてよく、または、いくつかの場合においては、かかる情報の損失がレンダリング時に容認されえないために、全く圧縮されなくてもよい。 In general, the disclosed two-dimensional framework supports resource-adaptive rendering. Additionally, time-varying quality/resolution adaptations may be provided based on current or real-time availability of computational resources and/or network bandwidth. Scaling (i.e., the ability to smoothly and seamlessly raise and lower image quality levels is in most cases fully automated. For example, the disclosed two-dimensional framework allows the reference Assets (i.e., including assets) across one or more features such as views or perspectives as well as images that encode metadata (e.g., textures, surface normal vectors, xyz coordinates, uv coordinates, lighting values, etc.) provides the ability to automatically downsample one or more images), in some such cases the scaling of the asset may not be uniform across all features of the asset and the associated It may vary depending on the type of information that contains or is encoded within the image, for example, the actual image pixel values of the asset's reference view or perspective may be irreversibly compressed, but the specific images that encode metadata (such as depth (i.e., xyz values) and normal values) may not be compressed in the same way, or in some cases such loss of information is acceptable at rendering time. Since it cannot be compressed, it does not have to be compressed at all.

いくつかの実施形態において、最も高い利用可能な品質または解像度を有するマスタアセット（すなわち、マスタアセットを含む１セットの画像）が生成され、例えば、図１のデータベース１０６に、格納される。いくつかのかかる場合に、アセットの１または複数のより低い品質／解像度のバージョンがマスタアセットから自動的に生成され、要求されたパースペクティブを生成するサーバ、要求側のクライアント、および／または、１以上の関連通信ネットワークの（現在の）能力に基づいて、要求されたパースペクティブまたはビューを生成するために適切なバージョンを選択できるように、格納される。あるいは、いくつかの場合に、アセットの単一のバージョン（すなわち、マスタアセット）が格納され、開示されているフレームワークは、要求されたパースペクティブを生成するサーバ、要求側のクライアント、および／または、１以上の関連通信ネットワークの（現在の）能力に基づいて、マスタアセットの品質または解像度までの品質または解像度のストリーミング配信またはプログレッシブ配信をサポートする。 In some embodiments, a master asset (ie, a set of images comprising the master asset) having the highest available quality or resolution is generated and stored, eg, in database 106 of FIG. In some such cases, one or more lower quality/resolution versions of the asset are automatically generated from the master asset to generate the requested perspective by the server, the requesting client, and/or one or more are stored so that the appropriate version can be selected to generate the requested perspective or view, based on the (current) capabilities of the associated communication network. Alternatively, in some cases, a single version of the asset (i.e., the master asset) is stored, and the disclosed framework is used by the server, requesting client, and/or Supports streaming or progressive delivery of quality or resolution up to the quality or resolution of the master asset, based on the (current) capabilities of one or more associated communication networks.

図６は、シーンの要求されたビューを提供するための処理の一実施形態を示すフローチャートである。処理６００は、例えば、図１の任意ビュー生成器１０２によって用いられてよい。いくつかの実施形態において、図３の処理３００は、処理６００の一部である。様々な実施形態において、処理６００は、１または複数のアセット（すなわち、所定のアセットまたはアセットの任意アンサンブル）を含むシーンの任意ビューを生成するために用いられてよい。 FIG. 6 is a flow diagram illustrating one embodiment of a process for providing a requested view of a scene. Process 600 may be used, for example, by arbitrary view generator 102 of FIG. In some embodiments, process 300 of FIG. 3 is part of process 600 . In various embodiments, process 600 may be used to generate arbitrary views of a scene including one or more assets (ie, a given asset or arbitrary ensemble of assets).

処理６００は、シーンの任意の他の既存の利用可能なビューとは異なっているまだ存在しないシーンの所望の任意ビューの要求が受信される工程６０２において始まる。一般に、任意ビューは、要求される前に仕様が予め知られていないシーンまたはアセットの任意の所望のビューを含んでよい。工程６０２の任意ビュー要求は、クライアントから受信され、所定のカメラ特性（例えば、レンズタイプおよび姿勢／パースペクティブ）、照明、テクスチャ、アセットアンサンブルなどの仕様を含んでよい。 Process 600 begins at step 602 where a request is received for a desired arbitrary view of a scene that does not yet exist that is different from any other existing available views of the scene. In general, arbitrary views may include any desired view of a scene or asset whose specifications are not known in advance before it is requested. The arbitrary view request of step 602 is received from the client and may include specifications such as predetermined camera characteristics (eg, lens type and pose/perspective), lighting, textures, asset ensemble, and the like.

工程６０４では、工程６０２において要求されたシーンの任意ビューが、利用可能なリソースに基づいて生成またはレンダリングされる。例えば、工程６０４において生成された要求任意ビューは、任意ビューを要求するクライアント、要求された任意ビューを生成するサーバの計算または処理能力、および／または、クライアントとサーバとの間の１以上の関連通信ネットワークの帯域幅利用可能性に基づいて、適切にスケーリングされうる。より具体的には、工程６０４では、次に説明する１または複数の関連付けられた軸に沿ってスケーリングまたは調整することによって、反応性に対して画像品質をトレードオフすることにより、リソース適応レンダリングを容易にする。 At step 604, the arbitrary view of the scene requested at step 602 is generated or rendered based on the available resources. For example, the requested arbitrary view generated in step 604 may include the client requesting the arbitrary view, the computing or processing power of the server generating the requested arbitrary view, and/or one or more relationships between the client and the server. It can be scaled appropriately based on the bandwidth availability of the communication network. More specifically, step 604 implements resource adaptive rendering by trading off image quality for responsiveness by scaling or adjusting along one or more associated axes as described next. make it easier.

開示されている技術を用いて工程６０４において生成またはレンダリングされる要求ビューを含む画像の品質は、少なくとも部分的には、要求ビューを生成するために用いられる既存のパースペクティブ変換された参照画像の数に基づいてよい。多くの場合、より多くの参照画像を用いると、より高い品質につながり、より少ない参照画像を用いると、より低い品質につながる。したがって、要求ビューを生成するために用いられる異なるパースペクティブを有する参照画像の数は、様々なプラットフォーム、デバイス、アプリケーション、または、利用例に対して適合または最適化されてよく、さらに、リアルタイムのリソースの利用可能性および制約に基づいて適合されてよい。いくつかの例として、静止画像を含む要求ビューまたは高速インターネット接続を有するデスクトップ上のネイティブアプリケーションのための要求ビューを生成するために、比較的多い数の参照画像（例えば、６０画像）が用いられてよく、一方、ビデオまたは拡張現実シーケンスのフレームを含む要求ビューもしくはモバイルデバイス用のウェブアプリケーションのための要求ビューを生成するために、比較的少ない数の参照画像（例えば、１２画像）が用いられてよい。 The quality of the images, including the demand views, generated or rendered in step 604 using the disclosed techniques depends, at least in part, on the number of existing perspective-transformed reference images used to generate the demand views. may be based on Using more reference images often leads to higher quality and using fewer reference images leads to lower quality. Accordingly, the number of reference images with different perspectives used to generate a request view may be adapted or optimized for various platforms, devices, applications, or use cases, further reducing real-time resource consumption. May be adapted based on availability and constraints. As some examples, a relatively large number of reference images (e.g., 60 images) are used to generate request views containing still images or for native applications on desktops with high-speed Internet connections. while a relatively small number of reference images (e.g., 12 images) are used to generate a desired view containing frames of a video or augmented reality sequence or for a web application for a mobile device. you can

開示されている技術を用いて工程６０４において生成またはレンダリングされる要求ビューを含む画像の品質は、少なくとも部分的には、要求ビューを生成するために用いられる１または複数のアセットを含む画像（すなわち、１または複数のアセットの参照パースペクティブおよび関連メタデータを含む画像）の解像度（すなわち、ピクセル密度）に基づいてよい。アセットを含む画像のより高解像度のバージョンは、より高い品質につながり、一方、アセットを含む画像のより低解像度のバージョンは、より低い品質につながる。したがって、要求ビューを生成するために用いられる異なるパースペクティブおよび関連メタデータを含む画像の解像度またはピクセル密度は、様々なプラットフォーム、デバイス、アプリケーション、または、利用例に対して適合または最適化されてよく、さらに、リアルタイムのリソースの利用可能性および制約に基づいて適合されてよい。いくつかの例として、高速インターネット接続を有するデスクトップ上のネイティブアプリケーションのための要求ビューを生成するために、１または複数のアセットに関連付けられている画像の比較的高い解像度（例えば、２Ｋ×２Ｋ）のバージョンが用いられてよく、一方、モバイルデバイス用のウェブベースアプリケーションのための要求ビューを生成するために、１または複数のアセットに関連付けられている画像の比較的低い解像度（例えば、５１２×５１２）のバージョンが用いられてよい。 The quality of the image containing the requested view generated or rendered in step 604 using the disclosed techniques is at least partially dependent on the image containing one or more assets used to generate the requested view (i.e. , an image including the reference perspective and associated metadata of one or more assets) resolution (ie, pixel density). A higher resolution version of the image containing the asset leads to higher quality, while a lower resolution version of the image containing the asset leads to lower quality. Accordingly, the resolution or pixel density of the images, including the different perspectives and associated metadata used to generate the requested view, may be adapted or optimized for various platforms, devices, applications or use cases, Further, it may be adapted based on real-time resource availability and constraints. As some examples, a relatively high resolution (e.g., 2K x 2K) of images associated with one or more assets to generate the desired view for native applications on desktops with high-speed internet connections may be used, while a relatively low resolution of images associated with one or more assets (e.g., 512×512 ) may be used.

開示されている技術を用いて工程６０４において生成またはレンダリングされる要求ビューを含む画像の品質は、少なくとも部分的には、要求ビューを生成するために用いられる１または複数のアセットを含む画像（すなわち、１または複数のアセットの参照パースペクティブおよび関連メタデータを含む画像）のビット深度（すなわちピクセルあたりのビット）に基づいてよい。アセットを含む画像のより高ビット深度のバージョンは、より高い品質につながり、一方、アセットを含む画像のより低ビット深度のバージョンは、より低い品質につながる。したがって、要求ビューを生成するために用いられる異なるパースペクティブおよび関連メタデータを含む画像のピクセルの精度は、様々なプラットフォーム、デバイス、アプリケーション、または、利用例に対して適合または最適化されてよく、さらに、リアルタイムのリソースの利用可能性および制約に基づいて適合されてよい。いくつかの例として、より高品質の要求ビューを生成するために、１または複数のアセットに関連付けられている画像のより高精度のバージョン（例えば、テクスチャ値については６４ｂｐｐ、ｘｙｚ座標および法線ベクトルについてはフロート）が用いられてよく、一方、より低品質の要求ビューを生成するために、１または複数のアセットに関連付けられている画像のより低精度のバージョン（例えば、テクスチャ値については２４ｂｐｐ、ｘｙｚ座標および法線ベクトルについては４８ｂｐｐ）が用いられてよい。 The quality of the image containing the requested view generated or rendered in step 604 using the disclosed techniques is at least partially dependent on the image containing one or more assets used to generate the requested view (i.e. , images including the reference perspective and associated metadata of one or more assets) bit depth (ie, bits per pixel). A higher bit-depth version of the image containing the asset leads to higher quality, while a lower bit-depth version of the image containing the asset leads to lower quality. Accordingly, the pixel accuracy of the images, including the different perspectives and associated metadata used to generate the requested view, may be adapted or optimized for various platforms, devices, applications, or use cases, and , may be adapted based on real-time resource availability and constraints. As a few examples, a higher precision version of the image associated with one or more assets (e.g., 64bpp for texture values, xyz coordinates and normal vectors) to produce a higher quality request view. float for ) may be used, while a lower-precision version of the image associated with one or more assets (e.g., 24 bpp for texture values, 48 bpp) may be used for xyz coordinates and normal vectors.

開示されているリソース適応レンダリングのための技術は、シーンの要求任意ビューを生成またはレンダリングするために用いられる画像の３つの軸（数、解像度、および、ビット深度）の内の任意の１または複数に沿った離散的および／または連続的なスケーリングをサポートする。要求ビューの画像品質は、要求ビューを生成またはレンダリングするために用いられる参照ビューおよびメタデータを含む画像の異なる組みあわせまたはバージョンを適切にスケーリングおよび／または選択することによって、変更されてよい。要求ビューの出力画像品質は、１または複数の（リアルタイムの）考慮事項および／または制約に基づいて、工程６０４で選択されてよい。例えば、要求ビューに対して選択される画像品質は、要求側クライアントのプラットフォームまたはデバイスタイプ（例えば、モバイル対デスクトップおよび／またはそれらのモデル）、所定のビューポートサイズおよび／またはフィルファクタ（例えば、５１２×５１２ウィンドウ対４Ｋウィンドウ）を有するウェブページなどでの利用例、アプリケーションタイプ（例えば、静止画像対ビデオ、ゲーム、または、仮想／拡張現実シーケンスのフレーム）、ネットワーク接続タイプ（例えば、モバイル対ブロードバンド）などに基づいてよい。したがって、品質は、所定の利用例と、所定の利用例に関するクライアントの能力とに基づいて選択されてよい。 The disclosed technique for resource-adaptive rendering can be any one or more of the three axes of the image (number, resolution, and bit depth) used to generate or render the desired arbitrary view of the scene. Supports discrete and/or continuous scaling along . The image quality of the requested view may be changed by appropriately scaling and/or selecting different combinations or versions of the image containing the reference view and metadata used to generate or render the requested view. The output image quality of the requested view may be selected at step 604 based on one or more (real-time) considerations and/or constraints. For example, the image quality selected for the requesting view may depend on the requesting client's platform or device type (e.g., mobile vs. desktop and/or models thereof), the predetermined viewport size and/or fill factor (e.g., 512 x512 windows vs. 4K windows), application type (e.g. still images vs. video, games, or frames in virtual/augmented reality sequences), network connection type (e.g. mobile vs. broadband) etc. Quality may thus be selected based on a given use case and the client's capabilities for the given use case.

いくつかの実施形態において、開示されている技術は、さらに、低い品質から、クライアントデバイスで利用可能または実現可能な最高品質以下の高い品質までの、ストリーミングまたはプログレッシブ配信をサポートする。多くの場合、要求ビューを生成するために用いられる参照画像のスケーリングまたは数の選択は、関連アプリケーションの待ち時間要件に少なくとも部分的に依存する。例えば、静止画像を生成するためには、比較的多数の参照画像が用いられてよいが、ビューが高速で変化するアプリケーションのためのフレームを生成するためには、比較的少数の参照画像が用いられてよい。様々な実施形態において、スケーリングは、スケーリングに利用可能な上述の軸の内の１または複数にわたって、および／または、様々な画像によって符号化されている情報のタイプに応じて、同じであっても異なっていてもよい。例えば、要求ビューを生成するために用いられる画像の解像度およびビット深度は、正比例して一様に、または、独立的に、スケーリングされてよい。一例として、解像度は、ダウンサンプリングされてよいが、ビット深度は、色調品質（照明、色、コントラスト）の維持が重要なアプリケーションにおいて高いダイナミックレンジおよび色深度を保持するために全くスケールダウンされなくてよい。さらに、要求ビューを生成するために用いられる画像の解像度およびビット深度は、参照ビューの実際のピクセル値など、一部のタイプのデータについては損失が許容されうるが、深度（ｘｙｚ座標）および面法線ベクトルなど、メタデータを含む他のタイプのデータについては許容されえないので、画像内に符号化された情報のタイプに応じて、異なってスケーリングされてよい。 In some embodiments, the disclosed technology further supports streaming or progressive delivery from low quality to high quality below the highest quality available or achievable on the client device. In many cases, the selection of scaling or number of reference images used to generate a request view depends at least in part on the latency requirements of the associated application. For example, a relatively large number of reference images may be used to generate a still image, while a relatively small number of reference images are used to generate frames for applications where the view changes rapidly. be taken. In various embodiments, the scaling may be the same across one or more of the above axes available for scaling and/or depending on the type of information being encoded by the various images. can be different. For example, the resolution and bit depth of the images used to generate the request views may be scaled uniformly or independently in direct proportion. As an example, the resolution may be downsampled, but the bit depth is not scaled down at all to preserve high dynamic range and color depth in applications where preservation of tonal quality (lighting, color, contrast) is important. good. Additionally, the resolution and bit depth of the images used to generate the request view may be lossy for some types of data, such as the actual pixel values of the reference view, but the depth (xyz coordinates) and plane Other types of data, including metadata, such as normal vectors, are unacceptable and may be scaled differently depending on the type of information encoded in the image.

工程６０６では、工程６０４において生成またはレンダリングされた要求ビューが、工程６０２の受信要求を満たすために、例えば、要求側クライアントに、提供される。その後、処理６００は終了する。 At step 606 , the requested view generated or rendered at step 604 is provided to, for example, a requesting client to satisfy the incoming request of step 602 . Thereafter, process 600 ends.

上述のように、アセットまたはアセットアンサンブルを含むシーンの所望の任意ビューを生成またはレンダリングするための上述の二次元フレームワークは、異なるパースペクティブを有する参照ビューと、各参照ビューまたはパースペクティブに関連付けられているメタデータとを含む画像に基づいている。いくつかの例として、各参照ビューまたはパースペクティブに関連付けられているメタデータは、参照ビューまたはパースペクティブの各ピクセルを三次元空間におけるその位置（ｘｙｚ座標値）およびその位置における面法線ベクトルに関連付けてよい。三次元モデルを用いた物理ベースレンダリング技術で生成された画像について、関連メタデータが、対応する三次元モデルからキャプチャまたは生成され、画像と関連付けられてよい。１または複数のタイプのメタデータが未知である画像（例えば、写真／スキャンまたはその他のレンダリング）については、かかるメタデータ値が、機械学習ベースの技術を用いて決定されてよい。例えば、次でさらに記載するように、ニューラルネットワークが、画像空間からメタデータ空間へのマッピングを決定するために用いられてよい。 As described above, the two-dimensional framework described above for generating or rendering any desired arbitrary view of a scene containing an asset or an ensemble of assets includes reference views having different perspectives and associated with each reference view or perspective. It is based on images that contain metadata. As some examples, the metadata associated with each reference view or perspective associates each pixel of the reference view or perspective with its position in three-dimensional space (xyz coordinate values) and the surface normal vector at that position. good. For images generated with physically-based rendering techniques using a three-dimensional model, relevant metadata may be captured or generated from the corresponding three-dimensional model and associated with the image. For images in which one or more types of metadata are unknown (eg, photographs/scans or other renderings), such metadata values may be determined using machine learning-based techniques. For example, a neural network may be used to determine the mapping from image space to metadata space, as described further below.

図７は、画像データセットに関連付けられている属性を学習するための機械学習ベース画像処理フレームワーク７００の一実施形態を示すハイレベルブロック図である。アセットの利用可能な三次元（ポリゴンメッシュ）モデルおよび所定のモデル化された環境７０２が、例えば、物理ベースレンダリング技術を用いて、幅広い画像データセット７０４をレンダリングするために用いられる。いくつかの実施形態において、モデル化された環境は、物理的なアセットが撮像または撮影される実際の物理環境と厳密に一致し、または、実際の物理環境を実質的にシミュレートする。レンダリングされた画像データセット７０４は、写実的レンダリングを含んでよく、アセットの複数のビューまたはパースペクティブと、テクスチャとを含んでよい。さらに、レンダリングされた画像データセット７０４は、適切にラベル付けまたはタグ付けされ、もしくは、レンダリング中に決定および／またはキャプチャされた関連メタデータと他の方法で関連付けられる。 FIG. 7 is a high-level block diagram illustrating one embodiment of a machine learning-based image processing framework 700 for learning attributes associated with image datasets. Available three-dimensional (polygon mesh) models of assets and a pre-defined modeled environment 702 are used to render a wide image dataset 704 using, for example, physically-based rendering techniques. In some embodiments, the modeled environment closely matches or substantially simulates the actual physical environment in which the physical asset is imaged or photographed. The rendered image dataset 704 may include photorealistic renderings and may include multiple views or perspectives of the asset and textures. Additionally, the rendered image dataset 704 is appropriately labeled or tagged or otherwise associated with relevant metadata determined and/or captured during rendering.

幅広いタグ付けされたデータセット７０４は、人工知能ベースの学習に完全に適している。例えば、１または複数の適切な機械学習技術（ディープニューラルネットワークおよび畳み込みニューラルネットワークなど）の任意の組み合わせを用いて、データセット７０４に対するトレーニング７０６を行った結果として、関連メタデータ値など、データセット７０４に関連付けられている１セットの１または複数の特性または属性７０８が学習される。かかる学習された属性は、データセット７０４に関連付けられたラベル、タグ、または、メタデータから導出または推定されてよい。画像処理フレームワーク７００は、様々なアセットおよびアセットの組みあわせに関連付けられている複数の異なるトレーニングデータセットに対してトレーニングされてよい。しかしながら、いくつかの実施形態において、トレーニングデータセットの少なくとも一部が、所定のモデル化された環境に制約される。様々な属性または属性タイプを学習するために多数のデータセットで学習した後、画像処理フレームワーク７００は、その後、トレーニングデータと同じまたは同様のモデル環境に関してレンダリングされたアセットの他のレンダリング、ならびに、トレーニングデータのモデル環境によってモデル化された環境と一致または類似する実際の物理環境でキャプチャされた写真など、かかる属性が未知である他の画像において同様の属性またはその組み合わせを検出または導出するために用いられてよい。一例として、物理的なｘｙｚ位置座標に画像ピクセルでタグ付けされ、面法線ベクトルに画像ピクセルでタグ付けされたデータセットでトレーニングされた機械学習ベースのフレームワークが、かかるメタデータ値が知られていない画像の位置（つまり、深度すなわちカメラからのｘｙｚ距離）および面法線ベクトルを予測するために用いられてよい。 A broad tagged dataset 704 is perfectly suited for artificial intelligence-based learning. For example, training 706 on data set 704 using any combination of one or more suitable machine learning techniques (such as deep neural networks and convolutional neural networks) may result in data set 704 including associated metadata values. A set of one or more characteristics or attributes 708 associated with is learned. Such learned attributes may be derived or inferred from labels, tags, or metadata associated with dataset 704 . The image processing framework 700 may be trained on multiple different training data sets associated with various assets and combinations of assets. However, in some embodiments, at least a portion of the training data set is constrained to a predetermined modeled environment. After training on multiple datasets to learn different attributes or attribute types, the image processing framework 700 then renders other renderings of the assets rendered with respect to the same or similar model environment as the training data, as well as To detect or derive similar attributes or combinations thereof in other images where such attributes are unknown, such as photographs captured in a real physical environment that matches or resembles the environment modeled by the model environment of the training data; may be used. As an example, a machine-learning-based framework trained on a dataset in which physical xyz position coordinates are tagged with image pixels and surface normal vectors are tagged with image pixels, such metadata values are known. It may be used to predict the position (ie, depth or xyz distance from the camera) and surface normal vector of the unexposed image.

開示されているフレームワークは、シミュレートまたはモデル化できる既知の制御または制約された物理環境が、個々のアセットまたはそれらの組み合わせを撮像または撮影するために用いられる場合に、特に有用である。一応用例において、例えば、オブジェクトまたはアイテムを撮像または撮影するための上述の装置（例えば、カメラリグ）が、小売業者の製品倉庫で用いられうる。かかる応用例において、オブジェクトが撮像または撮影される実際の物理環境に関する正確な情報が、例えば、いくつかの場合には、撮像装置内からの撮像オブジェクトの視点またはパースペクティブからわかる。実際の物理環境に関する既知の情報は、例えば、撮像装置の構造および形状と、利用されるカメラの数、タイプ、および、姿勢と、光源および周囲照明の位置および強度、などを含みうる。実際の物理環境に関するかかる既知の情報は、モデル化された環境が、実際の物理環境と同一であり、もしくは、実際の物理環境を少なくとも実質的に再現またはシミュレートするように、機械学習ベースの画像処理フレームワークのトレーニングデータセットのレンダリングのモデル化された環境を規定するために用いられる。いくつかの実施形態において、例えば、モデル化された環境は、撮像装置の三次元モデルと、実際の物理環境と同じカメラ構成および照明とを含む。メタデータ値が知られていない画像（実際の物理環境でキャプチャされた写真など）のかかるメタデータ値を検出または予測するために、開示されている機械学習ベースのフレームワークを利用できるように、メタデータ値が、既知のメタデータ値でタグ付けされたトレーニングデータセットから学習される。環境の特定の属性（例えば、形状、カメラ、照明）を既知の値に制約することで、円滑に、学習を行い、その他の属性（例えば、深度／位置、面法線ベクトル）を予測できるようになる。 The disclosed framework is particularly useful when known controlled or constrained physical environments that can be simulated or modeled are used to image or film individual assets or combinations thereof. In one application, for example, the devices described above (eg, camera rigs) for imaging or photographing objects or items can be used in a retailer's product warehouse. In such applications, accurate information about the actual physical environment in which the object is imaged or filmed is known, for example, in some cases from the viewpoint or perspective of the imaged object from within the imaging device. Known information about the actual physical environment may include, for example, the structure and shape of the imager; the number, type, and pose of the cameras employed; the location and intensity of light sources and ambient lighting; Such known information about the actual physical environment is machine learning-based so that the modeled environment is identical to, or at least substantially reproduces or simulates, the actual physical environment. It is used to define a modeled environment for rendering training datasets in an image processing framework. In some embodiments, for example, the modeled environment includes a three-dimensional model of the imaging device and the same camera configuration and lighting as the actual physical environment. To be able to utilize the disclosed machine learning-based framework for detecting or predicting metadata values in images where such metadata values are not known (such as photographs captured in real-world physical environments); Metadata values are learned from a training dataset tagged with known metadata values. Constrain certain attributes of the environment (e.g. shape, camera, lighting) to known values to facilitate learning and predict other attributes (e.g. depth/position, surface normal vector) become.

上述のように、機械学習ベースの画像処理フレームワークは、メタデータ値が既知であり、利用可能な三次元モデルおよび所定のモデル化された環境から生成されたレンダからメタデータを学習するために用いられてよく、機械学習ベースの画像処理フレームワークは、その後、かかるメタデータ値が未知の画像においてメタデータ値を特定するために用いられてよい。与えられた例のいくつかにおいて、所定の物理環境および対応するモデル化された環境に関して記載したが、開示されている技術は、一般に、異なるタイプのアセット、モデル環境、および／または、それらの組み合わせについて、異なるタイプの画像メタデータを学習および予測するために利用および適応されてもよい。例えば、記載されている機械学習ベースのフレームワークは、トレーニングデータセットが十分に包括的かつ多様なアセットおよび環境に及ぶと仮定すると、任意の環境においてレンダリングまたはキャプチャされた任意のアセットの画像についての未知のメタデータ値を決定するようトレーニングされてよい。 As mentioned above, a machine learning-based image processing framework, for which the metadata values are known, is used to learn metadata from available 3D models and renders generated from a given modeled environment. Machine learning-based image processing frameworks may then be used to identify metadata values in images where such metadata values are unknown. Although described in terms of given physical environments and corresponding modeled environments in some of the examples given, the disclosed techniques are generally applicable to different types of assets, modeled environments, and/or combinations thereof. may be utilized and adapted to learn and predict different types of image metadata. For example, the described machine-learning-based framework is capable of predicting images of any asset rendered or captured in any environment, assuming that the training dataset spans a sufficiently comprehensive and diverse set of assets and environments. It may be trained to determine unknown metadata values.

図８は、アセットまたはシーンの他の任意ビューを生成するために利用できるアセットまたはシーンに関連付けられている画像をデータベースに入力するための処理の一実施形態を示すフローチャートである。例えば、図８の処理８００は、図１のアセットデータベース１０６に入力するために用いられてよい。処理８００は、機械学習ベースのフレームワーク（図７のフレームワーク７００など）を利用する。いくつかの実施形態において、処理８００の画像は、所定の物理的環境および対応するモデル化された環境に制約される。しかしながら、より一般的には、処理８００は、任意の物理的環境またはモデル化された環境に関して用いられてよい。 FIG. 8 is a flowchart illustrating one embodiment of a process for entering images associated with an asset or scene into a database that can be used to generate other arbitrary views of the asset or scene. For example, process 800 of FIG. 8 may be used to populate asset database 106 of FIG. Process 800 utilizes a machine learning-based framework (such as framework 700 in FIG. 7). In some embodiments, the images of process 800 are constrained to a given physical environment and a corresponding modeled environment. More generally, however, process 800 may be used with respect to any physical or modeled environment.

処理８００は、トレーニングデータセットに関連付けられているメタデータが機械学習ベースの技術を用いて学習される工程８０２において始まる。いくつかの実施形態において、トレーニングに用いられる画像データセットは、例えば、形状、カメラ、照明などの所定の仕様によって規定されたシミュレートまたはモデル化された環境におけるアセットまたはシーンの既知の三次元モデルからレンダリングされたアセットまたはシーンの画像の広範なコレクションを含む。学習されるメタデータは、異なるタイプの画像メタデータ値を含んでよい。工程８０２のトレーニングデータセットは、所定のモデル環境内の異なるアセットを網羅してよく、または、より一般的には、異なる環境内の異なるアセットを包括的に網羅してよい。 Process 800 begins at step 802 where metadata associated with the training data set is learned using machine learning based techniques. In some embodiments, the image dataset used for training is a known three-dimensional model of an asset or scene in a simulated or modeled environment defined by predetermined specifications, e.g. geometry, camera, lighting, etc. Contains an extensive collection of images of assets or scenes rendered from. The learned metadata may include different types of image metadata values. The training data set of step 802 may cover different assets within a given model environment or, more generally, may comprehensively cover different assets within different environments.

工程８０４では、１または複数の我僧メタデータ値が未知または不完全である画像が受信される。受信された画像は、レンダリングまたは写真またはスキャンを含んでよい。いくつかの実施形態において、受信された画像は、工程８０２のトレーニング画像データセットの少なくとも一部に用いられたレンダリング環境と同じまたは同様のモデル化された環境または物理環境に関して生成またはキャプチャされたものである。 At step 804, an image with one or more unknown or incomplete Gaso metadata values is received. The received images may include renderings or photographs or scans. In some embodiments, the received images were generated or captured with respect to a modeled or physical environment that is the same or similar to the rendering environment used for at least a portion of the training image dataset of step 802. is.

工程８０６では、受信された画像の未知または不完全なメタデータ値は、処理８００の機械学習ベースのフレームワークを用いて決定または予測される。工程８０８では、受信された画像および関連メタデータは、例えば、図１のアセットデータベース１０６内に格納される。その後、処理８００は終了する。 At step 806 , unknown or incomplete metadata values for the received image are determined or predicted using the machine learning-based framework of process 800 . At step 808, the received images and associated metadata are stored, for example, in asset database 106 of FIG. Thereafter, process 800 ends.

関連メタデータを決定して、画像（すなわち、工程８０４において受信され、工程８０８において格納される画像）と関連付けることにより、処理８００は、その画像を、関連アセットまたはシーンの他の任意ビューを生成するために後で利用できる関連アセットまたはシーンの参照画像またはビューに変換することを効果的に容易にする。様々な実施形態において、参照画像として画像を格納する時に、画像は、対応するメタデータで適切にタグ付けされ、および／または、関連メタデータ値を符号化する１または複数の画像と関連付けられてよい。処理８００は、一般に、機械学習ベースの技術を用いて、画像が、例えば、任意のカメラ特性、テクスチャ、照明などを有する関連アセットまたはシーンの他のビューが生成されうる元となる基準画像になるために必要な未知の画像メタデータ値を決定することによって、任意の画像を基準画像に変換するために用いられてよい。さらに、処理８００は、精度が重要であるタイプのメタデータ（深度値および面法線ベクトル値など）を決定または予測するために特に有用である。 By determining and associating relevant metadata with an image (i.e., the image received at step 804 and stored at step 808), the process 800 converts the image into an associated asset or other arbitrary view of the scene. It effectively facilitates conversion into a reference image or view of the relevant asset or scene that can be used later for rendering. In various embodiments, when storing an image as a reference image, the image is appropriately tagged with corresponding metadata and/or associated with one or more images encoding associated metadata values. good. Process 800 generally uses machine learning-based techniques such that an image becomes a reference image from which related assets or other views of a scene can be generated, e.g., with arbitrary camera characteristics, textures, lighting, etc. It may be used to transform any image into a reference image by determining the unknown image metadata values needed to do so. Additionally, process 800 is particularly useful for determining or predicting types of metadata where accuracy is important (such as depth values and surface normal vector values).

上述のように、開示されている技術のほとんどは、既存参照画像またはビューならびに対応するメタデータの広範なデータセットを利用可能にして、それらを利用することに基づいている。したがって、多くの場合に、１または複数のオブジェクトまたはアセットの周りの異なるカメラパースペクティブを有する画像またはビューのシーケンスが、レンダリングまたは生成され、データベースまたはリポジトリに格納される。例えば、オブジェクトまたはアセットの周りの３６０°に及びまたは網羅する角度を含む３６０度回転がレンダリングまたは生成されてよい。かかるデータセットはオフラインで構築されてよいが、厳密な物理ベースレンダリング技術は、リソース消費の観点でコストの掛かる動作であり、かなりの処理能力および時間を必要とする。より効率的にオブジェクトまたはアセットの画像またはビューを生成またはレンダリングするためのいくつかの技術について、すでに説明した。より効率的にオブジェクトまたはアセットの画像またはビューをレンダリングまたは生成するためのさらなる技術について、次で詳細に説明する。 As noted above, most of the disclosed techniques are based on making available and utilizing extensive datasets of existing reference images or views and corresponding metadata. Therefore, often a sequence of images or views with different camera perspectives around one or more objects or assets is rendered or generated and stored in a database or repository. For example, a 360 degree rotation may be rendered or generated, including angles covering 360 degrees around an object or asset. Such datasets may be built offline, but strictly physically-based rendering techniques are costly operations in terms of resource consumption, requiring significant processing power and time. Several techniques have already been described for more efficiently generating or rendering images or views of objects or assets. Additional techniques for rendering or generating images or views of objects or assets more efficiently are described in detail below.

実質的な冗長性が、特定のタイプのデータまたはデータセットに関して存在する。例えば、オブジェクトまたはアセットの周りの一回転を含むセットの画像またはビューの中、特に、小さいカメラアングルまたは回転だけ異なる近くの画像またはビューの間に、多くの冗長性が存在する。同様に、アニメーションまたはビデオシーケンスのフレームの中、特に、隣接または近くのフレームの間に、冗長性が存在する。別の例として、同じテクスチャを含む画像またはビューの中に、多くの冗長性が存在する。したがって、より一般的には、特定の特徴空間において、多くの画像が、同じまたは非常に類似した特徴を示し、有意な特徴空間相関を共有する。例えば、上述の例において、実質的に同様のテクスチャ特徴が、多くの画像またはビューによって共有されてよい。大量の既存のオブジェクトまたはアセットのデータセットの利用可能性を前提とすると、新しい画像またはビュー（例えば、異なるパースペクティブ、もしくは、異なるオブジェクトまたはアセットのタイプまたは形状のもの）をレンダリングまたは生成する際に、かかる既存の画像に関する冗長性を利用できる。さらに、ゆっくりと変化する画像またはフレームのシーケンスにおける固有の冗長性が、同様に利用されてよい。機械学習は、比較的明確に定義され制約されている特徴空間を含む大きいデータセットにおいて特徴を学習および検出するのに特に適切である。したがって、いくつかの実施形態において、機械学習フレームワーク（ニューラルネットワークなど）が、他の（既存の）画像またはビューに関する特徴の冗長性を利用することに基づいて、新しい画像またはビューをより効率的にレンダリングまたは生成するために用いられる。一般に、任意の適切なニューラルネットワーク構成が、開示されている技術に関して利用されてよい。 Substantial redundancy exists for certain types of data or data sets. For example, there is a lot of redundancy among a set of images or views that include one rotation around an object or asset, especially between nearby images or views that differ by a small camera angle or rotation. Similarly, redundancy exists among the frames of an animation or video sequence, especially between adjacent or nearby frames. As another example, there is a lot of redundancy in images or views that contain the same texture. More generally, therefore, in a particular feature space, many images exhibit the same or very similar features and share significant feature space correlations. For example, in the example above, substantially similar texture features may be shared by many images or views. Given the availability of large existing object or asset datasets, when rendering or generating new images or views (e.g., of different perspectives or of different object or asset types or shapes), Redundancy with respect to such existing images can be exploited. In addition, inherent redundancies in sequences of slowly changing images or frames may be exploited as well. Machine learning is particularly well suited for learning and detecting features in large datasets containing relatively well-defined and constrained feature spaces. Thus, in some embodiments, machine learning frameworks (such as neural networks) generate new images or views more efficiently based on exploiting feature redundancy with respect to other (existing) images or views. used to render or generate In general, any suitable neural network configuration may be utilized with the disclosed techniques.

図９は、画像またはフレームを生成するための処理の一実施形態を示すハイレベルフローチャートである。いくつかの実施形態において、処理９００は、入力画像をアップスケーリングするための超解像処理を含む。以下にさらに説明するように、処理９００は、特に写実的な高品質または高精細度（ＨＤ）画像を生成する場合に、厳密な物理ベースレンダリングおよびその他の既存の技術と比較して実質的に少ないリソース消費をもたらすように、出力画像をより効率的に生成するために用いられてよい。 FIG. 9 is a high-level flowchart illustrating one embodiment of a process for generating an image or frame. In some embodiments, process 900 includes super-resolution processing to upscale the input image. As further described below, the process 900 provides substantially less than exact physically-based rendering and other existing techniques, especially when producing photorealistic high-quality or high-definition (HD) images. It may be used to generate output images more efficiently, resulting in less resource consumption.

処理９００は、特徴空間が識別または規定される工程９０２において始まる。工程９０２において識別された特徴空間は、１または複数の特徴（所定のテクスチャの特徴など）を含んでよい。いくつかの実施形態において、特徴空間は、ニューラルネットワークベースの機械学習フレームワークを用いて、工程９０２において識別される。いくつかのかかる場合に、例えば、ニューラルネットワークが、所定の画像のセットに関して知られ明確に定義されている制約された特徴空間を含むその画像のセットに固有の１または複数の特徴を決定または検出するために用いられる。すなわち、画像のセットは、特徴空間を規定するための事前分布として振る舞う。画像のセットは、例えば、厳密にレンダリングまたは生成された画像（例えば、高またはフル解像度または精細度の画像）、ならびに／もしくは、以前にレンダリングまたは生成された既存の画像またはその一部（例えば、既存の画像のパッチ）を含んでよい。 Process 900 begins at step 902 where a feature space is identified or defined. The feature space identified in step 902 may include one or more features (such as predetermined texture features). In some embodiments, the feature space is identified at step 902 using a neural network-based machine learning framework. In some such cases, for example, a neural network determines or detects one or more features specific to a given set of images comprising a constrained feature space that is known and well-defined for that set of images. used to That is, the set of images acts as a prior distribution for defining the feature space. The set of images may be, for example, strictly rendered or generated images (e.g., high or full resolution or definition images) and/or previously rendered or generated existing images or portions thereof (e.g., existing image patches).

工程９０４では、フィーチャが入力画像内で検出される。より具体的には、入力画像は、入力画像の特徴空間データ値を決定するために、ニューラルネットワークによって処理される。工程９０４の入力画像は、工程９０２の画像セットと比較して計算の複雑性またはコストの低い技術を用いてレンダリングまたは生成された低品質または低解像度または小さいサイズの画像を含む。すなわち、工程９０４の入力画像は、工程９０２の画像セットを構成する画像と比較して、ノイズが多く（例えば、収束に十分なサンプルを用いないことに起因する）、および／または、品質の劣る（例えば、より低い解像度および／またはサイズの）画像を含む。 At step 904, features are detected in the input image. More specifically, an input image is processed by a neural network to determine feature space data values for the input image. The input images of step 904 include images of low quality or resolution or small size rendered or generated using techniques with low computational complexity or cost compared to the image set of step 902 . That is, the input images of step 904 are noisier (e.g., due to not using enough samples for convergence) and/or of inferior quality compared to the images that make up the image set of step 902. Contains images (eg, of lower resolution and/or size).

工程９０６では、工程９０４において入力画像内で検出された特徴を、工程９０２において識別された特徴空間内の対応する（例えば、最も近くまたは最も類似した一致の）特徴で置き換えることによって、出力画像が生成される。より具体的には、工程９０２において識別された特徴空間に関して、工程９０４において入力画像内で検出された特徴について、最近傍探索が実行され、工程９０４において入力画像から検出された特徴が、工程９０２において識別された特徴空間からの対応する最も近い一致の特徴で置き換えられる。上述した特徴の検出、最近傍探索、および、特徴の置き換えは、特徴空間内で行われる。したがって、いくつかの実施形態において、工程９０６では、結果としての出力画像を生成するために、特徴空間から画像空間に戻すように復号または変換する工程を含む。特徴空間の操作は、画像空間における一貫した対応するピクセルレベル変換につながる。 At step 906, the output image is obtained by replacing the features detected in the input image at step 904 with corresponding (e.g., closest or closest matching) features in the feature space identified at step 902. generated. More specifically, with respect to the feature space identified in step 902, a nearest neighbor search is performed on the features detected in the input image in step 904, and the features detected in the input image in step 904 are identified in step 902 is replaced with the corresponding closest matching feature from the feature space identified in . The feature detection, nearest neighbor search, and feature replacement described above are performed in the feature space. Accordingly, in some embodiments, step 906 includes decoding or transforming from feature space back to image space to produce a resulting output image. Manipulations in feature space lead to consistent corresponding pixel-level transformations in image space.

一般に、処理９００は、他の既存画像から利用可能な冗長性および情報を利用することによって、画像復元またはアップスケーリングまたは修正のための効率的なフレームワークを提供する。例えば、処理９００は、入力画像をクリーンにするために（すなわち、ノイズの多い入力画像を比較的ノイズのない出力画像に変換するために）用いられてよい。同様に、処理９００は、入力画像の品質を改善するために（すなわち、例えば、解像度、サイズ、ビット深度などの点で、比較的低品質の入力画像を高品質の出力画像に変換するために）用いられてよい。より具体的には、処理９００は、画像セットの特徴を、画像セットと特徴空間における冗長性を共有する劣ったまたは劣化した入力画像に付与することを容易にする。処理９００は、本質的に、最近傍探索など何らかの他の比較的単純な距離計算と併用される低計算コストのルックアップ動作を含むので、特に厳密な物理ベースレンダリング技術と比較して、画像レンダリング空間において実質的な効率性を提供する。したがって、処理９００は、より高速および高効率で画像またはフレームを生成するための画像レンダリングまたは生成パイプラインにおいて特に有用である。例えば、計算の複雑性の低い物理ベースレンダリングまたは他の技術が、低品質または低解像度または小さいサイズの画像をレンダリングまたは生成するために用いられてよく、その後、処理９００が、その画像を高品質またはフル解像度または大きいサイズのバージョンに変換するために用いられてよい。さらに、処理９００は、同様に、所定の物理環境のシミュレートまたはモデル化されたバージョンに制約されたトレーニングデータセットに基づいて、所定の物理環境でキャプチャされた写真を含む入力画像を復元またはアップスケールまたはその他の方法で修正するために用いられてよい。すなわち、処理９００は、図７および図８に関して詳細に上述した機械学習ベースのアーキテクチャに関して用いられてよい。 In general, process 900 provides an efficient framework for image restoration or upscaling or modification by exploiting redundancy and information available from other existing images. For example, process 900 may be used to clean an input image (ie, to transform a noisy input image into a relatively noise-free output image). Similarly, the process 900 may be used to improve the quality of the input image (i.e., to convert a relatively low quality input image into a high quality output image, e.g., in terms of resolution, size, bit depth, etc.). ) may be used. More specifically, process 900 facilitates applying image set features to inferior or degraded input images that share redundancy in feature space with the image set. Since the process 900 essentially involves a low-computation-cost lookup operation combined with some other relatively simple distance calculation, such as a nearest-neighbor search, image rendering is much easier, especially compared to strict physically-based rendering techniques. Provides substantial efficiency in space. Process 900 is therefore particularly useful in image rendering or generation pipelines for generating images or frames with higher speed and efficiency. For example, physically-based rendering or other techniques with low computational complexity may be used to render or generate images of low quality or resolution or small size, after which process 900 converts the images to high quality. Or may be used to convert to full resolution or larger size versions. Additionally, process 900 similarly restores or uploads input images, including photographs, captured in a given physical environment based on a training data set constrained to a simulated or modeled version of the given physical environment. May be used to scale or otherwise modify. That is, process 900 may be used with the machine learning-based architectures described in detail above with respect to FIGS.

処理９００は、多くの具体的な利用例に関して利用され、利用例に適合されてよい。いくつかの実施形態において、処理９００は、ビデオまたはアニメーションシーケンスのオブジェクトまたはアセットまたはフレームの周りの（３６０°）回転を含む参照ビューなど、一連の画像を生成するために用いられる。かかる場合に、実質的な冗長性が、シーケンスの近隣の画像またはフレームの間に存在し、冗長性は、処理９００によって利用されうる。一例において、シーケンスのいくつかの画像は、独立フレーム（Ｉフレーム）として分類され、高またはフル精細度または解像度またはサイズでレンダリングされる。Ｉフレームに分類されないシーケンスのすべての画像は、アップスケーリングのために他のフレーム（すなわち、Ｉフレーム）に依存するため、低品質または解像度もしくは小さいサイズでレンダリングされ、従属フレーム（Ｄフレーム）として分類される。処理９００に関して、Ｉフレームは、工程９０２の画像のセットに対応し、各Ｄフレームは、工程９０４の入力画像に対応する。この例において、シーケンスのために選択されるＩフレームの数は、速度と品質との間の所望のトレードオフに依存してよく、より良い画質を得るためには、より多くのＩフレームが選択される。いくつかの場合に、所定の間隔を規定する一定のルールが、Ｉフレームを指定するために用いられてよく（例えば、シーケンス内の１つ置きまたは３つ置きの画像がＩフレームになる）、または、特定の閾値が、シーケンス内のＩフレームを識別して選択するために設定されてよい。あるいは、適応的な技術が、Ｄフレームと既存のＩフレームとの間の相関が弱くなるに伴って、新しいＩフレームを選択するために用いられてもよい。いくつかの実施形態において、処理９００は、所定のテクスチャを含む画像を生成するために用いられる。処理９００に関して、画像の低品質または低解像度または小さいサイズのバージョンは、工程９０４の入力画像を含み、所定のテクスチャの画像またはパッチのセットは、工程９０２の画像のセットを含む。より具体的には、この場合、所定のテクスチャは、同じテクスチャを含む既存の画像から、既知であり、明確に定義されている。この実施形態において、テクスチャパッチは、所定のテクスチャを有する１または複数の既存のレンダまたはアセットから生成され、生成されたパッチは、適切な方法で（例えば、より多様な特徴コンテンツを有するパッチを見つけて選択するために特徴空間内でクラスタリングを行うことで）サブサンプリングされ、次いで、格納されたパッチのセットが事前分布（すなわち、工程９０２の画像のセット）として利用されうるように格納される。２つの上述の例のいずれかに由来する工程９０６の出力画像は、工程９０４の入力画像の高品質または高解像度または大きいサイズまたはノイズ除去されたバージョンを含む。いくつかの具体例が記載されているが、処理９００は、一般に、十分な冗長性が存在する任意の適用可能なアプリケーションに適合されてよい。 Process 900 may be utilized for and adapted to many specific use cases. In some embodiments, process 900 is used to generate a sequence of images, such as a reference view containing (360°) rotations around an object or asset or frame of a video or animation sequence. In such cases, substantial redundancy exists between neighboring images or frames of the sequence, and the redundancy can be exploited by process 900 . In one example, some images of a sequence are classified as independent frames (I-frames) and rendered at high or full definition or resolution or size. All images in the sequence that are not classified as I-frames are rendered with lower quality or resolution or smaller size because they depend on other frames (i.e., I-frames) for upscaling and are classified as dependent frames (D-frames). be done. With respect to process 900 , an I-frame corresponds to the set of images of step 902 and each D-frame corresponds to an input image of step 904 . In this example, the number of I-frames selected for the sequence may depend on the desired trade-off between speed and quality, with more I-frames selected for better image quality. be done. In some cases, fixed rules that define predetermined intervals may be used to designate I-frames (e.g., every other or every third image in a sequence is an I-frame), Alternatively, a specific threshold may be set to identify and select I-frames within the sequence. Alternatively, adaptive techniques may be used to select new I-frames as the correlation between D-frames and existing I-frames weakens. In some embodiments, process 900 is used to generate an image containing a given texture. For process 900 , the low quality or low resolution or small size version of the image comprises the input image of step 904 and the set of images or patches of the given texture comprises the set of images of step 902 . More specifically, in this case the given texture is known and well defined from existing images containing the same texture. In this embodiment, texture patches are generated from one or more existing renders or assets with a given texture, and the generated patches are processed in an appropriate manner (e.g., finding patches with more diverse feature content). (by performing clustering in the feature space to select . The output image of step 906 from either of the two above examples includes a high quality or high resolution or large size or denoised version of the input image of step 904 . Although several specific examples are described, process 900 may generally be adapted to any applicable application where sufficient redundancy exists.

いくつかの実施形態において、１または複数の機械学習ベースの技術が、オブジェクトまたはアセットの任意または新規のビューまたはパースペクティブの生成に用いられてよい。いくつかのかかる場合に、関連付けられている機械学習ベースのフレームワークは、既知かつ明確に定義された特徴空間に制約される。例えば、かかる機械学習ベースのフレームワークによって処理された画像は、所定の環境および／または１以上の既知のテクスチャに制約されてよい。図７および図８に関して詳述したように、例えば、トレーニングデータセットは、物理的アセットの入力画像（すなわち、写真）がキャプチャされる実際の物理環境をシミュレートする所定のモデル環境に制約されてよい。かかる場合に、入力画像自体は、任意の画像メタデータ値または少なくとも非常に正確な画像メタデータ値と関連付けられていなくてよい。しかしながら、１または複数のニューラルネットワークが、正確なメタデータ値を含む合成トレーニングデータセットからシミュレーションでメタデータ値を学習するために利用され、その後、関連メタデータ値を予測または決定するため、および／または、開示されている任意ビュー生成フレームワークに記載されているように他のビューまたは画像を生成するために後で利用できる対応する参照画像またはビューを生成するために、実際のカメラでキャプチャされた入力画像（すなわち、写真）に適用されてよい。 In some embodiments, one or more machine learning-based techniques may be used to generate arbitrary or novel views or perspectives of objects or assets. In some such cases, associated machine learning-based frameworks are constrained to a known and well-defined feature space. For example, images processed by such machine learning-based frameworks may be constrained to a given environment and/or one or more known textures. As detailed with respect to FIGS. 7 and 8, for example, the training data set is constrained to a predetermined model environment that simulates the actual physical environment in which the input images (i.e., photographs) of physical assets are captured. good. In such cases, the input image itself may not be associated with any image metadata values, or at least very precise image metadata values. However, one or more neural networks are utilized to learn metadata values in simulation from a synthetic training data set containing accurate metadata values, and then predict or determine relevant metadata values; and/or or captured with a real camera to generate a corresponding reference image or view that can later be used to generate other views or images as described in the disclosed Arbitrary View Generation Framework. may be applied to an input image (i.e., a photograph).

図１０は、オブジェクトまたはアセットの任意または新規のビューまたはパースペクティブを生成するための処理の一実施形態を示すハイレベルフローチャートである。以下でさらに詳述するように、処理１０００は、既知の物理環境でキャプチャされたオブジェクトまたはアセットの画像または写真を、そのオブジェクトまたはアセットの任意ビューまたはパースペクティブに変換するために用いられてよい。処理１０００の工程の多くは、機械学習ベースのフレームワーク（例えば、物理環境をシミュレートする所定のモデル環境に制約されているトレーニングデータセットから学習する１または複数の関連ニューラルネットワーク）によって容易にされる。特徴空間は、さらに、広範なトレーニングデータセットが存在する既知のテクスチャに制約されてよい。 FIG. 10 is a high-level flowchart illustrating one embodiment of a process for generating arbitrary or new views or perspectives of objects or assets. As detailed further below, process 1000 may be used to transform an image or photograph of an object or asset captured in a known physical environment into an arbitrary view or perspective of that object or asset. Many of the steps of process 1000 are facilitated by machine learning-based frameworks (e.g., one or more associated neural networks learning from a training data set constrained to a given model environment simulating the physical environment). be. The feature space may be further constrained to known textures for which there are extensive training datasets.

処理１０００は、オブジェクトまたはアセットの入力画像が受信される工程１００２において始まる。いくつかの実施形態において、入力画像は、オブジェクトまたはアセットを撮影するための所定の撮像装置（例えば、カメラリグ）など、既知の物理環境でキャプチャされたオブジェクトまたはアセットの写真を含む。いくつかの実施形態において、入力画像は、複数の画像（例えば、異なるカメラまたはカメラアングルからの画像）を含む。例えば、入力画像は、所定の撮像装置またはカメラリグを備える左右のカメラで撮影された左右の画像を含むステレオペアを含んでよい。 Process 1000 begins at step 1002 where an input image of an object or asset is received. In some embodiments, the input image comprises a photograph of the object or asset captured in a known physical environment, such as a predetermined imaging device (eg, camera rig) for photographing the object or asset. In some embodiments, the input image includes multiple images (eg, images from different cameras or camera angles). For example, the input images may include a stereo pair including left and right images captured by left and right cameras of a given imaging device or camera rig.

工程１００４では、入力画像の被写体（すなわち、オブジェクトまたはアセット）のみが残るように、工程１００２において受信された入力画像の背景が除去される。一般に、背景除去のための任意の１または複数の適切な画像処理技術が、工程１００４において用いられてよい。いくつかの実施形態において、背景除去は、画像分割によって容易にされる。いくつかのかかる場合に、ニューラルネットワークが、画像分割を容易にするために用いられてよい。例えば、トレーニング中に、畳み込みニューラルネットワークまたはその他の適切なニューラルネットワークが、例えば、より低い解像度（１２８×１２８または２５６×２５６で、画像の特徴（エッジ、コーナー、形状、サイズなど）を学習し、それらの学習された特徴は、アップスケールされた分割マスクを作成するために組み合わせられてよい。 At step 1004, the background of the input image received at step 1002 is removed such that only the subject (ie, object or asset) of the input image remains. In general, any one or more suitable image processing techniques for background removal may be used in step 1004 . In some embodiments, background removal is facilitated by image segmentation. In some such cases, neural networks may be used to facilitate image segmentation. For example, during training, a convolutional neural network or other suitable neural network learns image features (edges, corners, shapes, sizes, etc.) at a lower resolution, e.g., 128x128 or 256x256, Those learned features may be combined to create an upscaled segmentation mask.

工程１００６では、入力画像内のオブジェクトまたはアセットの深度値が決定される。深度値は、ピクセルごとに工程１００６において決定される。工程１００６は、深度推定値を決定する工程、および／または、決定された深度推定値を微調整する工程を含んでよい。例えば、深度推定値は、入力画像を構成する左右のステレオペアから決定され、および／または、ニューラルネットワークを用いて予測されてよい。決定された深度推定値は、例えば、ニューラルネットワークおよび／またはその他の技術を用いて、後にクリーニングまたは微調整されてよい。 At step 1006, depth values for objects or assets in the input image are determined. A depth value is determined at step 1006 for each pixel. Step 1006 may include determining a depth estimate and/or fine-tuning the determined depth estimate. For example, depth estimates may be determined from the left and right stereo pairs that make up the input image and/or predicted using a neural network. The determined depth estimate may be later cleaned or fine-tuned using, for example, neural networks and/or other techniques.

工程１００８では、入力画像のパースペクティブとは異なるオブジェクトまたはアセットの所定の任意パースペクティブを含む出力画像が、工程１００６において決定された深度値に基づいてパースペクティブ変換を実行することによって生成される。一般に、所定の任意パースペクティブは、オブジェクトまたはアセットの任意の所望または要求されたカメラビューを含んでよい。例えば、所定の任意パースペクティブは、オブジェクトまたはアセットの（例えば、トップダウンまたは鳥瞰の）正投影ビューを含んでよい。工程１００８では、パースペクティブ変換推定値を決定する工程、および／または、決定されたパースペクティブ変換推定値を微調整する工程を含んでよい。例えば、パースペクティブ変換推定値は、数学的変換から直接的に決定されてもよく、および／または、ニューラルネットワーク（敵対的生成ネットワーク（ＧＡＮ）など）を用いて間接的に予測されてよい。決定されたパースペクティブ変換推定値は、その後、例えば、ニューラルネットワーク（図９に関して説明したような復元ネットワーク、または、ＧＡＮなど）を用いて、クリーニングまたは微調整されてよい。いくつかの場合に、決定されたパースペクティブ変換推定値は、ノイズ除去、修復などの伝統的な技術を用いて、代替的または追加的に微調整されてよい。 At step 1008 , an output image containing a predetermined arbitrary perspective of the object or asset that differs from the perspective of the input image is generated by performing a perspective transformation based on the depth values determined at step 1006 . In general, any given perspective may include any desired or requested camera view of an object or asset. For example, a given arbitrary perspective may include an orthographic (eg, top-down or bird's-eye) view of an object or asset. Step 1008 may include determining a perspective transformation estimate and/or fine-tuning the determined perspective transformation estimate. For example, perspective transformation estimates may be determined directly from mathematical transformations and/or indirectly predicted using a neural network (such as a generative adversarial network (GAN)). The determined perspective transformation estimate may then be cleaned or fine-tuned using, for example, a neural network (such as a reconstruction network as described with respect to FIG. 9, or a GAN). In some cases, the determined perspective transform estimate may alternatively or additionally be fine-tuned using traditional techniques such as denoising, inpainting, and the like.

その後、処理１０００は終了する。処理１０００に関して記載したように、ニューラルネットワークベース技術の複数のステージおよび／または層が、オブジェクトまたはアセットの任意ビューまたはパースペクティブを生成するために用いられてよい。 Thereafter, process 1000 ends. As described with respect to process 1000, multiple stages and/or layers of neural network-based techniques may be used to generate arbitrary views or perspectives of objects or assets.

上述の実施形態は、理解しやすいようにいくぶん詳しく説明されているが、本発明は、提供された詳細事項に限定されるものではない。本発明を実施する多くの代替方法が存在する。開示されている実施形態は、例示であり、限定するものではない。 Although the above embodiments are described in some detail for ease of understanding, the invention is not limited to the details provided. There are many alternative ways of implementing the invention. The disclosed embodiments are illustrative, not limiting.

Claims

a method,
remove the background of the input image of the object or asset,
determining a depth value for the object or asset in the input image;
generating an output image that includes a predetermined perspective of the object or asset that differs from the perspective of the input image by performing a perspective transformation based on the determined depth values;
with
A method, wherein a machine learning-based framework is utilized for one or more steps of said method, said machine learning-based framework being constrained to a known predetermined environment.

2. The method of claim 1, wherein the known predetermined environment comprises a physical environment in which the input image was taken and a simulated physical environment for training data sets of the machine learning-based framework. a model environment for and a method.

2. The method of claim 1, wherein the machine learning-based framework is constrained to one or more known textures.

2. The method of Claim 1, wherein the input image comprises a photograph captured by a camera of the object or asset.

2. The method of claim 1, wherein the input image comprises multiple images from different cameras or camera angles.

2. The method of claim 1, wherein removing the background is based at least in part on neural network-based image segmentation.

2. The method of claim 1, wherein depth values are determined on a pixel-by-pixel basis.

2. The method of claim 1, wherein determining the depth value comprises one or both of determining a depth estimate and fine-tuning the determined depth estimate.

9. The method of claim 8, wherein the depth estimate is determined from left and right stereo pairs that make up the input image.

9. The method of Claim 8, wherein the depth estimate is predicted using a neural network.

9. The method of claim 8, wherein the determined depth estimate is fine-tuned using a neural network.

2. The method of claim 1, wherein the predetermined perspective includes an orthographic view.

2. The method of claim 1, wherein performing the perspective transform comprises one or both of determining a perspective transform estimate and fine-tuning the determined perspective transform estimate. including, method.

14. The method of claim 13, wherein the perspective transformation estimate is determined from a mathematical transformation.

14. The method of claim 13, wherein the perspective transform estimate is predicted using a neural network.

14. The method of claim 13, wherein the perspective transformation estimate is fine-tuned using a neural network.

2. The method of claim 1, wherein the machine learning-based framework comprises one or more neural networks.

2. The method of claim 1, wherein the machine learning-based framework comprises a generative adversarial network (GAN).

a system,
a processor,
remove the background of the input image of the object or asset,
determining a depth value for the object or asset in the input image;
a processor configured to generate an output image that includes a predetermined perspective of the object or asset that differs from the perspective of the input image by performing a perspective transformation based on the determined depth values;
a memory coupled to the processor and configured to provide instructions to the processor;
with
A system, wherein a machine learning-based framework is utilized for one or more steps of said processor, said machine learning-based framework being constrained to a known predetermined environment.

A computer program product embodied in a persistent computer-readable storage medium,
computer instructions for removing the background of an input image of an object or asset;
computer instructions for determining a depth value of the object or asset in the input image;
computer instructions for generating an output image that includes a predetermined perspective of the object or asset that differs from the perspective of the input image by performing a perspective transformation based on the determined depth values;
with
A computer program product wherein a machine learning based framework is utilized for one or more steps of said computer program product, said machine learning based framework being constrained to a known predetermined environment.