JP2020042759A

JP2020042759A - Real time overlay arrangement in videos for augmented reality applications

Info

Publication number: JP2020042759A
Application number: JP2019040829A
Authority: JP
Inventors: ヘクデスリニディ; Hegde Srinidhi; ヘッバラグッペラムヤ; Hebbalaguppe Ramya
Original assignee: Tata Consultancy Services Ltd
Current assignee: Tata Consultancy Services Ltd
Priority date: 2018-09-06
Filing date: 2019-03-06
Publication date: 2020-03-19
Anticipated expiration: 2039-03-06
Also published as: EP3621039A1; KR102218608B1; CA3035482A1; CN110881109B; CN110881109A; US10636176B2; CA3035482C; US20200082574A1; JP6811796B2; KR20200028317A; AU2019201358A1; AU2019201358B2

Abstract

To provide a system and method for optimally placing context information for optimally placing an Augmented Reality (AR) application to overcome shield limitations of objects/scenes of interest through optimal placements of labels aiding better interpretations on the scene.SOLUTION: A processor implementation method is achieved by calculating an updated overlay position of a label arrangement in a video with a saliency map calculated as to each frame of an input video combined with an Euclidean distance between a current location of each frame based on an initial overlay position of labels and a previous entire position thereof. The overlay position is formulated as an objective function that minimizes visual saliency around the object of interest and minimizes time jitter, and facilitates coherence in real-time AR applications.SELECTED DRAWING: Figure 2

Description

関連出願の相互参照
本出願は、２０１８年９月６日にインドにおいて出願された「拡張現実アプリケーションのためのビデオにおけるリアルタイムオーバーレイ配置」と題するインド特許出願第２０１８２１０３３５４１号の完全明細書に対する優先権を主張する。 CROSS-REFERENCE TO RELATED APPLICATIONS This application claims priority to the complete specification of Indian Patent Application No. 201210333541, filed in India on September 6, 2018, entitled "Real-Time Overlay Arrangement in Video for Augmented Reality Applications." Insist.

本開示は、概してビデオ分析に関し、より詳細には、拡張現実アプリケーションのためのビデオにおけるリアルタイムオーバーレイ配置のためのシステムおよび方法に関する。 The present disclosure relates generally to video analysis, and more particularly to systems and methods for real-time overlay placement in video for augmented reality applications.

仮想現実（ＶＲ）による拡張現実（ＡＲ）は、パーソナルコンピュータ（ＰＣ）、インターネット、およびモバイルに続く技術の第４の波と考えられる。ＡＲでは、より良好な状況認識を可能にし、人間の認知および知覚を増強させるために仮想情報を重ね合わせることによって現実世界のシーンを拡張する。このコンテキスト情報は、テキスト、３Ｄ物体、ＧＰＳ座標、および音声の形態を取るが、これらに限定されない。そのようなコンテキスト情報の配置は、人工知能における主要な問題であるシーン理解にとって重要な貢献である。ラベルの空間的配置は、ラベルが（ｉ）対象の物体／シーンを遮らず、（ｉｉ）シーンのより良好な解釈のために最適に配置されるという制約のために困難な課題である。テキストラベルを最適に位置決めするための高度な現行水準の技法は、画像上でのみ機能し、デバイス（例えば、スマートフォン、タブレットなどのようなモバイル通信デバイス）上でのリアルタイム実施にとっては非効率的であることが多い。 Augmented reality (AR) with virtual reality (VR) is considered the fourth wave of technology after personal computers (PCs), the Internet, and mobile. AR extends real-world scenes by superimposing virtual information to enable better situational awareness and enhance human perception and perception. This context information may take the form of, but not limited to, text, 3D objects, GPS coordinates, and audio. The placement of such contextual information is an important contribution to scene understanding, a key issue in artificial intelligence. Spatial placement of labels is a challenging task due to the constraints that the labels are (i) not obstructing the object / scene of interest and (ii) are optimally placed for better interpretation of the scene. Advanced state-of-the-art techniques for optimally positioning text labels work only on images and are inefficient for real-time implementation on devices (eg, mobile communication devices such as smartphones, tablets, etc.). There are many.

本開示の実施形態は、従来のシステムにおいて本発明者らによって認識された１つまたは複数の上述の技術的問題に対する解決策として技術的改善を提示する。例えば、一態様では、拡張現実アプリケーションのためのビデオにおけるリアルタイムオーバーレイ配置のためのプロセッサ実施方法が提供される。この方法は、（ｉ）複数のフレームと複数のフレーム内の対象物とを含む入力ビデオ、および（ｉｉ）入力ビデオの中央フレーム上の配置について初期オーバーレイ位置が予め計算されているラベルをリアルタイムで受信することと、複数の顕著性マップを取得するために、複数のフレームの各々について顕著性マップをリアルタイムで計算することと、複数のユークリッド距離を取得するために、複数のフレームの各々について、ラベルの初期オーバーレイ位置に基づいて現在のオーバーレイ位置と以前のオーバーレイ位置との間のユークリッド距離をリアルタイムで計算することと、複数の顕著性マップおよび複数のユークリッド距離に基づいて、入力ビデオ内に配置するためのラベルの更新されたオーバーレイ位置をリアルタイムで算出することとを含む。 Embodiments of the present disclosure present a technical improvement as a solution to one or more of the above-mentioned technical problems identified by the inventors in conventional systems. For example, in one aspect, a processor-implemented method for real-time overlay placement in video for augmented reality applications is provided. The method includes the steps of: (i) input video including a plurality of frames and an object in the plurality of frames; and (ii) a label in which an initial overlay position is pre-calculated for an arrangement of the input video on a center frame in real time. Receiving, calculating a saliency map for each of the plurality of frames in real time to obtain a plurality of saliency maps, and obtaining each of the plurality of Euclidean distances, for each of the plurality of frames, Real-time calculation of the Euclidean distance between the current overlay position and the previous overlay position based on the label's initial overlay position and placement in the input video based on multiple saliency maps and multiple Euclidean distances The updated overlay position of the label for And a be.

一実施形態では、ラベルの更新されたオーバーレイ位置は、複数の顕著性マップと複数のユークリッド距離とを組み合わせることによって計算することができる。 In one embodiment, the updated overlay position of the label can be calculated by combining multiple saliency maps and multiple Euclidean distances.

一実施形態では、複数のフレームの各々のユークリッド距離は、入力ビデオ内に配置されるラベルの位置における時間ジッタをリアルタイムで制御するために計算される。一実施形態では、方法は、対象物の観察が遮られるのを最小限に抑えるために、ラベルを初期オーバーレイ位置から更新されたオーバーレイ位置にシフトすることをさらに含むことができる。 In one embodiment, the Euclidean distance of each of the plurality of frames is calculated to control the time jitter at the position of the label located in the input video in real time. In one embodiment, the method can further include shifting the label from the initial overlay position to the updated overlay position to minimize obstruction of the object.

一実施形態では、所定の閾値範囲内にある、現在のオーバーレイ位置と以前のオーバーレイ位置との間のユークリッド距離に対応する複数のピクセルが、ラベルを初期オーバーレイ位置から更新されたオーバーレイ位置にシフトするために選択される。 In one embodiment, a plurality of pixels within a predetermined threshold range and corresponding to the Euclidean distance between the current overlay position and the previous overlay position shift the label from the initial overlay position to the updated overlay position. Selected for.

例えば、一態様では、拡張現実アプリケーションのためのビデオにおけるリアルタイムオーバーレイ配置のためのシステムが提供される。システムは、命令を記憶するメモリと、１つまたは複数の通信インターフェースと、１つまたは複数の通信インターフェースを介してメモリに結合されている１つまたは複数のハードウェアプロセッサとを備え、１つまたは複数のハードウェアプロセッサは、命令によって、（ｉ）複数のフレームと複数のフレーム内の対象物とを含む入力ビデオ、および（ｉｉ）入力ビデオの中央フレーム上の配置について初期オーバーレイ位置が予め計算されているラベルをリアルタイムで受信することと、複数の顕著性マップを取得するために、複数のフレームの各々について顕著性マップをリアルタイムで計算することと、複数のユークリッド距離を取得するために、複数のフレームの各々について、ラベルの初期オーバーレイ位置に基づいて現在のオーバーレイ位置と以前のオーバーレイ位置との間のユークリッド距離をリアルタイムで計算することと、複数の顕著性マップおよび複数のユークリッド距離に基づいて、入力ビデオ内に配置するためのラベルの更新されたオーバーレイ位置をリアルタイムで算出することとを行うように構成される。 For example, in one aspect, a system is provided for real-time overlay placement in video for augmented reality applications. The system includes a memory for storing instructions, one or more communication interfaces, and one or more hardware processors coupled to the memory via the one or more communication interfaces. The plurality of hardware processors are pre-computed by the instructions with an initial overlay position for (i) an input video including a plurality of frames and objects in the plurality of frames, and (ii) an arrangement of the input video on a central frame. Receiving real-time labels, calculating a saliency map for each of a plurality of frames in real time to obtain a plurality of saliency maps, and obtaining a plurality of Euclidean distances. Current frame based on the label's initial overlay position for each of the frames -Real-time calculation of the Euclidean distance between the overlay position and the previous overlay position, and an updated overlay position of the label for placement in the input video based on multiple saliency maps and multiple Euclidean distances Is calculated in real time.

一実施形態では、ラベルの更新されたオーバーレイ位置は、複数の顕著性マップと複数のユークリッド距離とを組み合わせることによって計算される。一実施形態では、複数のフレームの各々のユークリッド距離は、入力ビデオ内に配置されるラベルの位置における時間ジッタをリアルタイムで制御するために計算される。 In one embodiment, the updated overlay position of the label is calculated by combining multiple saliency maps and multiple Euclidean distances. In one embodiment, the Euclidean distance of each of the plurality of frames is calculated to control the time jitter at the position of the label located in the input video in real time.

一実施形態では、１つまたは複数のハードウェアプロセッサは、対象物の観察が遮られるのを最小限に抑えるために、ラベルを初期オーバーレイ位置から更新されたオーバーレイ位置にシフトするようにさらに構成される。一実施形態では、所定の閾値範囲内にある、現在のオーバーレイ位置と以前のオーバーレイ位置との間のユークリッド距離に対応する複数のピクセルが、ラベルを初期オーバーレイ位置から更新されたオーバーレイ位置にシフトするために選択される。 In one embodiment, the one or more hardware processors are further configured to shift the label from the initial overlay position to the updated overlay position to minimize obstruction of object viewing. You. In one embodiment, a plurality of pixels within a predetermined threshold range and corresponding to the Euclidean distance between the current overlay position and the previous overlay position shift the label from the initial overlay position to the updated overlay position. Selected for.

さらに別の態様では、１つまたは複数のハードウェアプロセッサによって実行されると、拡張現実アプリケーションのためのビデオにおけるリアルタイムオーバーレイ配置のための方法をもたらす１つまたは複数の命令を含む１つまたは複数の非一時的機械可読情報記憶媒体が提供される。命令は、（ｉ）複数のフレームと複数のフレーム内の対象物とを含む入力ビデオ、および（ｉｉ）入力ビデオの中央フレーム上の配置について初期オーバーレイ位置が予め計算されているラベルをリアルタイムで受信することと、複数の顕著性マップを取得するために、複数のフレームの各々について顕著性マップをリアルタイムで計算することと、複数のユークリッド距離を取得するために、複数のフレームの各々について、ラベルの初期オーバーレイ位置に基づいて現在のオーバーレイ位置と以前のオーバーレイ位置との間のユークリッド距離をリアルタイムで計算することと、複数の顕著性マップおよび複数のユークリッド距離に基づいて、入力ビデオ内に配置するためのラベルの更新されたオーバーレイ位置をリアルタイムで算出することとをもたらす。 In yet another aspect, one or more instructions that include one or more instructions that, when executed by one or more hardware processors, provide a method for real-time overlay placement in video for augmented reality applications. A non-transitory machine-readable information storage medium is provided. The instructions receive in real time (i) an input video including a plurality of frames and an object in the plurality of frames, and (ii) a label whose initial overlay position is pre-computed for placement on the center frame of the input video. Calculating a saliency map for each of the plurality of frames in real time to obtain a plurality of saliency maps; and labeling each of the plurality of frames to obtain a plurality of Euclidean distances. Calculating in real time the Euclidean distance between the current overlay position and the previous overlay position based on the initial overlay position of the input and placing it in the input video based on the saliency maps and the Euclidean distances The updated overlay position of the label in real time Bring and that.

一実施形態では、複数のフレームの各々のユークリッド距離は、入力ビデオ内に配置されるラベルの位置における時間ジッタをリアルタイムで制御するために計算される。一実施形態では、命令は、１つまたは複数のハードウェアプロセッサによって実行されると、対象物の観察が遮られるのを最小限に抑えるために、ラベルを初期オーバーレイ位置から更新されたオーバーレイ位置にシフトすることをさらにもたらす。 In one embodiment, the Euclidean distance of each of the plurality of frames is calculated to control the time jitter at the position of the label located in the input video in real time. In one embodiment, the instructions, when executed by one or more hardware processors, move the label from the initial overlay location to the updated overlay location to minimize obstruction of object viewing. Bringing the shift further.

前述の一般的な説明および以下の詳細な説明は両方とも例示的かつ説明的なものにすぎず、特許請求されるような本発明を限定するものではないことを理解されたい。 It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention as claimed.

本開示に組み込まれてその一部を構成する添付の図面は、例示的な実施形態を例示し、本明細書と共に、開示される原理を説明するのに役立つ。 The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate exemplary embodiments and, together with the description, serve to explain the disclosed principles.

本開示の一実施形態による、拡張現実アプリケーションのためのビデオにおけるリアルタイムオーバーレイ配置のためのシステムの例示的なブロック図である。FIG. 1 is an exemplary block diagram of a system for real-time overlay placement in video for augmented reality applications, according to one embodiment of the present disclosure. 本開示の一実施形態による、図１のシステムを使用した拡張現実（ＡＲ）アプリケーションのためのビデオにおけるリアルタイムオーバーレイ配置のための方法の例示的な流れ図である。4 is an exemplary flowchart of a method for real-time overlay placement in video for augmented reality (AR) applications using the system of FIG. 1 according to one embodiment of the present disclosure. 本開示の一実施形態による、顕著性マップおよびユークリッド距離を計算することによる入力ビデオにおけるリアルタイムオーバーレイ配置のためのブロック図である。FIG. 4 is a block diagram for real-time overlay placement in input video by calculating saliency maps and Euclidean distances, according to one embodiment of the present disclosure. 本開示の一実施形態による、等高線プロットとしてのλおよびΣの変化による平均ラベル遮蔽対顕著性（ＬＯＳ）スコアの変動を示すグラフ図である。FIG. 6 is a graph illustrating the variation of average label occlusion versus saliency (LOS) score with changes in λ and Σ as a contour plot, according to one embodiment of the present disclosure.

例示的な実施形態が、添付の図面を参照して説明される。図において、参照符号の左端の数字（複数可）は、その参照符号が最初に現れる図を識別する。都合のよい場合にはいつでも、同じまたは同様の部分を指すために図面全体を通して同じ参照符号を使用する。本明細書では開示される原理の例および特徴が説明されているが、開示される実施形態の精神および範囲から逸脱することなく修正、適合、および他の実施態様が可能である。以下の詳細な説明は例示としてのみ考慮されるべきであり、その真の範囲および精神は添付の特許請求の範囲によって示されることが意図される。 Exemplary embodiments are described with reference to the accompanying drawings. In the figures, the left-most digit (s) of a reference number identifies the figure in which the reference number first appears. Wherever convenient, the same reference numbers will be used throughout the drawings to refer to the same or like parts. Although examples and features of the disclosed principles are described herein, modifications, adaptations, and other implementations are possible without departing from the spirit and scope of the disclosed embodiments. The following detailed description is to be considered only by way of example, the true scope and spirit of which is intended to be indicated by the appended claims.

上述のように、仮想現実（ＶＲ）による拡張現実（ＡＲ）は、ＰＣ、インターネット、およびモバイルに続く技術の第４の波と考えられる。現実世界のシーンに仮想情報を重ねることは、より良好な状況認識を可能にし、人間の認知および知覚を増強させるために非常に重要であると考えられる。そのようなコンテキスト情報の配置は、人工知能における主要な問題であるシーン理解にとって重要な貢献である。 As mentioned above, augmented reality (AR) with virtual reality (VR) is considered the fourth wave of technology after PC, Internet and mobile. Overlaying virtual information on real-world scenes is considered to be very important to enable better situational awareness and enhance human perception and perception. The placement of such contextual information is an important contribution to scene understanding, a key issue in artificial intelligence.

テキストラベルの最適配置に関連する用途のいくつかは、以下の通りである。（ｉ）屋内／屋外シーンおよびビデオにおける広告の最適配置が、見る者の視覚的注意を引くための極めて重要な広告戦略である。（ｉｉ）近傍の記念物および建造物の名称を識別するラベルが、旅行者のより良好な状況認識に役立つ。（ｉｉｉ）様々な従来のアプリケーションが、そのカメラを使用することによって様々なオペレーティングシステム（例えば、アンドロイド（登録商標）デバイス）上でリアルタイム翻訳を可能にする。携帯電話上のＡＲアプリケーションは、より高速で、正確に、効率的にかつ低い認知負荷でタスクを実行するのを助けることに留意されたい。最適なオーバーレイ配置が有用であり得る別の例は、兵士が頭部装着型デバイス（ＨＭＤ）を使用する状況である。ＨＭＤに表示される戦場マップ内の隊員のＧＰＳ座標のオーバーレイは、重要なときにシーンの現実の視野を遮るべきではない。さらに、ビデオ内の字幕の最適な配置は、視点が逸れるのを回避するのに役立つ。スマートラベル配置は、コミックスタイルのオーバーレイ配置を使用することによってビデオを面白くするのに役立つ。 Some of the uses related to optimal placement of text labels are: (I) Optimal placement of advertisements in indoor / outdoor scenes and videos is a very important advertising strategy to get the viewer's visual attention. (Ii) Labels identifying the names of nearby monuments and buildings help the traveler better recognize the situation. (Iii) Various conventional applications enable real-time translation on various operating systems (e.g., Android devices) by using the camera. Note that AR applications on mobile phones help to perform tasks faster, more accurately, efficiently and with lower cognitive load. Another example where optimal overlay placement may be useful is the situation where a soldier uses a head mounted device (HMD). The overlay of the crew's GPS coordinates in the battlefield map displayed on the HMD should not obstruct the real view of the scene when important. In addition, the optimal placement of subtitles in the video helps to avoid deviating viewpoints. Smart label placement helps to make videos interesting by using comic style overlay placement.

オーバーレイが対象の物体／シーンを遮らないような方法でコンテキスト情報がオーバーレイされ、それによってより適切な解釈を補助するように配置されるとき、これらの２Ｄテキストラベルの配置は困難である。配置がリアルタイムで機能する必要があるときにＡＲアプリケーションのラベル配置が単純ではないことを発見した研究はほとんどない。静止画像上にラベルを配置するという単純なタスクの場合、可能なラベル位置の数はラベル付けされるべき項目の数と共に指数関数的に増加する。その他の課題は、ＡＲアプリケーションのラベル配置に関する認知上および知覚上の問題の理解不足を含む。 The placement of these 2D text labels is difficult when the contextual information is overlaid in such a way that the overlay does not obstruct the object / scene of interest, thereby arranging for a better interpretation. Few studies have found that label placement in AR applications is not simple when placement needs to work in real time. For the simple task of placing labels on still images, the number of possible label locations increases exponentially with the number of items to be labeled. Other issues include a lack of understanding of cognitive and perceptual issues regarding label placement in AR applications.

上記すべてを述べたが、対象の物体／シーンの周りのオーバーレイ配置は、物体検出およびセグメンテーションと比較して、視覚映像界隈においてほとんど注目を集めていない。最近、テキストラベルをリアルタイムでオーバーレイするＡＲアプリケーションの需要が高まるにつれて、ラベル配置が大きな注目を集めています。ラベルをレンダリングするための幾何学ベースのレイアウトおよび画像ベースのレイアウト、審美的な規則、ならびに適応的オーバーレイに基づく最適なテキストラベルの配置についての関連研究が為されている。 Having said all of the above, overlay placement around the object / scene of interest has received little attention in the visual image area compared to object detection and segmentation. Recently, as the demand for AR applications that overlay text labels in real time has increased, label placement has received a great deal of attention. Related work has been done on geometry-based and image-based layouts for rendering labels, aesthetic rules, and optimal text label placement based on adaptive overlay.

幾何学ベースのレイアウト手法では、点特徴ラベル配置がＮＰ困難問題であることが実証されており、焼きなまし法および勾配降下法が解決策として提案されている。画像の美観ベースの（または画像ベースのレイアウト）手法は、ユーザの満足度の強い決定要因としてコンピュータインターフェースの視覚的美観を考慮するために開発された。それらは、空間レイアウト規則、対称性、要素間のバランス、ならびに、色彩設計、およびフォトブック生成の使用事例との調和など、一般的な設計原則を利用する。しかしながら、前述の手法は画像に作用し、リアルタイムカメラストリーム（またはリアルタイムビデオストリーム）には適していない。 In geometry-based layout techniques, point feature label placement has proven to be an NP-hard problem, and annealing and gradient descent have been proposed as solutions. Image aesthetic-based (or image-based layout) approaches have been developed to consider the visual aesthetics of a computer interface as a strong determinant of user satisfaction. They make use of general design principles, such as spatial layout rules, symmetry, balance between elements, and harmonization with color design and photobook generation use cases. However, the above approach operates on images and is not suitable for real-time camera streams (or real-time video streams).

他のいくつかの研究は、顕著性マップとエッジマップとの組み合わせを使用する、ビデオストリーム上のラベルの配置のためのＡＲブラウザに対する画像駆動型ビュー管理に焦点を当てている。そのような研究において、モバイルデバイス上のビデオストリームにこの手法を適用するときに大きな制限に直面することがあり、そのいくつかとして、第１に、そのような技法は、カメラのわずかな動きがあるときに大きく適用されることが観察されている。大規模な動きの場合、それらの技法はラベルに静的なレイアウトを使用する。ＡＲベースのアプリケーションについて、この方法は明らかに不可能である。第２に、視覚的顕著性アルゴリズムを実行することは、計算費用がかかる行列操作を含む。この問題は、計算リソースおよびメモリが限られているモバイル機器において特に顕著になる。さらに、上記のようなこれらの研究および他の従来知られているテキストオーバーレイの手法は計算量が多く、大部分はデスクトップコンピュータ上の画像に作用し、リアルタイム性能を欠き、さらにビデオのオーバーレイには適していない。さらに、遮蔽、照明に乏しいシナリオ、ライブ視野内のシーン変化により、オーバーレイには独自の課題がある。 Several other studies have focused on image-driven view management for AR browsers for placement of labels on video streams using a combination of saliency maps and edge maps. In such studies, significant limitations may be encountered when applying this approach to video streams on mobile devices, some of which include, first of all, that such techniques may require a small amount of camera movement. It has been observed that at some point it applies significantly. For large movements, those techniques use a static layout for the labels. For AR-based applications, this method is clearly not possible. Second, performing the visual saliency algorithm involves computationally expensive matrix operations. This problem is particularly pronounced on mobile devices where computing resources and memory are limited. In addition, these and other previously known text overlay techniques, such as those described above, are computationally intensive, mostly operate on images on desktop computers, lack real-time performance, and are not compatible with video overlays. Not suitable. In addition, overlays have their own challenges due to occlusion, poor lighting scenarios, and scene changes in the live view.

本開示の実施形態は、ＡＲアプリケーションのためのコンテキストラベルの戦略的配置のためのシステムおよび方法を提供する。本開示のシステムおよび方法は、スマートフォンおよびタブレットなどのローエンドのアンドロイドデバイスにおいてさえもリアルタイムで機能するラベル配置技術を提供する。本開示では、ラベル配置は、画像顕著性および時間ジッタによってパラメータ化された目的関数として定式化される。本開示は、オーバーレイ配置の有効性を測定するために、顕著性に対するラベル遮蔽対顕著性（ＬＯＳ）スコアの計算を実施する。 Embodiments of the present disclosure provide systems and methods for strategic placement of context labels for AR applications. The systems and methods of the present disclosure provide a label placement technique that works in real time even on low-end Android devices such as smartphones and tablets. In the present disclosure, label placement is formulated as an objective function parameterized by image saliency and time jitter. The present disclosure implements a label occlusion versus saliency (LOS) score calculation for saliency to measure the effectiveness of overlay placement.

ここで、図を通して一貫して同様の参照符号は対応する特徴を示している図面、より詳細には図１〜４を参照すると、好ましい実施形態が示されており、これらの実施形態は以下の例示的なシステムおよび／または方法の文脈において説明される。 Referring now to the drawings, in which like reference numerals refer to corresponding features throughout the drawings, and more particularly to FIGS. 1-4, preferred embodiments are illustrated and these embodiments are described below. It is described in the context of an exemplary system and / or method.

図１は、本開示の一実施形態による、拡張現実アプリケーションのためのビデオにおけるリアルタイムオーバーレイ配置のためのシステム１００の例示的なブロック図を示す。システム１００は、「オーバーレイ配置システム」とも呼ばれ、以後互換的に使用される。一実施形態では、システム１００は、１つまたは複数のプロセッサ１０４、通信インターフェースデバイス（複数可）または入出力（Ｉ／Ｏ）インターフェース（複数可）１０６、および１つまたは複数のプロセッサ１０４に動作可能に結合された１つまたは複数のデータ記憶デバイスまたはメモリ１０２を含む。１つまたは複数のプロセッサ１０４は、１つまたは複数のソフトウェア処理モジュールおよび／またはハードウェアプロセッサとすることができる。一実施形態では、ハードウェアプロセッサは、１つまたは複数のマイクロプロセッサ、マイクロコンピュータ、マイクロコントローラ、デジタル信号プロセッサ、中央処理装置、状態機械、論理回路、および／または動作命令に基づいて信号を操作する任意のデバイスとして実装することができる。他の機能の中でも、プロセッサ（複数可）は、メモリに記憶されているコンピュータ可読命令を取り出して実行するように構成されている。一実施形態では、デバイス１００は、ラップトップコンピュータ、ノートブック、ハンドヘルドデバイス、ワークステーション、メインフレームコンピュータ、サーバ、ネットワーククラウドなどのような様々なコンピューティングシステムにおいて実装することができる。 FIG. 1 illustrates an exemplary block diagram of a system 100 for real-time overlay placement in video for augmented reality applications, according to one embodiment of the present disclosure. System 100 is also referred to as an “overlay placement system” and is used interchangeably hereinafter. In one embodiment, system 100 is operable with one or more processors 104, communication interface device (s) or input / output (I / O) interface (s) 106, and one or more processors 104. And one or more data storage devices or memories 102 coupled to the One or more processors 104 may be one or more software processing modules and / or hardware processors. In one embodiment, a hardware processor manipulates signals based on one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, logic circuits, and / or operating instructions. It can be implemented as any device. Among other functions, the processor (s) is configured to retrieve and execute computer readable instructions stored in memory. In one embodiment, device 100 may be implemented in various computing systems such as laptop computers, notebooks, handheld devices, workstations, mainframe computers, servers, network clouds, and the like.

Ｉ／Ｏインターフェースデバイス（複数可）１０６は、例えばウェブインターフェース、グラフィカルユーザインターフェースなどの様々なソフトウェアおよびハードウェアインターフェースを含むことができ、例えば、ＬＡＮ、ケーブルなどの有線ネットワーク、ＷＬＡＮ、セルラ、衛星などの無線ネットワークを含む、多種多様なネットワークＮ／Ｗおよびプロトコルタイプ内の複数の通信を容易にすることができる。一実施形態では、Ｉ／Ｏインターフェースデバイス（複数可）は、いくつかのデバイスを互いにまたは別のサーバに接続するための１つまたは複数のポートを含むことができる。 The I / O interface device (s) 106 may include various software and hardware interfaces, such as, for example, a web interface, a graphical user interface, and the like, for example, a wired network such as a LAN, cable, WLAN, cellular, satellite, etc. Communication in a wide variety of network N / W and protocol types, including wireless networks of In one embodiment, the I / O interface device (s) may include one or more ports for connecting several devices to each other or to another server.

メモリ１０２は、例えばスタティックランダムアクセスメモリ（ＳＲＡＭ）およびダイナミックランダムアクセスメモリ（ＤＲＡＭ）などの揮発性メモリ、ならびに／または、読み出し専用メモリ（ＲＯＭ）、消去可能プログラマブルＲＯＭ、フラッシュメモリ、ハードディスク、光ディスク、および磁気テープなどの不揮発性メモリを含む、当技術分野で知られている任意のコンピュータ可読媒体を含むことができる。一実施形態では、データベース１０８をメモリ１０２に記憶することができ、データベース１０８は、限定ではないが、情報入力ビデオ、フレーム、対象物、ラベル、ラベルの初期オーバーレイ位置、ラベル幅および高さ、顕著性マップ出力、ユークリッド距離出力（複数可）、ビデオにおける配置のための更新されたオーバーレイ位置を含むことができる。より具体的には、ピクセル情報、各フレームの現在および以前のオーバーレイ位置、時間ジッタ、所定の閾値範囲などを含む、入力ビデオに関する情報を含む。一実施形態では、メモリ１０２は、１つまたは複数のハードウェアプロセッサ１０４によって実行されるときに本明細書に記載されている方法論を実行するための１つまたは複数の技法（複数可）（例えば、顕著性マップ計算技法（複数可）、ユークリッド距離計算技法（複数可））を記憶することができる。メモリ１０２は、本開示のシステムおよび方法によって実行される各ステップの入力（複数可）／出力（複数可）に関する情報をさらに含むことができる。 Memory 102 may be volatile memory such as, for example, static random access memory (SRAM) and dynamic random access memory (DRAM), and / or read only memory (ROM), erasable programmable ROM, flash memory, hard disk, optical disk, and It can include any computer readable medium known in the art, including non-volatile memory such as magnetic tape. In one embodiment, database 108 may be stored in memory 102, which may include, but is not limited to, information input videos, frames, objects, labels, initial overlay positions of labels, label width and height, salient. Gender map output, Euclidean distance output (s), and updated overlay locations for placement in the video. More specifically, it includes information about the input video, including pixel information, current and previous overlay positions for each frame, time jitter, predetermined threshold ranges, and the like. In one embodiment, memory 102 is one or more techniques (one or more) for performing the methodology described herein when executed by one or more hardware processors 104 (e.g., , Saliency map calculation technique (s), Euclidean distance calculation technique (s)). Memory 102 may further include information regarding the input (s) / output (s) of each step performed by the systems and methods of the present disclosure.

図２は、図１を参照して、本開示の一実施形態による、図１のシステム１００を使用した拡張現実（ＡＲ）アプリケーションのためのビデオにおけるリアルタイムオーバーレイ配置のための方法の例示的な流れ図を示す。一実施形態では、システム（複数可）１００は、１つまたは複数のハードウェアプロセッサ１０４に動作可能に結合されており、１つまたは複数のプロセッサ１０４によって方法のステップを実行するための命令を記憶するように構成されている１つまたは複数のデータ記憶デバイスまたはメモリ１０２を備える。ここで、本開示の方法のステップを、図１に示されるようなシステム１００の構成要素、および図２に示されるような流れ図を参照して説明する。入力ビデオをリアルタイム（ｒｅａｌｔｉｍｅ）（「リアルタイム（ｒｅａｌ−ｔｉｍｅ）」とも呼ばれ、以後互換的に使用されることがある）で受信する前に、システム１００および関連する方法は、ユーザ（複数可）によって指定されるいくつかのパラメータ、すなわちｋ、λ、Σ、Ｏ_ｈ、およびＯ_ｗを入力として受け取る。式中、
１）ｋは、処理をスキップするフレームの数である。本開示の技法／方法は、ｋフレームごとに実行される。ｋ＝１の場合、本開示の方法はすべてのフレームにおいて実行される。同様に、ｋ＝２の場合、本方法は１つおきのフレームにおいて実行される。
２）λは後続のオーバーレイの時間的コヒーレンスを制御する。λの値が小さいことは、オーバーレイがそれほど顕著でない領域に配置される可能性が高いが、それはまた多くのジッタの影響を受けることになることを意味する。λの値が高くなると、ジッタは低減するが、オーバーレイの動きも制限される。
３）Σは、探索空間サンプリングパラメータである。これは、２次元画像空間内のピクセルを均一にサンプリングする。例えば、ｕ_ｗおよびｕ_ｈがそれぞれフレームの幅および高さであると考える。これらは、本発明のコンテキストでは探索空間の寸法である。このとき、ｕ_ｈ／Σおよびｕ_ｗ／Σ個のピクセルが、それぞれの画像寸法においてスキップされる。
４）Ｏ_ｈ、Ｏ_ｗは、それぞれオーバーレイ高さおよびオーバーレイ幅である。 FIG. 2 is a flowchart of a method for real-time overlay placement in video for an augmented reality (AR) application using the system 100 of FIG. 1 according to an embodiment of the present disclosure with reference to FIG. 1. Is shown. In one embodiment, the system (s) 100 is operatively coupled to one or more hardware processors 104 and stores instructions for performing the method steps by one or more processors 104. One or more data storage devices or memories 102 configured to The steps of the method of the present disclosure will now be described with reference to the components of the system 100 as shown in FIG. 1 and a flow chart as shown in FIG. Prior to receiving the input video in real time (also referred to as "real-time" and sometimes used interchangeably hereinafter), the system 100 and associated methods require the user (s) to )), I.e., k, λ, Σ, O _h , and O _w . Where:
1) k is the number of frames whose processing is to be skipped. The techniques / methods of the present disclosure are performed every k frames. If k = 1, the method of the present disclosure is performed on every frame. Similarly, for k = 2, the method is performed on every other frame.
2) λ controls the temporal coherence of the subsequent overlay. A small value of λ means that the overlay is likely to be located in areas where the overlay is less pronounced, but it will also be subject to more jitter. Higher values of λ reduce jitter, but also limit overlay movement.
3) Σ is a search space sampling parameter. This uniformly samples the pixels in the two-dimensional image space. For example, consider u _w and u _h to be the width and height of the frame, respectively. These are the dimensions of the search space in the context of the present invention. Then, u _h / _ｈ and u _w / Σ pixels are skipped in each image dimension.
4) O _h and O _w are the overlay height and overlay width, respectively.

本開示の技法または方法が、最良のオーバーレイ位置を計算するためにすべてのピクセル値を探索することは実行可能でない場合がある。顕著性マップは離散値を有し、そのため、確率的勾配降下法などの最適化技法を使用することは不可能であり得る。すべてのピクセルを通じた線形探索は法外に費用がかかる。本開示では、均一サンプリング手法がとられる。本開示の方法およびシステム１００によって計算されるいくつかの他の中間変数は以下の通りである。
ａ）Ｘ_Ｐ；Ｙ_Ｐは、前回の反復におけるオーバーレイの最適位置である。これはフレームの中央に初期化される。
ｂ）Ｘ；Ｙは、現在の反復において計算されるオーバーレイの最適位置である。
ｃ）ＳＭは、従来の計算技法（複数可）（例えば、Ａｃｈａｎｔａ他としても参照されるＲａｄｈａｋｒｉｓｈｎａＡｃｈａｎｔａ、ＳｈｅｉｌａＨｅｍａｍｉ、ＦｒａｎｃｉｓｃｏＥｓｔｒａｄａ、およびＳａｂｉｎｅＳｕｓｓｔｒｕｎｋ「Ｆｒｅｑｕｅｎｃｙ−ｔｕｎｅｄｓａｌｉｅｎｔｒｅｇｉｏｎｄｅｔｅｃｔｉｏｎ」（Ｃｏｍｐｕｔｅｒｖｉｓｉｏｎａｎｄｐａｔｔｅｒｎｒｅｃｏｇｎｉｔｉｏｎ，２００９．ｃｖｐｒ２００９．ｉｅｅｅｃｏｎｆｅｒｅｎｃｅｏｎ．ＩＥＥＥ，２００９，ｐｐ．１５９７−１６０４）、または、従来の視覚的顕著性技法を参照されたい。これらは本明細書では互換的に使用され得る）を用いて計算される顕著性マップである。
ｄ）Ｐは、探索空間からサンプリングされるピクセルの集合である。
ｅ）Ｆ_ｗ，Ｆ_ｈは、それぞれビデオフレームの幅および高さである。 It may not be feasible for the techniques or methods of this disclosure to search all pixel values to calculate the best overlay position. The saliency map has discrete values, so it may not be possible to use optimization techniques such as stochastic gradient descent. A linear search through all pixels is prohibitively expensive. In the present disclosure, a uniform sampling approach is taken. Some other intermediate variables calculated by the method and system 100 of the present disclosure are as follows.
a) X _P ; Y _P is the optimal position of the overlay in the previous iteration. It is initialized to the center of the frame.
b) X; Y is the optimal position of the overlay calculated in the current iteration.
c) SM uses conventional computational technique (s) (e.g., Radhakrishna Achanta, also referred to as Achanta et al., Sheila Hemami, Francisco Estrada, and Sabine Sustain der kontagion responsibilities of "Frequency-tuned responsibilities"). 2009. cvpr 2009. IEEE conference on.IEEE, 2009, pp. 1597-1604), or conventional visual saliency techniques, which may be used interchangeably herein. It is a saliency map calculated.
d) P is a set of pixels sampled from the search space.
e) F _w and F _h are the width and height of the video frame, respectively.

上記の説明は、図２に記載された以下のステップによってより良好に理解される。本開示の一実施形態では、ステップ２０２において、１つまたは複数のハードウェアプロセッサ１０４は、（ｉ）複数のフレームと複数のフレーム内の対象物とを含む入力ビデオ、および（ｉｉ）入力ビデオの中央フレーム上に配置するための初期オーバーレイ位置が事前計算されているラベルをリアルタイムで受信する。一実施形態では、ラベルはラベル高さおよびラベル幅を含む。本開示の一実施形態では、入力ビデオが図３に示されている。初期オーバーレイ位置を有するラベル（例えば、ラベルは、入力ビデオの中央フレーム上にあるかまたはそこに配置されることになる）も入力として受信される（図２および図３には示されていない）。入力ビデオおよびラベルを受信すると、ステップ２０４において、１つまたは複数のハードウェアプロセッサは、複数の顕著性マップを取得するために、複数のフレームの各々について顕著性マップをリアルタイムで計算する。例示的な顕著性マップを図３に示す。本開示では、システム１００が、入力ビデオ内に存在する各フレームについて顕著性マップを計算する。言い換えれば、入力ビデオの対応する各フレームに対して１つの顕著性マップが存在することになる。そのため、顕著性マップの計算は、複数の顕著性マップを得るために、入力ビデオの最後のフレームまで反復して実行される。 The above description is better understood by the following steps set forth in FIG. In one embodiment of the present disclosure, in step 202, one or more hardware processors 104 include: (i) an input video including a plurality of frames and an object in the plurality of frames; Receive in real time a label with an initial overlay position pre-computed for placement on the center frame. In one embodiment, the label includes a label height and a label width. In one embodiment of the present disclosure, the input video is shown in FIG. A label having an initial overlay position (eg, the label will be on or located at the center frame of the input video) is also received as input (not shown in FIGS. 2 and 3). . Upon receiving the input video and label, at step 204, one or more hardware processors calculate a saliency map for each of the plurality of frames in real time to obtain a plurality of saliency maps. An example saliency map is shown in FIG. In the present disclosure, system 100 calculates a saliency map for each frame present in the input video. In other words, there will be one saliency map for each corresponding frame of the input video. Therefore, the saliency map calculation is performed iteratively until the last frame of the input video to obtain a plurality of saliency maps.

ステップ２０６において、１つまたは複数のハードウェアプロセッサ１０４は、複数のユークリッド距離を得るために、ラベルの初期オーバーレイ位置に基づいて現在のオーバーレイ位置と以前のオーバーレイ位置との間のユークリッド距離をリアルタイムで計算する。ユークリッド距離の計算は、複数のユークリッド距離を得るために、入力ビデオの最後のフレームまで反復して実行される。言い換えれば、ユークリッド距離は、複数のフレームの各々について計算される。言い換えれば、顕著性マップの計算の場合のように、入力ビデオの対応する各フレームに対して１つのユークリッド距離が存在することになる。例示的なユークリッド距離計算を図３に示す。本開示では、ユークリッド距離は、入力ビデオ内に配置されることになるラベルの位置における時間ジッタを制御するために各フレームについて計算される。時間ジッタの制御は、入力ビデオがリアルタイムで受信され処理されるときにリアルタイムで行われる。 At step 206, one or more hardware processors 104 determine in real time the Euclidean distance between the current overlay position and the previous overlay position based on the label's initial overlay position to obtain a plurality of Euclidean distances. calculate. The Euclidean distance calculation is performed iteratively until the last frame of the input video to obtain multiple Euclidean distances. In other words, the Euclidean distance is calculated for each of the plurality of frames. In other words, as in the case of the saliency map calculation, there will be one Euclidean distance for each corresponding frame of the input video. An exemplary Euclidean distance calculation is shown in FIG. In the present disclosure, the Euclidean distance is calculated for each frame to control the time jitter at the location of the label that will be placed in the input video. Control of the time jitter occurs in real time as the input video is received and processed in real time.

複数の顕著性マップおよび複数のユークリッド距離が計算されると、ステップ２０８において、１つまたは複数のハードウェアプロセッサ１０４は、複数の顕著性マップおよび複数のユークリッド距離に基づいて入力ビデオ内に配置するためのラベルの更新されたオーバーレイ位置をリアルタイムで算出する。言い換えれば、ラベルの更新されたオーバーレイ位置は、図３に示すように、複数の顕著性マップと複数のユークリッド距離とを組み合わせることによって計算される。ステップ２０４および２０６は連続して実行されるが、（ｉ）複数のフレームの各々について顕著性マップを計算するステップと、（ｉｉ）複数のフレームの各々について現在のオーバーレイ位置と以前のオーバーレイ位置との間のユークリッド距離を計算するステップとは、同時に実行することができる。これにより、より短い時間での計算をさらに確実にすることができ、その結果、リソースの利用がより良好または最適になり得る。さらに、更新されたオーバーレイ位置が計算されると、システム１００（または１つもしくは複数のハードウェアプロセッサ１０４）は、ステップ２１０において、対象物の観察が遮られるのを最小限に抑える（または低減する）ために、ラベルを初期オーバーレイ位置から更新されたオーバーレイ位置にシフトする。代替的に、この観察は、ラベルが初期オーバーレイ位置から更新されたオーバーレイ位置にシフトしたときに観察に遮蔽がないことも保証する。本開示では、現在のオーバーレイ位置と以前のオーバーレイ位置との間の、所定の閾値範囲内にあるユークリッド距離に対応する複数のピクセルが、初期オーバーレイ位置から更新されたオーバーレイ位置へラベルをシフトするために選択される。言い換えれば、現在のオーバーレイ位置と以前のオーバーレイ位置との間のユークリッド距離が所定の閾値範囲（「所定の閾値」とも呼ばれ、以下で互換的に使用され得る）内にある１つまたは複数のピクセルが、ラベルを、その初期オーバーレイ位置からシステム１００によってリアルタイムで計算される更新されたオーバーレイ位置までシフトするために選択される。更新されたオーバーレイ位置は、ラベル高さおよびラベル幅を有するラベルに関する情報を含む（例えば、この場合、幅および高さは、初期オーバーレイ位置に関連する初期幅および高さと同じであり得るか、または複数のピクセルの選択に応じて変化し得る）。例示的な重ね合わせフレームを図３に示す。より具体的には、図３は、図１〜図２を参照して、本開示の一実施形態による、顕著性マップおよびユークリッド距離を計算することによる入力ビデオにおけるリアルタイムオーバーレイ配置のためのブロック図を示す。 Once the saliency maps and the Euclidean distances have been calculated, at step 208, one or more hardware processors 104 place in the input video based on the saliency maps and the Euclidean distances. The updated overlay position of the label to calculate in real time. In other words, the updated overlay position of the label is calculated by combining multiple saliency maps and multiple Euclidean distances, as shown in FIG. Steps 204 and 206 are performed sequentially, but (i) calculating a saliency map for each of the plurality of frames; and (ii) determining a current overlay position and a previous overlay position for each of the plurality of frames. And calculating the Euclidean distance between can be performed simultaneously. This may make the calculation in a shorter time more reliable, which may result in better or optimal resource utilization. Further, once the updated overlay position has been calculated, system 100 (or one or more hardware processors 104) minimizes (or reduces) obstruction of the object at step 210. ) To shift the label from the initial overlay position to the updated overlay position. Alternatively, this observation also ensures that the observation is unobstructed when the label shifts from the initial overlay position to the updated overlay position. In the present disclosure, a plurality of pixels corresponding to a Euclidean distance that is within a predetermined threshold range between a current overlay position and a previous overlay position shift a label from an initial overlay position to an updated overlay position. Is selected. In other words, one or more of the Euclidean distances between the current overlay position and the previous overlay position are within a predetermined threshold range (also referred to as a “predetermined threshold” and may be used interchangeably below). Pixels are selected to shift the label from its initial overlay position to an updated overlay position calculated by the system 100 in real time. The updated overlay position includes information about the label with label height and label width (eg, where the width and height can be the same as the initial width and height associated with the initial overlay position, or May vary depending on the selection of multiple pixels). An exemplary overlay frame is shown in FIG. More specifically, FIG. 3 is a block diagram for real-time overlay placement in input video by calculating saliency maps and Euclidean distances, according to one embodiment of the present disclosure, with reference to FIGS. 1-2. Is shown.

一言で言えば、ステップ２０２から２０８までは、よりよく理解するために以下のように説明される。 In short, steps 202 through 208 are described as follows for a better understanding.

本開示の方法は、例えば、ｋフレームごとに実行される。所与のフレームについて、擬似コード（例えば、ＳａｌｉｅｎｃｙＭａｐＣｏｍｐｕｔａｔｉｏｎ（顕著性マップ計算））を使用する視覚的顕著性マップ（顕著性マップとも呼ばれ、以後互換的に使用されることがある）が計算される。次に、システム１００は、探索空間内に提供されたピクセル値（例えば、Σ探索空間サンプリングパラメータを参照）を通じて反復し、サイズＯ_ｈ、Ｏ_ｗの仮想ボックス内でマップによって与えられた顕著性値を合計する。本開示では、最も低い合計を有するピクセル値が、最も低い顕著性を示唆する理想的な候補として選択される。以前の位置と現在の位置との間の、λ（所定の閾値範囲または所定の閾値として参照される）によってスケーリングされたユークリッド距離ｄが可能な限り小さい場合、オーバーレイはシフトされる。低い顕著性と時間ジッタの両方によって課される制約を組み合わせるために、本開示は以下のように最適化問題を定式化する。
The method of the present disclosure is performed, for example, every k frames. For a given frame, a visual saliency map (also referred to as a saliency map, sometimes used interchangeably) using pseudo-code (eg, SaliencyMapComputation) is calculated. . Next, the system 100 iterates through the pixel values provided in the search space (see, for example, Σ search space sampling parameters), and the saliency value given by the map in a virtual box of size O _h , O _w Sum up. In the present disclosure, the pixel value with the lowest sum is selected as the ideal candidate indicating the lowest saliency. If the Euclidean distance d between the previous position and the current position, scaled by λ (referred to as a predetermined threshold range or predetermined threshold), is as small as possible, the overlay is shifted. To combine the constraints imposed by both low saliency and time jitter, the present disclosure formulates an optimization problem as follows.

以下は、本開示の技法／方法の例示的な擬似コードである。
１．（Ｘ_Ｐ；Ｙ_Ｐ）＝（フレーム幅／２，フレーム高さ／２）
２．ｋフレームごとに
３．ＳＭ＝顕著性マップ計算（フレーム）
４．ｆｏｒ（ｘ，ｙ）∈Ｐ
５．Ｌ＝｛（ａ，ｂ）｜ｘ≦ａ≦ｘ＋Ｏ_ｗ，ｙ≦ｂ≦ｙ＋Ｏ_ｈ｝
６．ｓ_ｘ，ｙ＝Σ_{（ａ，ｂ）∈Ｌ}ＳＭ（ａ，ｂ）
７．ｄ_ｘ，ｙ＝λ×距離（（Ｘ，Ｙ），（Ｘ_Ｐ，Ｙ_Ｐ））
８．ｓ_ｍｉｎ＝ｍｉｎ（ｓ_ｘ，ｙ＋ｄ_ｘ，ｙ）
９．（Ｘ，Ｙ）：＝ａｒｇ＿ｍｉｎ（ｓ_ｘ，ｙ）
１０．（Ｘ_Ｐ，Ｙ_Ｐ）：＝（Ｘ，Ｙ）／／遷移全体に線形補間を使用する The following is exemplary pseudo-code of the techniques / methods of the present disclosure.
1. _{_{(X P; Y P) =}} ( Frame Width / 2, the frame height / 2)
2. 2. Every k frames SM = saliency map calculation (frame)
4. for (x, y) ∈P
5. L = {(a, b) | x ≦ a ≦ x + O _w , y ≦ b ≦ y + O _h }
6. s _{x, y} = Σ _{(a, b) ∈L} SM (a, b)
7. d _{x, y} = λ × distance ((X, Y), (X _P , Y _P ))
8. s _min = min (s _{x, y} + d _{x, y} )
9. (X, Y): = arg_min (s _{x, y} )
10. (X _P , Y _P ): = (X, Y) // Use linear interpolation for entire transition

上記の擬似コードにおいて、行（またはコマンドもしくはプログラムコード）「ＳＭ＝顕著性マップ計算（フレーム）」を実行するためには、従来の顕著性マップ計算の技法を参照することができる（例えば、ＲａｄｈａｋｒｉｓｈｎａＡｃｈａｎｔａ，ＳｈｅｉｌａＨｅｍａｍｉ，ＦｒａｎｃｉｓｃｏＥｓｔｒａｄａ，およびＳａｂｉｎｅＳｕｓｓｔｒｕｎｋ「Ｆｒｅｑｕｅｎｃｙ−ｔｕｎｅｄｓａｌｉｅｎｔｒｅｇｉｏｎｄｅｔｅｃｔｉｏｎ」（Ｃｏｍｐｕｔｅｒｖｉｓｉｏｎａｎｄｐａｔｔｅｒｎｒｅｃｏｇｎｉｔｉｏｎ，２００９．ｃｖｐｒ２００９．ＩＥＥＥｃｏｎｆｅｒｅｎｃｅｏｎ．ＩＥＥＥ，２００９，ｐｐ．１５９７−１６０４．’−ａｌｓｏｒｅｆｅｒｒｅｄａｓＡｃｈａｎｔａｅｔａｌ．ｗｈｉｃｈｃａｎｂｅｆｏｕｎｄａｔ−ｈｔｔｐｓ：／／ｉｎｆｏｓｃｉｅｎｃｅ．ｅｐｆｌ．ｃｈ／ｒｅｃｏｒｄ／１３５２１７／ｆｉｌｅｓ／１７０８．ｐｄｆ）を参照されたい）。より具体的には、一実施形態では、顕著性マップ計算のよりよい理解のために、式（１）、（２）、（３）および（４）を含むＡｃｈａｎｔａ他の上記参考文献のセクション３．２を参照することができる。 In the above pseudo code, in order to execute the line (or command or program code) “SM = saliency map calculation (frame)”, it is possible to refer to a conventional saliency map calculation technique (for example, Radakrishna). Achanta, Sheila Hemami, Francisco Estrada, and Sabine Susstrunk "Frequency-tuned salient region detection" (Computer vision and pattern recognition, 2009.cvpr 2009.IEEE conference on.IEEE, 2009, pp.1597-1604 .'- also referred as Achanta et al. hich can be found at-https: //infoscience.epfl.ch/record/135217/files/1708.pdf) see). More specifically, in one embodiment, for a better understanding of saliency map calculations, section 3 of Achanta et al., Supra, including equations (1), (2), (3) and (4), .2 can be referred to.

実験および結果
実験には、タブレットを通じた３Ｄプリンタによる検査中の対象物を観察するために被験者（例えば、方法／擬似コードを試験するために、２５〜３４歳の年齢層の２５人の研究者、１０人の女性および１５人の男性）が関与した。主観的および客観的な一連の測定基準を取得して、（ａ）ユーザ体験、および（ｂ）オーバーレイの配置を評価した。すべての実験で、寸法５０×５０のラベルを使用した。これはユーザのニーズに従ってカスタマイズすることができる。実験は、Ｎｅｘｕｓ（登録商標）６アンドロイドフォンおよびＮｅｘｕｓ（登録商標）９タブレット上で行った。ユーザには、以下のパラメータを１〜５の尺度でレーティングすることを課した。その後、平均意見スコアを得た。使用された測定基準は、（ｉ）オーバーレイの位置、（ｉｉ）オーバーレイ内の低ジッタ、（ｉｉｉ）オーバーレイボックスおよびテキストの色、ならびに（ｉｖ）オーバーレイ応答性である。 Experiments and Results Experiments included subjects (eg, 25 researchers in the 25-34 age group to test methods / pseudocodes) to observe the object under inspection by a 3D printer through a tablet. , 10 women and 15 men). A series of subjective and objective metrics were acquired to evaluate (a) user experience and (b) overlay placement. Labels with dimensions 50x50 were used in all experiments. It can be customized according to the needs of the user. The experiments were performed on Nexus® 6 Android phone and Nexus® 9 tablet. Users were required to rate the following parameters on a scale of 1-5. Thereafter, an average opinion score was obtained. The metrics used were (i) the location of the overlay, (ii) low jitter within the overlay, (iii) the color of the overlay box and text, and (iv) the overlay responsiveness.

本開示は、本開示の方法を評価するために、ＤＩＥＭデータセットを使用した（例えば、ＰａｒａｇＫＭｉｔａｌ，ＴｉｍＪＳｍｉｔｈ，ＲｏｂｉｎＬＨｉｌｌ，およびＪｏｈｎＭＨｅｎｄｅｒｓｏｎ「Ｃｌｕｓｔｅｒｉｎｇｏｆｇａｚｅｄｕｒｉｎｇｄｙｎａｍｉｃｓｃｅｎｅｖｉｅｗｉｎｇｉｓｐｒｅｄｉｃｔｅｄｂｙｍｏｔｉｏｎ」（ＣｏｇｎｉｔｉｖｅＣｏｍｐｕｔａｔｉｏｎ，ｖｏｌ．３，ｎｏ．１，ｐｐ．５−２４，２０１１．）−ｈｔｔｐ：／／ｐｋｍｉｔａｌ．ｃｏｍ／ｈｏｍｅ／ｗｐ−ｃｏｎｔｅｎｔ／ｕｐｌｏａｄｓ／２０１０／０３／Ｍｉｔａｌ＿Ｃｌｕｓｔｅｒｉｎｇ＿ｏｆ＿Ｇａｚｅ＿Ｄｕｒｉｎｇ＿Ｄｙｎａｍｉｃ＿Ｓｃｅｎｅ＿Ｖｉｅｗｉｎｇ＿ｉｓ＿Ｐｒｅｄｉｃｔｅｄ．ｐｄｆを参照されたい）。本開示による実験を行うために、解像度１２８０×７２０のビデオがデータセットから取られた。このデータセットは、種々のジャンルの広告、予告編、およびテレビシリーズの様々なビデオで構成されていた。また、眼球運動を用いて、このデータセットは詳細な眼球固定顕著性注釈を提供した。 The present disclosure used DIEM datasets to evaluate the methods of the present disclosure (eg, Parag K Mital, Tim J. Smith, Robin L Hill, and John M Henderson, "Clustering of gadgeticing dancing dynamics"). motion "(Cognitive Computation, vol.3, no.1, pp.5-24,2011.) -http: //pkmital.com/home/wp-content/uploads/2010/03/Mital_Clustering_of_Gaze_During_Dynamic_Scene_Viewing_is_Predicted.p See f). To perform the experiments according to the present disclosure, a video with a resolution of 1280 × 720 was taken from the dataset. This dataset consisted of various genres of advertisements, trailers, and various videos of television series. Also, using eye movements, this dataset provided detailed eye fixation saliency annotations.

実験中、パラメータλおよびΣの値が、ＤＩＥＭデータセット上で、グリッド検索（当技術分野で公知）から経験的に見出され、ビデオ全体にわたるオーバーレイの平均ラベル遮蔽対顕著性（ＬＯＳ）スコア（下記に定義し、論じる）に対するその効果が比較された。図４は、図１〜図３を参照して、本開示の一実施形態による、等高線プロットとしてのλおよびΣの変化による平均ラベル遮蔽対顕著性（ＬＯＳ）スコアの変動を示すグラフ図を示す。より具体的には、図４は、λおよびΣに関する平均ＬＯＳスコアの等高線プロットを示す。実験の間、ＬＯＳスコアはλとは無関係であり、Σとλとの最適な組み合わせは（５、０．０２１）であり、Σが小さいことが好ましいことが観察された（４０２によって示される、図４の等高線の間にある逆Ｙのような記号を有する線表現を参照）。 During the experiment, the values of the parameters λ and Σ were empirically found on a DIEM dataset from a grid search (known in the art), and the average label occlusion versus saliency (LOS) score of the overlay over the video (LOS) (Defined and discussed below). FIG. 4 shows, with reference to FIGS. 1-3, a graph illustrating the variation of average label occlusion versus saliency (LOS) score with changes in λ and Σ as a contour plot, according to one embodiment of the present disclosure. . More specifically, FIG. 4 shows a contour plot of the average LOS score for λ and Σ. During the experiment, the LOS score was independent of λ, the optimal combination of Σ and λ was (5, 0.021), and it was observed that Σ is preferably small (indicated by 402; (See line representation with symbols such as inverted Y between contours in FIG. 4).

結果
主観的測定基準
下記の例示的な表（表１）は、主観的測定基準スコアを示す。
Results Subjective Metrics The following exemplary table (Table 1) shows the subjective metric scores.

上記の表１から、オーバーレイの位置は、オーバーレイがシーン内の顕著な領域を覆うのを防ぐために極めて重要である４．５と非常に高くレーティングされたことが、本開示によって推測される。およそ２０フレーム毎秒（ｆｐｓ）で実行される本開示の方法の上記擬似コードのリアルタイム実施は、場合によって４．７のオーバーレイ応答性の高いスコアをもたらした。単純な色彩設計−黒いフォントの白いボックスおよびその逆が選ばれ、ボックスの色は透明度がα＝０．５に設定されていた。オーバーレイボックスの色は、下記の例示的な方程式（または式）２によって与えられるピクセル強度（輝度チャネルＹ）に適用される単純な適応閾値に依存した。
From Table 1 above, it is inferred by the present disclosure that the position of the overlay was rated very high at 4.5, which is crucial to prevent the overlay from covering a significant area in the scene. A real-time implementation of the above pseudo-code of the method of the present disclosure performed at approximately 20 frames per second (fps) has resulted in an overlay responsive score of 4.7 in some cases. Simple color design-a white box with black font and vice versa was chosen, the color of the box was set to α = 0.5 for transparency. The color of the overlay box relied on a simple adaptive threshold applied to the pixel intensity (luminance channel Y) given by the exemplary equation (2) below.

データ駆動閾値Ｔｈｒｅｓｈは、所与のシーンの最大輝度値と最小輝度値との間の差の平均である。この値がＴｈｒｅｓｈ以上の場合、オーバーレイボックスは黒色の背景を使用し、逆も同様である（テキストラベルがどのようにオーバーレイされるかの設定全体が記されていますが、これは実験を通して実証されたことに留意されたい）。実験中に検討中のサンプルオーバーレイは、シーン全体に関するコンテキスト情報のみを示した。デモンストレーションから、オーバーレイはリアルタイムで機能し、同時にジッタが少ないことも分かった。 The data drive threshold Thresh is the average of the difference between the maximum and minimum luminance values for a given scene. If this value is greater than or equal to Threshold, the overlay box uses a black background and vice versa (the entire setting of how text labels are overlaid is noted, but this has been demonstrated through experimentation). Please note that). The sample overlay under consideration during the experiment showed only contextual information about the entire scene. The demonstration also showed that the overlay worked in real time, while at the same time having low jitter.

客観的測定基準
本開示の方法／擬似コードによって実行されたオーバーレイ配置の有効性が比較された。この比較のための評価基準は、ビデオの顕著性グラウンドトゥルースを有するラベルによって遮蔽される平均ＬＯＳスコアに基づいていた。スコアが小さいほど、遮蔽の少ない効果的なオーバーレイ配置を示す。ラベル遮蔽対顕著性（ＬＯＳ）スコアＳは、以下の式として定義され、表される。

式中、Ｌは、オーバーレイによって遮蔽されているピクセル（ｘ，ｙ）のセットであり、Ｇは、グラウンドトゥルース顕著性マップである。本開示の方法に関する上記の擬似コードは、０．０４２の平均ＬＯＳスコアを有し、オーバーレイ位置を計算するのに０．０２１秒の時間がかかることが分かった。 Objective Metrics The effectiveness of overlay placement performed by the methods / pseudocode of the present disclosure was compared. The criterion for this comparison was based on the average LOS score occluded by labels with video saliency ground truth. A lower score indicates an effective overlay arrangement with less occlusion. The label occlusion versus saliency (LOS) score S is defined and expressed as the following equation.

Where L is the set of pixels (x, y) occluded by the overlay and G is the ground truth saliency map. The above pseudo-code for the method of the present disclosure was found to have an average LOS score of 0.042 and take 0.021 seconds to calculate the overlay position.

本開示の実施形態は、ＡＲアプリケーションのためのビデオにおけるリアルタイムオーバーレイ（コンテキスト情報）配置のためのシステムおよび方法を提供する。上記の実験および結果に基づいて、本開示は、シーンのより良好な解釈を助けるラベルを最適に配置することによって、対象物／対象シーンとの遮蔽の制限を克服することが観察される。オーバーレイの配置は、（ｉ）対象物の周りの視覚的顕著性、および（ｉｉ）時間ジッタを最小にする目的関数として定式化され、リアルタイムＡＲアプリケーション（特に（ローエンドまたはハイエンド）スマートフォン、タブレット（複数可）、ＡＲベースのブラウザなどで実行される）におけるコヒーレンスを促進する。ＡＲアプリケーションの例は、ナビゲーションマップ、ゲームアプリケーションにおいて視覚化することができるような仮想環境体験などを含むことができるが、これらに限定されない。ＡＲベースのアプリケーションの他の例は、博物館探査作業、産業検査および修理作業、広告およびメディア、ならびに観光産業におけるライブ状況認識を含むが、これらに限定されない。 Embodiments of the present disclosure provide systems and methods for real-time overlay (context information) placement in video for AR applications. Based on the above experiments and results, it is observed that the present disclosure overcomes the limitations of object / object scene occlusion by optimally placing labels that aid in better interpretation of the scene. The placement of the overlay is formulated as an objective function that minimizes (i) visual saliency around the object and (ii) time jitter, and is used for real-time AR applications (especially (low-end or high-end) smartphones, tablets ( Yes), implemented in AR-based browsers, etc.). Examples of AR applications can include, but are not limited to, navigation maps, virtual environment experiences that can be visualized in gaming applications, and the like. Other examples of AR-based applications include, but are not limited to, museum exploration work, industrial inspection and repair work, advertising and media, and live situation awareness in the tourism industry.

本明細書は、当業者が実施形態を作成および使用することを可能にするために本明細書中の主題を説明する。主題の実施形態の範囲は特許請求の範囲によって定義され、当業者に想起される他の修正を含み得る。そのような他の変更は、それらが請求項の文言と異ならない類似の要素を有する場合、またはそれらが請求項の文言とのわずかな相違を有する同等の要素を含む場合、請求項の範囲内にあることが意図される。 This specification describes the subject matter herein to enable one of ordinary skill in the art to make and use the embodiments. The scope of the subject embodiments is defined by the claims, and may include other modifications that occur to those skilled in the art. Such other changes are within the scope of the claims, if they have similar elements that do not differ from the claim language, or if they include equivalent elements that have slight differences from the claim language. It is intended to be

保護の範囲は、そのようなプログラム、および、加えて、メッセージをその中に有するコンピュータ可読手段に拡張されることを理解されたい。そのようなコンピュータ可読記憶手段は、プログラムがサーバまたはモバイルデバイスまたは任意の適切なプログラム可能デバイス上で実行されるときに、本方法の１つまたは複数のステップを実施するためのプログラムコード手段を含む。ハードウェアデバイスは、例えば、サーバもしくはパーソナルコンピュータなどのような任意の種類のコンピュータなど、またはそれらの任意の組み合わせを含む、プログラム可能な任意の種類のデバイスとすることができる。デバイスはまた、例えば、例として特定アプリケーション向け集積回路（ＡＳＩＣ）、フィールドプログラマブルゲートアレイ（ＦＰＧＡ）のようなハードウェア手段、または、例えば、ＡＳＩＣおよびＦＰＧＡなどのハードウェア手段とソフトウェア手段との組み合わせ、または、少なくとも１つのマイクロプロセッサおよびソフトウェアモジュールが中に配置されている少なくとも１つのメモリであってもよい。したがって、手段はハードウェア手段とソフトウェア手段の両方を含むことができる。本明細書に記載の方法実施形態は、ハードウェアおよびソフトウェアにおいて実施することができる。デバイスはソフトウェア手段も含み得る。代替的に、実施形態は、例えば、複数のＣＰＵを使用して、異なるハードウェアデバイス上で実施されてもよい。 It is to be understood that the scope of protection extends to such programs and, in addition, to computer readable means having messages therein. Such computer readable storage means include program code means for performing one or more steps of the method when the program is executed on a server or mobile device or any suitable programmable device. . A hardware device can be any type of programmable device, including, for example, any type of computer, such as a server or personal computer, or any combination thereof. The device can also be a hardware means, such as, for example, an application specific integrated circuit (ASIC), a field programmable gate array (FPGA), or a combination of hardware and software means, such as, for example, ASICs and FPGAs; Alternatively, it may be at least one memory with at least one microprocessor and software module located therein. Thus, the means may include both hardware and software means. The method embodiments described herein can be implemented in hardware and software. The device may also include software means. Alternatively, embodiments may be implemented on different hardware devices, for example, using multiple CPUs.

本明細書の実施形態は、ハードウェア要素およびソフトウェア要素を含むことができる。ソフトウェアで実施される実施形態は、ファームウェア、常駐ソフトウェア、マイクロコードなどを含むがこれらに限定されない。本明細書で説明されている様々なモジュールによって実行される機能は、他のモジュールまたは他のモジュールの組み合わせにおいて実施されてもよい。本明細書の目的のために、コンピュータ使用可能またはコンピュータ可読媒体は、命令実行システム、装置、またはデバイスによって、またはそれらに関連して使用するためのプログラムを含む、記憶する、通信する、伝播する、または移送することができる任意の装置とすることができる。 Embodiments herein may include hardware and software elements. Embodiments implemented in software include, but are not limited to, firmware, resident software, microcode, and the like. The functions performed by the various modules described herein may be implemented in other modules or combinations of other modules. For the purposes of this specification, computer usable or computer readable media includes, stores, communicates, propagates, including programs for use by or in connection with an instruction execution system, apparatus, or device. Or any device that can be transported.

例示されているステップは示された例示的な実施形態を説明するために記載されており、進行中の技術開発は特定の機能が実行される方法を変えることが予期されるはずである。これらの例は例示の目的で本明細書に提示されており、限定ではない。さらに、機能的構成ブロックの境界は、説明の便宜上、本明細書において任意に定義されている。特定の機能およびそれらの関係が適切に実行される限り、代替の境界を定義することができる。代替形態（本明細書に記載されたものの等価物、拡張形態、変形形態、逸脱形態などを含む）が、当業者には本明細書に含まれる教示に基づいて明らかになるであろう。そのような代替形態は、開示される実施形態の範囲および精神の範囲内に入る。また、「備える」、「有する」、「含有する」、および「含む」という単語、および他の同様の形態は、意味において同等であり、これらの単語のうちのいずれか１つに続く１つまたは複数の項目が、そのような１つまたは複数の項目の網羅的なリストであるようには意図されておらず、または、リストされた１つまたは複数の項目のみに限定されるようにも意図されていないという意味において、制限がないように意図されている。本明細書および添付の特許請求の範囲において使用されるとき、単数形「ａ」、「ａｎ」、および「ｔｈｅ」は、文脈が明らかにそうでないことを指示しない限り、複数の参照を含むことにも留意されたい。 The illustrated steps are set forth to illustrate the exemplary embodiments shown, and it is expected that ongoing technology development will change the way certain functions are performed. These examples are provided herein for illustrative purposes and are not limiting. Further, boundaries of functional building blocks are arbitrarily defined herein for the convenience of description. Alternative boundaries can be defined as long as the particular functions and their relationships are properly performed. Alternatives, including equivalents, extensions, variations, departures, and the like, of those described herein will be apparent to those skilled in the art based on the teachings contained herein. Such alternatives fall within the scope and spirit of the disclosed embodiments. Also, the words “comprising”, “having”, “containing”, and “including”, and other similar forms, are equivalent in meaning and are preceded by one of the following: Or where the items are not intended to be an exhaustive list of one or more of such items, or are limited to only one or more of the listed items. It is intended to be unlimited in the sense that it is not intended. As used in this specification and the appended claims, the singular forms "a," "an," and "the" include plural referents unless the context clearly dictates otherwise. Please note also.

さらに、本開示と一致する実施形態を実施する際に、１つまたは複数のコンピュータ可読記憶媒体を利用することができる。コンピュータ可読記憶媒体は、プロセッサによって読み取り可能な情報またはデータが記憶され得る任意の種類の物理メモリを指す。したがって、コンピュータ可読記憶媒体は、本明細書で説明される実施形態と一致するステップまたは段階をプロセッサ（複数可）に実行させるための命令を含む、１つまたは複数のプロセッサによる実行のための命令を記憶することができる。「コンピュータ可読媒体」という用語は、有形の項目を含み、搬送波および過渡信号、すなわち非一時的なものを除外すると理解されるべきである。例としては、ランダムアクセスメモリ（ＲＡＭ）、読み出し専用メモリ（ＲＯＭ）、揮発性メモリ、不揮発性メモリ、ハードドライブ、ＣＤ−ＲＯＭ、ＤＶＤ、フラッシュドライブ、ディスク、および他の任意の既知の物理的記憶媒体が挙げられる。 Further, one or more computer-readable storage media can be utilized in implementing embodiments consistent with the present disclosure. Computer readable storage media refers to any type of physical memory in which information or data readable by a processor can be stored. Accordingly, the computer-readable storage medium includes instructions for execution by one or more processors, including instructions for causing processor (s) to perform steps or steps consistent with embodiments described herein. Can be stored. The term “computer-readable medium” should be understood to include tangible items and exclude carrier waves and transients, ie, non-transitory. Examples include random access memory (RAM), read only memory (ROM), volatile memory, non-volatile memory, hard drive, CD-ROM, DVD, flash drive, disk, and any other known physical storage Media.

本開示および例は例示としてのみ考慮されることを意図しており、開示される実施形態の真の範囲および精神は特許請求の範囲によって示される。 It is intended that the disclosure and examples be considered as exemplary only, with a true scope and spirit of the disclosed embodiments being indicated by the following claims.

１００システム、デバイス
１０２メモリ
１０４ハードウェアプロセッサ
１０６入出力（Ｉ／Ｏ）インターフェース
１０８データベース REFERENCE SIGNS LIST 100 system, device 102 memory 104 hardware processor 106 input / output (I / O) interface 108 database

Claims

A processor implementation method, comprising:
Receive in real time: (i) an input video that includes a plurality of frames and objects within the plurality of frames; and (ii) a label whose initial overlay position is pre-computed for an arrangement of the input video on a central frame. (202)
Calculating in real time a saliency map for each of the plurality of frames to obtain a plurality of saliency maps;
Calculating, in real time, a Euclidean distance between a current overlay position and a previous overlay position based on the initial overlay position of the label for each of the plurality of frames to obtain a plurality of Euclidean distances ( 206),
Calculating, in real time, an updated overlay position of the label for placement in the input video based on the plurality of saliency maps and the plurality of Euclidean distances;
A processor implementation method, comprising:

The processor-implemented method of claim 1, wherein the updated overlay position of the label is calculated by combining the saliency maps and the Euclidean distances.

The processor-implemented method of claim 1, wherein the Euclidean distance of each of the plurality of frames is calculated to control a time jitter at a position of the label located in the input video in real time.

The processor implementation of claim 1, further comprising: shifting (210) the label from the initial overlay location to the updated overlay location to minimize obstruction of viewing of the object. Method.

A plurality of pixels corresponding to a Euclidean distance within the predetermined threshold range between the current overlay position and the previous overlay position for shifting a label from the initial overlay position to the updated overlay position. The method of claim 1, wherein the method is selected.

A memory (102) for storing instructions,
One or more communication interfaces (106);
One or more hardware processors (104) coupled to the memory (102) via the one or more communication interfaces (106), wherein the system (100) comprises:
The one or more hardware processors (104), according to the instructions,
Receive in real time: (i) an input video that includes a plurality of frames and objects within the plurality of frames; and (ii) a label whose initial overlay position is pre-computed for an arrangement of the input video on a central frame. That
Calculating a saliency map for each of the plurality of frames in real time to obtain a plurality of saliency maps;
Calculating, in real time, a Euclidean distance between a current overlay position and a previous overlay position based on the initial overlay position of the label, for each of the plurality of frames, to obtain a plurality of Euclidean distances. ,
Calculating, in real time, an updated overlay position of the label for placement in the input video based on the plurality of saliency maps and the plurality of Euclidean distances;
The system (100), wherein the system (100) is configured to:

7. The system of claim 6, wherein the updated overlay position of the label is calculated by combining the plurality of saliency maps and the plurality of Euclidean distances.

7. The system of claim 6, wherein the Euclidean distance of each of the plurality of frames is calculated to control a time jitter at a position of the label located in the input video in real time.

The one or more hardware processors are further configured to shift the label from the initial overlay position to the updated overlay position to minimize obstruction of viewing of the object. The system of claim 6, wherein

A plurality of pixels corresponding to a Euclidean distance within the predetermined threshold range between the current overlay position and the previous overlay position for shifting a label from the initial overlay position to the updated overlay position. 7. The system of claim 6, wherein the system is selected.

When executed by one or more hardware processors,
Receive in real time: (i) an input video that includes a plurality of frames and objects within the plurality of frames; and (ii) a label whose initial overlay position is pre-computed for an arrangement of the input video on a central frame. That
Calculating a saliency map for each of the plurality of frames in real time to obtain a plurality of saliency maps;
Calculating, in real time, a Euclidean distance between a current overlay position and a previous overlay position based on the initial overlay position of the label, for each of the plurality of frames, to obtain a plurality of Euclidean distances. ,
Calculating, in real time, an updated overlay position of the label for placement in the input video based on the plurality of saliency maps and the plurality of Euclidean distances;
One or more non-transitory machine-readable information storage media comprising one or more instructions that provide for:

The one or more non-transitory machine-readable information storage of claim 11, wherein the updated overlay position of the label is calculated by combining the plurality of saliency maps and the plurality of Euclidean distances. Medium.

The one or more non-Euclidean distances of claim 11, wherein the Euclidean distance of each of the plurality of frames is calculated to control, in real time, a time jitter at a position of the label located in the input video. Temporary machine-readable information storage medium.

The instructions, when executed by the one or more hardware processors, move the label from the initial overlay location to the updated overlay location to minimize obstruction of viewing of the object. 12. The one or more non-transitory machine-readable information storage media of claim 11, further comprising:

A plurality of pixels corresponding to a Euclidean distance within the predetermined threshold range between the current overlay position and the previous overlay position for shifting a label from the initial overlay position to the updated overlay position. 12. The one or more non-transitory machine-readable information storage media of claim 11, wherein the non-transitory machine-readable information storage media is selected.