JP2024514813A

JP2024514813A - Learning ordinal representations for object localization based on deep reinforcement learning

Info

Publication number: JP2024514813A
Application number: JP2023561700A
Authority: JP
Inventors: シャオボハン、; レンチャンミン、; ティンフェンリ、
Original assignee: NEC Laboratories America Inc
Current assignee: NEC Laboratories America Inc
Priority date: 2021-04-08
Filing date: 2022-04-08
Publication date: 2024-04-03
Also published as: DE112022002037T5; WO2022217122A1

Abstract

クエリオブジェクト位置特定の問題に対する強化学習に基づく手法が提供される。エージェントは、少ない例示的なセットで指定された関心のあるオブジェクトの位置を特定するように訓練される。順序メトリック学習によって例示的なセットを用いて定式化された転送可能な報酬信号を学習する。これにより、報酬信号が直ぐに利用できない新しい環境にテストタイムのポリシー適応を可能にするため、アノテーションが付与された画像に限定される微調整手法よりも優れた性能を発揮する。さらに、転送可能な報酬により、訓練されたエージェントを、アノテーションの改良や一連の画像にわたる複数の共通オブジェクトからの選択的な位置特定などの新しいタスクに再利用できる。破損したＭＮＩＳＴデータセットとＣＵ－Ｂｉｒｄｓデータセットの実験により、本発明の手法の有効性が実証された。【選択図】図２A reinforcement learning based approach to the problem of query object localization is provided. An agent is trained to localize objects of interest specified by a small set of examples. A transferable reward signal is formulated using the example set by ordinal metric learning. This outperforms fine-tuning approaches limited to annotated images by enabling test-time policy adaptation to new environments where the reward signal is not readily available. Furthermore, the transferable reward allows the trained agent to be reused for new tasks such as annotation refinement and selective localization from multiple common objects across a set of images. Experiments on corrupted MNIST and CU-Birds datasets demonstrate the effectiveness of our approach. [Selected Figure] Figure 2

Description

本開示は、一般に、画像処理及び画像認識に関する。より具体的には、深層強化学習に基づくオブジェクト位置特定のための順序表現の学習に関するシステム及び方法について記載する。 This disclosure relates generally to image processing and image recognition. More specifically, this disclosure describes systems and methods for learning ordinal representations for object localization based on deep reinforcement learning.

当業者であれば容易に理解できるように、多くの分野において、画像または画像のセット内で１つまたは複数のタイプの共通のオブジェクトを自動で発見することが重要である場合が多い。特に、全教師あり（fully supervised）オブジェクト検出または位置特定では、訓練において人による大量のアノテーション（つまり、ターゲットオブジェクト周囲のバウンディングボックス）が必要であるが、これは高いコストを要し、コスト重視のアプリケーションでは非現実的である。例えば、分散型光ファイバセンシングやデジタルパソロジーでは、経験豊富な専門家からの高品質なアノテーションはかなり限定されるが、弱教師ありオブジェクト検出または位置特定（ＷＳＯＤ（weakly supervised object detection）またはＷＳＯＬ（weaklysupervised object localization））手法では画像レベルのアノテーション（クラス）のみを必要とする。但し、そのような学習されたアノテーションは部分的なものであることが多く、全体の領域ではなく、最もターゲットオブジェクトの識別が可能な領域を参照する。最終的に、共位置特定（co-localization）のための既存の手法は教師なしであるため、画像データセットに複数のタイプの共通オブジェクトが含まれている場合、不要な共通オブジェクトが出力として提供される可能性がある。 As one skilled in the art can readily appreciate, in many fields it is often important to automatically discover one or more types of common objects in an image or set of images. In particular, fully supervised object detection or localization requires a large amount of human annotations (i.e., bounding boxes around target objects) during training, which is costly and impractical for cost-sensitive applications. For example, in distributed fiber optic sensing and digital pathology, high-quality annotations from experienced experts are quite limited, while weakly supervised object detection or localization (WSOD or WSOL) methods require only image-level annotations (classes). However, such learned annotations are often partial, referring to regions where the target object is most discernible, rather than the entire region. Finally, existing methods for co-localization are unsupervised and may provide unnecessary common objects as output when an image dataset contains multiple types of common objects.

上記の問題に対処するためのシステム及び方法を対象とする本開示の態様により当該技術分野における進歩がもたらされる。有利なことに、本発明の手法は、正確なバウンディングボックスのアノテーションを備えた「シードデータセット」のみを必要とする。 Aspects of the present disclosure that are directed to systems and methods for addressing the above problems provide an advance in the art. Advantageously, our approach requires only a "seed data set" with accurate bounding box annotations.

従来の全教師ありオブジェクト検出／位置特定手法とは明らかに対照的に、本発明者らのアルゴリズムは、シードデータセット用として、はるかに小さいサイズを必要とする。シードデータセットから開始し、画像環境を探索する強化学習エージェントとして大量の摂動（perturbed）ボックスがサンプリングされる。これらの摂動ボックスの優先度は、画像のグラウンドトゥルースのバウンディングボックスに対するＩｏＵ（Intersection over Union）に基づいて必然的に決定される。この情報を、強化学習アノテーションエージェントと一緒に訓練された順序表現にエンコードする。既存の深層強化学習に基づくオブジェクト位置特定手法は、この情報をエンコードできないため、サンプリング効率がはるかに悪くなる。 In clear contrast to conventional fully supervised object detection/localization approaches, our algorithm requires a much smaller size for the seed dataset. Starting from the seed dataset, a large number of perturbed boxes are sampled as the reinforcement learning agent explores the image environment. The priority of these perturbed boxes is necessarily determined based on the intersection over union (IoU) with respect to the bounding boxes of the image ground truth. We encode this information into an ordinal representation that is trained together with the reinforcement learning annotation agent. Existing deep reinforcement learning-based object localization approaches cannot encode this information, resulting in much less efficient sampling.

さらにＷＳＯＤ／ＷＳＯＬ法とは対照的に、本発明者らの手法は、異なるクラス間の識別ではなく、同じ画像クラス内の異なる画像間の共通オブジェクト間の類似性に明確に焦点を当てる。画像レベルのクラスラベルを組み込むことはできるが、必須ではない。 Furthermore, in contrast to WSOD/WSOL methods, our approach explicitly focuses on similarity between common objects across different images in the same image class, rather than discrimination between different classes. Image-level class labels can be incorporated, but are not required.

より具体的には、共位置特定におけるターゲットオブジェクトのクラスに関する曖昧さは、シードデータセット内でターゲットオブジェクトを明確に指定することで回避される。このアルゴリズムはヒューマン・イン・ザ・ループの方法で機能する。特に、画像データセットが与えられると、人は幾つかのデータにアノテーションを付け始め、強化学習エージェントは人のガイダンスに従って残りのデータに自動的にラベルを付与する。 More specifically, ambiguity regarding the class of the target object in co-localization is avoided by explicitly specifying the target object in the seed dataset. The algorithm works in a human-in-the-loop manner: in particular, given an image dataset, a human starts annotating some data and a reinforcement learning agent automatically labels the remaining data following the human guidance.

本発明のフレームワークは、時間と労力を非常に要するファイバセンシングタスクにおける画像データのアノテーションという一般的な課題から動機付けられている。但し、本発明者らの手法は、デジタルパソロジーの画像、ビデオのオブジェクト追跡、サウンドイベント検出のための時間的限定等、他のデータモダリティ／アプリケーションにも適用できる。 The framework of the present invention is motivated by the common problem of image data annotation in fiber sensing tasks, which are very time-consuming and labor-intensive. However, our approach can also be applied to other data modalities/applications, such as images for digital pathology, object tracking for video, temporal constraints for sound event detection, etc.

運用上、各画像は、アノテーションエージェントがバウンディングボックスを移動することで、相互作用できる環境として見なされる。学習された位置特定の戦略は、新しい環境（画像）に一般化できるものとする。複数の学習段階及び異なる画像間での情報共有を促進するために、報酬はＩｏＵを介して直接与えられるのではなく、学習された潜在表現の距離を介して間接的に与えられる。 Operationally, each image is viewed as an environment with which the annotation agent can interact by moving its bounding box. The learned localization strategy is assumed to be generalizable to new environments (images). To encourage information sharing across multiple learning stages and different images, rewards are not given directly via IoU but indirectly via the distances of the learned latent representations.

本発明者らの手法により、順序表現学習と深層強化学習（ＲＬ）は互いに利益をもたらしながら一緒に訓練される。表現学習モデルは、正確にアノテーションが付与されたデータだけでなく、摂動を伴う拡張データでも訓練される。既存の表現学習法では、正しくアノテーションが付与されたデータからよりコンパクトなクラスタを直接生成することができない。したがって、報酬はオリジナルのデータに対してのみ定義でき、潜在的な埋め込みに対しては定義できない。本発明者らの手法では、同じ画像の不完全なアノテーションのペア間の順序関係を維持するように潜在的な埋め込み関数が訓練される。すなわち、より高いＩｏＵのバウンディングボックスの埋め込みは、より低いＩｏＵのボックスの埋め込みよりもグラウンドトゥルースバウンディングボックスの埋め込みに近くなる。その結果、ＲＬ報酬は埋め込み距離に基づいて定義できる。 With our approach, ordered representation learning and deep reinforcement learning (RL) are trained together, benefiting each other. Representation learning models are trained not only on accurately annotated data but also on augmented data with perturbations. Existing representation learning methods cannot directly generate more compact clusters from correctly annotated data. Therefore, rewards can only be defined for the original data, not for potential embeddings. In our approach, a latent embedding function is trained to maintain order relationships between pairs of incomplete annotations of the same image. That is, a higher IoU bounding box embedding is closer to the ground truth bounding box embedding than a lower IoU box embedding. As a result, RL rewards can be defined based on embedding distance.

順序埋め込みが深層ＲＬエージェントで個別に訓練される場合、摂動サンプルはランダムに生成され、サンプルの大部分はＲＬエージェントの検索パスにないため、冗長で非効率的である。提案する共同訓練スキームでは、ＲＬエージェントが埋め込み空間を探索しているときにボックスペアがサンプリングされるため、順序埋め込みをより効率的に訓練できる。学習の様々な段階で、教師（supervision）はカスタマイズされる。モデルは、訓練後の段階にて、より適切にアノテーションが付与されたボックスのペアに優先度を割り当てることを学習する。 If the ordered embedding is trained separately on the deep RL agent, it is redundant and inefficient because the perturbation samples are randomly generated and most of the samples are not in the search path of the RL agent. In the proposed joint training scheme, ordered embeddings can be trained more efficiently because box pairs are sampled while the RL agent is exploring the embedding space. Supervision is customized at various stages of learning. In a post-training stage, the model learns to assign priority to pairs of boxes that are better annotated.

副産物として、埋め込み距離は、アノテーションの品質を評価するためのメトリックも提供する。高品質と低品質の両方のアノテーションが付与された一連の画像が与えられると、適切にアノテーションが付与されたデータは、通常の埋め込み空間におけるコンパクトなクラスタに分類される。したがって、それらを選択することができる。アノテーションの品質は、フィルタリングされたデータのクラスタ重心までの距離に応じてランク付けできる。 As a by-product, the embedding distance also provides a metric for assessing annotation quality: given a set of images with both high- and low-quality annotations, well-annotated data fall into compact clusters in the regular embedding space and can therefore be selected. The quality of annotations can be ranked according to the distance of the filtered data to the cluster centroids.

最後に、本発明者らの回帰型ニューラルネットワーク（ＲＮＮ）に基づく方法は、画像全体から開始する探索が可能になる。これにより、ターゲットとなるオブジェクトのサイズが異なり、画像が高解像度である場合でも、同じクラスの複数の共通オブジェクトを含む大規模な単一画像の共位置特定の問題に本発明者らの手法を適用できるようになる。人とＲＬアノテータとの間の相互作用プロセスは次のように機能する。人は、１つまたは２つの関心のあるターゲットオブジェクトにラベルを付与することで、アノテーションプロセスを開始する。アノテーションエージェントは、粗い解像度で画像全体を調べることから開始し、トップダウンスキームにしたがって、一連の反復的な行動（action）を取得し、残りの画像におけるオブジェクトの位置を特定する。人は、新規のオブジェクトが見つからなくなるまで、選択したオブジェクトを受け取る、または拒否する及び／またはアノテータを再度実行できる。 Finally, our recurrent neural network (RNN) based method allows for a search starting from the whole image. This makes our approach applicable to the problem of large-scale single-image co-localization containing multiple common objects of the same class, even when the target objects are of different sizes and the images are at high resolution. The interaction process between the human and the RL annotator works as follows: The human starts the annotation process by labeling one or two target objects of interest. The annotation agent starts by examining the whole image at a coarse resolution and, following a top-down scheme, takes a set of repetitive actions to locate the objects in the remaining images. The human can accept or reject the selected object and/or run the annotator again until no new objects are found.

本開示のより完全な理解は、添付の図面を参照することで実現され得る。 A more complete understanding of the present disclosure may be realized with reference to the accompanying drawings.

図１は、本開示の態様による、アノテーションエージェントとデータ表現の共同訓練フレームワークを示す概略図である。FIG. 1 is a schematic diagram illustrating a co-training framework for annotation agents and data representations in accordance with aspects of the present disclosure.

図２は、本開示の一態様による、モデル訓練プロセスを示す概略フロー図である。FIG. 2 is a schematic flow diagram illustrating a model training process, according to one aspect of the present disclosure.

図３は、アプリケーション１、すなわち、本開示の態様による、適切にアノテーションが付与されたデータがイベント分類器の下流の訓練に利益をもたらすことができる、ファイバセンシングデータセットの人のガイドによる自動アノテーションを示す概略図である。FIG. 3 is a schematic diagram illustrating application 1, i.e., human-guided automatic annotation of fiber sensing datasets, where properly annotated data can benefit downstream training of event classifiers, according to aspects of the present disclosure.

図４は、アプリケーション２、すなわち、本開示の態様による、訓練を受けたエージェントによって高品質のアノテーションが識別可能であり、低品質のデータを修正できる、クラウドソーシングに基づく画像アノテーションプラットフォームのためのワーカーの品質評価及び改善を示す概略図である。FIG. 4 is a schematic diagram illustrating Application 2, i.e., worker quality assessment and improvement for a crowdsourcing-based image annotation platform in which high-quality annotations can be identified by trained agents and low-quality data can be corrected, according to an aspect of the present disclosure.

図５は、本開示の態様による、埋め込みネット及びトリプレット損失の順序表現学習を示す概略図である。FIG. 5 is a schematic diagram illustrating ordered representation learning of embedding nets and triplet loss, according to aspects of the present disclosure.

図６は、本開示の態様による、順序埋め込みに基づく報酬及び行動空間を示す概略図である。FIG. 6 is a schematic diagram illustrating reward and action space based on order embedding, according to aspects of the present disclosure.

図７は、本開示の態様による、ＲＬエージェント及び順序表現学習の回帰型ニューラルネットワーク（ＲＮＮ）一式に基づくアーキテクチャを示す概略図である。FIG. 7 is a schematic diagram illustrating an architecture based on a recurrent neural network (RNN) suite of RL agents and ordinal representation learning, according to aspects of the present disclosure.

図８（Ａ）は、本開示の態様による、ＲＬエージェントの行動シーケンス及び学習の収束、並びにクラッタ背景からの数字４の共位置特定及びグラウンドトゥルースまでの埋め込み距離の収束のプロットを示す図である。FIG. 8(A) is a diagram illustrating a plot of convergence of behavioral sequences and learning of an RL agent and convergence of co-localization of digit 4 from a cluttered background and embedding distance to ground truth, in accordance with aspects of the present disclosure. . 図８（Ｂ）は、本開示の態様による、ＲＬエージェントの行動シーケンス及び学習の収束、並びにクラッタ背景からの数字４の共位置特定及びグラウンドトゥルースまでの埋め込み距離の収束のプロットを示す図である。FIG. 8(B) is a diagram illustrating a plot of the convergence of the behavioral sequence and learning of the RL agent and the convergence of the co-localization of the digit 4 from the clutter background and the embedding distance to the ground truth, in accordance with aspects of the present disclosure. . 図８（Ｃ）は、本開示の態様による、ＲＬエージェントの行動シーケンス及び学習の収束、並びにクラッタ背景からの数字４の共位置特定及びグラウンドトゥルースまでの埋め込み距離の収束のプロットを示す図である。FIG. 8(C) illustrates a plot of the convergence of the RL agent's behavioral sequence and learning, as well as the convergence of the co-localization of the digit 4 from a cluttered background and embedding distance to ground truth, according to aspects of the present disclosure.

図９は、本開示の態様による、ＲＬ更新中の固定埋め込みと訓練埋め込みとを比較したデータセットを示す図である。FIG. 9 illustrates a dataset comparing fixed embeddings and training embeddings during RL updates, according to aspects of the present disclosure.

図１０は、本開示の態様による、数字４及び他の新しい数字０～９に関して訓練及びテストされたエージェントを示すデータセットを示す図である。FIG. 10 is a diagram illustrating a dataset showing agents trained and tested on the number 4 and other new numbers 0-9, in accordance with aspects of the present disclosure.

図１１は、本開示の態様による、バウンディングボックスではなく例示的なセットで定義された報酬信号を有するＲＬに基づくクエリオブジェクト位置特定を示す概略図である。FIG. 11 is a schematic diagram illustrating RL-based query object localization with reward signals defined on an example set rather than bounding boxes, according to an aspect of the present disclosure.

図１２は、本開示の態様による、例示的なＲｏＩエンコーダ及び投影ヘッドを示す概略図である。FIG. 12 is a schematic diagram illustrating an exemplary RoI encoder and projection head in accordance with aspects of the present disclosure.

図１３（Ａ）は、本開示の態様による、ＯｒｄＡｃｃにおけるランダムサンプリング及びアンカーサンプリング（％）を示すデータセットである。FIG. 13(A) is a dataset showing random sampling and anchor sampling (%) in OrdAcc, according to an aspect of the present disclosure. 図１３（Ｂ）は、本開示の態様による、ＣｏｒＬｏｃにおけるＩｏＵ報酬（％）の符号ありと符号なしの比較を示すデータセットである。FIG. 13(B) is a dataset illustrating a comparison of signed and unsigned IoU rewards (%) in CorLoc, in accordance with aspects of the present disclosure.

図１４（Ａ）は、本開示の態様による、異なる訓練セットサイズ下での比較表を示すプロットである。FIG. 14(A) is a plot showing a comparison table under different training set sizes, according to an aspect of the present disclosure. 図１４（Ｂ）は、本開示の態様による、異なる訓練セットサイズ下での比較表を示すプロットである。FIG. 14(B) is a plot showing a comparison table under different training set sizes, in accordance with aspects of the present disclosure.

図１５（Ａ）は、本開示の態様による、ＣｏｒＬｏｃ（％）のデータセットを示す図である。FIG. 15(A) is a diagram illustrating a data set of CorLoc (%), according to aspects of the present disclosure. 図１５（Ｂ）は、本開示の態様による、使用されるアンカーに応じた４つの訓練戦略の比較のデータセットを示す図である。FIG. 15(B) is a diagram illustrating a dataset for comparison of four training strategies depending on the anchors used, in accordance with aspects of the present disclosure.

図１６は、本開示の態様による、使用されるアンカーに応じた異なる数字における性能のデータセットを示す図である。FIG. 16 illustrates a data set of performance at different numbers depending on the anchor used, in accordance with aspects of the present disclosure.

図１７は、本開示の態様による、使用されるアンカーに応じた適応前、適応後及び適応の微調整を示すプロットを示す図である。FIG. 17 is a diagram illustrating plots illustrating pre-adaptation, post-adaptation, and adaptation fine-tuning depending on the anchor used, in accordance with aspects of the present disclosure.

図１８（Ａ）は、本開示の態様による、不正確なアノテーションから厳格なアノテーションが付与されたバウンディングボックスの性能を示す図である。FIG. 18(A) is a diagram illustrating the performance of bounding boxes with inaccurate annotations to strict annotations, according to aspects of the present disclosure. 図１８（Ｂ）は、本開示の態様による、他の背景に転送するときの性能を示す図である。FIG. 18(B) is a diagram illustrating performance when transferring to other backgrounds, according to aspects of the present disclosure.

図１９は、本開示の態様による、使用されるアンカーに応じた訓練及び報酬の位置特定エージェントのためのアルゴリズムＩのリストを示す図である。FIG. 19 is a diagram illustrating a list of Algorithm I for training and rewarding location agents depending on the anchors used, according to aspects of the present disclosure.

以下は、単に本開示の原理を例示するものである。したがって、当業者であれば、本明細書で明示的に説明または図示されていなくても、本開示の主旨及び範囲に含まれる、本開示の原理を具体化する様々な構成を考え出すことができることを理解されたい。 The following merely illustrates the principles of the present disclosure. Accordingly, those skilled in the art will be able to devise various configurations embodying the principles of this disclosure that are within the spirit and scope of this disclosure, even if not explicitly described or illustrated herein. I want you to understand.

さらに、本明細書で挙げる全ての実施例及び条件付き用語は、本開示の原理及び本技術を促進するために本発明者らが提示する概念の理解を助ける教育目的のためだけであることを意味し、具体的に挙げられた実施例及び条件に限定されないと解釈されるべきである。 Further, all examples and conditional terms mentioned herein are for educational purposes only to assist in understanding the principles of the present disclosure and concepts presented by the inventors in furtherance of the present technology. meaning and should not be construed as being limited to the specifically mentioned examples and conditions.

さらに、本開示の原理、態様及び実施形態、並びにその特定の実施例で挙げる本明細書の全てのステートメントは、その構成及び機能の均等物の両方を含むことを意味する。さらに、そのような均等物には、現在知られている均等物と、将来開発される均等物、すなわち構成に関係なく同じ機能を実現する、開発された要素の両方を含むことを意味する。 Furthermore, all statements herein reciting principles, aspects, and embodiments of the present disclosure, as well as specific examples thereof, are intended to encompass both structural and functional equivalents thereof. Moreover, such equivalents are intended to include both currently known equivalents as well as equivalents developed in the future, i.e., elements developed that perform the same function, regardless of structure.

したがって、例えば、本明細書の任意のブロック図は、本開示の原理を実施する回路の実例を示す概念図であることが当業者に理解されよう。 Thus, for example, those skilled in the art will appreciate that any block diagram herein is a conceptual illustration that illustrates an example of a circuit implementing the principles of the present disclosure.

図１は、本開示の態様による、アノテーションエージェント及びデータ表現の共同訓練フレームワークを示す概略図である。 FIG. 1 is a schematic diagram illustrating a joint training framework for annotation agents and data representations in accordance with aspects of the present disclosure.

図２は、本開示の態様による、モデル訓練プロセスを示す概略フロー図である。 FIG. 2 is a schematic flow diagram illustrating a model training process in accordance with aspects of the present disclosure.

図３は、アプリケーション１、すなわち、本開示の態様による、適切にアノテーションが付与されたデータがイベント分類器の下流の訓練に利益をもたらすことができる、ファイバセンシングデータセットの人のガイドによる自動アノテーションを示す概略図である。 FIG. 3 illustrates Application 1, i.e., human-guided automatic annotation of fiber sensing datasets, where properly annotated data can benefit downstream training of event classifiers, according to aspects of the present disclosure. FIG.

図４は、アプリケーション２、すなわち、本開示の態様による、訓練を受けたエージェントによって高品質のアノテーションが識別可能であり、低品質のデータを修正できる、クラウドソーシングに基づく画像アノテーションプラットフォームのためのワーカーの品質評価及び改善を示す概略図である。 FIG. 4 illustrates Application 2, a worker for a crowdsourcing-based image annotation platform in which high-quality annotations can be identified and low-quality data can be corrected by trained agents, according to aspects of the present disclosure. FIG. 2 is a schematic diagram showing quality evaluation and improvement of

これから説明するように、本発明の方法／アルゴリズムは、訓練に３つのステップを含む。 As we will now explain, the method/algorithm of the present invention involves three steps in training:

ステップ１：シード画像のセットを識別する。これは、人の専門家、予め選択した経験則または第３者のデータセットから取得できる。 Step 1: Identify a set of seed images. This can be obtained from human experts, pre-selected heuristics, or a third-party dataset.

ステップ２：順序埋め込みを事前訓練する。シードデータセットが与えられると、様々なレベルでグラウンドトゥルースバウンディングボックスをランダムに摂動させて事前学習する。摂動のレベルはパラメータｐで示される。順序埋め込みは、同じ画像から拡張された摂動データの各ペアに対して局所的に順序制約を満たす必要がある。図５は、本開示の態様による、埋め込みネット及びトリプレット損失の順序表現学習を示す概略図である。 Step 2: Pre-train the ordered embedding. Given a seed dataset, we pre-train by randomly perturbing the ground truth bounding box at various levels. The level of perturbation is indicated by the parameter p. Ordered embeddings require ordering constraints to be satisfied locally for each pair of perturbed data extended from the same image. FIG. 5 is a schematic diagram illustrating ordered representation learning for embedding nets and triplet losses in accordance with aspects of the present disclosure.

ステップ３：強化学習。埋め込み関数が与えられると、ＲＬエージェントは、画像全体から開始し、離散行動空間から回帰的に行動をサンプリングする。図６は、本開示の態様による、順序埋め込みに基づく報酬及び行動空間を示す概略図である。行動の報酬は埋め込み距離から計算される。ポリシーネットワーク（行動ヘッド）は、埋め込みネットワークと一緒に更新される。ニューラルネットワークアーキテクチャは、本開示の態様による、ＲＬエージェント及び順序表現学習の回帰型ニューラルネットワーク（ＲＮＮ）一式に基づくアーキテクチャを示す概略図である、図７に詳細に示されている。 Step 3: Reinforcement learning. Given the embedding function, the RL agent starts from the entire image and recursively samples behavior from the discrete behavior space. FIG. 6 is a schematic diagram illustrating reward and behavior spaces based on ordered embeddings, in accordance with aspects of the present disclosure. The reward for an action is calculated from the embedding distance. The policy network (behavioral head) is updated together with the embedded network. The neural network architecture is illustrated in detail in FIG. 7, which is a schematic diagram illustrating an architecture based on a set of RL agents and ordinal representation learning recurrent neural networks (RNNs), in accordance with aspects of the present disclosure.

提案する手法の有効性は、ＣｌｕｔｔｅｒＭＮＩＳＴベンチマークのデータセットで評価する。図８（Ａ）、図８（Ｂ）及び図８（Ｃ）は、本開示の態様による、ＲＬエージェントの行動シーケンス及び学習の収束、並びにクラッタ背景からの数字４の共位置特定及び及びグラウンドトゥルースまでの埋め込み距離の収束のプロットを示している。この図は、最終的な位置特定性能の観点から共同訓練の利点を示しており、エージェントが共位置特定タスクで１桁訓練され、訓練段階では見えない共通オブジェクト（０～３、５～９）の新しいクラスを見つけるために適応することを示している。 The effectiveness of the proposed approach is evaluated on the Clutter MNIST benchmark dataset. Figures 8(A), 8(B) and 8(C) show plots of the convergence of the RL agent's action sequence and learning, as well as the convergence of the co-localization of the digit 4 from a cluttered background and the embedding distance to the ground truth, according to aspects of the present disclosure. The figures show the benefit of co-training in terms of final localization performance, showing that the agent is trained on the co-localization task for one order of magnitude and adapts to find new classes of common objects (0-3, 5-9) that are not visible in the training phase.

本発明のシステム及び方法は、高品質のアノテーションが付与されたデータの不足を克服するために、順序表現学習と深層強化学習とを一緒に実行する。本発明のシステム及び方法は、全教師ありタスク、弱教師ありタスク及び共位置特定タスクに幅広く適用できる。 The system and method of the present invention performs ordered representation learning and deep reinforcement learning together to overcome the lack of high quality annotated data. The systems and methods of the present invention are broadly applicable to fully supervised tasks, weakly supervised tasks, and co-location tasks.

本発明のシステム及び方法は、限られた量の高品質で信頼性の高い人によるアノテーションが付与されたデータを効果的に利用して、低品質のアノテーションが付与されたデータの品質を特定し、改善するヒューマン・イン・ザ・ループのパラダイムを採用する。 The systems and methods of the present invention effectively utilize a limited amount of high-quality, reliable human-annotated data to identify the quality of low-quality annotated data. , adopt a human-in-the-loop paradigm to improve.

当業者であれば容易に理解及び認識できるように、本発明のシステム及び方法は、多くのアプリケーション、すなわち、（１）ファイバセンシングを含むがこれに限定されないコスト重視のアプリケーションにおいて、ラベルのないデータセットに自動でアノテーションを付与するツールとして、（２）クラスアクティベーションマップ（ＣＡＭ）法等の深層ニューラルネットワークの解釈可能性を高めるツールとして、（３）アノテーションの品質を評価し、クラウドソーシングプラットフォームの低品質なアノテーションを改善するツールとして、並びに（４）インテリジェント農場における衛星画像からの作物やデジタルパソロジーにおけるスライド全体画像からの細胞のような、同じ画像内における複数の共通のターゲットオブジェクトの位置を特定するツールとして、利益をもたらす可能性がある。 As one skilled in the art can readily understand and appreciate, the systems and methods of the present invention may be of benefit in many applications, namely: (1) as a tool for automatically annotating unlabeled datasets in cost-sensitive applications, including but not limited to fiber sensing; (2) as a tool for enhancing the interpretability of deep neural networks, such as class activation maps (CAM) methods; (3) as a tool for assessing annotation quality and improving poor quality annotations in crowdsourcing platforms; and (4) as a tool for locating multiple common target objects within the same image, such as crops from satellite images in intelligent farming or cells from whole slide images in digital pathology.

例示的な実施形態は、図面及び詳細な説明によってより完全に説明される。しかしながら、本開示による実施形態は、様々な形態で具現化することが可能であり、図面及び詳細な説明に記載された特定のまたは例示的な実施形態に限定されない。 The exemplary embodiments are more fully described in the drawings and detailed description. However, embodiments according to the present disclosure may be embodied in many different forms and are not limited to the specific or exemplary embodiments set forth in the drawings and detailed description.

図９は、本開示の態様による、ＲＬ更新中の固定埋め込みと訓練埋め込みとを比較したデータセットである。 Figure 9 is a dataset comparing fixed embeddings and training embeddings during RL updates according to aspects of the present disclosure.

図１０は、本開示の態様による、数字４及び他の新しい数字０～９に関して訓練及びテストされたエージェントを示すデータセットである。 FIG. 10 is a dataset illustrating agents trained and tested on the number 4 and other new numbers 0-9, in accordance with aspects of the present disclosure.

ここで、クエリオブジェクト位置特定の問題に対する強化学習に基づく手法について説明する。この手法では、エージェントは、少ない例示的なセットによって指定される関心のあるオブジェクトの位置を特定するように訓練される。順序メトリック学習によって、例示的なセットを用いて定式化された転送可能な報酬信号を学習する。これにより、報酬信号が直ぐに利用できない新しい環境にテストタイムのポリシー適応を可能にするため、アノテーションが付与された画像に限定される微調整手法よりも優れた性能を発揮する。さらに、転送可能な報酬により、訓練されたエージェントを、アノテーションの改良または画像のセットにわたる複数の共通オブジェクトからの選択的な位置特定等の新しいタスクで再利用できる。破損したＭＮＩＳＴデータセット及びＣＵ－Ｂｉｒｄｓデータセットにおける実験により、本発明の手法の有効性が実証された。 Here, a method based on reinforcement learning for the query object localization problem will be described. In this approach, an agent is trained to locate objects of interest specified by a small set of examples. By ordinal metric learning, we learn a transferable reward signal formulated using an example set. This allows test-time policy adaptation to new environments where reward signals are not readily available, thus outperforming fine-tuning methods that are limited to annotated images. Additionally, transferable rewards allow trained agents to be reused for new tasks, such as refining annotations or selective localization from multiple common objects across a set of images. Experiments on corrupted MNIST and CU-Birds datasets demonstrated the effectiveness of our approach.

本開示では、クエリオブジェクト位置特定の問題に対する強化学習（ＲＬ）の定式化に焦点を当て、エージェントは、サンプル画像の少ないセットで指定されたターゲットオブジェクトの位置を特定するように訓練される。視覚に基づくエージェントは、クラス固有の位置特定ポリシーに従って、画像環境と能動的に情報を交換する積極的な情報収集器と見なすことができるため、ロボットの操作や具現化されたＡＩタスクにより適している。 In this disclosure, we focus on a reinforcement learning (RL) formulation for the query object localization problem, where an agent is trained to localize a specified target object in a small set of example images. Vision-based agents can be viewed as active information gatherers that actively exchange information with the image environment according to a class-specific localization policy, making them more suitable for robotic operations and embodied AI tasks.

テストタイム中、位置特定のためのクエリされたオブジェクトは、固定ポリシーを備えたクラスに依存しないエージェントの適用性を妨げる、新規であるか、または背景の環境が大幅に変化する可能性がある。報酬信号が利用可能な場合、微調整法によりエージェントを新しい環境に効果的に適応させ、性能を向上させることができる。標準のＲＬ設定とは異なり、バウンディングボックスのアノテーションは、テスト画像の位置特定エージェントによって検出されるため、テスト中に報酬信号はアプリケーションで利用できない。 During the test time, the objects queried for location may be new or the background environment may change significantly, which precludes the applicability of class-independent agents with fixed policies. When reward signals are available, fine-tuning methods can effectively adapt agents to new environments and improve performance. Unlike standard RL settings, the bounding box annotation is detected by the localization agent in the test image, so no reward signal is available to the application during testing.

この問題に対処するため、少ない例示的なセットで定義された非明示的に転送可能な報酬信号を学習するための順序メトリック学習に基づくフレームワークについて説明する。順序埋め込みネットワークは、ＲＬタスクに関連するように設計された損失関数の下で、データ拡張を用いて事前訓練される。報酬信号により、テストタイム中の継続的な訓練によるポリシーネットワークにおけるコントローラの明確な更新が可能になる。微調整手法と比べて、エージェントはテスト画像を無制限に利用できるため、より広範囲の新しい環境にさらされる。エージェントは、例示的なセットから正確に情報を得ることで、位置特定ターゲットの変化に柔軟に対応できる。 To address this issue, we describe a framework based on ordinal metric learning to learn an implicitly transferable reward signal defined on a small example set. An ordinal embedding network is pre-trained with data augmentation under a loss function designed to be relevant to the RL task. The reward signal enables explicit updating of the controller in the policy network with continuous training during test time. Compared to fine-tuning approaches, the agent has unlimited access to test images, thus exposing it to a wider range of novel environments. By accurately deriving information from the example set, the agent can flexibly adapt to changes in localization targets.

図１１は、本開示の態様による、バウンディングボックスではなく例示的なセットで定義された報酬信号を有するＲＬに基づくクエリオブジェクト位置特定を示す概略図である。 FIG. 11 is a schematic diagram illustrating RL-based query object localization with reward signals defined in an example set rather than a bounding box, in accordance with aspects of the present disclosure.

バウンディングボックス回帰型手法と比較して、オフポリシーＲＬに基づくオブジェクト位置特定手法は、各画像環境に対してカスタマイズされた検索パスを備え、領域提案（region-proposal）が不要であるという利点を有する。エージェントの特殊性は、報酬で使用されるバウンディングボックスのクラスに純粋に依存する。エージェントは、クラス固有にすることもできるが、各クラスのエージェントを個別に訓練する必要がある。 Compared to bounding box regression-based methods, off-policy RL-based object localization methods have the advantage of having a search path customized for each image environment and no region-proposals are required. The specificity of the agent purely depends on the class of the bounding box used in the reward. Agents can also be class-specific, but it is necessary to train an agent for each class separately.

クラウドソーシングプラットフォームの増大にもかかわらず、十分な量のバウンディングボックスのアノテーションを取得するのは依然としてコストを要し、エラーが発生しやすい。さらに、アノテーションの品質は異なることが多く、特定のオブジェクトクラスに対する正確なアノテーションには、アノテータに特別な専門知識が必要になる場合がある。弱教師ありオブジェクト位置特定（ＷＳＯＬ）法の出現は、バウンディングボックスのアノテーションを得る際に画像クラスのラベルを利用する、この状況を軽減する。ＷＳＯＬ法は、クラス間の識別機能に過度に依存し、訓練段階では見えないクラスを一般化できないという欠点があることが知られている。 Despite the proliferation of crowdsourcing platforms, obtaining a sufficient amount of bounding box annotations remains costly and error-prone. Furthermore, the quality of annotations often varies, and accurate annotation for a particular object class may require special expertise on the part of the annotator. The advent of weakly supervised object localization (WSOL) methods alleviates this situation, making use of image class labels in obtaining bounding box annotations. It is known that the WSOL method relies too much on the discrimination function between classes and has the drawback of not being able to generalize to classes that are invisible during the training stage.

クラス内の類似度は、ターゲットクラスに属するオブジェクト位置特定の問題にとってより自然な目的であることに留意する。同様の問題は画像の共位置特定であり、そのタスクは画像のセット内における共通のオブジェクトを特定することである。共位置特定手法は、画像全体にわたる共通の特性を利用してオブジェクトの位置を特定する。共位置特定手法は、教師なしであるため、複数の共通オブジェクトまたは共通部位（鳥の頭及び胴体等）が存在する場合に曖昧さが発生し、不要な共通オブジェクトが出力として提供される可能性がある。 Note that intra-class similarity is a more natural objective for the problem of locating objects belonging to the target class. A similar problem is image co-localization, where the task is to identify common objects within a set of images. Co-localization techniques exploit common characteristics across images to locate objects. Since the co-localization method is unsupervised, ambiguity may occur when multiple common objects or common parts (such as the head and body of a bird) are present, and unnecessary common objects may be provided as output. There is.

タスクの高い特殊性を備えたエージェントを訓練するという目標と、同時に新しい状況に対するより優れた一般化性能を備えたエージェントを訓練するという目標との間には、明らかな矛盾が存在する。これら２つの目標を調和させる鍵は、少ない例示的なセットを使用することにある。パラメータで定義された静的モデルの訓練から、サポートセットと一緒に定義されたモデルへのパラダイムシフトがあり、フューショット（few-shot）訓練で非常に効果的であることが証明されている。 There is an apparent conflict between the goal of training agents with high task specificity and at the same time training agents with better generalization performance to new situations. The key to reconciling these two goals is to use a small set of examples. There has been a paradigm shift from training static models defined by parameters to models defined together with a support set, which has proven to be very effective in few-shot training.

非明示的に調整可能なモデルをメタ学習する取り組みに加えて、事前訓練されたモデルの微調整も、データが豊富なタスクからデータが少ないタスクに知識を転送する際に利用されてきた。報酬信号が利用できない場合、コントローラを固定したまま自己教師あり補助損失を最適化することで中間表現を微調整するポリシー適応手法を採用できる。本発明者らの開示は、テストタイムの訓練と同じ動機を共有しているが、コントローラを適応させたり、新しいタスクに再利用したりする必要がある設定に焦点を当てている。 In addition to efforts to meta-learn implicitly tunable models, fine-tuning of pre-trained models has also been exploited in transferring knowledge from data-rich to data-poor tasks. When no reward signal is available, policy adaptation techniques can be employed to fine-tune the intermediate representation by optimizing a self-supervised auxiliary loss while keeping the controller fixed. Our disclosure shares the same motivation as test-time training, but focuses on settings where the controller needs to be adapted or reused for new tasks.

クエリオブジェクト位置特定において、画像Ｉのセットと、例示的な画像Ｅの少ないセットが与えられる。画像のアノテーションは、バウンディングボックスｇの形式で利用可能である。本発明者らの目標は、候補ボックスを用いずに各画像内でクエリされたオブジェクトを含むバウンディングボックスの位置を見つけることである。 In query object localization, we are given a set of images I and a small set of example images E. Image annotations are available in the form of bounding boxes g. Our goal is to find the location of a bounding box that contains the queried object in each image without using candidate boxes.

各画像Ｉ_iを環境として考慮すると、オブジェクト位置特定のための既存のＲＬ手法は、そのグラウンドトゥルースオブジェクトバウンディングボックスｇ_iを報酬信号として使用する。

Considering each image I _i as the environment, existing RL methods for object localization use its ground truth object bounding box g _i as the reward signal.

ここで、ＩｏＵ（ｂ_t，ｇ_i）は、現在のウィンドウｂ_tと対応するグラウンドトゥルースボックスｇ_iとの間のＩｏＵ（Intersection-over-Union）を示し、

である。マッピング

を学習するバウンディングボックス回帰型手法と同様に、画像とボックスをペアにする必要がある。但し、アノテーションが付与された画像とボックスのペア（Ｉ，ｇ）は、訓練段階とテスト段階の両方で不足している可能性がある。（？？）における報酬信号は、潜在的なドメインシフトを伴うテスト画像はもちろん訓練画像間でも転送できない。 Here, IoU (b _t , g _i ) indicates the IoU (Intersection-over-Union) between the current window b _t and the corresponding ground truth box g _i ,

It is. mapping

Similar to the bounding box regression method that learns, images and boxes need to be paired. However, the annotated image-box pair (I, g) may be missing in both the training and testing stages. The reward signal in (??) cannot be transferred between training images as well as test images with potential domain shifts.

この問題に対処するには、現在のウィンドウｂ_tによって切り取られた画像とグラウンドトゥルースウィンドウｇとの間の距離に基づいて報酬信号を定義するのが自然な考え方である。Ｄ次元の画像特徴ベクトルから埋め込み関数

で生成されたＭ次元表現ｂ_t及びｇが与えられると、距離関数

は埋め込み距離ｄ（ｂ_t，ｇ）を返す。但し、エージェントがグラウンドトゥルースボックスｇに近づくにつれて、単調に減少しない可能性がある。結果として、埋め込み距離に基づく報酬信号は（？？）よりも効果が低い可能性がある。 To address this problem, a natural idea is to define the reward signal based on the distance between the image cropped by the current window b _t and the ground truth window g. Embedding function from D-dimensional image feature vector

Given the M-dimensional representation b _t and g generated by , the distance function

returns the embedding distance d(b _t , g). However, as the agent approaches the ground truth box g, it may not decrease monotonically. As a result, reward signals based on embedding distance may be less effective than (??).

さらに、本発明者らは、順序埋め込みに基づく報酬信号を用いることを提案する。制約セットＣにおけるｇからの任意の２つの摂動ボックスｂ_j，ｂ_kについて、埋め込みｂ_j，ｂ_k，ｇが学習され、任意のボックスのペア間の相対的な優先度がユークリッド空間で保存される。

Furthermore, we propose to use a reward signal based on an ordinal embedding: for any two perturbed boxes _bj , _bk from g in a constraint set C, an embedding _bj , _bk , g is learned, preserving the relative priorities between any pair of boxes in Euclidean space.

ここで、ｐ_j及びｐ_kは、優先度（グラウンドトゥルースボックスまたはユーザからの順序フィードバックに対するＩｏＵから導出される）を表す。この問題は、当初は非計量多次元尺度構成法（Non-metric multidimensional scaling）として提起された。ここでは非常にシンプルなペアワイズに基づく手法を適用するが、リストワイズに基づく手法、クアドラプレット（quadruplet）に基づく手法、ランドマークに基づく手法等の他の拡張機能も存在する。 Here, p _j and p _k represent the priorities (derived from the ground truth box or IoU for order feedback from the user). This problem was originally posed as non-metric multidimensional scaling. Here we apply a very simple pairwise-based approach, but other extensions also exist, such as listwise-based approaches, quadruplet-based approaches, landmark-based approaches, etc.

式（２）におけるアンカーｇは、同じ画像からの埋め込みに限定されない。例えば、これは例示的なセットＥのプロトタイプ埋め込み、

で置き換えることができる。ここで、ｂ_iは、グラウンドトゥルースボックスｇ_iで切り取られた画像Ｉ_iの埋め込みである。複数のクラスの画像が利用可能な場合、プロトタイプをさらにクラス依存またはクラスタベースにすることができる。幾つかの実験において、アンカーとしてプロトタイプに基づく埋め込みがｇよりも一般化性能が優れている可能性があることを見出した。この選択により、クラス毎の訓練画像が少ないサブセットのみにアノテーションが付与される場合、本発明者らの手法はフューショット訓練にも適している。順序報酬はメタ情報として見ることができる。さらに、テストタイム中の例示的なセットが切り取られたオブジェクトのみを含む場合でも、テストタイムのポリシー適応は、画像ボックスのペアなしで依然として実現可能である。 Anchor g in equation (2) is not limited to embedding from the same image. For example, this is the prototype embedding for the exemplary set E,

can be replaced with Here, b _i is the embedding of the image I _i cropped by the ground truth box g _i . If images of multiple classes are available, the prototype can be made more class-dependent or cluster-based. In some experiments, we found that embeddings based on prototypes as anchors can have better generalization performance than g. Due to this selection, if annotations are provided only to a small subset of training images for each class, our method is also suitable for few-shot training. Order rewards can be viewed as meta-information. Furthermore, even if the exemplary set during test time only includes cropped objects, the test time policy adaptation is still feasible without image box pairs.

訓練タイム中、例示的なセットＥは画像Ｉとボックスｇの両方を含むと仮定する。カスタマイズされたデータ拡張スキームであるボックス摂動を採用する。このスキームでは、ｇの周囲のボックスペアをサンプリングすることでＣが構築される。ＩｏＵに基づくパーティションスキームを用いる方がランダムにサンプリングよりも効果的であることを見出した。これは、ボックス摂動に対するニューラルネットワークのロバスト性を強化し、報酬の増加または減少を区別するという、その使用の特別な目的を保護するための手順とみなすことができる。データ拡張を伴う事前訓練により、ポリシーネットワークの訓練の下流タスクをより効率的に行うこともできる。 During training time, we assume that the example set E contains both images I and boxes g. We employ a customized data augmentation scheme, box perturbation, in which C is constructed by sampling box pairs around g. We find that using an IoU-based partitioning scheme is more effective than random sampling. This can be seen as a procedure to enhance the robustness of the neural network against box perturbations and to safeguard the special purpose of its use: to distinguish between increasing or decreasing rewards. Pre-training with data augmentation also allows the downstream task of training the policy network to be more efficient.

本開示では、ｐを、グラウンドトゥルースボックスｇに対するボックスｂのＩｏＵ、すなわち、ｐ＝ＩｏＵ（ｂ，ｇ）として定義する。データ拡張によって取得された画像ペアに指定された局所的な順序制約と一致する埋め込み空間を学習する。 In this disclosure, we define p as the IoU of box b with respect to ground truth box g, i.e., p = IoU(b, g). We learn an embedding space that is consistent with the local order constraints specified on the image pairs obtained by data augmentation.

本発明者らは、所望の埋め込みを学習するためにトリプレット損失を最適化することを選択する。

ここで、ｆ_aは「アンカー」埋め込みである。ｆ_pは、グラウンドトゥルースボックスｇを備えたより大きいＩｏＵとより小さいＩｏＵを有する「正」の埋め込みであり、ｆ_nは、グラウンドトゥルースボックスｇを備えたより大きいＩｏＵとより小さいＩｏＵを有する「負」の埋め込みである。報酬を定義するための適切な表現は、必ずしも同時に適切な状態表現であるとは限らないことに注意されたい。エージェントが正しい行動を取得するための十分な情報が含まれていない可能性がある。表現と対照的な損失との間に投影ヘッドを追加すると、学習された表現の品質が大幅に向上することを示唆している。 We choose to optimize the triplet loss to learn the desired embedding.

where f _a is the "anchor" embedding, f _p is the "positive" embedding with larger and smaller IoU with the ground truth box g, and f _n is the "negative" embedding with larger and smaller IoU with the ground truth box g. Note that a good representation for defining the reward is not necessarily a good state representation at the same time; it may not contain enough information for the agent to obtain the correct action. We suggest that adding a projection head between the representation and the contrastive loss can significantly improve the quality of the learned representation.

本発明者らは、投影ヘッドを用いることが、本発明者らのタスクにおける２つの目的のバランスを取る上で極めて重要であることを見出した。ネットワークアーキテクチャを図１２に示す。図１２では、ＭＬＰ投影ヘッドがＲｏＩ（Region of Interest：関心領域）エンコーダの後に取り付けられている。与えられた画像とＲｏＩに従って、ＲｏＩエンコーダは位置特定のための状態表現で使用されるＲｏＩ特徴ｓを抽出する。投影ヘッドは、報酬を計算するために順序埋め込みｂを学習する。ＲＯＩアライメントモジュールは、様々なサイズのボックスを処理する。結合損失関数

の下では、状態表現ｓはｂに対する順序教師ありから間接的に恩恵を受けることができるが、それでも満足のいく画像再構成結果をレンダリングする必要がある。オートエンコーダスキームの他に、ＲｏＩエンコーダは事前訓練されたネットワークも使用できる。 We have found that the use of a projection head is crucial in balancing the two objectives in our task. The network architecture is shown in FIG. In FIG. 12, the MLP projection head is mounted after the RoI (Region of Interest) encoder. According to the given image and RoI, the RoI encoder extracts RoI features s that are used in the state representation for localization. The projection head learns the ordered embedding b to calculate the reward. The ROI alignment module processes boxes of various sizes. joint loss function

Under , the state representation s can indirectly benefit from order supervisedness on b, but is still required to render a satisfactory image reconstruction result. Besides autoencoder schemes, RoI encoders can also use pre-trained networks.

位置特定は、各画像におけるロウ（raw）画素を環境として用いるマルコフ決定プロセス（ＭＤＰ：Markov Decision Process）として定式化される。ここで説明するように、バウンディングボックスの座標ではなく順序埋め込みを用いてエージェントが行う改善箇所を計算する。状態ｓ⁰からｓに移動するエージェントに対する報酬は次の形式になる。

ここで、ａはプロトタイプ埋め込みである。順序埋め込みは、事前訓練されたＲｏＩエンコーダ及び投影ヘッドによってＥにおけるグラウンドトゥルースボックスで囲まれた画像領域から抽出され、プロトタイプが平均ベクトルとして計算される。さらに、本発明では、履歴行動及び状態のベクトルを含むＤｅｅｐＱＮｅｔｗｏｒｋではなく、回帰型ニューラルネットワーク（ＲＮＮ）（Ｍｎｉｈｅｔａｌ., ２０１４）を備えたポリシー勾配を使用する。画像ピクセル全体を入力として開始し、エージェントは、割引報酬和を最大化することで、各ステップで現在のバウンディングボックスを変換する行動を選択するように訓練される。エージェントは、現在のボックスからプールされた特徴を状態として取得すると同時に、履歴観察からの情報をエンコードするＲＮＮの内部状態も維持する。行動セットは、トップダウン検索を容易にする個別の行動で定義される。行動セットには、５つのスケーリング、８つの変換変換及び１つの停止行動が含まれる。 Localization is formulated as a Markov Decision Process (MDP) using raw pixels in each image as the environment. As described here, we use ordinal embeddings rather than bounding box coordinates to calculate improvements made by the agent. The reward for an agent moving from state s ⁰ to s is of the form:

Here, a is the prototype embedding. The ordered embedding is extracted from the image region bounded by the ground truth box in E by a pre-trained RoI encoder and projection head, and the prototype is computed as the mean vector. Moreover, we use policy gradients with recurrent neural networks (RNNs) (Mnih et al., 2014) rather than DeepQNetworks, which include vectors of historical behavior and states. Starting with the entire image pixel as input, the agent is trained to choose an action that transforms the current bounding box at each step by maximizing the discounted reward sum. The agent obtains the pooled features from the current box as state, while also maintaining the internal state of the RNN that encodes information from historical observations. Behavior sets are defined with individual behaviors that facilitate top-down searching. The behavior set includes 5 scaling, 8 transformation transformations, and 1 stop behavior.

テストタイム中のテストタイム適応において、エージェントは、テスト例のセットＥ_testのプロトタイプとして、ａを含む（４）から受け取った報酬を用いてポリシーネットワークをさらに更新するオプションを有する。テスト条件に一致させるため、訓練バッチは２つのグループに分割され、位置を特定する訓練画像と重ならない少ないサブセットでａが計算される。テスト適応中、ａは例示的なセットのプロトタイプになる。完全なアルゴリズムは、図１９で例示的に示されたアルゴリズム１に概説されている。 In test-time adaptation during test time, the agent has the option to further update the policy network with the reward received from (4) including a as a prototype for the set of test examples E _test . To match the test conditions, the training batch is split into two groups and a is computed on a small subset that does not overlap with the training images to be localized. During test adaptation, a becomes a prototype for the example set. The complete algorithm is outlined in Algorithm 1, which is exemplarily shown in Figure 19.

訓練からテストへの報酬信号の転送可能性は、学習された順序表現の一般化能力に決定的に依存する。順序の優先度がテストドメインで保持されない場合、提案するテストタイムのポリシー適応スキームは機能しない。自己教師あり目的を用いて表現を適応させることにより、この問題が解決される可能性がある。本発明者らの手法は、画像環境内で複数のクエリ対象オブジェクトまたはクエリ対象オブジェクトが無いという特殊なケースを直接的に処理しないが、これらのタスクを達成するために簡単に変更できる。 The transferability of the reward signal from training to testing critically depends on the generalization ability of the learned ordinal representation. If ordinal priors do not hold in the test domain, our proposed test-time policy adaptation scheme will not work. Adapting the representation using a self-supervised objective could solve this problem. Although our approach does not directly handle the special cases of multiple or no queried objects in an image environment, it can be easily modified to accomplish these tasks.

ＭＮＩＳＴ及びＣＵＢの鳥類データセットに対するいくつかのタスクを用いて本発明者らの手法を評価する。ＭＮＩＳＴの場合、画像エンコーダとして各層の後にＲｅＬＵを活性化する３つの畳み込み層を使用し、オートエンコーダを学習するためのデコーダとして同じであるがミラー構造を使用する。次に、２つの全結合層に続いて、順序報酬学習のための投影ヘッドとしてＲｏＩアライン層を接続する。ＣＵＢデータセットには、ＩｍａｇｅＮｅｔエンコーダで事前学習されたＶＧＧ１６のｃｏｎｖ５＿３よりも前の層を採用する。投影ヘッドは前と同じ構造であるが、各全結合層用にさらに多くのユニットがある。学習された順序構造を評価するために、本発明は、摂動ボックスのペアの順序が正しく予測される画像のパーセンテージで定義されるＯｒｄＡｃｃを使用する。本発明は、ＣｏｒＬｏｃ（Correct Localization）メトリックを使用する。これは、基準（criterion）

にしたがって正しく位置が特定された画像のパーセンテージとして定義される。ここで、ｂ_pは予測ボックスであり、ｇはグラウンドトゥルースボックスである。 We evaluate our approach using several tasks on the MNIST and CUB bird datasets. For MNIST, we use three convolutional layers activating ReLU after each layer as the image encoder and the same but mirror structure as the decoder to learn the autoencoder. Next, following the two fully connected layers, we connect an RoI align layer as a projection head for ordinal reward learning. For the CUB data set, we employ layers prior to conv5_3 of VGG16 that have been pre-trained with the ImageNet encoder. The projection head has the same structure as before, but there are more units for each fully connected layer. To evaluate the learned ordered structure, we use OrdAcc, which is defined as the percentage of images in which the order of pairs of perturbed boxes is predicted correctly. The present invention uses the CorLoc (Correct Localization) metric. This is the criterion

is defined as the percentage of images that are correctly located according to Here b _p is the prediction box and g is the ground truth box.

本発明者らは、ＣｌｕｔｔｅｒｅｄＭＮＩＳＴにおける表現及び報酬の観点から、順序埋め込みを使用することの有効性を分析する。８４×８４のクラッタ（cluttered）背景に、２８×２８の数字がランダムに配置される。ここでは、オートエンコーダのみで訓練された埋め込みと、通常の射影ヘッドと一緒に訓練された埋め込みを比較する。さらに、使用されたＩｏＵに基づく報酬と埋め込みに基づく報酬とを比較する。エージェントは、特定の数の数字４の画像で訓練される。テストセットにおける全ての画像でテストされる。異なる訓練セットのサイズの下での結果を図１３（Ａ）及び図１３（Ｂ）に示す。図１３（Ａ）は、本開示の態様による、ＯｒｄＡｃｃ（％）のランダムサンプリングアンカーサンプリングを示すデータセットであり、図１３（Ｂ）は、本開示の態様による、ＣｏｒＬｏｃ（％）におけるＩｏＵ報酬の符号ありと符号なしの比較結果を示すデータセットである。順序埋め込みが表現と報酬の両方に存在する（「ＡＥ＋Ｏｒｄ＋Ｅｍｂｅｄ」）と、特に訓練セットのサイズが小さい場合、モデルの性能はその他の設定よりも一貫して優れている。 We analyze the effectiveness of using ordinal embeddings in terms of representation and reward in Cluttered MNIST. 28x28 digits are randomly placed on a cluttered background of 84x84. Here, we compare embeddings trained with only the autoencoder and embeddings trained together with a normal projection head. Furthermore, we compare the IoU-based rewards used with embedding-based rewards. The agent is trained with a certain number of images of the digit 4. It is tested with all images in the test set. Results under different training set sizes are shown in Figures 13(A) and 13(B). Figure 13(A) is a dataset showing random sampling anchor sampling of OrdAcc(%) according to aspects of the present disclosure, and Figure 13(B) is a dataset showing the results of a comparison of signed and unsigned IoU rewards in CorLoc(%) according to aspects of the present disclosure. When ordinal embedding is present in both the representation and the reward ("AE+Ord+Embed"), the model consistently outperforms other settings, especially when the training set size is small.

図１４（Ａ）及び図１４（Ｂ）は、本開示の態様による、異なる訓練セットのサイズの下での比較結果を示すプロットである。 Figures 14(A) and 14(B) are plots showing comparison results under different training set sizes according to aspects of the present disclosure.

順序報酬を効率的に学習するため、拡張バウンディングボックスのペアを生成するサンプリング戦略を比較する実験を行う。最初の戦略は、ボックスのペアが完全にランダムに生成される、ランダムサンプリングである。もう１つの戦略は、最初に様々なスケールで高密度のアンカーを生成し、次にグラウンドトゥルースボックスを用いてＩｏＵに従ってそれらを１０のグループに分割するアンカー毎のサンプリングである。各グループの間隔は０．１である。サンプリングは、最初にグループレベル、すなわち２つのグループがサンプリングされる。次に、各グループに対応する２つのボックスをサンプリングする。したがって、サンプリングされたボックスは、ランダムサンプリングと比較して、より多くのケースをカバーできる。２つの戦略の結果として得られるＯｒｄＡｃｃを図１３（Ａ）に示す。アンカーサンプリングを使用すると、より適切な順序埋め込みを学習できる。 To efficiently learn ordinal rewards, we conduct experiments to compare sampling strategies that generate pairs of augmented bounding boxes. The first strategy is random sampling, where pairs of boxes are generated completely randomly. Another strategy is anchor-by-anchor sampling, which first generates a high density of anchors at various scales and then uses a ground truth box to divide them into 10 groups according to IoU. The interval between each group is 0.1. Sampling is first done at the group level, ie two groups are sampled. Next, sample two boxes corresponding to each group. Therefore, the sampled box can cover more cases compared to random sampling. The OrdAcc obtained as a result of the two strategies is shown in FIG. 13(A). Anchor sampling can be used to learn better ordered embeddings.

報酬｛＋１、－１｝、符号の有無はエージェントを訓練する報酬として式１を使用する。しかしながら、図１４（Ａ）及び図１４（Ｂ）から、特に訓練セットのサイズが小さい場合、このＩｏＵ報酬と埋め込み報酬との間に大きなギャップがあることが分かる。順序報酬は埋め込み空間におけるＩｏＵの特性に近似するため、報酬としてのＩｏＵよりも精度が低くなるはずであり、これは少々常識に反している。この問題を分析するため、数字４の画像でモデルを訓練するために、式１における符号演算を取り除く。図１３（Ｂ）で示すように、符号演算により、位置特定の精度は、数字４で３．４％、他の数字のテストセットで６．２％高くなる。 Reward {+1, -1}, presence or absence of sign, uses equation 1 as the reward for training the agent. However, it can be seen from FIGS. 14(A) and 14(B) that there is a large gap between this IoU reward and the embedding reward, especially when the size of the training set is small. Since the ordinal reward approximates the characteristics of the IoU in the embedding space, it should have lower accuracy than the IoU as a reward, which is somewhat contrary to common sense. To analyze this problem, we remove the sign operation in Equation 1 to train the model on images of the digit 4. As shown in FIG. 13B, the sign operation increases the localization accuracy by 3.4% for the number 4 and by 6.2% for the test set of other numbers.

図１５（Ａ）は、本開示の態様による、ＣｏｒＬｏｃ（％）のデータセットを示し、図１５（Ｂ）は、本開示の態様による、使用されるアンカーに応じた４つの訓練戦略の比較結果を示すデータセットである。 FIG. 15(A) shows a dataset of CorLoc (%), according to aspects of the present disclosure, and FIG. 15(B) shows comparison results of four training strategies depending on the anchors used, according to aspects of the present disclosure. This is a dataset showing the following.

ＤｅｅｐＱ－Ｎｅｔｗｏｒｋを用いてエージェントを訓練するのとは対照的に、ポリシーの勾配を適用してエージェントを最適化する。さらに、ＲＮＮによるトップダウン検索戦略を採用するが、これらの動作ではメモリをエンコードするために履歴行動のベクトルが使用される。図１５（Ａ）で示すように、数字４で訓練及びテストされたモデルまたは他の数字でテストされたモデルを用いて設計の選択を評価する。見ての通り、エージェントは「ＰＧ＋ＲＮＮ」で最高の性能を達成する。履歴行動ベクトルを用いる場合、エージェントがＤＱＮで訓練されると精度が低下する。 In contrast to training the agent using Deep Q-Network, we apply policy gradients to optimize the agent. Furthermore, we employ a top-down search strategy with RNNs, in which vectors of historical behavior are used to encode the memory. As shown in FIG. 15(A), the design choices are evaluated using a model trained and tested on digit 4 or a model tested on other digits. As can be seen, the agent achieves the best performance with “PG+RNN”. When using historical behavior vectors, the accuracy decreases when the agent is trained with DQN.

本発明者らは、ＣＵＢデータセットのサブセットで、順序報酬学習及び異なる訓練戦略の位置特定に対する効果を評価するための実験を行った。ここで、訓練セット及びテストセットには、１５及び５つの異なるファイングレインのクラスが含まれており、結果として訓練用の８９６枚の画像及びテスト用の２９４毎の画像が得られる。図１５（Ｂ）は、４つの設定「Ｓｅｌｆ」、「Ｐｒｏｔｏ」、「Ｓｈｕｆｆｌｅｓｅｌｆ」及び「Ｓｈｕｆｆｌｅｐｒｏｔｏ」のＯｒｄＡｃｃ及びＣｏｒＬｏｃを示している。「Ｓｅｌｆ」は、埋め込み事前訓練及びエージェント訓練の両方とも、このインスタンスからのグラウンドトゥルースをアンカーとして使用する。「Ｐｒｏｔｏ」は、埋め込み事前訓練及びエージェント訓練の両方とも、バッチ内のインスタンスを含むサブグループのプロトタイプを使用する。「Ｓｈｕｆｆｌｅｓｅｌｆ」は、埋め込み事前訓練及びエージェント訓練の両方とも、別のインスタンスからのグラウンドトゥルースを使用する。「Ｓｈｕｆｆｌｅｐｒｏｔｏ」は、埋め込み事前訓練及びエージェント訓練の両方とも、バッチ内にこのインスタンスを持たないサブグループのプロトタイプを使用する。ＲｏＩエンコーダは、ｌｏｓｓ_tripのみで訓練される。したがって、訓練セット全体を１つのクラスとして見ることができる。結果から、「Ｓｈｕｆｆｌｅｐｒｏｔｏ」はＯｒｄＡｃｃが他と比べて低いものの、ＣｏｒＬｏｃが大きなマージンを有して最も優れていることが分かる。この現象は、この訓練戦略が訓練セットにコンパクトさをもたらし、クラスタの周囲に順序構造を構築していることを示唆している。ＯｒｄＡｃｃはインスタンスをアンカーとして用いて計算されることに注意されたい。 We conducted experiments to evaluate the effect of ordinal reward learning and different training strategies on localization on a subset of the CUB dataset, where the training and test sets contained 15 and 5 different fine-grained classes, resulting in 896 images for training and 294 images for testing. Fig. 15(B) shows the OrdAcc and CorLoc for the four settings "Self", "Proto", "Shuffle self" and "Shuffle proto". "Self" uses the ground truth from this instance as an anchor for both embedding pre-training and agent training. "Proto" uses the prototype of the subgroup containing the instance in the batch for both embedding pre-training and agent training. "Shuffle self" uses the ground truth from another instance for both embedding pre-training and agent training. Shuffle proto uses the prototypes of the subgroup that do not have this instance in the batch for both embedding pre-training and agent training. The RoI encoder is trained with loss _trips only. Therefore, the entire training set can be viewed as one class. The results show that Shuffle proto has a lower OrdAcc than the others, but its CorLoc is the best by a large margin. This phenomenon suggests that this training strategy brings compactness to the training set and builds an order structure around the clusters. Note that OrdAcc is calculated using the instances as anchors.

当業者には理解されるように、潜在的に新しい環境において関心のあるクエリされたオブジェクトを検索するために位置特定エージェントを訓練するための、順序表現学習に基づく報酬を開示する。特に、学習目標を達成するためのガイダンス信号として小さな模範的なセットを使用する。これにより、学習の曖昧さを回避できる。一方、テスト画像環境を用いて、テスト中に画像ボックスのペアを必要とせずにドメインシフトについてエージェントに通知する。本発明者らのアルゴリズムは、候補ボックスを提案する必要なく、ロウ（raw）画像の画素を入力として受け取る。 As will be appreciated by those skilled in the art, we disclose a reward based on ordinal representation learning to train a localization agent to search for queried objects of interest in potentially new environments. In particular, we use a small exemplar set as a guidance signal to achieve the learning goal, which avoids ambiguity in learning. Meanwhile, we use a test image environment to inform the agent about domain shifts during testing without requiring image box pairs. Our algorithm takes raw image pixels as input without the need to propose candidate boxes.

本発明者らの手法は、例示的なセットとの特徴類似度に基づいているが、バウンディングボックス回帰型手法及びバウンディングボックスＲＬ手法とは根本的に異なる。様々なオブジェクトクラスと背景のシナリオに一般化するために、これまでの手法は最前面と背景のバリエーションをカバーする大規模なデータセットに対してクラス認識として訓練する必要がある。対照的に、本発明者らは、テストタイム中のポリシー適応能力を備えた専門のエージェントを訓練できるようになる。 Although our approach is based on feature similarity with an exemplary set, it is fundamentally different from bounding box regression and bounding box RL approaches. To generalize to different object classes and background scenarios, previous methods need to be trained as class recognizers on large datasets that cover foreground and background variations. In contrast, we will be able to train professional agents with the ability to adapt policies during test times.

位置特定モデルを分類モデルと一緒に訓練する代わりに、弱教師あり学習と同様の主旨で、画像クラスラベルからの学習ボックスのアノテーションを探索する。分類モデルから画像ラベルが与えられると、本発明者らの位置特定モデルは強化された解釈可能性を伴ってボックス領域を識別できる。経験的に、本発明者らの手法は、データが豊富な単一のソースタスクからデータが少ないテストタスクまでの転送学習設定で機能することを示している。さらに、本発明者らの手法は、訓練中に多数のタスクにわたる限られたアノテーションが利用できるフューショット学習設定にも適用される。今後の研究には、属性に基づくクロスモダリティクエリまたはゼロショットクエリ及び例示的なセットにおける設計された一連のターゲットを使用したカリキュラム学習が含まれる。 Instead of training a localization model together with a classification model, we explore the annotations of learning boxes from image class labels, in a similar spirit to weakly supervised learning. Given the image labels from the classification model, our localization model is able to identify box regions with enhanced interpretability. Empirically, our approach has been shown to work in transfer learning settings from data-rich single source tasks to data-poor test tasks. Furthermore, our approach also applies to a few-shot learning setting where limited annotations across a large number of tasks are available during training. Future work will include curricular learning using attribute-based cross-modality or zero-shot queries and a designed set of targets in an exemplary set.

アノテーションの収集は、機械学習システムを構築する際に重要な役割を果たす。これは、特にコスト重視のアプリケーションにおいて、自動化から大きな恩恵を受ける可能性があるタスクの１つである。本発明者らは、クラス毎のアノテーションサンプル数、アノテーションクラス数及び必要な精度レベルの観点から、人によるラベル付けの労力を軽減することを目指している。本発明者らの手法は、データ品質の客観的な評価と反復的な改善を可能にする。 Collecting annotations plays an important role when building machine learning systems. This is one task that could greatly benefit from automation, especially in cost-sensitive applications. The present inventors aim to reduce the labor of human labeling in terms of the number of annotation samples for each class, the number of annotation classes, and the required accuracy level. Our approach allows objective assessment and iterative improvement of data quality.

図１６は、本開示の態様による、使用されるアンカーに応じた異なる数字における性能を示すデータセットである。 FIG. 16 is a dataset illustrating performance at different numbers depending on the anchor used, according to aspects of the present disclosure.

図１７は、本開示の態様による、使用されるアンカーに応じた適応前、適応後及び適応の微調整を示すプロットである。 Figure 17 is a plot showing pre-adaptation, post-adaptation, and fine-tuning of adaptation depending on the anchor used, according to an aspect of the present disclosure.

図１８（Ａ）は、本開示の態様による、不正確なアノテーションから厳格なアノテーションが付与されたバウンディングボックスの性能を示すデータセットであり、図１８（Ｂ）は、本開示の態様による、他の背景に転送するときの性能を示すデータセットである。 FIG. 18(A) is a dataset illustrating the performance of bounding boxes with inaccurate annotations to strict annotations according to aspects of the present disclosure, and FIG. This is a dataset that shows the performance when transferring to a background.

図１９は、本開示の態様による、使用されるアンカーに応じた訓練及び報酬の位置特定エージェントのためのアルゴリズムＩのリストである。 FIG. 19 is a listing of algorithm I for training and rewarding a localization agent depending on the anchors used, according to aspects of the present disclosure.

ここでは、いくつかの具体的な例を用いて本開示を示したが、当業者であれば本教示がそれらに限定されないことを認識するであろう。したがって、本開示は本明細書に添付される特許請求の範囲によってのみ限定されるべきである。 Although the present disclosure has been presented herein using several specific examples, those skilled in the art will recognize that the present teachings are not limited thereto. Accordingly, the present disclosure should be limited only by the scope of the claims appended hereto.

Claims

A deep reinforcement learning (RL) method for object localization, comprising:
Obtain a seed dataset containing a set of seed images each with a ground truth bounding box annotation,
pre-training an ordered embedding that satisfies the ordering constraint locally for each pair of perturbed data augmented from the same image by randomly perturbing the ground truth bounding box at different levels denoted by a parameter p;
The pre-training is performed through a backbone network, a region of interest (RoI) head and the effect of a triplet loss;
Construct an RL agent using an embedding function to recursively sample actions from a discrete action space such that a reward is generated, starting from the entire image;
The reward of the sample action is determined from the embedding distance, and the policy network is updated based on the determined reward,
A method for outputting annotation policies and embedded functions.

2. The method of claim 1, wherein the annotation of the seed image bounding box is initially provided by human behavior.