JP2021502645A

JP2021502645A - Target detection methods and devices, training methods, electronic devices and media

Info

Publication number: JP2021502645A
Application number: JP2020526040A
Authority: JP
Inventors: ポーリー; ウェイウー
Original assignee: ベイジンセンスタイムテクノロジーディベロップメントカンパニーリミテッド
Priority date: 2017-11-12
Filing date: 2018-11-09
Publication date: 2021-01-28
Anticipated expiration: 2038-11-09
Also published as: US11455782B2; JP7165731B2; PH12020550588A1; SG11202004324WA; WO2019091464A1; US20200265255A1; CN108230359A; CN108230359B; KR20200087784A

Abstract

本開示の実施例は、目標検出方法及び装置、トレーニング方法、電子機器並びに媒体を開示する。目標検出方法は、検出フレームと、目標対象物の検出枠の画像であって、画像の大きさが前記検出フレームより小さいテンプレートフレームの特徴をニューラルネットワークによりそれぞれ抽出することと、前記テンプレートフレームの特徴に基づいて局所領域検出器の分類の重みと回帰の重みを取得することと、前記検出フレームの特徴を前記局所領域検出器に入力し、前記局所領域検出器から出力される複数の候補枠の分類結果と回帰結果を取得することと、前記局所領域検出器から出力される複数の候補枠の分類結果と回帰結果により、前記検出フレームにおける前記目標対象物の検出枠を取得することとを含む。本開示の実施例は、目標追跡の速度と正確性を高めることができる。The embodiments of the present disclosure disclose target detection methods and devices, training methods, electronic devices and media. The target detection method is an image of the detection frame and the detection frame of the target object, and the features of the template frame whose image size is smaller than the detection frame are extracted by the neural network, respectively, and the features of the template frame. Obtaining the classification weight and the regression weight of the local region detector based on the above, inputting the characteristics of the detection frame to the local region detector, and outputting from the local region detector of a plurality of candidate frames. This includes acquiring the classification result and the regression result, and acquiring the detection frame of the target object in the detection frame from the classification result and the regression result of a plurality of candidate frames output from the local region detector. .. The embodiments of the present disclosure can increase the speed and accuracy of target tracking.

Description

本開示は、コンピュータビジョン技術に関し、特に、目標検出方法及び装置、トレーニング方法、電子機器並びに媒体に関する。
＜関連出願の相互参照＞
本願は、２０１７年１１月１２日に中国特許局に提出された、出願番号ＣＮ２０１７１１１１０５８７．１、発明の名称「目標検出方法及び装置、トレーニング方法、電子機器、プログラム並びに媒体」の中国特許出願の優先権を主張し、その開示の全てが参照によって本願に組み込まれる。 The present disclosure relates to computer vision technology, in particular to target detection methods and devices, training methods, electronic devices and media.
<Cross-reference of related applications>
This application is prioritized by the Chinese patent application filed with the Chinese Patent Office on November 12, 2017, with application number CN201711110587.1, and the title of the invention "Target detection method and device, training method, electronic device, program and medium". Claim the right and all of its disclosures are incorporated herein by reference.

単一目標追跡は、人工知能分野の重要な課題となっており、自動運転、多目標追跡等の一連のタスクに用いられる。単一目標追跡の主なタスクは、ビデオシーケンスのセグメント内の１フレームの画像において追跡される目標を指定し、その後のフレーム画像においてこの指定された目標を継続的に追跡することである。 Single-target tracking has become an important issue in the field of artificial intelligence, and is used for a series of tasks such as autonomous driving and multi-target tracking. The main task of single target tracking is to specify a target to be tracked in one frame of image within a segment of the video sequence, and to continuously track this specified target in subsequent frame images.

本開示の実施例は、目標追跡を行うための技術的手段を提供する。 The embodiments of the present disclosure provide technical means for performing target tracking.

本開示の実施例の一方面によれば、
検出フレームと、目標対象物の検出枠の画像であって、画像の大きさが前記検出フレームより小さいテンプレートフレームの特徴をニューラルネットワークによりそれぞれ抽出することと、
前記テンプレートフレームの特徴に基づいて局所領域検出器の分類の重みと回帰の重みを取得することと、
前記検出フレームの特徴を前記局所領域検出器に入力し、前記局所領域検出器から出力される複数の候補枠の分類結果と回帰結果を取得することと、
前記局所領域検出器から出力される複数の候補枠の分類結果と回帰結果により、前記検出フレームにおける前記目標対象物の検出枠を取得することと、を含む目標追跡方法を提供する。 According to one side of the embodiments of the present disclosure
Extracting the features of the detection frame and the template frame, which is the image of the detection frame of the target object and the size of the image is smaller than the detection frame, by the neural network, respectively.
Obtaining the classification weight and regression weight of the local region detector based on the characteristics of the template frame,
The characteristics of the detection frame are input to the local area detector, and the classification result and the regression result of a plurality of candidate frames output from the local area detector are acquired.
Provided is a target tracking method including acquiring a detection frame of the target object in the detection frame based on a classification result and a regression result of a plurality of candidate frames output from the local region detector.

本開示の実施例の別の方面によれば、
検出フレームと、目標対象物の検出枠の画像であって、画像の大きさが前記検出フレームより小さいテンプレートフレームの特徴をニューラルネットワークによりそれぞれ抽出することと、
第１の畳み込み層によって前記テンプレートフレームの特徴のチャンネルを増加し、取得された第１の特徴を前記局所領域検出器の分類の重みとし、第２の畳み込み層によって前記テンプレートフレームの特徴のチャンネルを増加し、得られた第２の特徴を前記局所領域検出器の回帰の重みとすることと、
前記検出フレームの特徴を前記局所領域検出器に入力し、前記局所領域検出器から出力される複数の候補枠の分類結果と回帰結果を取得することと、
前記局所領域検出器から出力される複数の候補枠の分類結果と回帰結果により、前記検出フレームにおける前記目標対象物の検出枠を取得することと、
取得された前記検出フレームにおける前記目標対象物の検出枠を予測検出枠とし、前記検出フレームのラベリング情報と前記予測検出枠に基づいて前記ニューラルネットワーク、前記第１の畳み込み層及び前記第２の畳み込み層をトレーニングすることと、を含む目標検出ネットワークのトレーニング方法を提供する。 According to another aspect of the embodiments of the present disclosure,
Extracting the features of the detection frame and the template frame, which is the image of the detection frame of the target object and the size of the image is smaller than the detection frame, by the neural network, respectively.
The first convolution layer increases the channels of the features of the template frame, the acquired first features are the weight of the classification of the local region detector, and the second convolution layer provides the channels of the features of the template frame. The second feature obtained by increasing is to be the regression weight of the local region detector.
The characteristics of the detection frame are input to the local area detector, and the classification result and the regression result of a plurality of candidate frames output from the local area detector are acquired.
Obtaining the detection frame of the target object in the detection frame from the classification result and the regression result of the plurality of candidate frames output from the local region detector, and
The detection frame of the target object in the acquired detection frame is used as a prediction detection frame, and the neural network, the first convolution layer, and the second convolution are based on the labeling information of the detection frame and the prediction detection frame. It provides training methods for layer training and target detection networks, including.

本開示の実施例のまた１つの方面によれば、
検出フレームと、目標対象物の検出枠の画像であって、画像の大きさが前記検出フレームより小さいテンプレートフレームの特徴をそれぞれ抽出するためのニューラルネットワークと、
前記テンプレートフレームの特徴のチャンネルを増加し、得られた第１の特徴を局所領域検出器の分類の重みとするための第１の畳み込み層と、
前記テンプレートフレームの特徴のチャンネルを増加し、得られた第２の特徴を前記局所領域検出器の回帰の重みとするための第２の畳み込み層と、
前記検出フレームの特徴により複数の候補枠の分類結果と回帰結果を出力するための局所領域検出器と、
前記局所領域検出器から出力される複数の候補枠の分類結果と回帰結果により、前記検出フレームにおける前記目標対象物の検出枠を取得するための取得ユニットと、を含む目標検出装置を提供する。 According to another aspect of the embodiments of the present disclosure,
A neural network for extracting the features of the detection frame and the template frame, which is an image of the detection frame of the target object and whose image size is smaller than the detection frame.
A first convolution layer for increasing the channels of the features of the template frame and weighting the obtained first features to the classification of the local region detector.
A second convolution layer for increasing the channels of the features of the template frame and weighting the resulting second features to the regression of the local region detector.
A local region detector for outputting classification results and regression results of a plurality of candidate frames according to the characteristics of the detection frame, and
Provided is a target detection device including an acquisition unit for acquiring a detection frame of the target object in the detection frame based on a classification result and a regression result of a plurality of candidate frames output from the local region detector.

本開示の実施例の更なる別の方面によれば、本開示の実施例のいずれか１つに記載の前記目標検出装置を含む電子機器を提供する。 According to yet another aspect of the embodiments of the present disclosure, there is provided an electronic device comprising the target detection device according to any one of the embodiments of the present disclosure.

本開示の実施例の更なる別の方面によれば、
実行可能コマンドを記憶するためのメモリと、
前記メモリと通信して、前記実行可能コマンドを実行することにより、本開示の実施例のいずれか１つに記載の前記方法の操作を完成するためのプロセッサと、を含む別の電子機器を提供する。 According to yet another aspect of the embodiments of the present disclosure,
Memory for storing executable commands and
Provided is another electronic device comprising a processor for completing the operation of the method according to any one of the embodiments of the present disclosure by communicating with the memory and executing the executable command. To do.

本開示の実施例の更なる別の方面によれば、コンピュータ読取可能コマンドを記憶するためのコンピュータ記憶媒体であって、前記コマンドが実行されると、本開示の実施例のいずれか１つに記載の前記方法の操作が実現されるコンピュータ記憶媒体を提供する。 According to yet another aspect of the embodiments of the present disclosure, it is a computer storage medium for storing a computer-readable command, and when the command is executed, it becomes one of the embodiments of the present disclosure. Provided is a computer storage medium in which the operation of the above-described method is realized.

本開示の実施例の更なる別の方面によれば、コンピュータ読取可能コマンドを含むコンピュータプログラムであって、前記コンピュータ読取可能コマンドが機器上で作動すると、前記機器におけるプロセッサに本開示の実施例のいずれか１つに記載の前記方法におけるステップを実現するための実行可能コマンドを実行させるコンピュータプログラムを提供する。 According to yet another aspect of the embodiments of the present disclosure, a computer program comprising a computer-readable command that, when the computer-readable command is activated on the device, tells the processor in the device of the embodiments of the present disclosure. Provided is a computer program that executes an executable command for realizing the step in the method according to any one of the above.

本開示の上記実施例によれば、ニューラルネットワークによりテンプレートフレームと検出フレームの特徴をそれぞれ抽出し、テンプレートフレームの特徴に基づいて局所領域検出器の分類の重みと回帰の重みを取得し、検出フレームの特徴を局所領域検出器に入力し、局所領域検出器から出力される複数の候補枠の分類結果と回帰結果を取得し、局所領域検出器から出力される複数の候補枠の分類結果と回帰結果により、検出フレームにおける目標対象物の検出枠を取得する。本開示の実施例では、同一のニューラルネットワーク又は同じ構成を有するニューラルネットワークにより同一の目標対象物の類似特徴をよりよく抽出でき、異なるフレームから抽出された目標対象物の特徴変化が小さく、検出フレームにおける目標対象物の検出結果の正確性を高めることに寄与する。また、テンプレートフレームの特徴に基づいて局所領域検出器の分類の重みと回帰の重みを取得することにより、局所領域検出器は検出フレームの複数の候補枠の分類結果と回帰結果を取得し、更に検出フレームにおける前記目標対象物の検出枠を取得することができ、目標対象物の位置と大きさの変化をよりよく推定でき、検出フレームでの目標対象物の位置をより精確に確定でき、目標追跡の速度や正確性が高くなり、追跡効果に優れ、速度が速い。 According to the above embodiment of the present disclosure, the characteristics of the template frame and the detection frame are extracted by the neural network, the classification weight of the local region detector and the regression weight are acquired based on the characteristics of the template frame, and the detection frame is obtained. Is input to the local area detector, the classification result and regression result of multiple candidate frames output from the local area detector are acquired, and the classification result and regression of multiple candidate frames output from the local area detector are acquired. Based on the result, the detection frame of the target object in the detection frame is acquired. In the embodiment of the present disclosure, similar features of the same target object can be better extracted by the same neural network or a neural network having the same configuration, the feature change of the target object extracted from different frames is small, and the detection frame. Contributes to improving the accuracy of the detection result of the target object in. In addition, by acquiring the classification weight and the regression weight of the local area detector based on the characteristics of the template frame, the local area detector acquires the classification result and the regression result of a plurality of candidate frames of the detection frame, and further. The detection frame of the target object in the detection frame can be acquired, the change in the position and size of the target object can be better estimated, the position of the target object in the detection frame can be determined more accurately, and the target can be determined. The tracking speed and accuracy are high, the tracking effect is excellent, and the speed is fast.

以下、図面及び実施例によって本開示の技術的手段をさらに詳しく説明する。 Hereinafter, the technical means of the present disclosure will be described in more detail with reference to the drawings and examples.

明細書の一部を構成する図面は、本開示の実施例を説明するためのものであって、また、実施例に対する説明と共に本開示の原理を解釈するためのものである。
図面を伴う以下の詳細な説明により、本開示をより明瞭に理解することができる。
本開示の目標検出方法の一実施例のフローチャートである。本開示の目標検出方法の別の実施例のフローチャートである。本開示の目標検出ネットワークのトレーニング方法の一実施例のフローチャートである。本開示の目標検出ネットワークのトレーニング方法の別の実施例のフローチャートである。本開示の目標検出装置の一実施例の構成模式図である。本開示の目標検出装置の別の実施例の構成模式図である。本開示の目標検出装置のまた１つの実施例の構成模式図である。本開示の目標検出装置の一適用実施例の構成模式図である。本開示の目標検出装置の別の適用実施例の構成模式図である。本開示の電子機器の一適用実施例の構成模式図である。 The drawings that form part of the specification are for explaining the embodiments of the present disclosure and for interpreting the principles of the present disclosure along with the description for the embodiments.
The following detailed description, accompanied by drawings, provides a clearer understanding of the present disclosure.
It is a flowchart of one Example of the target detection method of this disclosure. It is a flowchart of another embodiment of the target detection method of this disclosure. It is a flowchart of one Example of the training method of the target detection network of this disclosure. It is a flowchart of another embodiment of the training method of the target detection network of this disclosure. It is a block diagram of one Example of the target detection apparatus of this disclosure. It is a block diagram of another Example of the target detection apparatus of this disclosure. It is a block diagram of another Example of the target detection apparatus of this disclosure. It is a block diagram of one application example of the target detection apparatus of this disclosure. It is a block diagram of another application embodiment of the target detection apparatus of this disclosure. It is a block diagram of one application example of the electronic device of this disclosure.

ここで、図面を参照しながら本開示の様々な例示的な実施例を詳細に説明する。なお、特に断らない限り、これらの実施例で記述した部材及びステップの相対的配置、数式及び値は本開示の範囲を限定するものではないことに注意すべきである。 Here, various exemplary embodiments of the present disclosure will be described in detail with reference to the drawings. It should be noted that unless otherwise stated, the relative arrangements, formulas and values of members and steps described in these examples do not limit the scope of the present disclosure.

更に、本開示の実施例では、「複数」は２つ又は２つ以上を指してよく、「少なくとも１つ」は１つ、２つ又は２つ以上を指してよいことを理解すべきである。 Further, it should be understood that in the embodiments of the present disclosure, "plurality" may refer to two or more, and "at least one" may refer to one, two or more. ..

本願の実施例における「第１の」、「第２の」等の用語は、異なるステップ、機器又はモジュール等を区別するためのものに過ぎず、特定の技術的意味を表したり、必然的な論理的順序を表したりすることではないことが当業者にとって理解可能である。 Terms such as "first" and "second" in the embodiments of the present application are merely for distinguishing different steps, devices, modules, etc., and may represent a specific technical meaning or are inevitable. It is understandable to those skilled in the art that it does not represent a logical order.

更に、本開示で言及された任意の部材、データ又は構造は、明確に限定されない限り又は明細書の前後で反対的なものが示唆されない限り、一般的には１つ又は複数と理解してよいことを理解すべきである。 Further, any member, data or structure referred to in this disclosure may generally be understood as one or more unless expressly limited or when the opposite is suggested before and after the specification. You should understand that.

更に、本開示では、各実施例について、相違点を強調して説明し、同一又は類似的な点について、相互に参照することができるので、簡潔化するために、繰り返して説明しないことを理解すべきである。 Furthermore, it is understood that in the present disclosure, each embodiment will be described with emphasis on differences, and the same or similar points can be referred to each other, and therefore, for the sake of brevity, they will not be described repeatedly. Should.

同時に、説明の便宜上、図面に示した各部分の寸法は実際の比例関係に従って描いたものではないことを理解すべきである。 At the same time, for convenience of explanation, it should be understood that the dimensions of each part shown in the drawings are not drawn according to the actual proportional relationship.

以下の少なくとも一つの例示的な実施例に対する説明は、実質的に、説明するためのものに過ぎず、本開示及びその適用または使用をなんら限定するものではない。 The description for at least one exemplary example below is merely for illustration purposes and does not limit the disclosure and its application or use in any way.

関連分野の当業者に既知の技術、方法及び機器については、詳細に説明しない場合があるが、場合によって、前記技術、方法及び機器は明細書の一部としての援用と見なすべきである。 Techniques, methods and equipment known to those skilled in the art may not be described in detail, but in some cases said techniques, methods and equipment should be considered as reference as part of the specification.

なお、以下の図面において、類似する符号及び英文字は類似項目を表し、ある項目がある図面において定義されれば、以降の図面においてそれをさらに説明する必要がないことに注意すべきである。 It should be noted that in the drawings below, similar symbols and letters represent similar items, and if an item is defined in one drawing, it does not need to be further described in subsequent drawings.

本開示の実施例は端末装置、コンピュータシステム、サーバ等の電子機器に適用可能であり、それは他の多くの汎用または専用コンピューティングシステム環境または構成で動作可能である。端末装置、コンピュータシステム、サーバ等の電子機器と共に使用するのに適する公知の端末装置、コンピューティングシステム、環境及び／または構成の例としては、パーソナルコンピュータシステム、サーバコンピュータシステム、シンクライアント、ファットクライアント、手持ちまたはラップトップデバイス、マイクロプロセッサベースのシステム、セットトップボックス、プログラマブル消費者用電子機器、ネットワークパソコン、小型コンピュータシステム、大型コンピュータシステム及び前記システムのいずれかを含む分散型クラウドコンピューティング技術環境などを含むが、これらに限定されない。 The embodiments of the present disclosure are applicable to electronic devices such as terminal devices, computer systems, servers, etc., which can operate in many other general purpose or dedicated computing system environments or configurations. Examples of known terminal devices, computing systems, environments and / or configurations suitable for use with electronic devices such as terminal devices, computer systems, servers include personal computer systems, server computer systems, thin clients, fat clients, etc. Handheld or laptop devices, microprocessor-based systems, set-top boxes, programmable consumer electronics, networked personal computers, small computer systems, large computer systems and distributed cloud computing technology environments including any of the above systems. Including, but not limited to.

端末装置、コンピュータシステム、サーバ等の電子機器はコンピュータシステムにより実行されるコンピュータシステム実行可能コマンド（例えば、プログラムモジュール）の一般的なコンテキストで記述できる。通常、プログラムモジュールは特定のタスクを実行するかまたは特定の抽象データ型を実現するルーチン、プログラム、目的プログラム、コンポーネント、ロジック、データ構造などを含んでよい。コンピュータシステム／サーバは分散型クラウドコンピューティング環境において実施されてよい。分散型クラウドコンピューティング環境において、タスクは通信ネットワークを介してリンクされる遠隔処理機器により実行される。分散型クラウドコンピューティング環境において、プログラムモジュールは記憶機器を含むローカルまたはリモートのコンピューティングシステムの記憶媒体に存在してよい。 Electronic devices such as terminal devices, computer systems, and servers can be described in the general context of computer system executable commands (eg, program modules) executed by the computer system. In general, a program module may include routines, programs, objective programs, components, logics, data structures, etc. that perform a particular task or implement a particular abstract data type. The computer system / server may be implemented in a distributed cloud computing environment. In a decentralized cloud computing environment, tasks are performed by remote processing devices linked over a communication network. In a distributed cloud computing environment, the program module may reside on the storage medium of a local or remote computing system, including storage equipment.

図１は本開示の目標検出方法の一実施例のフローチャートである。図１に示すように、該実施例の目標検出方法は以下の操作を含む。 FIG. 1 is a flowchart of an embodiment of the target detection method of the present disclosure. As shown in FIG. 1, the target detection method of the embodiment includes the following operations.

１０２、ニューラルネットワークによりテンプレートフレームと検出フレームの特徴をそれぞれ抽出する。
ここで、テンプレートフレームは目標対象物の検出枠の画像であり、テンプレートフレームの画像の大きさが検出フレームより小さく、検出フレームは、目標対象物の検出を行う現在フレーム又は現在フレームにおける目標対象物の存在可能な領域画像である。検出フレームが目標対象物の検出を行う現在フレームにおける目標対象物の存在可能な領域画像である場合に、本開示の各実施例の一実施形態では、該領域画像の大きさがテンプレートフレームの画像より大きく、例えば、該領域画像はテンプレートフレームの画像の中心点を中心点として、大きさがテンプレートフレーム画像の大きさの２−４倍であってよい。 102. The features of the template frame and the detection frame are extracted by the neural network.
Here, the template frame is an image of the detection frame of the target object, the size of the image of the template frame is smaller than the detection frame, and the detection frame is the current frame for detecting the target object or the target object in the current frame. It is a possible area image of. When the detection frame is a region image in which the target object can exist in the current frame for detecting the target object, in one embodiment of each embodiment of the present disclosure, the size of the region image is the image of the template frame. Larger, for example, the region image may be 2-4 times larger than the size of the template frame image, with the center point of the image of the template frame as the center point.

本開示の各実施例の一実施形態では、テンプレートフレームは、ビデオシーケンスにおいて検出タイミングが検出フレームより前に位置し且つ目標対象物の検出枠が特定されたフレームであり、ビデオシーケンスにおいて目標追跡を行う開始フレームであってよく、該開始フレームのビデオフレームシーケンスでの位置が柔軟に設定され、例えばビデオフレームシーケンスにおける先頭フレーム又は任意の中間フレームであってよい。検出フレームは目標追跡を行うフレームであり、検出フレームの画像において目標対象物の検出枠が特定された後、該検出フレームにおける検出枠に対応する画像を次の検出フレームのテンプレートフレームの画像としてよい。 In one embodiment of each embodiment of the present disclosure, the template frame is a frame in which the detection timing is located before the detection frame in the video sequence and the detection frame of the target object is specified, and the target tracking is performed in the video sequence. It may be the start frame to be performed, and the position of the start frame in the video frame sequence can be flexibly set, for example, the first frame in the video frame sequence or any intermediate frame. The detection frame is a frame for tracking the target, and after the detection frame of the target object is specified in the image of the detection frame, the image corresponding to the detection frame in the detection frame may be used as the image of the template frame of the next detection frame. ..

本開示の各実施例の一実施形態では、該操作１０２において、同一のニューラルネットワークによりテンプレートフレームと検出フレームの特徴をそれぞれ抽出してもよいし、同じ構成を有する別々のニューラルネットワークによりテンプレートフレームと検出フレームの特徴をそれぞれ抽出してもよい。 In one embodiment of each embodiment of the present disclosure, in the operation 102, the features of the template frame and the detection frame may be extracted by the same neural network, or the template frame and the template frame may be extracted by different neural networks having the same configuration. The characteristics of each detection frame may be extracted.

選択可能な一例において、該操作１０２はプロセッサによりメモリに記憶された対応のコマンドを呼び出して実行されてもよく、プロセッサで作動するニューラルネットワークによって実行されてもよい。 In one selectable example, the operation 102 may be executed by invoking a corresponding command stored in memory by the processor or by a neural network running on the processor.

１０４、テンプレートフレームの特徴に基づいて局所領域検出器の分類の重みと回帰の重みを取得する。 104. Acquire the classification weight and the regression weight of the local region detector based on the characteristics of the template frame.

本開示の各実施例の一実施形態では、第１の畳み込み層によってテンプレートフレームの特徴に対して畳み込み操作を行い、畳み込み操作により取得された第１の特徴を局所領域検出器の分類の重みとするようにしてもよい。 In one embodiment of each embodiment of the present disclosure, a convolution operation is performed on the features of the template frame by the first convolution layer, and the first feature acquired by the convolution operation is used as the weight of the classification of the local region detector. You may try to do it.

例えば、選択可能な一例では、第１の畳み込み層によってテンプレートフレームの特徴のチャンネル数を増加して、チャンネル数がテンプレートフレームの特徴のチャンネル数の２ｋ（ｋが０より大きい整数である）倍になる第１の特徴を取得するように、局所領域検出器の分類の重みを取得することができる。 For example, in a selectable example, the first convolution layer increases the number of channels in the template frame feature so that the number of channels is 2k (k is an integer greater than 0) times the number of channels in the template frame feature. The weight of the classification of the local region detector can be obtained so as to obtain the first feature.

本開示の各実施例の一実施形態では、第２の畳み込み層によってテンプレートフレームの特徴に対して畳み込み操作を行い、畳み込み操作により取得された第２の特徴を局所領域検出器の回帰の重みとするようにしてもよい。 In one embodiment of each embodiment of the present disclosure, a second convolution layer performs a convolution operation on the features of the template frame, and the second feature acquired by the convolution operation is used as the regression weight of the local region detector. You may try to do it.

例えば、選択可能な一例では、第２の畳み込み層によってテンプレートフレームの特徴のチャンネル数を増加して、チャンネル数がテンプレートフレームの特徴のチャンネル数の４ｋ（ｋが０より大きい整数である）倍になる第２の特徴を取得するように、局所領域検出器の回帰の重みを取得することができる。 For example, in a selectable example, the second convolution layer increases the number of channels in the template frame feature so that the number of channels is 4k (k is an integer greater than 0) times the number of channels in the template frame feature. The regression weight of the local region detector can be obtained so as to obtain the second feature.

選択可能な一例において、該操作１０４はプロセッサによりメモリに記憶された対応のコマンドを呼び出して実行されてもよく、それぞれプロセッサで作動する第１の畳み込み層と第２の畳み込み層により実行されてもよい。 In a selectable example, the operation 104 may be executed by invoking a corresponding command stored in memory by the processor, or may be executed by a first convolution layer and a second convolution layer operated by the processor, respectively. Good.

１０６、検出フレームの特徴を局所領域検出器に入力し、局所領域検出器から出力される複数の候補枠の分類結果と回帰結果を取得する。
ここで、分類結果は各候補枠のそれぞれの目標対象物の検出枠である確率値を含み、回帰結果は各候補枠のテンプレートフレームに対応する検出枠からのずれ量を含む。 106, the characteristics of the detection frame are input to the local area detector, and the classification result and the regression result of a plurality of candidate frames output from the local area detector are acquired.
Here, the classification result includes the probability value which is the detection frame of the target object of each candidate frame, and the regression result includes the deviation amount from the detection frame corresponding to the template frame of each candidate frame.

本開示の各実施例の選択可能な一例では、上記複数の候補枠は、検出フレームの各位置でのＫ個の候補枠を含んでよい。ここで、Ｋが予め設定された、１より大きい整数である。Ｋ個の候補枠の長さと幅の割合はそれぞれ異なっており、例えば、Ｋ個の候補枠の長さと幅の割合は、１：１、２：１、２：１、３：１、１：３等を含んでよい。分類結果は、各位置でのＫ個の候補枠が目標対象物の検出枠である確率値を表すためのものである。 In a selectable example of each embodiment of the present disclosure, the plurality of candidate frames may include K candidate frames at each position of the detection frame. Here, K is a preset integer greater than 1. The ratio of length and width of K candidate frames is different, for example, the ratio of length and width of K candidate frames is 1: 1, 2: 1, 2: 1, 3: 1, 1: 1. 3 etc. may be included. The classification result is for expressing the probability value that the K candidate frames at each position are the detection frames of the target object.

本開示の目標検出方法の選択可能な一実施例では、該操作１０６によって複数の候補枠が目標対象物の検出枠である確率値を取得した後、更に、該分類結果に対して正規化処理を行って、各候補枠が目標対象物の検出枠である確率値の和を１にすることを含んでよい。このように、各候補枠が目標対象物の検出枠であるか否かを容易に判断することに寄与する。 In one selectable embodiment of the target detection method of the present disclosure, after the probability value that a plurality of candidate frames are the detection frames of the target object is acquired by the operation 106, the classification result is further normalized. Is performed, and the sum of the probability values in which each candidate frame is the detection frame of the target object is set to 1. In this way, it contributes to easily determining whether or not each candidate frame is a detection frame for the target object.

本開示の各実施例の選択可能な一例では、回帰結果は検出フレーム画像の各位置でのＫ個の候補枠のそれぞれの、テンプレートフレームにおける目標対象物の検出枠からのずれ量を含み、このずれ量は位置と大きさの変化量を含んでよく、この位置は中心点の位置であってもよく、基準枠の４つの頂点の位置等であってもよい。 In a selectable example of each embodiment of the present disclosure, the regression result includes the amount of deviation of each of the K candidate frames at each position of the detection frame image from the detection frame of the target object in the template frame. The amount of deviation may include the amount of change in position and size, and this position may be the position of the center point, the position of the four vertices of the reference frame, or the like.

第２の特徴のチャンネル数がテンプレートフレームの特徴のチャンネル数の４ｋ倍になる場合に、各候補枠のそれぞれの、テンプレートフレームにおける目標対象物の検出枠からのずれ量は、例えば中心点の位置の横座標のずれ量（ｄｘ）、中心点の位置の縦座標のずれ量（ｄｙ）、高さの変化量（ｄｈ）及び幅の変化量（ｄｗ）を含んでよい。 When the number of channels of the second feature is 4k times the number of channels of the feature of the template frame, the amount of deviation of each candidate frame from the detection frame of the target object in the template frame is, for example, the position of the center point. The abscissa deviation amount (dx), the vertical coordinate deviation amount (dy) of the position of the center point, the height change amount (dh), and the width change amount (dw) may be included.

本開示の各実施例の一実施形態では、該操作１０６には、分類の重みを用いて検出フレームの特徴に対して畳み込み操作を行い、複数の候補枠の分類結果を取得することと、回帰の重みを用いて検出フレームの特徴に対して畳み込み操作を行い、複数の候補枠の回帰結果を取得することと、を含んでよい。 In one embodiment of each embodiment of the present disclosure, in the operation 106, a convolution operation is performed on the features of the detection frame using the classification weights to acquire the classification results of a plurality of candidate frames, and regression. It may include performing a convolution operation on the feature of the detection frame using the weight of and acquiring the regression results of a plurality of candidate frames.

選択可能な一例において、該操作１０６はプロセッサによりメモリに記憶された対応のコマンドを呼び出して実行されてもよく、プロセッサで作動する局所領域検出器によって実行されてもよい。 In one selectable example, the operation 106 may be executed by invoking the corresponding command stored in memory by the processor or by a local region detector operated by the processor.

１０８、局所領域検出器から出力される複数の候補枠の分類結果と回帰結果により、検出フレームにおける目標対象物の検出枠を取得する。 108, The detection frame of the target object in the detection frame is acquired from the classification result and the regression result of the plurality of candidate frames output from the local area detector.

選択可能な一例において、該操作１０８はプロセッサによりメモリに記憶された対応のコマンドを呼び出して実行されてもよく、プロセッサで作動する取得ユニットによって実行されてもよい。 In a selectable example, the operation 108 may be executed by invoking a corresponding command stored in memory by the processor or by an acquisition unit operated by the processor.

本開示の上記実施例の目標検出方法によれば、ニューラルネットワークによりテンプレートフレームと検出フレームの特徴をそれぞれ抽出し、テンプレートフレームの特徴に基づいて局所領域検出器の分類の重みと回帰の重みを取得し、検出フレームの特徴を局所領域検出器に入力し、局所領域検出器から出力される複数の候補枠の分類結果と回帰結果を取得し、局所領域検出器から出力される複数の候補枠の分類結果と回帰結果により、検出フレームにおける目標対象物の検出枠を取得する。本開示の実施例では、同一のニューラルネットワーク又は同じ構成を有するニューラルネットワークにより同一の目標対象物の類似特徴をよりよく抽出でき、異なるフレームから抽出された目標対象物の特徴変化が小さく、検出フレームにおける目標対象物の検出結果の正確性を高めることに寄与する。また、テンプレートフレームの特徴に基づいて局所領域検出器の分類の重みと回帰の重みを取得することにより、局所領域検出器は検出フレームの複数の候補枠の分類結果と回帰結果を取得し、更に検出フレームにおける前記目標対象物の検出枠を取得することができ、目標対象物の位置と大きさの変化をよりよく推定でき、検出フレームでの目標対象物の位置をより精確に確定でき、目標追跡の速度や正確性が高くなり、追跡効果に優れ、速度が速い。 According to the target detection method of the above embodiment of the present disclosure, the characteristics of the template frame and the detection frame are extracted by the neural network, and the classification weight and the regression weight of the local region detector are obtained based on the characteristics of the template frame. Then, the characteristics of the detection frame are input to the local area detector, the classification result and regression result of the multiple candidate frames output from the local area detector are acquired, and the multiple candidate frames output from the local area detector are acquired. The detection frame of the target object in the detection frame is acquired from the classification result and the regression result. In the embodiment of the present disclosure, similar features of the same target object can be better extracted by the same neural network or a neural network having the same configuration, the feature change of the target object extracted from different frames is small, and the detection frame. Contributes to improving the accuracy of the detection result of the target object in. In addition, by acquiring the classification weight and the regression weight of the local area detector based on the characteristics of the template frame, the local area detector acquires the classification result and the regression result of a plurality of candidate frames of the detection frame, and further. The detection frame of the target object in the detection frame can be acquired, the change in the position and size of the target object can be better estimated, the position of the target object in the detection frame can be determined more accurately, and the target can be determined. The tracking speed and accuracy are high, the tracking effect is excellent, and the speed is fast.

本開示の実施例では、テンプレートフレームに基づいて、局所領域検出器は検出フレームから大量の候補枠を高速に生成し、且つ検出フレームの各位置でのＫ個の候補枠のそれぞれの、テンプレートフレームにおける目標対象物の検出枠からのずれ量を取得することができ、目標対象物の位置と大きさの変化をよりよく推定でき、検出フレームでの目標対象物の位置をより精確に確定でき、目標追跡の速度や正確性が高くなり、追跡効果に優れ、速度が速い。 In the embodiment of the present disclosure, based on the template frame, the local region detector quickly generates a large number of candidate frames from the detection frame, and each of the K candidate frames at each position of the detection frame is a template frame. The amount of deviation from the detection frame of the target object can be obtained, the change in the position and size of the target object can be better estimated, and the position of the target object in the detection frame can be determined more accurately. The speed and accuracy of target tracking are high, the tracking effect is excellent, and the speed is fast.

本開示の目標検出方法の別の実施例では、
ニューラルネットワークにより、ビデオシーケンスにおいて時系列的に検出フレームの後に位置する少なくとも１つの他の検出フレームの特徴を抽出することと、
上記少なくとも１つの他の検出フレームの特徴を局所領域検出器に順に入力し、局所領域検出器から出力される上記少なくとも１つの他の検出フレームにおける複数の候補枠及び各候補枠の分類結果と回帰結果を順に取得し、即ち、順に上記少なくとも１つの他の検出フレームの特徴に対して操作１０６を実行することと、
上記少なくとも１つの他の検出フレームの複数の候補枠の分類結果と回帰結果により、上記少なくとも１つの他の検出フレームにおける目標対象物の検出枠を順に取得し、即ち、順に上記少なくとも１つの他の検出フレームの複数の候補枠の分類結果と回帰結果に対して操作１０８を実行することと、を更に含んでよい。 In another embodiment of the target detection method of the present disclosure,
A neural network is used to extract the features of at least one other detection frame that chronologically follows the detection frame in the video sequence.
The features of the at least one other detection frame are input to the local region detector in order, and the plurality of candidate frames in the at least one other detection frame output from the local region detector and the classification result and regression of each candidate frame. Acquiring the results in sequence, that is, performing operation 106 on the features of at least one of the other detection frames in sequence,
Based on the classification result and the regression result of the plurality of candidate frames of the at least one other detection frame, the detection frames of the target object in the at least one other detection frame are sequentially acquired, that is, the at least one other detection frame is sequentially acquired. It may further include executing operation 108 on the classification result and the regression result of a plurality of candidate frames of the detection frame.

本開示の目標検出方法のまた１つの実施例では、検出フレームが目標対象物の検出を行う現在フレームにおける目標対象物の存在可能な領域画像である場合に、更に、予めテンプレートフレームの中心点を中心点として、現在フレームから長さ及び／又は幅がそれぞれテンプレートフレームの画像の長さ及び／又は幅より大きい領域画像を切り出して検出フレームとすることを含んでよい。 In another embodiment of the target detection method of the present disclosure, when the detection frame is a region image in which the target object can exist in the current frame for detecting the target object, the center point of the template frame is further set in advance. As the center point, a region image whose length and / or width is larger than the length and / or width of the image of the template frame, respectively, may be cut out from the current frame and used as a detection frame.

図２は本開示の目標検出方法の別の実施例のフローチャートである。図２に示すように、該実施例の目標検出方法は以下の操作を含む。 FIG. 2 is a flowchart of another embodiment of the target detection method of the present disclosure. As shown in FIG. 2, the target detection method of the embodiment includes the following operations.

２０２、ニューラルネットワークによりテンプレートフレームと検出フレームの特徴をそれぞれ抽出する。
ここで、テンプレートフレームは目標対象物の検出枠の画像であり、テンプレートフレームの画像大きさが検出フレームより小さく、検出フレームは、目標対象物の検出を行う現在フレーム又は現在フレームにおける目標対象物の存在可能な領域画像である。テンプレートフレームは、ビデオシーケンスにおいて検出タイミングが検出フレームより前に位置し且つ目標対象物の検出枠が特定されたフレームである。 202, the features of the template frame and the detection frame are extracted by the neural network.
Here, the template frame is an image of the detection frame of the target object, the image size of the template frame is smaller than the detection frame, and the detection frame is the current frame for detecting the target object or the target object in the current frame. It is a region image that can exist. The template frame is a frame in which the detection timing is positioned before the detection frame in the video sequence and the detection frame of the target object is specified.

本開示の各実施例の一実施形態では、該操作２０２において、同一のニューラルネットワークによりテンプレートフレームと検出フレームの特徴をそれぞれ抽出してもよいし、同じ構成を有する別々のニューラルネットワークによりテンプレートフレームと検出フレームの特徴をそれぞれ抽出してもよい。 In one embodiment of each embodiment of the present disclosure, in the operation 202, the features of the template frame and the detection frame may be extracted by the same neural network, or the template frame and the template frame may be extracted by different neural networks having the same configuration. The characteristics of each detection frame may be extracted.

選択可能な一例において、該操作２０２はプロセッサによりメモリに記憶された対応のコマンドを呼び出して実行されてもよく、プロセッサで作動するニューラルネットワークによって実行されてもよい。 In one selectable example, the operation 202 may be executed by invoking a corresponding command stored in memory by the processor or by a neural network running on the processor.

２０４、第３の畳み込み層によって検出フレームの特徴に対して畳み込み操作を行い、チャンネル数が検出フレームの特徴のチャンネル数と同様な第３の特徴を取得し、第４の畳み込み層によって検出フレームの特徴に対して畳み込み操作を行い、チャンネル数が検出フレームの特徴のチャンネル数と同様な第４の特徴を取得する。 204, the third convolution layer performs a convolution operation on the features of the detection frame, the number of channels acquires the third feature similar to the number of channels of the features of the detection frame, and the fourth convolution layer of the detection frame. A convolution operation is performed on the feature, and a fourth feature whose number of channels is the same as the number of channels of the feature of the detection frame is acquired.

選択可能な一例において、該操作２０４はプロセッサによりメモリに記憶された対応のコマンドを呼び出して実行されてもよく、それぞれプロセッサで作動する第３の畳み込み層と第４の畳み込み層により実行されてもよい。 In a selectable example, the operation 204 may be executed by invoking a corresponding command stored in memory by the processor, or may be executed by a third convolution layer and a fourth convolution layer operated by the processor, respectively. Good.

２０６、テンプレートフレームの特徴に基づいて局所領域検出器の分類の重みと回帰の重みを取得する。 206, Obtain the classification weight and regression weight of the local region detector based on the characteristics of the template frame.

ここで、操作２０６と２０４は実行順序が限定されなく、同時に実行されてもよいし、任意の先後順序で実行されてもよい。 Here, the operations 206 and 204 are not limited in the execution order, and may be executed at the same time, or may be executed in any first-after order.

選択可能な一例において、該操作２０６はプロセッサによりメモリに記憶された対応のコマンドを呼び出して実行されてもよく、それぞれプロセッサで作動する第１の畳み込み層と第２の畳み込み層により実行されてもよい。 In a selectable example, the operation 206 may be executed by invoking a corresponding command stored in memory by the processor, or may be executed by a first convolution layer and a second convolution layer operated by the processor, respectively. Good.

２０８、分類の重みを用いて第３の特徴に対して畳み込み操作を行い、複数の候補枠の分類結果を取得し、回帰の重みを用いて第４の特徴に対して畳み込み操作を行い、複数の候補枠の回帰結果を取得する。
ここで、分類結果は各候補枠のそれぞれの目標対象物の検出枠である確率値を含み、回帰結果は各候補枠のテンプレートフレームに対応する検出枠からのずれ量を含む。 208, the convolution operation is performed on the third feature using the classification weights, the classification results of a plurality of candidate frames are acquired, and the convolution operation is performed on the fourth feature using the regression weights. Get the regression result of the candidate frame of.
Here, the classification result includes the probability value which is the detection frame of the target object of each candidate frame, and the regression result includes the deviation amount from the detection frame corresponding to the template frame of each candidate frame.

選択可能な一例において、該操作２０８はプロセッサによりメモリに記憶された対応のコマンドを呼び出して実行されてもよく、プロセッサで作動する局所領域検出器によって実行されてもよい。 In one selectable example, the operation 208 may be executed by invoking the corresponding command stored in memory by the processor or by a local region detector operated by the processor.

２１０、局所領域検出器から出力される複数の候補枠の分類結果と回帰結果により、検出フレームにおける目標対象物の検出枠を取得する。 210, The detection frame of the target object in the detection frame is acquired from the classification result and the regression result of the plurality of candidate frames output from the local area detector.

選択可能な一例において、該操作２１０はプロセッサによりメモリに記憶された対応のコマンドを呼び出して実行されてもよく、プロセッサで作動する取得ユニットによって実行されてもよい。 In one selectable example, the operation 210 may be executed by invoking a corresponding command stored in memory by the processor or by an acquisition unit operated by the processor.

本開示の各実施例の一実施形態では、操作１０８又は２１０には、分類結果と回帰結果により複数の候補枠から１つの候補枠を選択し、選択された候補枠のずれ量により選択された候補枠を回帰させ、検出フレームにおける目標対象物の検出枠を取得することを含んでよい。 In one embodiment of each embodiment of the present disclosure, in operation 108 or 210, one candidate frame is selected from a plurality of candidate frames based on the classification result and the regression result, and is selected according to the deviation amount of the selected candidate frames. It may include regressing the candidate frame and acquiring the detection frame of the target object in the detection frame.

選択可能な一例では、分類結果と回帰結果により複数の候補枠から１つの候補枠を選択する時に、分類結果と回帰結果の重み係数により複数の候補枠から１つの候補枠を選択し、例えば、分類結果と回帰結果の重み係数により、それぞれ各候補枠の確率値と分類結果の重み係数との積と、ずれ量と回帰結果の重み係数との積との和に基づいて総合的スコアを算出し、上記複数の候補枠の総合的スコアにより上記複数の候補枠から１つの候補枠を選択するように実現することができる。 In one selectable example, when one candidate frame is selected from a plurality of candidate frames based on the classification result and the regression result, one candidate frame is selected from the plurality of candidate frames based on the weighting coefficient of the classification result and the regression result. Based on the weighting coefficients of the classification result and the regression result, the overall score is calculated based on the sum of the product of the probability value of each candidate frame and the weighting coefficient of the classification result and the product of the deviation amount and the weighting coefficient of the regression result. Then, it can be realized that one candidate frame is selected from the plurality of candidate frames based on the total score of the plurality of candidate frames.

別の選択可能な例では、上記各実施例によって回帰結果を取得した後に、更に、回帰結果の位置と大きさの変化量により候補枠の確率値を調整することを含んでよい。例えば、回帰結果の位置と大きさの変化量により候補枠の確率値を調整する。例えば、位置の変化量が大きく（即ち、位置移動が大きく）、大きさの変化量が大きい（即ち、形状変化が大きい）候補枠の確率値に対してペナルティを与えて、その確率値を低くする。それに対応して、この例では、分類結果と回帰結果により複数の候補枠から１つの候補枠を選択する時に、調整後の分類結果により複数の候補枠から１つの候補枠を選択し、例えば、調整後の確率値により複数の候補枠から確率値が最も高い候補枠を選択するように実現することができる。 Another selectable example may include, after obtaining the regression results by each of the above embodiments, further adjusting the probability value of the candidate frame according to the amount of change in the position and size of the regression results. For example, the probability value of the candidate frame is adjusted according to the amount of change in the position and size of the regression result. For example, a penalty is given to the probability value of a candidate frame in which the amount of change in position is large (that is, the amount of position movement is large) and the amount of change in size is large (that is, the shape change is large), and the probability value is lowered. To do. Correspondingly, in this example, when one candidate frame is selected from a plurality of candidate frames based on the classification result and the regression result, one candidate frame is selected from the plurality of candidate frames based on the adjusted classification result, for example. It can be realized that the candidate frame having the highest probability value is selected from a plurality of candidate frames according to the adjusted probability value.

選択可能な一例において、回帰結果の位置と大きさの変化量により候補枠の確率値を調整する上記操作は、プロセッサによりメモリに記憶された対応のコマンドを呼び出して実行されてもよく、プロセッサで作動する調整ユニットによって実行されてもよい。 In one selectable example, the above operation of adjusting the probability value of the candidate frame according to the amount of change in the position and size of the regression result may be executed by calling the corresponding command stored in the memory by the processor, and the processor may execute the operation. It may be performed by an operating adjustment unit.

図３は本開示の目標検出ネットワークのトレーニング方法の一実施例のフローチャートである。本開示の実施例の目標検出ネットワークは、本開示の実施例のニューラルネットワーク、第１の畳み込み層及び第２の畳み込み層を含む。図３に示すように、該実施例のトレーニング方法は以下の操作を含む。 FIG. 3 is a flowchart of an embodiment of the training method of the target detection network of the present disclosure. The target detection network of the embodiments of the present disclosure includes the neural network of the embodiments of the present disclosure, a first convolution layer and a second convolution layer. As shown in FIG. 3, the training method of the embodiment includes the following operations.

３０２、ニューラルネットワークによりテンプレートフレームと検出フレームの特徴をそれぞれ抽出する。
ここで、テンプレートフレームは目標対象物の検出枠の画像であり、テンプレートフレームの画像大きさが検出フレームより小さく、検出フレームは、目標対象物の検出を行う現在フレーム又は現在フレームにおける目標対象物の存在可能な領域画像である。テンプレートフレームは、ビデオシーケンスにおいて検出タイミングが検出フレームより前に位置し且つ目標対象物の検出枠が特定されたフレームである。 302, the features of the template frame and the detection frame are extracted by the neural network.
Here, the template frame is an image of the detection frame of the target object, the image size of the template frame is smaller than the detection frame, and the detection frame is the current frame for detecting the target object or the target object in the current frame. It is a region image that can exist. The template frame is a frame in which the detection timing is positioned before the detection frame in the video sequence and the detection frame of the target object is specified.

本開示の各実施例の一実施形態では、該操作３０２において、同一のニューラルネットワークによりテンプレートフレームと検出フレームの特徴をそれぞれ抽出してもよいし、同じ構成を有する別々のニューラルネットワークによりテンプレートフレームと検出フレームの特徴をそれぞれ抽出してもよい。 In one embodiment of each embodiment of the present disclosure, in the operation 302, the features of the template frame and the detection frame may be extracted by the same neural network, or the template frame and the template frame may be extracted by different neural networks having the same configuration. The characteristics of each detection frame may be extracted.

選択可能な一例において、該操作３０２はプロセッサによりメモリに記憶された対応のコマンドを呼び出して実行されてもよく、プロセッサで作動するニューラルネットワークによって実行されてもよい。 In one selectable example, the operation 302 may be executed by invoking a corresponding command stored in memory by the processor or by a neural network running on the processor.

３０４、第１の畳み込み層によってテンプレートフレームの特徴に対して畳み込み操作を行い、畳み込み操作により取得された第１の特徴を局所領域検出器の分類の重みとし、第２の畳み込み層によってテンプレートフレームの特徴に対して畳み込み操作を行い、畳み込み操作により取得された第２の特徴を局所領域検出器の回帰の重みとする。 304, the first convolution layer performs a convolution operation on the features of the template frame, the first feature acquired by the convolution operation is used as the weight of the classification of the local area detector, and the second convolution layer of the template frame. A convolution operation is performed on the feature, and the second feature acquired by the convolution operation is used as the regression weight of the local region detector.

選択可能な一例において、該操作３０４はプロセッサによりメモリに記憶された対応のコマンドを呼び出して実行されてもよく、それぞれプロセッサで作動する第１の畳み込み層と第２の畳み込み層により実行されてもよい。 In a selectable example, the operation 304 may be executed by invoking a corresponding command stored in memory by the processor, or may be executed by a first convolution layer and a second convolution layer operated by the processor, respectively. Good.

３０６、検出フレームの特徴を局所領域検出器に入力し、局所領域検出器から出力される複数の候補枠の分類結果と回帰結果を取得する。
ここで、分類結果は各候補枠のそれぞれの目標対象物の検出枠である確率値を含み、回帰結果は各候補枠のテンプレートフレームに対応する検出枠からのずれ量を含む。 306, The characteristics of the detection frame are input to the local area detector, and the classification result and the regression result of a plurality of candidate frames output from the local area detector are acquired.
Here, the classification result includes the probability value which is the detection frame of the target object of each candidate frame, and the regression result includes the deviation amount from the detection frame corresponding to the template frame of each candidate frame.

本開示の各実施例の一実施形態では、該操作３０６は、分類の重みを用いて検出フレームの特徴に対して畳み込み操作を行い、複数の候補枠の分類結果を取得することと、回帰の重みを用いて検出フレームの特徴に対して畳み込み操作を行い、複数の候補枠の回帰結果を取得することと、を含んでよい。 In one embodiment of each embodiment of the present disclosure, the operation 306 performs a convolution operation on the features of the detection frame using the classification weights to acquire the classification results of a plurality of candidate frames, and regresses. It may include performing a convolution operation on the feature of the detection frame using the weight and acquiring the regression result of a plurality of candidate frames.

選択可能な一例において、該操作３０６はプロセッサによりメモリに記憶された対応のコマンドを呼び出して実行されてもよく、プロセッサで作動する領域検出器によって実行されてもよい。 In one selectable example, the operation 306 may be executed by invoking the corresponding command stored in memory by the processor or by a region detector operated by the processor.

３０８、局所領域検出器から出力される複数の候補枠の分類結果と回帰結果により、検出フレームにおける目標対象物の検出枠を取得する。 308, The detection frame of the target object in the detection frame is acquired from the classification result and the regression result of a plurality of candidate frames output from the local area detector.

選択可能な一例において、該操作３０８はプロセッサによりメモリに記憶された対応のコマンドを呼び出して実行されてもよく、プロセッサで作動する取得ユニットによって実行されてもよい。 In one selectable example, the operation 308 may be executed by invoking a corresponding command stored in memory by the processor or by an acquisition unit operated by the processor.

３１０、取得された検出フレームにおける目標対象物の検出枠を予測検出枠とし、検出フレームのラベリング情報と予測検出枠に基づいてニューラルネットワーク、第１の畳み込み層及び第２の畳み込み層をトレーニングする。 310. Using the detection frame of the target object in the acquired detection frame as the prediction detection frame, the neural network, the first convolution layer, and the second convolution layer are trained based on the labeling information of the detection frame and the prediction detection frame.

選択可能な一例において、該操作３１０はプロセッサによりメモリに記憶された対応のコマンドを呼び出して実行されてもよく、プロセッサで作動するトレーニングユニットによって実行されてもよい。 In one selectable example, the operation 310 may be executed by invoking a corresponding command stored in memory by the processor or by a training unit operated by the processor.

本開示の上記実施例の目標検出ネットワークのトレーニング方法によれば、ニューラルネットワークによりテンプレートフレームと検出フレームの特徴をそれぞれ抽出し、テンプレートフレームの特徴に基づいて局所領域検出器の分類の重みと回帰の重みを取得し、検出フレームの特徴を局所領域検出器に入力し、局所領域検出器から出力される複数の候補枠の分類結果と回帰結果を取得し、局所領域検出器から出力される複数の候補枠の分類結果と回帰結果により、検出フレームにおける目標対象物の検出枠を取得し、検出フレームのラベリング情報と予測検出枠に基づいて目標検出ネットワークをトレーニングする。本開示の実施例に基づいてトレーニングされた目標検出ネットワークによれば、同一のニューラルネットワーク又は同じ構成を有するニューラルネットワークにより同一の目標対象物の類似特徴をよりよく抽出でき、異なるフレームから抽出された目標対象物の特徴変化が小さく、検出フレームにおける目標対象物の検出結果の正確性を高めることに寄与する。また、テンプレートフレームの特徴に基づいて局所領域検出器の分類の重みと回帰の重みを取得することにより、局所領域検出器は検出フレームの複数の候補枠の分類結果と回帰結果を取得し、更に検出フレームにおける前記目標対象物の検出枠を取得することができ、目標対象物の位置と大きさの変化をよりよく推定でき、検出フレームでの目標対象物の位置をより精確に確定でき、目標追跡の速度や正確性が高くなり、追跡効果に優れ、速度が速い。 According to the target detection network training method of the above-described embodiment of the present disclosure, the characteristics of the template frame and the detection frame are extracted by the neural network, and the weight and regression of the classification of the local region detector based on the characteristics of the template frame. The weights are acquired, the characteristics of the detection frame are input to the local area detector, the classification results and regression results of multiple candidate frames output from the local area detector are acquired, and multiple output from the local area detector. Based on the classification result and regression result of the candidate frame, the detection frame of the target object in the detection frame is acquired, and the target detection network is trained based on the labeling information of the detection frame and the prediction detection frame. According to the target detection network trained based on the examples of the present disclosure, similar features of the same target object can be better extracted by the same neural network or a neural network having the same configuration, and extracted from different frames. The characteristic change of the target object is small, which contributes to improving the accuracy of the detection result of the target object in the detection frame. In addition, by acquiring the classification weight and the regression weight of the local area detector based on the characteristics of the template frame, the local area detector acquires the classification result and the regression result of a plurality of candidate frames of the detection frame, and further. The detection frame of the target object in the detection frame can be acquired, the change in the position and size of the target object can be better estimated, the position of the target object in the detection frame can be determined more accurately, and the target can be determined. The tracking speed and accuracy are high, the tracking effect is excellent, and the speed is fast.

本開示のトレーニング方法の別の実施例では、ニューラルネットワークによりビデオシーケンスにおいて時系列的に検出フレームの後に位置する少なくとも１つの他の検出フレームの特徴を抽出することと、
少なくとも１つの他の検出フレームの特徴を局所領域検出器に順に入力し、局所領域検出器から出力される少なくとも１つの他の検出フレームにおける複数の候補枠及び各候補枠の分類結果と回帰結果を順に取得し、即ち、順に少なくとも１つの他の検出フレームの特徴に対して操作３０６を実行することと、
少なくとも１つの他の検出フレームの複数の候補枠の分類結果と回帰結果により、少なくとも１つの他の検出フレームにおける目標対象物の検出枠を順に取得し、即ち、順に少なくとも１つの他の検出フレームの複数の候補枠の分類結果と回帰結果に対して操作３０８を実行することと、を更に含んでよい。 In another embodiment of the training method of the present disclosure, a neural network is used to extract features of at least one other detection frame that chronologically follows the detection frame in the video sequence.
The features of at least one other detection frame are input to the local region detector in order, and the classification result and regression result of a plurality of candidate frames and each candidate frame in at least one other detection frame output from the local region detector are input. Acquiring in sequence, i.e. performing operation 306 on features of at least one other detection frame in sequence.
Based on the classification result and the regression result of the plurality of candidate frames of at least one other detection frame, the detection frames of the target object in at least one other detection frame are sequentially acquired, that is, the detection frames of at least one other detection frame are sequentially acquired. The operation 308 may be further included for the classification result and the regression result of a plurality of candidate frames.

本開示のトレーニング方法のまた１つの実施例では、検出フレームが目標対象物の検出を行う現在フレームにおける目標対象物の存在可能な領域画像である場合に、予めテンプレートフレームの中心点を中心点として、現在フレームから長さ及び／又は幅がそれぞれテンプレートフレームの画像の長さ及び／又は幅より大きい領域画像を切り出して検出フレームとすることを更に含んでよい。 In another embodiment of the training method of the present disclosure, when the detection frame is a region image in which the target object can exist in the current frame for detecting the target object, the center point of the template frame is set as the center point in advance. It may further include cutting out a region image whose length and / or width is larger than the length and / or width of the image of the template frame, respectively, from the current frame to obtain a detection frame.

図４は本開示の目標検出ネットワークのトレーニング方法の別の実施例のフローチャートである。本開示の実施例の目標検出ネットワークは、本開示の実施例のニューラルネットワーク、第１の畳み込み層、第２の畳み込み層、第３の畳み込み層及び第４の畳み込み層を含む。図４に示すように、該実施例のトレーニング方法は以下の操作を含む。 FIG. 4 is a flowchart of another embodiment of the training method of the target detection network of the present disclosure. The target detection network of the embodiments of the present disclosure includes the neural network of the embodiments of the present disclosure, a first convolution layer, a second convolution layer, a third convolution layer, and a fourth convolution layer. As shown in FIG. 4, the training method of the embodiment includes the following operations.

４０２、ニューラルネットワークによりテンプレートフレームと検出フレームの特徴をそれぞれ抽出する。
ここで、テンプレートフレームは目標対象物の検出枠の画像であり、テンプレートフレームの画像大きさが検出フレームより小さく、検出フレームは、目標対象物の検出を行う現在フレーム又は現在フレームにおける目標対象物の存在可能な領域画像である。テンプレートフレームは、ビデオシーケンスにおいて検出タイミングが検出フレームより前に位置し且つ目標対象物の検出枠が特定されたフレームである。 402, the features of the template frame and the detection frame are extracted by the neural network.
Here, the template frame is an image of the detection frame of the target object, the image size of the template frame is smaller than the detection frame, and the detection frame is the current frame for detecting the target object or the target object in the current frame. It is a region image that can exist. The template frame is a frame in which the detection timing is positioned before the detection frame in the video sequence and the detection frame of the target object is specified.

本開示の各実施例の一実施形態では、該操作４０２において、同一のニューラルネットワークによりテンプレートフレームと検出フレームの特徴をそれぞれ抽出してもよいし、同じ構成を有する別々のニューラルネットワークによりテンプレートフレームと検出フレームの特徴をそれぞれ抽出してもよい。 In one embodiment of each embodiment of the present disclosure, in the operation 402, the features of the template frame and the detection frame may be extracted by the same neural network, or the template frame and the template frame may be extracted by different neural networks having the same configuration. The characteristics of each detection frame may be extracted.

選択可能な一例において、該操作４０２はプロセッサによりメモリに記憶された対応のコマンドを呼び出して実行されてもよく、プロセッサで作動するニューラルネットワークによって実行されてもよい。 In one selectable example, the operation 402 may be executed by invoking a corresponding command stored in memory by the processor or by a neural network running on the processor.

４０４、第３の畳み込み層によって検出フレームの特徴に対して畳み込み操作を行い、チャンネル数が検出フレームの特徴のチャンネル数と同様な第３の特徴を取得し、第４の畳み込み層によって検出フレームの特徴に対して畳み込み操作を行い、チャンネル数が検出フレームの特徴のチャンネル数と同様な第４の特徴を取得する。 404, the third convolution layer performs a convolution operation on the features of the detection frame, the number of channels acquires the third feature similar to the number of channels of the features of the detection frame, and the fourth convolution layer of the detection frame. A convolution operation is performed on the feature, and a fourth feature whose number of channels is the same as the number of channels of the feature of the detection frame is acquired.

選択可能な一例において、該操作４０４はプロセッサによりメモリに記憶された対応のコマンドを呼び出して実行されてもよく、それぞれプロセッサで作動する第３の畳み込み層と第４の畳み込み層により実行されてもよい。 In a selectable example, the operation 404 may be executed by invoking a corresponding command stored in memory by the processor, or may be executed by a third convolution layer and a fourth convolution layer operated by the processor, respectively. Good.

４０６、第１の畳み込み層によってテンプレートフレームの特徴に対して畳み込み操作を行い、畳み込み操作により取得された第１の特徴を局所領域検出器の分類の重みとし、第２の畳み込み層によってテンプレートフレームの特徴に対して畳み込み操作を行い、畳み込み操作により取得された第２の特徴を局所領域検出器の回帰の重みとする。 406, the first convolution layer performs a convolution operation on the features of the template frame, the first feature acquired by the convolution operation is used as the weight of the classification of the local area detector, and the second convolution layer of the template frame. A convolution operation is performed on the feature, and the second feature acquired by the convolution operation is used as the regression weight of the local region detector.

ここで、操作４０６と４０４は実行順序が限定されなく、同時に実行されてもよいし、任意の先後順序で実行されてもよい。 Here, the operations 406 and 404 are not limited in the execution order, and may be executed at the same time, or may be executed in any first-after order.

選択可能な一例において、該操作４０６はプロセッサによりメモリに記憶された対応のコマンドを呼び出して実行されてもよく、それぞれプロセッサで作動する第１の畳み込み層と第２の畳み込み層により実行されてもよい。 In a selectable example, the operation 406 may be executed by invoking a corresponding command stored in memory by the processor, or may be executed by a first convolution layer and a second convolution layer operated by the processor, respectively. Good.

４０８、分類の重みを用いて第３の特徴に対して畳み込み操作を行い、複数の候補枠の分類結果を取得し、回帰の重みを用いて第４の特徴に対して畳み込み操作を行い、複数の候補枠の回帰結果を取得する。
ここで、分類結果は各候補枠のそれぞれの目標対象物の検出枠である確率値を含み、回帰結果は各候補枠のテンプレートフレームに対応する検出枠からのずれ量を含む。 408, the convolution operation is performed on the third feature using the classification weight, the classification results of a plurality of candidate frames are acquired, and the convolution operation is performed on the fourth feature using the regression weight. Get the regression result of the candidate frame of.
Here, the classification result includes the probability value which is the detection frame of the target object of each candidate frame, and the regression result includes the deviation amount from the detection frame corresponding to the template frame of each candidate frame.

選択可能な一例において、該操作４０８はプロセッサによりメモリに記憶された対応のコマンドを呼び出して実行されてもよく、プロセッサで作動する局所領域検出器によって実行されてもよい。 In a selectable example, the operation 408 may be executed by invoking a corresponding command stored in memory by the processor or by a local region detector operated by the processor.

４１０、局所領域検出器から出力される複数の候補枠の分類結果と回帰結果により、検出フレームにおける目標対象物の検出枠を取得する。 410, The detection frame of the target object in the detection frame is acquired from the classification result and the regression result of a plurality of candidate frames output from the local area detector.

選択可能な一例において、該操作４１０はプロセッサによりメモリに記憶された対応のコマンドを呼び出して実行されてもよく、プロセッサで作動する第１の特徴抽出ユニット７０１によって実行されてもよい。 In a selectable example, the operation 410 may be executed by invoking a corresponding command stored in memory by the processor, or may be executed by a first feature extraction unit 701 running on the processor.

４１２、取得された検出フレームにおける目標対象物の検出枠を予測検出枠とし、ラベリングされた検出フレームでの目標対象物の検出枠の位置及び大きさと予測検出枠の位置及び大きさとの差により、ニューラルネットワーク、第１の畳み込み層及び第２の畳み込み層の重み値を調整する。 412, the detection frame of the target object in the acquired detection frame is used as the prediction detection frame, and the difference between the position and size of the detection frame of the target object in the labeled detection frame and the position and size of the prediction detection frame Adjust the weight values of the neural network, the first convolution layer and the second convolution layer.

選択可能な一例において、該操作４１２はプロセッサによりメモリに記憶された対応のコマンドを呼び出して実行されてもよく、プロセッサで作動するトレーニングユニットによって実行されてもよい。 In one selectable example, the operation 412 may be executed by invoking a corresponding command stored in memory by the processor or by a training unit operated by the processor.

本開示の各実施例の一実施形態では、操作３０８又は４１０には、分類結果と回帰結果により複数の候補枠から１つの候補枠を選択し、選択された候補枠のずれ量により選択された候補枠を回帰させ、検出フレームにおける目標対象物の検出枠を取得することを含んでよい。 In one embodiment of each embodiment of the present disclosure, in operation 308 or 410, one candidate frame is selected from a plurality of candidate frames based on the classification result and the regression result, and is selected according to the deviation amount of the selected candidate frames. It may include regressing the candidate frame and acquiring the detection frame of the target object in the detection frame.

選択可能な一例では、分類結果と回帰結果により複数の候補枠から１つの候補枠を選択する時に、分類結果と回帰結果の重み係数により複数の候補枠から１つの候補枠を選択し、例えば、分類結果と回帰結果の重み係数により、それぞれ各候補枠の確率値と分類結果の重み係数との積と、ずれ量と回帰結果の重み係数との積との和に基づいて総合的スコアを算出し、上記複数の候補枠の総合的スコアにより上記複数の候補枠から確率値が高くてずれ量が小さい候補枠を選択するように実現することができる。 In one selectable example, when one candidate frame is selected from a plurality of candidate frames based on the classification result and the regression result, one candidate frame is selected from the plurality of candidate frames based on the weighting coefficient of the classification result and the regression result. Based on the weighting coefficients of the classification result and the regression result, the overall score is calculated based on the sum of the product of the probability value of each candidate frame and the weighting coefficient of the classification result and the product of the deviation amount and the weighting coefficient of the regression result. Then, it is possible to select a candidate frame having a high probability value and a small deviation amount from the plurality of candidate frames based on the total score of the plurality of candidate frames.

別の選択可能な例では、上記各実施例によって回帰結果を取得した後に、更に、回帰結果の位置と大きさの変化量により候補枠の確率値を調整することを含んでよい。例えば、回帰結果の位置と大きさの変化量により候補枠の確率値を調整する。それに対応して、この例では、分類結果と回帰結果により複数の候補枠から１つの候補枠を選択する時に、調整後の分類結果により複数の候補枠から１つの候補枠を選択し、例えば、調整後の確率値により複数の候補枠から確率値が最も高い候補枠を選択するように実現することができる。 Another selectable example may include, after obtaining the regression results by each of the above embodiments, further adjusting the probability value of the candidate frame according to the amount of change in the position and size of the regression results. For example, the probability value of the candidate frame is adjusted according to the amount of change in the position and size of the regression result. Correspondingly, in this example, when one candidate frame is selected from a plurality of candidate frames based on the classification result and the regression result, one candidate frame is selected from the plurality of candidate frames based on the adjusted classification result, for example. It can be realized that the candidate frame having the highest probability value is selected from a plurality of candidate frames according to the adjusted probability value.

選択可能な一例において、上記回帰結果の位置と大きさの変化量により候補枠の確率値を調整する操作はプロセッサによりメモリに記憶された対応のコマンドを呼び出して実行されてもよく、プロセッサで作動する調整ユニットによって実行されてもよい。 In one selectable example, the operation of adjusting the probability value of the candidate frame according to the amount of change in the position and size of the regression result may be executed by calling the corresponding command stored in the memory by the processor, and is operated by the processor. It may be performed by the adjusting unit.

選択可能な一例では、分類結果と回帰結果により複数の候補枠から１つの候補枠を選択する時に、分類結果と回帰結果の重み係数により複数の候補枠から１つの候補枠を選択し、例えば、分類結果と回帰結果の重み係数により、それぞれ各候補枠の確率値とずれ量に基づいて総合的スコアを算出し、上記複数の候補枠の総合的スコアにより上記複数の候補枠から１つの候補枠を選択するように実現することができる。 In one selectable example, when one candidate frame is selected from a plurality of candidate frames based on the classification result and the regression result, one candidate frame is selected from the plurality of candidate frames based on the weighting coefficient of the classification result and the regression result. The overall score is calculated based on the probability value and the amount of deviation of each candidate frame based on the weighting coefficient of the classification result and the regression result, and one candidate frame from the plurality of candidate frames is calculated based on the overall score of the plurality of candidate frames. Can be realized to select.

本開示の各実施例では、局所領域検出器は、第３の畳み込み層、第４の畳み込み層及び２つの畳み込み操作ユニットを含んでよい。ここで、局所領域検出器と第１の畳み込み層、第２の畳み込み層が結合されて形成された局所領域検出器はリージョンプロポーザルネットワーク（ＲｅｇｉｏｎＰｒｏｐｏｓａｌＮｅｔｗｏｒｋ）と呼んでもよい。 In each embodiment of the present disclosure, the local region detector may include a third convolution layer, a fourth convolution layer and two convolution operation units. Here, the local region detector formed by combining the local region detector, the first convolution layer, and the second convolution layer may be referred to as a region proposal network (Region Proposal Network).

本開示の実施例で提供される目標検出方法、目標検出ネットワークのトレーニング方法のいずれか一つはデータ処理能力を有するいかなる適切な機器によって実行されてもよく、機器は端末装置とサーバ等を含むが、それらに限定されない。又は、本開示の実施例で提供される目標検出方法、目標検出ネットワークのトレーニング方法のいずれか一つはプロセッサによって実行されてもよく、例えば、プロセッサはメモリに記憶された対応のコマンドを呼び出すことで本開示の実施例に係わる目標検出方法、目標検出ネットワークのトレーニング方法のいずれか一つを実行する。以下、詳細な説明を省略する。 Any one of the target detection method and the target detection network training method provided in the embodiments of the present disclosure may be executed by any suitable device having data processing capability, and the device includes a terminal device, a server, and the like. However, it is not limited to them. Alternatively, any one of the target detection method and the target detection network training method provided in the embodiments of the present disclosure may be executed by the processor, for example, the processor calls the corresponding command stored in the memory. In, either one of the target detection method and the target detection network training method according to the embodiment of the present disclosure is executed. Hereinafter, detailed description will be omitted.

当業者であれば、上記方法の実施例を実現する全てまたは一部のステップはプログラムによって関連ハードウェアに命令を出すことにより完成できることを理解すべき、前記プログラムは、ＲＯＭ、ＲＡＭ、磁気ディスクまたは光ディスクなどのプログラムコードを記憶可能である様々な媒体を含むコンピュータ読み取り可能記憶媒体に記憶可能であり、該プログラムが実行されると、上記方法の実施例を含むステップを実行する。 Those skilled in the art should understand that all or part of the steps to realize an embodiment of the above method can be completed by programmatically issuing instructions to the relevant hardware, said program being ROM, RAM, magnetic disk or A program code such as an optical disk can be stored in a computer-readable storage medium including various media capable of storing the program code, and when the program is executed, a step including an embodiment of the above method is executed.

図５は本開示の目標検出装置の一実施例の構成模式図である。本開示の各実施例の目標検出装置は、本開示の上記の各目標検出方法の実施例を実現するために利用可能である。図５に示すように、該実施例の目標検出装置はニューラルネットワーク、第１の畳み込み層、第２の畳み込み層、局所領域検出器及び取得ユニットを含む。 FIG. 5 is a schematic configuration diagram of an embodiment of the target detection device of the present disclosure. The target detection device of each embodiment of the present disclosure can be used to realize the embodiment of each of the above-mentioned target detection methods of the present disclosure. As shown in FIG. 5, the target detector of the embodiment includes a neural network, a first convolution layer, a second convolution layer, a local region detector and an acquisition unit.

ニューラルネットワークは、検出フレームと、目標対象物の検出枠の画像であって、画像の大きさが検出フレームより小さいテンプレートフレームの特徴をそれぞれ抽出するために用いられる。ここで、テンプレートフレームは目標対象物の検出枠の画像であり、テンプレートフレームの画像の大きさが検出フレームより小さく、検出フレームは、目標対象物の検出を行う現在フレーム又は現在フレームにおける目標対象物の存在可能な領域画像である。テンプレートフレームは、ビデオシーケンスにおいて検出タイミングが検出フレームより前に位置し且つ目標対象物の検出枠が特定されたフレームである。テンプレートフレームと検出フレームの特徴を抽出するニューラルネットワークは同一のニューラルネットワークであってもよいし、同じ構成を有する別々のニューラルネットワークであってもよい。 The neural network is used to extract the features of the detection frame and the image of the detection frame of the target object, and the size of the image is smaller than the detection frame, respectively. Here, the template frame is an image of the detection frame of the target object, the size of the image of the template frame is smaller than the detection frame, and the detection frame is the current frame for detecting the target object or the target object in the current frame. It is a possible area image of. The template frame is a frame in which the detection timing is positioned before the detection frame in the video sequence and the detection frame of the target object is specified. The neural network that extracts the features of the template frame and the detection frame may be the same neural network, or may be different neural networks having the same configuration.

第１の畳み込み層は、前記テンプレートフレームの特徴に対して畳み込み操作を行い、畳み込み操作により取得された第１の特徴を前記局所領域検出器の分類の重みとするために用いられる。 The first convolution layer is used to perform a convolution operation on the features of the template frame, and to use the first features acquired by the convolution operation as the weight of the classification of the local region detector.

第２の畳み込み層は、第２の畳み込み層によって前記テンプレートフレームの特徴に対して畳み込み操作を行い、畳み込み操作により取得された第２の特徴を前記局所領域検出器の回帰の重みとするために用いられる。 The second convolution layer is used to perform a convolution operation on the features of the template frame by the second convolution layer, and to use the second features acquired by the convolution operation as the regression weight of the local region detector. Used.

局所領域検出器は、検出フレームの特徴により複数の候補枠の分類結果と回帰結果を出力するために用いられ、ここで、分類結果は各候補枠のそれぞれの目標対象物の検出枠である確率値を含み、回帰結果は各候補枠のテンプレートフレームに対応する検出枠からのずれ量を含む。 The local region detector is used to output the classification result and the regression result of a plurality of candidate frames according to the characteristics of the detection frame, and here, the classification result is the probability that the classification result is the detection frame of each target object of each candidate frame. The value is included, and the regression result includes the amount of deviation from the detection frame corresponding to the template frame of each candidate frame.

取得ユニットは、局所領域検出器から出力される複数の候補枠の分類結果と回帰結果により、検出フレームにおける目標対象物の検出枠を取得するために用いられる。 The acquisition unit is used to acquire the detection frame of the target object in the detection frame based on the classification result and the regression result of a plurality of candidate frames output from the local area detector.

本開示の上記実施例の目標検出装置によれば、ニューラルネットワークによりテンプレートフレームと検出フレームの特徴をそれぞれ抽出し、テンプレートフレームの特徴に基づいて局所領域検出器の分類の重みと回帰の重みを取得し、検出フレームの特徴を局所領域検出器に入力し、局所領域検出器から出力される複数の候補枠の分類結果と回帰結果を取得し、局所領域検出器から出力される複数の候補枠の分類結果と回帰結果により、検出フレームにおける目標対象物の検出枠を取得する。本開示の実施例では、同一のニューラルネットワーク又は同じ構成を有するニューラルネットワークにより同一の目標対象物の類似特徴をよりよく抽出でき、異なるフレームから抽出された目標対象物の特徴変化が小さく、検出フレームにおける目標対象物の検出結果の正確性を高めることに寄与する。また、テンプレートフレームの特徴に基づいて局所領域検出器の分類の重みと回帰の重みを取得することにより、局所領域検出器は検出フレームの複数の候補枠の分類結果と回帰結果を取得し、更に検出フレームにおける前記目標対象物の検出枠を取得することができ、目標対象物の位置と大きさの変化をよりよく推定でき、検出フレームでの目標対象物の位置をより精確に確定でき、目標追跡の速度や正確性が高くなり、追跡効果に優れ、速度が速い。 According to the target detection device of the above embodiment of the present disclosure, the characteristics of the template frame and the detection frame are extracted by the neural network, and the classification weight and the regression weight of the local region detector are acquired based on the characteristics of the template frame. Then, the characteristics of the detection frame are input to the local area detector, the classification result and regression result of the multiple candidate frames output from the local area detector are acquired, and the multiple candidate frames output from the local area detector are acquired. The detection frame of the target object in the detection frame is acquired from the classification result and the regression result. In the embodiment of the present disclosure, similar features of the same target object can be better extracted by the same neural network or a neural network having the same configuration, the feature change of the target object extracted from different frames is small, and the detection frame. Contributes to improving the accuracy of the detection result of the target object in. In addition, by acquiring the classification weight and the regression weight of the local area detector based on the characteristics of the template frame, the local area detector acquires the classification result and the regression result of a plurality of candidate frames of the detection frame, and further. The detection frame of the target object in the detection frame can be acquired, the change in the position and size of the target object can be better estimated, the position of the target object in the detection frame can be determined more accurately, and the target can be determined. The tracking speed and accuracy are high, the tracking effect is excellent, and the speed is fast.

本開示の目標検出装置の各実施例の一実施形態では、局所領域検出器は、分類の重みを用いて検出フレームの特徴に対して畳み込み操作を行い、複数の候補枠の分類結果を取得することと、回帰の重みを用いて検出フレームの特徴に対して畳み込み操作を行い、複数の候補枠の回帰結果を取得することとに用いられる。 In one embodiment of each embodiment of the target detection apparatus of the present disclosure, the local region detector performs a convolution operation on the features of the detection frame using the classification weights to acquire the classification results of a plurality of candidate frames. It is also used to perform a convolution operation on the features of the detection frame using the regression weights and acquire the regression results of a plurality of candidate frames.

検出フレームが目標対象物の検出を行う現在フレームにおける目標対象物の存在可能な領域画像である場合に、本開示の目標検出装置の別の実施例では、テンプレートフレームの中心点を中心点として、現在フレームから長さ及び／又は幅がそれぞれテンプレートフレームの画像の長さ及び／又は幅より大きい領域画像を切り出して検出フレームとするための前処理ユニットを更に含んでよい。図６に示すように、本開示の目標検出装置の別の実施例の構成模式図である。 In another embodiment of the target detection device of the present disclosure, where the detection frame is an image of a region in which the target object can exist in the current frame for detecting the target object, the center point of the template frame is set as the center point. A preprocessing unit for cutting out a region image whose length and / or width is larger than the length and / or width of the image of the template frame, respectively, from the current frame and making it a detection frame may be included. As shown in FIG. 6, it is a block diagram of another embodiment of the target detection apparatus of this disclosure.

また、図６を再度参照し、本開示の目標検出装置のまた１つの実施例では、検出フレームの特徴に対して畳み込み操作を行い、チャンネル数が検出フレームの特徴のチャンネル数と同様な第３の特徴を取得するための第３の畳み込み層を更に含んでよい。それに対応して、該実施例では、局所領域検出器は、分類の重みを用いて第３の特徴に対して畳み込み操作を行うために用いられる。 Further, referring to FIG. 6 again, in another embodiment of the target detection device of the present disclosure, a convolution operation is performed on the feature of the detection frame, and the number of channels is the same as the number of channels of the feature of the detection frame. A third convolution layer may be further included to obtain the characteristics of. Correspondingly, in this embodiment, the local region detector is used to perform a convolution operation on the third feature using the classification weights.

また、図６を再度参照し、本開示の目標検出装置の更なる別の実施例では、検出フレームの特徴に対して畳み込み操作を行い、チャンネル数が検出フレームの特徴のチャンネル数と同様な第４の特徴を取得するための第４の畳み込み層を更に含んでよい。それに対応して、該実施例では、局所領域検出器は、回帰の重みを用いて第４の特徴に対して畳み込み操作を行うために用いられる。 Further, referring to FIG. 6 again, in yet another embodiment of the target detection device of the present disclosure, a convolution operation is performed on the feature of the detection frame, and the number of channels is the same as the number of channels of the feature of the detection frame. A fourth convolutional layer for acquiring the characteristics of 4 may be further included. Correspondingly, in this embodiment, the local region detector is used to perform a convolution operation on the fourth feature using the regression weights.

本開示の目標検出装置の各実施例の別の実施形態では、取得ユニットは、分類結果と回帰結果により複数の候補枠から１つの候補枠を選択し、選択された候補枠のずれ量により選択された候補枠を回帰させ、検出フレームにおける目標対象物の検出枠を取得するために用いられる。 In another embodiment of each embodiment of the target detection apparatus of the present disclosure, the acquisition unit selects one candidate frame from a plurality of candidate frames based on the classification result and the regression result, and selects by the deviation amount of the selected candidate frames. It is used to regress the candidate frame and acquire the detection frame of the target object in the detection frame.

例示的には、取得ユニットは分類結果と回帰結果により複数の候補枠から１つの候補枠を選択する時に、分類結果と回帰結果の重み係数により複数の候補枠から１つの候補枠を選択するために用いられる。 Illustratively, when the acquisition unit selects one candidate frame from a plurality of candidate frames based on the classification result and the regression result, the acquisition unit selects one candidate frame from the plurality of candidate frames based on the weighting coefficient of the classification result and the regression result. Used for.

また、図６を再度参照し、本開示の目標検出装置の更なる別の実施例では、回帰結果により分類結果を調整するための調整ユニットを更に含んでよい。それに対応して、取得ユニットは分類結果と回帰結果により複数の候補枠から１つの候補枠を選択する時に、調整後の分類結果により複数の候補枠から１つの候補枠を選択するために用いられる。 Also, with reference to FIG. 6 again, in yet another embodiment of the target detection apparatus of the present disclosure, an adjustment unit for adjusting the classification result based on the regression result may be further included. Correspondingly, the acquisition unit is used to select one candidate frame from a plurality of candidate frames according to the adjusted classification result when selecting one candidate frame from a plurality of candidate frames based on the classification result and the regression result. ..

図７は本開示の目標検出装置の更なる別の実施例の構成模式図である。該実施例の目標検出装置は、本開示の図３〜図４の目標検出ネットワークのトレーニング方法の実施例のいずれか一つを実現するために利用可能である。図７に示すように、図５又は図６に示す実施例と比べると、該実施例の目標検出装置は、取得された検出フレームにおける目標対象物の検出枠を予測検出枠とし、検出フレームのラベリング情報と予測検出枠に基づいてニューラルネットワーク、第１の畳み込み層及び第２の畳み込み層をトレーニングするためのトレーニングユニットを更に含む。 FIG. 7 is a schematic configuration diagram of still another embodiment of the target detection device of the present disclosure. The target detection device of the embodiment can be used to realize any one of the examples of the training method of the target detection network of FIGS. 3 to 4 of the present disclosure. As shown in FIG. 7, as compared with the embodiment shown in FIG. 5 or 6, the target detection device of the embodiment uses the detection frame of the target object in the acquired detection frame as the prediction detection frame, and the detection frame of the detection frame. It further includes a neural network, a training unit for training the first convolution layer and the second convolution layer based on the labeling information and the predictive detection frame.

一実施形態では、検出フレームのラベリング情報は、ラベリングされた検出フレームでの目標対象物の検出枠の位置と大きさを含む。それに対応して、該実施形態では、トレーニングユニットは、ラベリングされた検出枠の位置及び大きさと予測検出枠の位置及び大きさとの差により、ニューラルネットワーク、第１の畳み込み層及び第２の畳み込み層の重み値を調整するために用いられる。 In one embodiment, the detection frame labeling information includes the position and size of the detection frame of the target object in the labeled detection frame. Correspondingly, in the embodiment, the training unit has a neural network, a first convolution layer and a second convolution layer due to the difference between the position and size of the labeled detection frame and the position and size of the predicted detection frame. It is used to adjust the weight value of.

本開示の上記実施例によれば、ニューラルネットワークによりテンプレートフレームと検出フレームの特徴をそれぞれ抽出し、テンプレートフレームの特徴に基づいて局所領域検出器の分類の重みと回帰の重みを取得し、検出フレームの特徴を局所領域検出器に入力し、局所領域検出器から出力される複数の候補枠の分類結果と回帰結果を取得し、局所領域検出器から出力される複数の候補枠の分類結果と回帰結果により、検出フレームにおける目標対象物の検出枠を取得し、検出フレームのラベリング情報と予測検出枠に基づいて目標検出ネットワークをトレーニングする。本開示の実施例に基づいてトレーニングされた目標検出ネットワークによれば、同一のニューラルネットワーク又は同じ構成を有するニューラルネットワークにより同一の目標対象物の類似特徴をよりよく抽出でき、異なるフレームから抽出された目標対象物の特徴変化が小さく、検出フレームにおける目標対象物の検出結果の正確性を高めることに寄与する。また、テンプレートフレームの特徴に基づいて局所領域検出器の分類の重みと回帰の重みを取得することにより、局所領域検出器は検出フレームの複数の候補枠の分類結果と回帰結果を取得し、更に検出フレームにおける前記目標対象物の検出枠を取得することができ、目標対象物の位置と大きさの変化をよりよく推定でき、検出フレームでの目標対象物の位置をより精確に確定でき、目標追跡の速度や正確性が高くなり、追跡効果に優れ、速度が速い。 According to the above embodiment of the present disclosure, the characteristics of the template frame and the detection frame are extracted by the neural network, the classification weight of the local region detector and the regression weight are acquired based on the characteristics of the template frame, and the detection frame is obtained. Is input to the local area detector, the classification result and regression result of multiple candidate frames output from the local area detector are acquired, and the classification result and regression of multiple candidate frames output from the local area detector are acquired. Based on the result, the detection frame of the target object in the detection frame is acquired, and the target detection network is trained based on the labeling information of the detection frame and the prediction detection frame. According to the target detection network trained based on the examples of the present disclosure, similar features of the same target object can be better extracted by the same neural network or a neural network having the same configuration, and extracted from different frames. The characteristic change of the target object is small, which contributes to improving the accuracy of the detection result of the target object in the detection frame. In addition, by acquiring the classification weight and the regression weight of the local area detector based on the characteristics of the template frame, the local area detector acquires the classification result and the regression result of a plurality of candidate frames of the detection frame, and further. The detection frame of the target object in the detection frame can be acquired, the change in the position and size of the target object can be better estimated, the position of the target object in the detection frame can be determined more accurately, and the target can be determined. The tracking speed and accuracy are high, the tracking effect is excellent, and the speed is fast.

図８は本開示の目標検出装置の一適用実施例の構成模式図である。図９は本開示の目標検出装置の別の適用実施例の構成模式図である。図８及び図９では、ＬｘＭｘＮ（例えば、２５６ｘ２０ｘ２０）において、Ｌはチャンネル数を表し、ＭとＮはそれぞれ高さ（即ち、長さ）と幅を表す。 FIG. 8 is a schematic configuration diagram of an application embodiment of the target detection device of the present disclosure. FIG. 9 is a schematic configuration diagram of another application embodiment of the target detection device of the present disclosure. In FIGS. 8 and 9, at LxMxN (eg, 256x20x20), L represents the number of channels and M and N represent the height (ie, length) and width, respectively.

本開示の実施例は、本開示の上記実施例のいずれか１つの目標検出装置を含む電子機器を更に提供する。 The embodiments of the present disclosure further provide an electronic device that includes a target detector of any one of the above embodiments of the present disclosure.

本開示の実施例は、実行可能コマンドを記憶するためのメモリと、メモリと通信して、実行可能コマンドを実行することにより本開示の上記実施例のいずれか１つの目標検出方法又は目標検出ネットワークのトレーニング方法の操作を完成するためのプロセッサと、を含む別の電子機器を更に提供する。 The embodiments of the present disclosure include a memory for storing an executable command and a target detection method or target detection network of any one of the above embodiments of the present disclosure by communicating with the memory and executing the executable command. Further provides additional electronic equipment, including a processor for completing the operation of the training method.

図１０は本開示の電子機器の一適用実施例の構成模式図である。以下、本願の実施例の端末装置又はサーバを実現するのに適する電子機器の構成模式図を示す図１０を参照する。図１０に示すように、該電子機器は１つ又は複数のプロセッサ、通信部などを含む。前記１つ又は複数のプロセッサは、例えば、１つ又は複数の中央処理ユニット（ＣＰＵ）、及び／又は１つ又は複数の画像プロセッサ（ＧＰＵ）などであり、プロセッサは、読み取り専用メモリ（ＲＯＭ）に記憶された実行可能コマンド又は記憶部からランダムアクセスメモリ（ＲＡＭ）にロードされた実行可能コマンドによって各種の適切な動作及び処理を実現することができる。通信部はネットワークカードを含んでよいが、それに限定されなく、前記ネットワークカードはＩＢ（Ｉｎｆｉｎｉｂａｎｄ）ネットワークカードを含んでよいが、それに限定されない。プロセッサは実行可能コマンドを実行するように読み取り専用メモリ及び／又はランダムアクセスメモリと通信し、バスを介して通信部に接続され、通信部を介して他の対象装置と通信して、本開示の実施例で提供される方法のいずれか一項に対応する操作を完成してよい。例えば、ニューラルネットワークにより、検出フレームと、目標対象物の検出枠の画像であって、画像の大きさが前記検出フレームより小さいテンプレートフレームの特徴をそれぞれ抽出し、前記テンプレートフレームの特徴に基づいて局所領域検出器の分類の重みと回帰の重みを取得し、前記検出フレームの特徴を前記局所領域検出器に入力し、前記局所領域検出器から出力される複数の候補枠の分類結果と回帰結果を取得し、前記局所領域検出器から出力される複数の候補枠の分類結果と回帰結果により、前記検出フレームにおける前記目標対象物の検出枠を取得する。更に、例えば、ニューラルネットワークにより、検出フレームと、目標対象物の検出枠の画像であって、画像の大きさが前記検出フレームより小さいテンプレートフレームの特徴をそれぞれ抽出し、第１の畳み込み層によって前記テンプレートフレームの特徴のチャンネルを増加し、取得された第１の特徴を前記局所領域検出器の分類の重みとし、第２の畳み込み層によって前記テンプレートフレームの特徴のチャンネルを増加し、得られた第２の特徴を前記局所領域検出器の回帰の重みとし、前記検出フレームの特徴を前記局所領域検出器に入力し、前記局所領域検出器から出力される複数の候補枠の分類結果と回帰結果を取得し、前記局所領域検出器から出力される複数の候補枠の分類結果と回帰結果により、前記検出フレームにおける前記目標対象物の検出枠を取得し、取得された前記検出フレームにおける前記目標対象物の検出枠を予測検出枠とし、前記検出フレームのラベリング情報と前記予測検出枠に基づいて前記ニューラルネットワーク、前記第１の畳み込み層及び前記第２の畳み込み層をトレーニングする。 FIG. 10 is a schematic configuration diagram of an application embodiment of the electronic device of the present disclosure. Hereinafter, FIG. 10 will be referred to, which shows a schematic configuration diagram of an electronic device suitable for realizing the terminal device or server according to the embodiment of the present application. As shown in FIG. 10, the electronic device includes one or more processors, a communication unit, and the like. The one or more processors are, for example, one or more central processing units (CPUs) and / or one or more image processors (GPUs), where the processors are in read-only memory (ROM). Various appropriate operations and processes can be realized by the stored executable command or the executable command loaded from the storage unit into the random access memory (RAM). The communication unit may include, but is not limited to, a network card, the network card may include, but is not limited to, an IB (Infiniband) network card. The processor communicates with a read-only memory and / or a random access memory to execute an executable command, is connected to a communication unit via a bus, and communicates with another target device via the communication unit, according to the present disclosure. The operation corresponding to any one of the methods provided in the examples may be completed. For example, the neural network extracts the features of the detection frame and the image of the detection frame of the target object, and the size of the image is smaller than the detection frame, respectively, and localizes based on the features of the template frame. The classification weight and the regression weight of the region detector are acquired, the characteristics of the detection frame are input to the local region detector, and the classification result and the regression result of a plurality of candidate frames output from the local region detector are input. The detection frame of the target object in the detection frame is acquired from the classification result and the regression result of the plurality of candidate frames output from the local area detector. Further, for example, the features of the detection frame and the image of the detection frame of the target object, which are smaller than the detection frame, are extracted by the neural network, and the first convolution layer is used to extract the features of the template frame. The channel of the feature of the template frame is increased, the acquired first feature is used as the weight of the classification of the local region detector, and the channel of the feature of the template frame is increased by the second convolution layer, and the obtained first feature is obtained. The feature 2 is used as the regression weight of the local region detector, the feature of the detection frame is input to the local region detector, and the classification result and the regression result of a plurality of candidate frames output from the local region detector are input. The detection frame of the target object in the detection frame is acquired based on the classification result and the regression result of the plurality of candidate frames acquired and output from the local region detector, and the target object in the acquired detection frame. The detection frame is used as a prediction detection frame, and the neural network, the first convolution layer, and the second convolution layer are trained based on the labeling information of the detection frame and the prediction detection frame.

また、ＲＡＭには、装置の動作に必要な様々なプログラムやデータが格納されていてもよい。ＣＰＵ、ＲＯＭ及びＲＡＭは、バスを介して相互に接続される。ＲＡＭを有する場合に、ＲＯＭは選択可能なモジュールである。ＲＡＭはプロセッサに本開示の上記方法のいずれか一項に対応する操作を実行させるための実行可能コマンドを格納するか、または動作時当該実行可能コマンドをＲＯＭに書き込む。入力／出力（Ｉ／Ｏ）インターフェイスもバスに接続される。通信部は、集積的に設置されてもよく、複数のサブモジュール（例えば複数のＩＢネットワークカード）を有し、且つバスを介してリンクされるように設置されてもよい。 Further, the RAM may store various programs and data necessary for the operation of the device. The CPU, ROM and RAM are connected to each other via a bus. If you have RAM, ROM is a selectable module. The RAM stores an executable command for causing the processor to perform an operation corresponding to any one of the above methods of the present disclosure, or writes the executable command to the ROM during operation. The input / output (I / O) interface is also connected to the bus. The communication unit may be installed in an integrated manner, may have a plurality of submodules (for example, a plurality of IB network cards), and may be installed so as to be linked via a bus.

キーボード、マウスなどを含む入力部と、例えば陰極線管（ＣＲＴ）、液晶ディスプレイー（ＬＣＤ）など及びスピーカなどを含む出力部と、ハードディスクなどを含む記憶部と、例えばＬＡＮカード、モデムなどのネットワークインターフェイスカードを含む通信部とがＩ／Ｏインターフェイスに接続されている。通信部は例えばインターネットのようなネットワークを介して通信処理を実行する。ドライブも必要に応じてＩ／Ｏインターフェイスに接続される。例えば磁気ディスク、光ディスク、光磁気ディスク、半導体メモリなどのリムーバブル媒体は、必要に応じてドライブ上に装着され、そこから読み出されたコンピュータプログラムを必要に応じて記憶部にインストールする。 An input unit including a keyboard, a mouse, etc., an output unit including a cathode ray tube (CRT), a liquid crystal display (LCD), a speaker, etc., a storage unit including a hard disk, etc., and a network interface such as a LAN card, a modem, etc. The communication unit including the card is connected to the I / O interface. The communication unit executes communication processing via a network such as the Internet. The drive is also connected to the I / O interface as needed. For example, a removable medium such as a magnetic disk, an optical disk, a magneto-optical disk, or a semiconductor memory is mounted on a drive as needed, and a computer program read from the removable medium is installed in a storage unit as needed.

なお、図１０に示すアーキテクチャは選択可能な一実施形態に過ぎない。具体的な実践過程では、実際の必要に応じて上記図１０の部品の数及び種類を選択、削除、追加、または置換することができる。異なる機能の部品の設置について、個別な設置または集積な設置などの実現方式を採用でき、例えばＧＰＵとＣＰＵは、個別に設置されるかまたはＧＰＵをＣＰＵに集積させて、通信部は、個別に設置されるか、またはＣＰＵやＧＰＵに集積的に設置されることなどが可能です。これらの代替的な実施形態はすべて本願の保護範囲に含まれる。 The architecture shown in FIG. 10 is only one selectable embodiment. In a specific practical process, the number and type of parts shown in FIG. 10 can be selected, deleted, added, or replaced as actually required. For the installation of parts with different functions, implementation methods such as individual installation or integrated installation can be adopted. For example, the GPU and CPU are installed separately or the GPU is integrated in the CPU, and the communication unit is individually installed. It can be installed, or it can be installed in a CPU or GPU in an integrated manner. All of these alternative embodiments are within the scope of protection of the present application.

また、本開示の実施例は、コンピュータ読取可能コマンドを記憶するためのコンピュータ記憶媒体であって、該コマンドが実行されると、本開示の上記実施例の目標検出方法又は目標検出ネットワークのトレーニング方法のいずれか１つの操作が実現されるコンピュータ記憶媒体を更に提供する。 Further, the embodiment of the present disclosure is a computer storage medium for storing a computer-readable command, and when the command is executed, the target detection method or the target detection network training method of the above-described embodiment of the present disclosure is executed. Further provided is a computer storage medium in which any one of the above operations is realized.

また、本開示の実施例は、コンピュータ読取可能コマンドを含むコンピュータプログラムであって、該コンピュータ読取可能コマンドが機器上で作動すると、該機器におけるプロセッサに本開示の上記実施例の目標検出方法又は目標検出ネットワークのトレーニング方法のいずれか１つの操作を実現するための実行可能コマンドを実行させるコンピュータプログラムを更に提供する。 Further, the embodiment of the present disclosure is a computer program including a computer-readable command, and when the computer-readable command is activated on the device, the target detection method or the target of the above-described embodiment of the present disclosure is applied to the processor in the device. Further provided is a computer program that executes an executable command to implement any one of the detection network training methods.

本開示の実施例は単一目標追跡を行うことができる。例えば、多目標追跡システムにおいて、フレームごとに目標検出を行わなくてもよく、所定の検出間隔、例えば１０フレームごとに一回検出し、その間の９フレームについては単一目標追跡によって中間フレームでの目標の位置を特定するようにしてもよい。本開示の実施例のアルゴリズムは速度が速いので、全体的に多目標追跡システムに追跡をより速く完了させ、よりよい効果を達成することができる。 The embodiments of the present disclosure can perform single target tracking. For example, in a multi-target tracking system, it is not necessary to perform target detection for each frame, and detection is performed once every 10 frames at a predetermined detection interval, and 9 frames in the meantime are detected in an intermediate frame by single target tracking. The position of the target may be specified. Due to the high speed of the algorithms in the embodiments of the present disclosure, the overall multi-target tracking system can complete the tracking faster and achieve better effects.

本明細書における様々な実施例は漸進的に説明され、各実施例について他の実施例との相違点に集中して説明したが、各実施例の同一または類似の部分については相互に参照すればよい。システム実施例については、基本的に方法実施例に対応するので、簡単に説明したが、関連部分は方法実施例の説明の一部を参照すればよい。 The various examples herein have been described incrementally, with each example focused on the differences from the other examples, but the same or similar parts of each example are referred to each other. Just do it. Since the system embodiment basically corresponds to the method embodiment, it has been briefly described, but the related part may refer to a part of the description of the method embodiment.

本開示の方法及び装置は、様々な形態で実現され得る。例えば、ソフトウェア、ハードウェア、ファームウェア、またはソフトウェア、ハードウェア、ファームウェアの任意の組合わせによって本開示の方法及び装置を実現することができる。前記方法のステップの上記順序は単に説明するためのものであり、他の形態で特に説明しない限り、本開示の方法のステップは、上記具体的に説明した順序に限定されない。また、いくつかの実施例では、本開示は記録媒体に記憶されたプログラムとしてもよく、これらのプログラムは本開示の方法を実現するための機械可読コマンドを含む。従って、本開示は本開示の方法を実行するためのプログラムが記憶された記録媒体も含む。 The methods and devices of the present disclosure can be realized in various forms. For example, the methods and devices of the present disclosure can be implemented by software, hardware, firmware, or any combination of software, hardware, firmware. The order of the steps of the method is merely for illustration purposes, and the steps of the methods of the present disclosure are not limited to the specifically described order, unless specifically described in other embodiments. Also, in some embodiments, the disclosure may be programs stored on a recording medium, which include machine-readable commands to implement the methods of the disclosure. Accordingly, the present disclosure also includes a recording medium in which a program for executing the method of the present disclosure is stored.

本願の説明は、例示及び説明のためのものであり、漏れなくまたは本願を開示された形式に限定するものではない。当業者にとっては多くの修正及び変形が明らかなことである。実施例を選択し説明する目的は、本願の原理及び実際応用をより好適に説明し、当業者に本願を理解させて特定用途に適する各種の修正を加えた各種の実施例を設計させることにある。 The description of the present application is for illustration and explanation purposes only and is not intended to be complete or limited to the disclosed form. Many modifications and modifications are apparent to those skilled in the art. The purpose of selecting and explaining the examples is to better explain the principles and practical applications of the present application, to allow those skilled in the art to understand the present application, and to design various examples with various modifications suitable for a specific application. is there.

Claims

Extracting the features of the detection frame and the template frame, which is the image of the detection frame of the target object and the size of the image is smaller than the detection frame, by the neural network, respectively.
Obtaining the classification weight and regression weight of the local region detector based on the characteristics of the template frame,
The characteristics of the detection frame are input to the local area detector, and the classification result and the regression result of a plurality of candidate frames output from the local area detector are acquired.
A target detection method comprising acquiring a detection frame for a target object in the detection frame based on a classification result and a regression result of a plurality of candidate frames output from the local region detector.

The neural network is used to extract features of at least one other detection frame that are chronologically located after the detection frame in the video sequence.
The features of the at least one other detection frame are input to the local region detector in order, and a plurality of candidate frames and the classification result of each candidate frame in the at least one other detection frame output from the local region detector. And to get the regression results in order,
The feature is that the detection frames of the target object in the at least one other detection frame are sequentially acquired based on the classification result and the regression result of the plurality of candidate frames of the at least one other detection frame. The method according to claim 1.

Extracting the characteristics of the template frame and the detection frame by the neural network is
Extracting the features of the template frame and the detection frame by the same neural network, or
The method according to claim 1 or 2, wherein the features of the template frame and the detection frame are extracted by separate neural networks having the same configuration.

The template frame according to any one of claims 1 to 3, wherein the detection timing is located before the detection frame in the video sequence, and the detection frame of the target object is specified. the method of.

The method according to any one of claims 1 to 4, wherein the detection frame is a current frame for detecting the target object or a region image in which the target object can exist in the current frame. ..

When the detection frame is a region image in which the target object can exist in the current frame for detecting the target object,
It further includes cutting out a region image having a center point of the template frame as a center point and having a length and / or width larger than the length and / or width of the image of the template frame, respectively, from the current frame to obtain the detection frame. The method according to claim 5, wherein the method is characterized by the above.

Obtaining the classification weights of the local region detector based on the characteristics of the template frame
A claim comprising performing a convolution operation on the features of the template frame by the first convolution layer and using the first features acquired by the convolution operation as weights for classification of the local region detector. Item 8. The method according to any one of Items 1 to 6.

Obtaining the regression weights of the local region detector based on the characteristics of the template frame
A claim comprising performing a convolution operation on the features of the template frame by the second convolution layer and using the second features acquired by the convolution operation as the regression weight of the local region detector. Item 8. The method according to any one of Items 1 to 7.

It is possible to input the characteristics of the detection frame into the local area detector and acquire the classification result and the regression result of a plurality of candidate frames output from the local area detector.
A convolution operation is performed on the features of the detection frame using the weights of the classification to acquire classification results of a plurality of candidate frames, and a convolution operation is performed on the features of the detection frame using the weights of the regression. The method according to any one of claims 1 to 8, wherein the method is to obtain the regression results of a plurality of candidate frames, and to include.

After extracting the features of the detection frame, a convolution operation is performed on the features of the detection frame by the third convolution layer to acquire a third feature whose number of channels is the same as the number of channels of the features of the detection frame. Including that
Acquiring the classification results of a plurality of candidate frames by performing a convolution operation on the feature of the detection frame using the weight of the classification is a convolution operation on the feature of the third feature using the weight of the classification. The method according to claim 9, wherein the method includes obtaining the classification results of a plurality of candidate frames.

After extracting the features of the template frame, a convolution operation is performed on the features of the detection frame by the fourth convolution layer to acquire a fourth feature whose number of channels is the same as the number of channels of the features of the detection frame. Including that
Acquiring the regression results of a plurality of candidate frames by performing a convolution operation on the feature of the detection frame using the weight of the regression is a convolution operation on the feature of the fourth feature using the weight of the regression. The method according to claim 9 or 10, wherein the method comprises performing regression results of a plurality of candidate frames.

Acquiring the detection frame of the target object in the detection frame from the classification result and the regression result of the plurality of candidate frames output from the local region detector is possible.
One candidate frame is selected from the plurality of candidate frames based on the classification result and the regression result, the selected candidate frame is regressed according to the deviation amount of the selected candidate frame, and the target object in the detection frame is used. The method according to any one of claims 1 to 11, which comprises acquiring a detection frame.

Selecting one candidate frame from the plurality of candidate frames based on the classification result and the regression result is not possible.
The method according to claim 12, wherein one candidate frame is selected from the plurality of candidate frames based on the weighting coefficient of the classification result and the regression result.

Further including adjusting the classification result according to the regression result after obtaining the regression result.
The selection of one candidate frame from the plurality of candidate frames based on the classification result and the regression result includes selecting one candidate frame from the plurality of candidate frames based on the adjusted classification result. The method according to claim 12.

Extracting the features of the detection frame and the template frame, which is the image of the detection frame of the target object and the size of the image is smaller than the detection frame, by the neural network, respectively.
A convolution operation is performed on the features of the template frame by the first convolution layer, the first feature acquired by the convolution operation is used as the classification weight of the local region detector, and the template frame is performed by the second convolution layer. The convolution operation is performed on the feature of, and the second feature acquired by the convolution operation is used as the regression weight of the local region detector.
The characteristics of the detection frame are input to the local area detector, and the classification result and the regression result of a plurality of candidate frames output from the local area detector are acquired.
Obtaining the detection frame of the target object in the detection frame from the classification result and the regression result of the plurality of candidate frames output from the local region detector, and
The detection frame of the target object in the acquired detection frame is used as a prediction detection frame, and the neural network, the first convolution layer, and the second convolution are based on the labeling information of the detection frame and the prediction detection frame. A method of training a target detection network, characterized by training and including layers.

Extracting the features of at least one other detection frame located after the detection frame in time series in the video sequence by the neural network.
The features of the at least one other detection frame are input to the local region detector in order, and a plurality of candidate frames and the classification result of each candidate frame in the at least one other detection frame output from the local region detector. And to get the regression results in order,
The feature is that the detection frames of the target object in the at least one other detection frame are sequentially acquired based on the classification result and the regression result of the plurality of candidate frames of the at least one other detection frame. The method according to claim 15.

Extracting the characteristics of the template frame and the detection frame by the neural network is
Extracting the features of the template frame and the detection frame by the same neural network, or
The method according to claim 15 or 16, wherein the features of the template frame and the detection frame are extracted by separate neural networks having the same configuration.

The template frame according to any one of claims 15 to 17, wherein the detection timing is positioned before the detection frame in the video sequence, and the detection frame of the target object is specified. the method of.

The method according to any one of claims 15 to 18, wherein the detection frame is a current frame for detecting the target object or a region image in which the target object can exist in the current frame. ..

When the detection frame is a region image in which the target object can exist in the current frame for detecting the target object,
It further includes cutting out a region image having a center point of the template frame as a center point and having a length and / or width larger than the length and / or width of the image of the template frame, respectively, from the current frame to obtain the detection frame. The method according to claim 19, wherein the method is characterized by the above.

It is possible to input the characteristics of the detection frame into the local area detector and acquire the classification result and the regression result of a plurality of candidate frames output from the local area detector.
A convolution operation is performed on the features of the detection frame using the weights of the classification to acquire classification results of a plurality of candidate frames, and a convolution operation is performed on the features of the detection frame using the weights of the regression. The method according to any one of claims 15 to 20, wherein the method is to obtain the regression results of a plurality of candidate frames, and to include.

After extracting the characteristics of the detection frame,
Further including performing a convolution operation on the feature of the detection frame by the third convolution layer to acquire a third feature whose number of channels is similar to the number of channels of the feature of the detection frame.
Acquiring the classification results of a plurality of candidate frames by performing a convolution operation on the feature of the detection frame using the weight of the classification is a convolution operation on the feature of the third feature using the weight of the classification. The method according to claim 21, wherein the method includes obtaining the classification results of a plurality of candidate frames.

After extracting the features of the template frame,
Further including performing a convolution operation on the feature of the detection frame by the fourth convolution layer to acquire a fourth feature whose number of channels is similar to the number of channels of the feature of the detection frame.
Acquiring the regression results of a plurality of candidate frames by performing a convolution operation on the feature of the detection frame using the weight of the regression is a convolution operation on the feature of the fourth feature using the weight of the regression. 21. The method according to claim 21, wherein the method includes performing regression results of a plurality of candidate frames.

Acquiring the detection frame of the target object in the detection frame from the classification result and the regression result of the plurality of candidate frames output from the local region detector is possible.
One candidate frame is selected from the plurality of candidate frames based on the classification result and the regression result, the selected candidate frame is regressed according to the deviation amount of the selected candidate frame, and the target object in the detection frame is used. The method according to any one of claims 15 to 23, which comprises acquiring a detection frame.

Selecting one candidate frame from the plurality of candidate frames based on the classification result and the regression result is not possible.
24. The method of claim 24, wherein one candidate frame is selected from the plurality of candidate frames based on the weighting coefficient of the classification result and the regression result.

Further including adjusting the classification result according to the regression result after obtaining the regression result.
The selection of one candidate frame from the plurality of candidate frames based on the classification result and the regression result includes selecting one candidate frame from the plurality of candidate frames based on the adjusted classification result. 25. The method of claim 25.

The labeling information of the detection frame includes the position and size of the detection frame of the target object in the labeled detection frame.
The detection frame of the target object in the acquired detection frame is used as a prediction detection frame, and based on the labeling information of the detection frame and the prediction detection frame, the neural network, the first convolution layer, and the second Training the convolutionary layer is
Including adjusting the weight values of the neural network, the first convolutional layer, and the second convolutional layer by the difference between the position and size of the labeled detection frame and the position and size of the predicted detection frame. The method according to any one of claims 15 to 26.

A neural network for extracting the features of the detection frame and the template frame, which is an image of the detection frame of the target object and whose image size is smaller than the detection frame.
A first convolution layer for increasing the channels of the features of the template frame and weighting the obtained first features to the classification of the local region detector.
A second convolution layer for increasing the channels of the features of the template frame and weighting the resulting second features to the regression of the local region detector.
A local region detector for outputting classification results and regression results of a plurality of candidate frames according to the characteristics of the detection frame, and
A target detection characterized by including an acquisition unit for acquiring a detection frame of the target object in the detection frame based on a classification result and a regression result of a plurality of candidate frames output from the local region detector. apparatus.

28. The apparatus of claim 28, wherein the neural network has the same configuration and includes separate neural networks used to extract features of the template frame and the detection frame, respectively.

The apparatus according to claim 28 or 29, wherein the template frame is a frame in which the detection timing is located before the detection frame in the video sequence and the detection frame of the target object is specified.

The apparatus according to any one of claims 28 to 30, wherein the detection frame is a current frame for detecting the target object or a region image in which the target object can exist in the current frame. ..

Preprocessing for cutting out a region image whose length and / or width is larger than the length and / or width of the image of the template frame, respectively, from the current frame with the center point of the template frame as the center point to obtain the detection frame. 31. The apparatus of claim 31, further comprising a unit.

The local region detector performs a convolution operation on the features of the detection frame using the weights of the classification to acquire the classification results of a plurality of candidate frames, and uses the weights of the regression to obtain the detection frame. The apparatus according to any one of claims 28 to 32, wherein the convolution operation is performed on the feature of the above, and the device is used for acquiring regression results of a plurality of candidate frames.

A third convolution layer for performing a convolution operation on the features of the detection frame and acquiring a third feature in which the number of channels is the same as the number of channels of the features of the detection frame.
33. The apparatus of claim 33, further comprising the local region detector for performing a convolution operation on the third feature using the weights of the classification.

A fourth convolution layer for performing a convolution operation on the features of the detection frame and acquiring a fourth feature in which the number of channels is the same as the number of channels of the features of the detection frame.
33. The apparatus of claim 33, further comprising the local region detector for performing a convolution operation on the fourth feature using the weights of the regression.

The acquisition unit selects one candidate frame from the plurality of candidate frames based on the classification result and the regression result, returns the selected candidate frame according to the deviation amount of the selected candidate frame, and returns the selected candidate frame in the detection frame. The apparatus according to any one of claims 28 to 35, which is used for acquiring a detection frame for the target object.

When the acquisition unit selects one candidate frame from the plurality of candidate frames based on the classification result and the regression result, the acquisition unit selects one candidate frame from the plurality of candidate frames based on the weighting coefficient of the classification result and the regression result. 36. The device of claim 36, characterized in that it is used for selection.

An adjustment unit for adjusting the classification result according to the regression result is further included.
The acquisition unit is used to select one candidate frame from the plurality of candidate frames based on the adjusted classification result when selecting one candidate frame from the plurality of candidate frames based on the classification result and the regression result. 36. The apparatus according to claim 36.

The detection frame of the target object in the acquired detection frame is used as a prediction detection frame, and the neural network, the first convolution layer, and the second convolution are based on the labeling information of the detection frame and the prediction detection frame. The device according to any one of claims 28 to 38, further comprising a training unit for training layers.

The labeling information of the detection frame includes the position and size of the detection frame of the target object in the labeled detection frame.
The training unit determines the weight values of the neural network, the first convolution layer, and the second convolution layer by the difference between the position and size of the labeled detection frame and the position and size of the prediction detection frame. 39. The apparatus of claim 39, characterized in that it is used for coordinating.

An electronic device comprising the target detection device according to any one of claims 28 to 40.

Memory for storing executable commands and
An electronic device comprising: a processor for completing the operation of the method according to any one of claims 1-27 by communicating with the memory and executing the executable command.

A computer storage medium for storing a computer-readable command, wherein when the command is executed, the operation of the method according to any one of claims 1 to 27 is realized. Storage medium.