JP7292492B2

JP7292492B2 - Object tracking method and device, storage medium and computer program

Info

Publication number: JP7292492B2
Application number: JP2022504275A
Authority: JP
Inventors: 飛王; 光啓陳; 晨銭
Original assignee: Beijing Sensetime Technology Development Co Ltd
Current assignee: Beijing Sensetime Technology Development Co Ltd
Priority date: 2020-04-28
Filing date: 2021-04-16
Publication date: 2023-06-16
Anticipated expiration: 2041-04-16
Also published as: TWI769787B; WO2021218671A1; TW202141424A; CN111539991B; CN111539991A; KR20220024986A; JP2022542566A

Description

本発明は、コンピュータビジョン分野に関し、特にオブジェクト追跡方法及び装置、記憶媒体並びにコンピュータプログラムに関する。 The present invention relates to the field of computer vision, and more particularly to an object tracking method and apparatus, storage medium and computer program.

現在、複数オブジェクト追跡技術によってオブジェクトの運動軌跡を分析する需要は、ますます強くなってきている。複数オブジェクト追跡を行う過程では、オブジェクト検出によって複数のオブジェクトの所在する位置を取得してから、各オブジェクトに対して単一オブジェクト追跡を行う必要がある。 Currently, there is a growing demand for analyzing motion trajectories of objects by multi-object tracking technology. In the process of multi-object tracking, it is necessary to obtain the locations of multiple objects by object detection, and then perform single-object tracking for each object.

上記複数オブジェクト追跡の処理時間は、シーンにおけるオブジェクトの数に対して線形相関を示す。例えば、シーンにＮ個の対象（オブジェクト）が含まれ、ここでのＮが正整数である場合に、複数オブジェクト追跡は、単一オブジェクト追跡の推論をＮ回行う必要があり、処理時間は、単一オブジェクト追跡に必要な時間のＮ倍まで増加する。Ｎの値が大きいほど、複数オブジェクト追跡の時間は、長くなる。そのため、機器の高い演算能力が要求されるとともに、時間も長くかかってしまった。 The processing time of the multiple object tracking shows a linear correlation with the number of objects in the scene. For example, if a scene contains N objects, where N is a positive integer, multiple-object tracking requires N inferences of single-object tracking, and the processing time is It increases up to N times the time required for single object tracking. The larger the value of N, the longer the time for multiple object tracking. As a result, the equipment required high computing power and took a long time.

本発明は、オブジェクト追跡方法及び装置、記憶媒体並びにコンピュータプログラムを提供する。 The present invention provides an object tracking method and apparatus, a storage medium and a computer program.

本発明の実施例の第１態様は、オブジェクト追跡方法を提供する。前記方法は、同一シーンに対応する複数枚のシーン画像を取得するステップと、前記複数枚のシーン画像のうちの各シーン画像に対して特徴抽出処理及び目標部位検出を行い、前記各シーン画像の特徴情報と前記各シーン画像における複数の目標部位の位置とを取得するステップと、前記各シーン画像の特徴情報のうち、前記複数の目標部位の位置のそれぞれに対応する目標特徴情報を取得するステップと、取得された前記複数の目標部位の位置のそれぞれに対応する目標特徴情報に基づいて、前記複数枚のシーン画像に現れた複数の同じオブジェクトを特定するステップと、を含み、各シーン画像は、前記複数の同じオブジェクトのうちの一部又は全部を含む。 A first aspect of embodiments of the present invention provides an object tracking method. The method comprises the steps of: obtaining a plurality of scene images corresponding to the same scene; performing feature extraction processing and target part detection on each scene image among the plurality of scene images; acquiring feature information and positions of a plurality of target parts in each of the scene images; and acquiring target feature information corresponding to each of the positions of the plurality of target parts among the feature information of each of the scene images. and identifying a plurality of identical objects appearing in the plurality of scene images based on the target feature information corresponding to each of the acquired positions of the plurality of target regions, wherein each scene image is , including some or all of said plurality of same objects.

幾つかの選択可能な実施例において、前記複数枚のシーン画像のうちの各シーン画像に対して特徴抽出処理及び目標部位検出を行い、前記各シーン画像の特徴情報と前記各シーン画像における複数の目標部位の位置とを取得するステップは、前記複数枚のシーン画像のうちの各シーン画像の第１特徴マップを抽出することと、前記各シーン画像の第１特徴マップにおいて目標部位検出を行い、前記各シーン画像における複数の目標部位の位置を取得し、且つ、前記各シーン画像の第１特徴マップに対して特徴抽出処理を行い、多次元の第２特徴マップを取得することと、を含み、前記各シーン画像の特徴情報のうち、前記複数の目標部位の位置のそれぞれに対応する目標特徴情報を取得するステップは、前記多次元の第２特徴マップにおいて、前記複数の目標部位の位置のそれぞれに対応する目標特徴ベクトルを取得することを含む。 In some alternative embodiments, feature extraction processing and target site detection are performed on each scene image of the plurality of scene images, and the feature information of each scene image and the plurality of scene images in each scene image are obtained. The step of obtaining the position of the target part includes: extracting a first feature map of each scene image among the plurality of scene images; detecting the target part in the first feature map of each scene image; obtaining positions of a plurality of target parts in each of the scene images, and performing feature extraction processing on a first feature map of each of the scene images to obtain a multidimensional second feature map. The step of acquiring target feature information corresponding to each of the positions of the plurality of target parts among the feature information of each of the scene images includes: Obtaining a target feature vector corresponding to each.

幾つかの選択可能な実施例において、前記取得された前記複数の目標部位の位置のそれぞれに対応する目標特徴情報に基づいて、前記複数枚のシーン画像に現れた複数の同じオブジェクトを特定するステップは、前記複数枚のシーン画像のうちの隣接する２枚ずつのシーン画像にそれぞれ対応する複数の目標特徴情報を利用し、前記隣接する２枚ずつのシーン画像における各目標部位の間の類似度を取得することと、前記隣接する２枚ずつのシーン画像における各目標部位の間の類似度に基づいて、異なるシーン画像に現れた複数の同じオブジェクトを特定することと、を含む。 In some alternative embodiments, identifying a plurality of identical objects appearing in the plurality of scene images based on target feature information corresponding to each of the acquired locations of the plurality of target sites. uses a plurality of pieces of target feature information respectively corresponding to two adjacent scene images out of the plurality of scene images, and calculates the similarity between each target part in each of the adjacent two scene images. and identifying a plurality of identical objects appearing in different scene images based on the similarity between each target portion in the adjacent two scene images.

幾つかの選択可能な実施例において、前記隣接する２枚ずつのシーン画像は、第１シーン画像及び第２シーン画像であり、前記複数枚のシーン画像のうちの隣接する２枚ずつのシーン画像にそれぞれ対応する複数の目標特徴情報を利用し、前記隣接する２枚ずつのシーン画像における各目標部位の間の類似度を取得することは、第１シーン画像におけるＮ個の目標特徴ベクトルのそれぞれと第２シーン画像におけるＭ個の目標特徴ベクトルとの類似度を特定することと、前記第１シーン画像におけるＮ個の目標特徴ベクトルのそれぞれと前記第２シーン画像におけるＭ個の目標特徴ベクトルとの前記類似度に基づいて、Ｎ×Ｍ次元の類似度行列を取得することと、を含み、Ｎ及びＭは、２以上の正整数であり、前記類似度行列における何れかの次元の値は、前記第１シーン画像の何れかの第１目標部位と前記第２シーン画像中の何れかの第２目標部位との類似度を表す。 In some alternative embodiments, the adjacent two scene images are a first scene image and a second scene image, and the adjacent two scene images of the plurality of scene images. using a plurality of target feature information corresponding to each of the N target feature vectors in the first scene image to obtain the similarity between each target part in the two adjacent scene images and M target feature vectors in a second scene image; and each of the N target feature vectors in the first scene image and the M target feature vectors in the second scene image. obtaining an N×M dimensional similarity matrix based on the similarity of , represents the degree of similarity between any first target portion in the first scene image and any second target portion in the second scene image.

幾つかの選択可能な実施例において、前記隣接する２枚ずつのシーン画像における各目標部位の間の類似度に基づいて、前記異なるシーン画像に現れた複数の同じオブジェクトを特定することは、前記類似度行列に基づいて、前記Ｎ個の目標特徴ベクトルのうちの第１目標特徴ベクトルのそれぞれと前記Ｍ個の目標特徴ベクトルとの類似度から類似度最大値を特定することと、前記類似度最大値が所定閾値よりも大きい場合に、前記Ｍ個の目標特徴ベクトルのうち、前記類似度最大値に対応する第２目標特徴ベクトルを特定することと、前記第１シーン画像における前記第１目標特徴ベクトルに対応する第１目標部位の所属するオブジェクトと前記第２シーン画像における第２目標特徴ベクトルに対応する第２目標部位の所属するオブジェクトとを同じオブジェクトとすることと、を含む。 In some optional embodiments, identifying a plurality of identical objects appearing in the different scene images based on a similarity between each target portion in the adjacent two scene images comprises: identifying a maximum similarity value from similarities between each of the first target feature vectors of the N target feature vectors and the M target feature vectors based on a similarity matrix; and identifying a second target feature vector corresponding to the maximum similarity value among the M target feature vectors if the maximum value is greater than a predetermined threshold; making the object to which the first target part corresponding to the feature vector belongs and the object to which the second target part corresponding to the second target feature vector in the second scene image belongs the same object.

幾つかの選択可能な実施例において、前記複数枚のシーン画像のうちの各シーン画像に対して特徴抽出処理及び目標部位検出を行い、前記各シーン画像の特徴情報と前記各シーン画像における複数の目標部位の位置とを取得するステップは、特徴検出モデルのバックボーンネットワークを介して前記複数枚のシーン画像のうちの各シーン画像の第１特徴マップを抽出することと、前記特徴検出モデルの部位検出ブランチを介して、前記各シーン画像の第１特徴マップにおいて目標部位検出を行い、前記各シーン画像における複数の目標部位の位置を取得し、且つ、前記特徴検出モデルの特徴抽出ブランチを介して、前記各シーン画像の第１特徴マップに対して特徴抽出処理を行い、多次元の第２特徴マップを取得することと、を含む。 In some alternative embodiments, feature extraction processing and target site detection are performed on each scene image of the plurality of scene images, and the feature information of each scene image and the plurality of scene images in each scene image are obtained. The step of obtaining a position of a target part includes: extracting a first feature map of each scene image among the plurality of scene images via a backbone network of the feature detection model; and performing part detection of the feature detection model. performing target feature detection in a first feature map of each scene image, via a branch, to obtain locations of a plurality of target features in each scene image; and, via a feature extraction branch of the feature detection model, performing feature extraction processing on the first feature map of each scene image to obtain a multi-dimensional second feature map.

幾つかの選択可能な実施例において、前記方法は、同一シーンに対応する複数枚のサンプルシーン画像を初期ニューラルネットワークモデルに入力し、前記初期ニューラルネットワークモデルから出力された各サンプルシーン画像における複数の目標部位の位置のそれぞれに対応するサンプル特徴ベクトルを取得するステップと、前記各サンプルシーン画像におけるマーキングされた複数の目標部位のそれぞれに対応するオブジェクト識別子に基づいて、隣接する２枚ずつのサンプルシーン画像における、同じ前記オブジェクト識別子の前記目標部位の位置に対応する前記サンプル特徴ベクトルの間の第１類似度を特定し、及び／又は、異なる前記オブジェクト識別子の前記目標部位の位置に対応する前記サンプル特徴ベクトルの間の第２類似度を特定するステップと、前記各サンプルシーン画像におけるマーキングされた複数の目標部位のそれぞれに対応するオブジェクト識別子に基づいて、前記第１類似度と前記第２類似度とのうちの少なくとも一方に基づいて、前記初期ニューラルネットワークモデルに対して教師ありトレーニングを行い、前記特徴検出モデルを取得するステップと、更に含む。 In some optional embodiments, the method includes inputting multiple sample scene images corresponding to the same scene into an initial neural network model, and performing multiple sample scene images for each sample scene image output from the initial neural network model. obtaining a sample feature vector corresponding to each target site position; and obtaining two adjacent sample scenes based on object identifiers corresponding to each of the plurality of marked target sites in each of the sample scene images. determining a first degree of similarity between the sample feature vectors corresponding to the target site locations of the same object identifier and/or the samples corresponding to the target site locations of different object identifiers in an image; determining a second degree of similarity between feature vectors; and determining the first degree of similarity and the second degree of similarity based on object identifiers corresponding to each of the plurality of marked target regions in each of the sample scene images. performing supervised training on the initial neural network model to obtain the feature detection model based on at least one of:

幾つかの選択可能な実施例において、前記各サンプルシーン画像におけるマーキングされた複数の目標部位のそれぞれに対応するオブジェクト識別子に基づいて、前記第１類似度と前記第２類似度とのうちの少なくとも一方に基づいて、前記初期ニューラルネットワークモデルに対して教師ありトレーニングを行い、前記特徴検出モデルを取得するステップは、第１類似度参照値と前記第１類似度との差分を第１損失関数とすることと、第２類似度参照値と前記第２類似度との差分を第２損失関数とすることと、前記第１損失関数と前記第２損失関数とのうちの少なくとも一方に基づいて、前記初期ニューラルネットワークモデルをトレーニングし、前記特徴検出モデルを取得することと、を含み、
前記第１類似度参照値は、前記隣接する２枚ずつのサンプルシーン画像におけるマーキングされた同じオブジェクト識別子の目標部位に対応するサンプル特徴ベクトルの間の類似度参照値であり、前記第２類似度参照値は、前記隣接する２枚ずつのサンプルシーン画像におけるマーキングされた異なるオブジェクト識別子の目標部位に対応するサンプル特徴ベクトルの間の類似度参照値である。 In some optional embodiments, at least one of the first similarity measure and the second similarity measure is determined based on an object identifier corresponding to each of the plurality of marked target portions in each of the sample scene images. supervised training the initial neural network model to obtain the feature detection model based on a first loss function, the difference between a first similarity reference value and the first similarity; setting the difference between the second similarity reference value and the second similarity as a second loss function; and based on at least one of the first loss function and the second loss function, training the initial neural network model to obtain the feature detection model;
The first similarity reference value is a similarity reference value between sample feature vectors corresponding to target portions of the same marked object identifier in the adjacent two sample scene images, and the second similarity The reference value is a similarity reference value between the sample feature vectors corresponding to the marked target portions of different object identifiers in the adjacent two sample scene images.

幾つかの選択可能な実施例において、前記方法は、前記複数のシーン画像に現れた複数の同じオブジェクトのうちの少なくとも１つのオブジェクトの、所定時間帯内における運動軌跡が目標運動軌跡に合致するか否かを特定するステップを更に含む。 In some optional embodiments, the method includes determining whether a motion trajectory of at least one of the plurality of same objects appearing in the plurality of scene images matches a desired motion trajectory within a predetermined time period. The step of determining whether the

幾つかの選択可能な実施例において、前記複数枚のシーン画像は、教室シーンに対応し、前記オブジェクトは、ティーチング対象を含み、前記目標運動軌跡は、ティーチングタスクにおいて前記ティーチング対象へ指定される少なくとも１種の運動軌跡を含む。 In some optional embodiments, the plurality of scene images correspond to a classroom scene, the object includes a teaching subject, and the desired motion trajectory is at least specified to the teaching subject in a teaching task. Contains one type of motion trajectory.

本発明の実施例の第２態様は、オブジェクト追跡装置を提供する。前記装置は、同一シーンに対応する複数枚のシーン画像を取得するための取得モジュールと、前記複数枚のシーン画像のうちの各シーン画像に対して特徴抽出処理及び目標部位検出を行い、前記各シーン画像の特徴情報と前記各シーン画像における複数の目標部位の位置とを取得するための処理モジュールと、前記各シーン画像の特徴情報のうち、前記複数の目標部位の位置のそれぞれに対応する目標特徴情報を取得するための特徴情報特定モジュールと、取得された前記複数の目標部位の位置のそれぞれに対応する目標特徴情報に基づいて、前記複数枚のシーン画像に現れた複数の同じオブジェクトを特定するためのオブジェクト特定モジュールと、を備え、各シーン画像は、前記複数の同じオブジェクトのうちの一部又は全部を含む。 A second aspect of an embodiment of the present invention provides an object tracking device. The apparatus includes an acquisition module for acquiring a plurality of scene images corresponding to the same scene, performing feature extraction processing and target part detection on each scene image among the plurality of scene images, and a processing module for acquiring feature information of a scene image and positions of a plurality of target parts in each of the scene images; A plurality of identical objects appearing in the plurality of scene images are identified based on a feature information identifying module for acquiring feature information and target feature information corresponding to each of the acquired positions of the plurality of target parts. and an object identification module for doing so, wherein each scene image includes some or all of said plurality of same objects.

本発明の実施例の第３態様は、コンピュータ可読記憶媒体を提供する。前記記憶媒体には、コンピュータプログラムが記憶され、前記コンピュータプログラムは、第１態様の何れか一項に記載のオブジェクト追跡方法を実行するために用いられる。 A third aspect of embodiments of the invention provides a computer-readable storage medium. A computer program is stored in the storage medium, and the computer program is used to execute the object tracking method according to any one of the first aspects.

本発明の実施例の第４態様は、オブジェクト追跡装置を提供する。当該オブジェクト追跡装置は、プロセッサと、前記プロセッサで実行され得る実行可能指令を記憶するためのメモリと、を備え、前記プロセッサは、前記メモリに記憶された実行可能指令を呼び出すことで第１態様の何れか一項に記載のオブジェクト追跡方法を実施するように構成される。 A fourth aspect of an embodiment of the present invention provides an object tracking device. The object tracking device comprises a processor and a memory for storing executable instructions executable by the processor, the processor calling the executable instructions stored in the memory to perform the Arranged to implement the object tracking method according to any one of the clauses.

本発明の実施例の第５態様は、コンピュータプログラムを提供する。前記コンピュータプログラムがプロセッサによって実行されたときに、第１態様の何れか一項に記載のオブジェクト追跡方法は、実施可能である。 A fifth aspect of embodiments of the present invention provides a computer program product. The object tracking method according to any one of the first aspects is implementable when said computer program is executed by a processor.

本発明の実施例に係る技術案は、以下の有利な作用効果を奏することができる。 The technical solutions according to the embodiments of the present invention can have the following advantageous effects.

本発明の実施例では、隣接する２枚ずつのシーン画像において複数のオブジェクトをそれぞれ特定した後で前段のシーン画像における各オブジェクトごとに後段のシーン画像に含まれる複数のオブジェクトを単一オブジェクト追跡推論をそれぞれ行う必要がなく、単一シーン画像に対して単一フレーム推定を行って複数の目標部位の位置に対応する目標特徴情報を取得し、単一フレーム推定結果についてマッチングを取って隣接する２枚ずつのシーン画像における複数の同じオブジェクトを取得し、複数オブジェクト追跡の目的を果たす。また、現在シーンに複数のオブジェクトが含まれたとしても、シーン画像全体に対して推定を行うため、全複数オブジェクト追跡手順の時間は、シーン画像に含まれるオブジェクトの数に関係しない。そのため、オブジェクト数の増加につれて単一オブジェクト追跡推論を逐一に行うことによって追跡時間が増加することはない。これにより、計算リソースが非常に大きく節約され、複数オブジェクト追跡の時間が短縮され、複数オブジェクト追跡の検出効率が有効的に向上する。 In the embodiment of the present invention, after identifying a plurality of objects in each of two adjacent scene images, a single object tracking inference is performed for each object in the preceding scene image to determine the plurality of objects contained in the subsequent scene image. instead of performing single frame estimation on a single scene image to obtain target feature information corresponding to the positions of a plurality of target parts, and performing matching on the single frame estimation results to obtain adjacent two Acquire multiple same objects in the scene images one by one to achieve the purpose of multi-object tracking. Also, even if the current scene contains multiple objects, the time of the entire multi-object tracking procedure is not related to the number of objects contained in the scene image because the estimation is performed for the entire scene image. Therefore, tracking time does not increase by iterating single-object tracking inferences as the number of objects increases. This greatly saves computing resources, shortens the time of multi-object tracking, and effectively improves the detection efficiency of multi-object tracking.

上述した一般的な記述及び後文の詳細に対する記述が単に例示的や解釈的なものであり、本発明を制限できないことは、理解されるべきである。 It is to be understood that the general description above and the detailed description hereinafter are merely illustrative and explanatory and are not limiting of the present invention.

ここでの図面は、明細書に組み込まれて明細書の一部を構成する。これらの図面は、本発明に合致する実施例を示しつつ、明細書の記載とともに本発明の原理を解釈するために用いられる。
本発明の１つの例示的な実施例に示すオブジェクト追跡方法のフローチャートである。本発明の１つの例示的な実施例に示すもう１つのオブジェクト追跡方法のフローチャートである。本発明の１つの例示的な実施例に示すもう１つのオブジェクト追跡方法のフローチャートである。本発明の１つの例示的な実施例に示すもう１つのオブジェクト追跡方法のフローチャートである。本発明の１つの例示的な実施例に示すもう１つのオブジェクト追跡方法のフローチャートである。本発明の１つの例示的な実施例に示す特徴検出モデルの構造模式図である。本発明の１つの例示的な実施例に示す複数オブジェクト追跡の推定手順の模式図である。本発明の１つの例示的な実施例に示すもう１つのオブジェクト追跡方法のフローチャートである。本発明の１つの例示的な実施例に示す特徴検出モデルのトレーニングシーンの模式図である。本発明の１つの例示的な実施例に示すもう１つのオブジェクト追跡方法のフローチャートである。本発明の１つの例示的な実施例に示すオブジェクト追跡装置のブロック図である。本発明の１つの例示的な実施例に示すオブジェクト追跡装置のための構造模式図である。 The drawings herein are incorporated into and constitute a part of the specification. These drawings, while illustrating embodiments consistent with the invention, are used together with the description to interpret the principles of the invention.
1 is a flowchart of an object tracking method according to one exemplary embodiment of the present invention; 4 is a flow chart of another object tracking method according to an exemplary embodiment of the present invention; 4 is a flow chart of another object tracking method according to an exemplary embodiment of the present invention; 4 is a flow chart of another object tracking method according to an exemplary embodiment of the present invention; 4 is a flow chart of another object tracking method according to an exemplary embodiment of the present invention; FIG. 4 is a structural schematic diagram of a feature detection model shown in one exemplary embodiment of the present invention; FIG. 4 is a schematic diagram of an estimation procedure for multiple object tracking according to one exemplary embodiment of the present invention; 4 is a flow chart of another object tracking method according to an exemplary embodiment of the present invention; FIG. 4 is a schematic diagram of a training scene for a feature detection model in accordance with one exemplary embodiment of the present invention; 4 is a flow chart of another object tracking method according to an exemplary embodiment of the present invention; 1 is a block diagram of an object tracking device according to one exemplary embodiment of the present invention; FIG. 1 is a structural schematic diagram for an object tracking device shown in one exemplary embodiment of the present invention; FIG.

ここで、例示的な実施例を詳細に説明する。その例示は、図面に示される。以下の記述は、図面に係る際、別途示さない限り、異なる図面における同じ符号が同じ又は類似する要素を示す。以下の例示的な実施例に記述される実施形態が本発明と一致する全ての実施形態を代表するわけではない。逆に、それらは、単に添付する特許請求の範囲に詳細に記述されるような、本発明の幾つかの態様に一致する装置及び方法の例である。 An illustrative embodiment will now be described in detail. An illustration thereof is shown in the drawing. The following description, when referring to the drawings, like numerals in different drawings indicate the same or similar elements, unless otherwise indicated. The embodiments described in the illustrative examples below do not represent all embodiments consistent with the present invention. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present invention as set forth in detail in the appended claims.

本発明で使用される用語は、単に特定の実施例を記述する目的であり、本発明を制限するためのものではない。本発明及び添付する特許請求の範囲で使用される単数形式の「１種」、「前記」及び「当該」も、文脈から他の意味を明瞭で分かる場合でなければ、複数の形式を含むことを意図する。理解すべきことは、本文で使用される用語「及び／又は」が、１つ又は複数の関連する列挙項目を含む如何なる或いは全ての可能な組み合わせを指す。 The terminology used in the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the present invention and the appended claims, the singular forms "a", "said" and "the" include plural forms unless the context clearly dictates otherwise. intended to It should be understood that the term "and/or" as used herein refers to any and all possible combinations of one or more of the associated listed items.

理解すべきことは、本発明において第１、第２、第３等の用語を用いて各種の情報を記述するが、これらの情報は、これらの用語に限定されるものではない。これらの用語は、単に同一のタイプの情報同士を区分するために用いられる。例えば、本発明の範囲を逸脱しない限り、第１情報が第２情報と呼称されてもよく、類似的に、第２情報が第１情報と呼称されてもよい。これは、コンテキストに依存する。例えば、ここで使用される言葉「場合」は、「…とき」や「…ときに」あるいは「特定の状況に応じて」として解釈されてもよい。 It should be understood that although the terms first, second, third, etc. are used in the present invention to describe various types of information, these information are not limited to these terms. These terms are only used to distinguish between similar types of information. For example, first information may be referred to as second information, and similarly, second information may be referred to as first information, without departing from the scope of the present invention. This is context dependent. For example, the word "if" as used herein may be interpreted as "when" or "when" or "depending on the particular circumstances".

本発明の実施例は、複数オブジェクト追跡案を提供し、例示として、異なるシーンでの端末機器に適用可能である。異なるシーンは、教室、監視カメラを配置した地点、又は複数オブジェクトを追跡する必要がある他の室内若しくは室外シーンを含むが、それらに限定されない。端末機器は、カメラヘッドを有する如何なる機器を採用可能であり、又は、端末機器は、外付け撮像機器であってもよい。端末機器は、同一シーンで前後して複数枚のシーン画像を収集してもよく、又は、ビデオストリームをそのまま収集して当該ビデオストリームにおける複数枚の画像を前記複数枚のシーン画像としてもよい。 Embodiments of the present invention provide a multi-object tracking scheme and are applicable to terminal devices in different scenes as an example. Different scenes include, but are not limited to, classrooms, surveillance camera locations, or other indoor or outdoor scenes where multiple objects need to be tracked. The terminal device can be any device with a camera head, or the terminal device can be an external imaging device. The terminal device may collect a plurality of scene images one after the other in the same scene, or may collect the video stream as it is and use the plurality of images in the video stream as the plurality of scene images.

更に、端末機器は、取得された複数枚のシーン画像のうちの各シーン画像に対して、特徴抽出処理及び目標部位検出を行い、各シーン画像の特徴情報と前記各シーン画像における複数の目標部位の位置とに基づいて、各シーン画像の特徴情報のうち、複数の目標部位の位置のそれぞれに対応する目標特徴情報を取得することにより、複数枚のシーン画像に現れた複数の同じオブジェクトを特定する。 Further, the terminal device performs feature extraction processing and target site detection on each scene image of the acquired plurality of scene images, and extracts feature information of each scene image and a plurality of target sites in each scene image. By obtaining target feature information corresponding to each of the positions of a plurality of target parts among the feature information of each scene image based on the position of each of the scene images, a plurality of the same objects appearing in the plurality of scene images are identified. do.

例えば、教室において、端末機器は、教室内に配置された、カメラヘッドを有するティーチングマルチメディア機器を採用可能であり、ティーチングプロジェクタ、教室内のモニタリング機器等を含むが、それらに限定されない。端末機器は、教室内の複数枚のシーン画像を取得することにより、前記複数枚のシーン画像のうちの各シーン画像に対して特徴抽出処理及び目標部位検出を行い、前記各シーン画像の特徴情報と前記各シーン画像における複数の目標部位の位置とを取得する。前記各シーン画像の特徴情報のうち、前記複数の目標部位の位置のそれぞれに対応する目標特徴情報を取得することにより、前記複数枚のシーン画像に現れた複数の同じオブジェクトを特定し、複数オブジェクト追跡の目的を果たす。当該シーンにおけるオブジェクトは、ティーチング対象、例えば、学生を含んでもよいが、それに限定されない。目標部位は、人顔部位及び人体部位を含んでもよいが、それらに限定されない。 For example, in a classroom, the terminal equipment can employ teaching multimedia equipment with a camera head placed in the classroom, including but not limited to teaching projectors, classroom monitoring equipment, and the like. The terminal device acquires a plurality of scene images in the classroom, performs feature extraction processing and target part detection on each scene image among the plurality of scene images, and obtains feature information of each scene image. and the positions of a plurality of target parts in each scene image. A plurality of the same objects appearing in the plurality of scene images are identified by obtaining target feature information corresponding to the positions of the plurality of target parts from among the feature information of each of the scene images; serve the purpose of tracking. Objects in the scene may include, but are not limited to, teaching subjects, such as students. Target parts may include, but are not limited to, human facial parts and human body parts.

更に例えば、地下鉄又は鉄道駅には、１つ又は複数のモニタリングカメラヘッドが配置されて、モニタリングカメラヘッドを介して地下鉄又は鉄道駅の複数枚のシーン画像を取得してもよい。当該シーンでのオブジェクトは、乗客、乗客の持つスーツケース、従業員等を含んでもよい。本発明の実施例に関わる技術案を採用すると、地下鉄駅又は鉄道駅のような人の通行量が大きいシーンにおいて、複数枚のシーン画像に現れた複数の同じオブジェクトを特定可能であり、複数オブジェクト追跡の目的を果たす。 Further for example, in a subway or railway station, one or more monitoring camera heads may be arranged to acquire multiple scene images of the subway or railway station via the monitoring camera heads. Objects in the scene may include passengers, suitcases carried by passengers, employees, and the like. By adopting the technical solution according to the embodiment of the present invention, it is possible to identify the same multiple objects appearing in multiple scene images in a scene with a large number of people, such as a subway station or a train station. serve the purpose of tracking.

例示として、本発明の実施例に関わる複数オブジェクト追跡案は、更に、異なるシーンでのクラウドサーバに適用可能である。当該クラウドサーバは、外付けカメラヘッドが設けられて、外付けカメラヘッドを介して同一シーンで前後して複数枚のシーン画像を収集してもよく、又は、ビデオストリームをそのまま収集して当該ビデオストリームにおける複数枚の画像を前記複数枚のシーン画像としてもよい。収集されたシーン画像は、ルータ又はゲートウェイを介してクラウドサーバへ送信されてもよい。クラウドサーバは、各シーン画像に対して特徴抽出処理及び目標部位検出を行い、前記各シーン画像の特徴情報と前記各シーン画像における複数の目標部位の位置とを取得することにより、前記各シーン画像の特徴情報のうち、前記複数の目標部位の位置のそれぞれに対応する目標特徴情報を取得し、更に、前記複数枚のシーン画像に現れた複数の同じオブジェクトを特定する。 By way of example, the multiple object tracking scheme according to embodiments of the present invention is also applicable to cloud servers in different scenes. The cloud server may be provided with an external camera head, and may collect multiple scene images back and forth in the same scene via the external camera head, or collect the video stream as it is and A plurality of images in the stream may be used as the plurality of scene images. Collected scene images may be sent to a cloud server via a router or gateway. The cloud server performs feature extraction processing and target site detection on each scene image, and acquires feature information of each scene image and positions of a plurality of target sites in each scene image, thereby obtaining each scene image. Among the feature information, target feature information corresponding to each of the positions of the plurality of target parts is obtained, and a plurality of identical objects appearing in the plurality of scene images are specified.

例えば、外付けカメラヘッドは、教室に設けられ、教室内で複数枚のシーン画像を収集し、ルータ又はゲートウェイを介してクラウドサーバへ送信し、クラウドサーバは、上記オブジェクト追跡方法を実行する。 For example, an external camera head is installed in a classroom to collect multiple scene images in the classroom and send them to a cloud server via a router or gateway, and the cloud server executes the above object tracking method.

本発明の実施例では、端末機器又はクラウドサーバを介して、複数枚のシーン画像に現れた複数の同じオブジェクトを特定した後、同じ認識枠で同一オブジェクトをマーキングしてマーキングされた後のシーン画像を出力してもよい。例えば、出力された隣接する２枚のシーン画像において、赤色認識枠で当該シーンにおけるオブジェクト１をマーキングし、緑色認識枠で当該シーンにおけるオブジェクト２をマーキングし、青色認識枠で当該シーンにおけるオブジェクト３をマーキングする等により、現在シーンにおける複数の同じオブジェクトをより良好に示す。或いは、認識枠に対応するオブジェクト識別子によって同じ又は異なるオブジェクトを区分してもよい。例えば、出力された１枚のシーン画像に３つの認識枠が含まれ、対応するオブジェクト識別子がそれぞれ１、２及び３であり、それに隣接するシーン画像に２つの認識枠が含まれ、対応するオブジェクト識別子がそれぞれ１及び３である場合に、この２枚のシーン画像におけるオブジェクト識別子が１である認識枠が、同じオブジェクトに対応し、オブジェクト識別子が３である認識枠も同じオブジェクトに対応し、オブジェクト識別子が１及び３である認識枠がそれぞれ異なるオブジェクトに対応することは、特定することができる。 In the embodiment of the present invention, after specifying a plurality of same objects appearing in a plurality of scene images via a terminal device or a cloud server, the same objects are marked in the same recognition frame, and the marked scene image is may be output. For example, in two adjacent output scene images, the red recognition frame marks the object 1 in the scene, the green recognition frame marks the object 2 in the scene, and the blue recognition frame marks the object 3 in the scene. Better indication of multiple identical objects in the current scene, such as by marking. Alternatively, the same or different objects may be distinguished by object identifiers corresponding to recognition frames. For example, one output scene image contains three recognition frames, the corresponding object identifiers are 1, 2 and 3, respectively, and the adjacent scene image contains two recognition frames and the corresponding object When the identifiers are 1 and 3, respectively, the recognition frames with the object identifier of 1 in these two scene images correspond to the same object, and the recognition frames with the object identifier of 3 also correspond to the same object. It can be specified that the recognition frames with identifiers 1 and 3 correspond to different objects.

また、更に、端末機器又はクラウドサーバを介して、複数の同じオブジェクトのうちの少なくとも１つのオブジェクトの、所定時間帯における運動軌跡を特定し、当該運動軌跡が目標運動軌跡に合致するか否かを分析してもよい。 Furthermore, the motion trajectory of at least one of the plurality of identical objects is specified in a predetermined time period via a terminal device or a cloud server, and whether or not the motion trajectory matches the desired motion trajectory is determined. may be analyzed.

例えば、現在シーンが教室であり、オブジェクトがティーチング対象を含む場合に、目標運動軌跡は、ティーチングタスクにおいて前記ティーチング対象へ指定される少なくとも１種の運動軌跡、例えば、現在所在する位置から教師によって指定された他の位置（他の位置は、教壇、黒板又は他のクラスメートの所在する位置であってもよい）へ移動することを含んでもよいが、それに限定されない。又は、目標運動軌跡は、同一位置に存在することを含んでもよい。教師は、複数のティーチング対象の運動軌跡に基づいて、ティーチング活動をより良好に行うことができる。 For example, if the current scene is a classroom and the object includes a teaching target, the desired motion trajectory is at least one motion trajectory specified for the teaching target in the teaching task, e.g. (The other location may be the podium, the blackboard, or the location where other classmates are located), but is not limited thereto. Alternatively, the desired motion trajectory may include existing at the same position. Teachers can better conduct teaching activities based on the motion trajectories of multiple teaching objects.

更に例えば、現在シーンが監視カメラを配置した地下鉄駅又は鉄道駅であることを例とし、オブジェクトが乗車員を含むが、それに限定されない場合に、目標運動軌跡は、指定の危険運動軌跡又は不正運動軌跡、例えばホーム位置からレールの所在する位置に移動したり、改札機の上方又は下方等に移動したりすることを含んでもよいが、それらに限定されない。従業員は、乗車員の運動軌跡に応じて、駅管理をより良好に行い、危険行為又はただ乗りなどの不正乗車行為の発生を回避することができる。 Further, for example, if the current scene is a subway station or a train station where surveillance cameras are placed, and the objects include, but are not limited to, passengers, the target motion trajectory may be a designated dangerous motion trajectory or illegal motion trajectory. Trajectories may include, but are not limited to, moving from a home position to a position where a rail is located, moving above or below a ticket gate, and the like. According to the motion trajectory of the passenger, the employee can better manage the station and avoid the occurrence of dangerous behavior or illegal boarding behavior such as free riding.

上述したのが単に本発明に適用するシーンに対する例示の説明であり、動作タイプ認識を迅速に行う必要がある他の室内又はシーンも本発明の保護範囲に含まれる。 The above is merely an example description of the scenes that the present invention applies to, and other rooms or scenes that require quick action type recognition are also included in the protection scope of the present invention.

図１に示すように、図１は、１つの例示的な実施例に係るオブジェクト追跡方法を示し、以下のステップを含む。 As shown in FIG. 1, FIG. 1 shows an object tracking method according to one exemplary embodiment, including the following steps.

ステップ１０１では、同一シーンに対応する複数枚のシーン画像を取得する。 At step 101, a plurality of scene images corresponding to the same scene are acquired.

本発明の実施例において、同一シーンで前後して複数枚のシーン画像を収集してもよく、又はビデオストリームを収集してビデオストリームにおける複数枚の画像を複数枚のシーン画像としてもよい。本発明のシーンは、複数オブジェクト追跡を行う必要がある如何なるシーン、例えば、教室、監視カメラを配置した地点等を含むが、それらに限定されない。 In embodiments of the present invention, multiple scene images may be collected one behind the other for the same scene, or a video stream may be collected and multiple images in the video stream may be multiple scene images. A scene of the present invention includes, but is not limited to, any scene in which multiple object tracking needs to be performed, such as a classroom, a point of surveillance camera placement, and the like.

ステップ１０２では、前記複数枚のシーン画像のうちの各シーン画像に対して特徴抽出処理及び目標部位検出を行い、前記各シーン画像の特徴情報と前記各シーン画像における複数の目標部位の位置とを取得する。 In step 102, feature extraction processing and target site detection are performed on each scene image of the plurality of scene images, and the feature information of each scene image and the positions of the plurality of target sites in each scene image are determined. get.

本発明の実施例において、各シーン画像に対して特徴抽出を行うとは、各シーン画像から特徴情報を抽出することを指し、当該特徴情報は、色特徴、テクスチャ特徴、形状特徴等を含んでもよいが、それらに限定されない。色特徴は、グローバル特徴であり、画像に対応する対象の表面色属性を記述し、テクスチャ特徴もグローバル特徴であり、画像に対応する対象の表面テクスチャ属性を記述し、形状特徴は、２種の表し方を有し、１種が輪郭特徴であり、もう１種が領域特徴である。画像の輪郭特徴は、主に対象の外側境界に対するものであり、画像の領域特徴は、画像領域の形状に関連する。 In the embodiment of the present invention, performing feature extraction on each scene image means extracting feature information from each scene image, and the feature information may include color features, texture features, shape features, and the like. Good, but not limited to them. Color features are global features and describe surface color attributes of objects corresponding to images; texture features are also global features and describe surface texture attributes of objects corresponding to images; There are representations, one for contour features and another for region features. The contour features of the image are mainly for the outer boundary of the object, and the region features of the image are related to the shape of the image region.

本発明の実施例において、１つの目標部位が１つのオブジェクトに対応するが、それに限定されなく、複数の目標部位が１つのオブジェクトに対応してもよい。目標部位は、人顔部位及び／又は人体部位を含んでもよいが、それらに限定されない。人体部位は、人物の人体全体又は人体のある指定部位、例えば手部、足部等を含んでもよい。目標部位の位置は、少なくとも当該目標部位の認識枠の中心位置で示されてもよい。例えば、目標部位が人顔部位を含む場合に、目標部位の位置は、人顔認識枠の中心位置で示されてもよい。当該目標部位の認識枠は、例えば、当該目標部位の外接矩形枠等として実現されてもよい。 In the embodiment of the present invention, one target site corresponds to one object, but it is not limited to this, and multiple target sites may correspond to one object. Target parts may include, but are not limited to, human facial parts and/or human body parts. Body parts may include the entire human body or certain designated parts of the human body, such as hands, feet, and the like. The position of the target site may be indicated by at least the center position of the recognition frame of the target site. For example, when the target part includes a human face part, the position of the target part may be indicated by the center position of the human face recognition frame. The recognition frame of the target part may be implemented as, for example, a circumscribed rectangular frame of the target part.

ステップ１０３では、前記各シーン画像の特徴情報のうち、前記複数の目標部位の位置のそれぞれに対応する目標特徴情報を取得する。 In step 103, among the feature information of each scene image, target feature information corresponding to each of the positions of the plurality of target parts is obtained.

本発明の実施例において、各シーン画像に複数の目標部位が含まれ、取得された各シーン画像の特徴情報に基づいて、目標部位を含む領域の画素に対して特徴抽出を行い、複数の目標部位の位置のそれぞれに対応する目標特徴情報を特定する。例示として、畳み込み処理等により、各シーン画像の特徴情報のうち、各目標部位の領域に含まれる複数の画素のそれぞれに対応する目標特徴情報を取得してもよい。 In an embodiment of the present invention, each scene image includes a plurality of target regions, and based on the acquired feature information of each scene image, feature extraction is performed on pixels in an area including the target region to obtain a plurality of target regions. Identify target feature information corresponding to each of the positions of the part. As an example, target feature information corresponding to each of a plurality of pixels included in each target site area may be acquired from the feature information of each scene image by convolution processing or the like.

ステップ１０４では、取得された前記複数の目標部位の位置のそれぞれに対応する目標特徴情報に基づいて、前記複数枚のシーン画像に現れた複数の同じオブジェクトを特定する。各シーン画像は、前記複数の同じオブジェクトのうちの一部のオブジェクト又は全部のオブジェクトを含む。 In step 104, a plurality of identical objects appearing in the plurality of scene images are identified based on the target feature information corresponding to each of the acquired positions of the plurality of target parts. Each scene image includes some or all of the same objects.

上記実施例では、各シーン画像において複数の目標部位の位置に対応する目標特徴情報を取得し、前記複数枚のシーン画像のこれらの目標特徴情報に対してマッチングを取ることにより、前記複数枚のシーン画像に現れた複数の同じオブジェクトを特定することができる。 In the above-described embodiment, target feature information corresponding to the positions of a plurality of target parts in each scene image is acquired, and matching is performed with respect to the target feature information of the plurality of scene images. Multiple identical objects appearing in a scene image can be identified.

上記実施例では、隣接する２枚ずつのシーン画像において複数のオブジェクトをそれぞれ特定した後、前者のシーン画像における各オブジェクトに対して、後者のシーン画像に含まれる複数のオブジェクトの中で、単一オブジェクト追跡推論をそれぞれ行う必要がない。代わりに、単一シーン画像に対して単一フレーム推定を行って複数の目標部位の位置に対応する目標特徴情報を取得し、取得された隣接する２枚ずつのシーン画像の単一フレーム推定結果についてマッチングを取って隣接する２枚ずつのシーン画像における複数の同じオブジェクトを取得し、複数オブジェクト追跡の目的を果たす。現在シーンに複数のオブジェクトが含まれたとしても、シーン画像全体に対して推定を行うため、全複数オブジェクト追跡手順の時間は、シーン画像に含まれるオブジェクトの数に関係しない。そのため、オブジェクト数の増加につれて単一オブジェクト追跡推論を逐一に行うことによって追跡時間が増加することはない。これにより、計算リソースが非常に大きく節約され、複数オブジェクト追跡の時間が短縮され、複数オブジェクト追跡の検出効率が有効的に向上する。 In the above embodiment, after specifying a plurality of objects in each of two adjacent scene images, each object in the former scene image is identified as a single object among the plurality of objects included in the latter scene image. No need to do each object tracking inference. Instead, single-frame estimation is performed on a single scene image to obtain target feature information corresponding to the positions of a plurality of target parts, and single-frame estimation results of two adjacent scene images obtained are obtained. to obtain multiple identical objects in every two adjacent scene images to achieve the purpose of multiple object tracking. Even if the current scene contains multiple objects, the time of the entire multi-object tracking procedure is independent of the number of objects contained in the scene image because the estimation is performed on the entire scene image. Therefore, tracking time does not increase by iterating single-object tracking inferences as the number of objects increases. This greatly saves computing resources, shortens the time of multi-object tracking, and effectively improves the detection efficiency of multi-object tracking.

幾つかの選択可能な実施例において、図２に示すように、ステップ１０２は、以下のステップを含んでもよい。 In some alternative embodiments, as shown in FIG. 2, step 102 may include the following steps.

ステップ１０２－１では、前記複数枚のシーン画像のうちの各シーン画像の第１特徴マップを抽出する。 At step 102-1, a first feature map of each scene image among the plurality of scene images is extracted.

本発明の実施例において、予めトレーニングされたニューラルネットワークモデルを介して各シーン画像の画像特徴を抽出して第１特徴マップを取得してもよい。当該ニューラルネットワークモデルは、ビジュアル幾何学グループネットワーク（ＶｉｓｕａｌＧｅｏｍｅｔｒｙＧｒｏｕｐＮｅｔｗｏｒｋ、ＶＧＧＮｅｔ）等のモデルを採用してもよいが、それらに限定されない。 In an embodiment of the present invention, image features of each scene image may be extracted through a pre-trained neural network model to obtain a first feature map. The neural network model may employ a model such as a Visual Geometry Group Network (VGG Net), but is not limited thereto.

ステップ１０２－２では、前記各シーン画像の第１特徴マップにおいて目標部位検出を行い、前記各シーン画像における複数の目標部位の位置を取得し、且つ、前記各シーン画像の第１特徴マップに対して特徴抽出処理を行い、多次元の第２特徴マップを取得する。 In step 102-2, target part detection is performed in the first feature map of each scene image to obtain the positions of a plurality of target parts in each scene image; feature extraction processing is performed to obtain a multidimensional second feature map.

本発明の実施例において、目標部位は、人顔部位及び／又は人体部位を含んでもよい。領域予測ネットワーク（ＲｅｇｉｏｎＰｒｏｐｏｓａｌＮｅｔｗｏｒｋ、ＲＰＮ）を介して、各シーン画像の第１特徴マップにおいて人顔部位及び／又は人体部位の検出を行い、人顔部位に対応する人顔領域及び／又は人体部位に対応する人体領域を特定してもよい。人顔領域は、人顔認識枠でマーキングされてもよく、人体領域は、人体認識枠でマーキングされてもよい。例示として、人顔認識枠の中心位置を人顔部位の位置としてもよい。同様に、人体認識枠の中心位置を人体部位の位置としてもよい。 In embodiments of the present invention, the target part may include a human facial part and/or an anthropomorphic part. Human facial regions and/or human body regions are detected in the first feature map of each scene image through a region prediction network (RPN), and human facial regions and/or human body regions corresponding to the human facial regions are detected. You may identify the human body area|region corresponding to . The human face area may be marked with a human face recognition frame, and the human body area may be marked with a human body recognition frame. As an example, the center position of the human face recognition frame may be the position of the human face part. Similarly, the center position of the human body recognition frame may be the position of the human body part.

更に、各シーン画像の第１特徴マップに対して特徴抽出処理を行い、第１特徴マップに含まれる複数種の特徴情報を異なるチャンネルを介して抽出してもよい。このように、多次元の第２特徴マップを取得する。例示として、第２特徴マップのサイズは、第１特徴マップのサイズと同じであってもよく、且つ第２特徴マップの次元値は、各シーン画像に対応する所定チャンネル数である。 Further, feature extraction processing may be performed on the first feature map of each scene image, and multiple types of feature information included in the first feature map may be extracted via different channels. Thus, a multi-dimensional second feature map is obtained. Illustratively, the size of the second feature map may be the same as the size of the first feature map, and the dimension value of the second feature map is the predetermined number of channels corresponding to each scene image.

それ相応に、ステップ１０３は、以下のことを含んでもよい。 Accordingly, step 103 may include the following.

前記多次元の第２特徴マップにおいて、前記複数の目標部位の位置のそれぞれに対応する目標特徴ベクトルを取得する。 A target feature vector corresponding to each of the positions of the plurality of target parts in the multidimensional second feature map is obtained.

本発明の実施例において、目標特徴情報は、何れかの次元の第２特徴マップに含まれる複数の目標部位の領域のうちの各領域中の複数の画素のそれぞれに対応する特徴情報を表すために用いられる。目標部位は、人顔部位及び／又は人体部位を含んでもよい。 In an embodiment of the present invention, the target feature information represents feature information corresponding to each of a plurality of pixels in each of a plurality of target region regions included in the second feature map of any dimension. used for The target part may include a human facial part and/or an anthropomorphic part.

何れかの次元の第２特徴マップに含まれる複数の目標部位の領域において、何れか１つの画素に対応する特徴情報も、１つの一次元の特徴ベクトルを構成可能である。後の類似度算出が容易にするように、これらの特徴ベクトルから１つ又は複数の特徴ベクトルを選択して当該目標部位の領域の特徴情報（即ち、目標特徴情報）を示してもよい。本発明の実施例において、目標部位の位置の画素に対応する特徴ベクトルを選択し、当該特徴ベクトルを当該次元の第２特徴マップにおける目標部位の位置に対応する目標特徴ベクトルとしてもよい。目標部位の位置は、人顔認識枠の中心位置及び／又は人体認識枠の中心位置を含んでもよい。 Feature information corresponding to any one pixel in a plurality of target region regions included in any dimension of the second feature map can also constitute one one-dimensional feature vector. From these feature vectors, one or more feature vectors may be selected to represent feature information of the region of the target site (ie, target feature information) to facilitate subsequent similarity calculations. In an embodiment of the present invention, a feature vector corresponding to the pixel at the target site location may be selected and used as the target feature vector corresponding to the target site location in the second feature map of that dimension. The position of the target part may include the center position of the human face recognition frame and/or the center position of the human body recognition frame.

更に、後の目標部位のマッチングの正確度が向上するように、多次元の第２特徴マップのうちの少なくとも１つの次元の第２特徴マップについて、複数の目標部位の位置の画素に対応する特徴情報を得て前記複数の目標部位の位置のそれぞれに対応する目標特徴ベクトルを取得してもよい。例示として、各次元の第２特徴マップごとに、複数の目標部位の位置のそれぞれに対応する目標特徴ベクトルを取得可能である。このように、目標特徴ベクトルの次元値と第２特徴マップの次元値とを同じにする。例えば、第２特徴マップの次元値がＣである場合に、目標特徴ベクトルの次元値もＣとなる。 Further, for the second feature map of at least one dimension of the multi-dimensional second feature map, features corresponding to pixels at the locations of the plurality of target sites are obtained so as to improve the accuracy of subsequent target site matching. Information may be obtained to obtain a target feature vector corresponding to each of the plurality of target site locations. Illustratively, for each dimension of the second feature map, a target feature vector corresponding to each of a plurality of target site locations can be obtained. Thus, the dimension values of the target feature vector and the dimension values of the second feature map are made the same. For example, if the dimension value of the second feature map is C, the dimension value of the target feature vector is also C.

上記実施例では、シーン画像全体について順次に実行された特徴抽出、目標部位検出、及び複数の目標部位の位置のそれぞれに対応する目標特徴ベクトルの特定である全手順が単一シーン画像に対する単一フレーム推定であるため、その中に含まれるオブジェクトの数の多少に関係しない。後続では、隣接する２枚ずつのシーン画像における、複数のオブジェクト位置のそれぞれに対応する目標特徴ベクトルに対してマッチングを取るため、単一オブジェクト追跡推論を別々に行う必要がない。シーン画像に含まれるオブジェクト数が多くなっても、一度にマッピング手順を完了することができる。本発明のオブジェクト追跡方法がシーン画像中のオブジェクト数に関係せず、オブジェクト数の増加による追跡時間の増加はない。これにより、計算リソースが非常に大きく節約され、複数オブジェクト追跡の時間が短縮され、複数オブジェクト追跡の検出効率が有効的に向上する。 In the above embodiment, the entire procedure of feature extraction, target site detection, and identification of target feature vectors corresponding to each of the locations of a plurality of target sites, performed sequentially on the entire scene image, is performed on a single scene image. Since it is a frame estimate, it does not matter how many or how many objects it contains. Subsequent matching is performed against the target feature vectors corresponding to each of the multiple object positions in the two adjacent scene images, so there is no need for separate single-object tracking inference. Even if the number of objects included in the scene image increases, the mapping procedure can be completed at once. Since the object tracking method of the present invention is independent of the number of objects in the scene image, there is no increase in tracking time as the number of objects increases. This greatly saves computing resources, shortens the time of multi-object tracking, and effectively improves the detection efficiency of multi-object tracking.

幾つかの選択可能な実施例において、図３に示すように、ステップ１０４は、以下のステップを含んでもよい。 In some alternative embodiments, as shown in FIG. 3, step 104 may include the following steps.

ステップ１０４－１では、前記複数枚のシーン画像のうちの隣接する２枚ずつのシーン画像にそれぞれ対応する前記複数の目標特徴情報を利用して、前記隣接する２枚ずつのシーン画像における各目標部位の間の類似度を取得する。 In step 104-1, using the plurality of target feature information corresponding to each of the adjacent two scene images out of the plurality of scene images, each target in the adjacent two scene images is Get the similarity between parts.

本発明の実施例では、各シーン画像の特徴情報のうち、前記複数の目標部位に対応する複数の目標特徴情報が既に特定され、隣接する２枚ずつのシーン画像のそれぞれに対応する複数の目標特徴情報を利用して類似度算出を行い、隣接する２枚ずつのシーン画像における各目標部位の間の類似度を取得することができる。 In the embodiment of the present invention, among the feature information of each scene image, a plurality of target feature information corresponding to the plurality of target parts are already specified, and a plurality of target features corresponding to each of two adjacent scene images are identified. Similarity calculation is performed using the feature information, and the similarity between each target part in each two adjacent scene images can be obtained.

ステップ１０４－２では、前記隣接する２枚ずつのシーン画像における各目標部位の間の類似度に基づいて、前記異なるシーン画像に現れた複数の同じオブジェクトを特定する。 At step 104-2, a plurality of identical objects appearing in the different scene images are identified based on the degree of similarity between each target portion in the two adjacent scene images.

本発明の実施例において、隣接する２枚ずつのシーン画像における、最も類似度が大きい目標部位の所属するオブジェクトを異なるシーン画像に現れた同じオブジェクトとしてもよい。 In the embodiment of the present invention, the same object appearing in different scene images may be the object to which the target part with the highest degree of similarity belongs in each of two adjacent scene images.

上記実施例において、隣接する２枚ずつのシーン画像における各目標部位の間の類似度に基づいて異なるシーン画像に現れた複数の同じオブジェクトを特定可能であり、複数オブジェクト追跡の目的が果たされるとともに、追跡手順がオブジェクト数に関係せず、利用可能性が高くなる。 In the above embodiment, multiple identical objects appearing in different scene images can be identified based on the similarity between each target part in two adjacent scene images, and the purpose of multiple object tracking is achieved. , the tracking procedure is independent of the number of objects, increasing availability.

幾つかの選択可能な実施例において、隣接する２枚ずつのシーン画像は、第１シーン画像Ｔ_０及び第２シーン画像Ｔ_１である。 In some alternative embodiments, the adjacent two scene images are a first scene image _T0 and a second scene image _T1 .

図４に示すように、上記ステップ１０４－１は、以下のステップを含んでもよい。 As shown in FIG. 4, step 104-1 above may include the following steps.

ステップ１０４－１１では、第１シーン画像におけるＮ個の目標特徴ベクトルのそれぞれと第２シーン画像におけるＭ個の目標特徴ベクトルとの類似度を特定する。 Step 104-11 determines the degree of similarity between each of the N target feature vectors in the first scene image and the M target feature vectors in the second scene image.

本発明の実施例において、目標特徴情報は、何れかの次元の第２特徴マップに含まれる複数の目標部位の領域の各領域中の複数の画素のそれぞれに対応する特徴情報を表すために用いられる。目標部位は、人顔部位及び／又は人体部位を含んでもよい。 In an embodiment of the present invention, the target feature information is used to represent feature information corresponding to each of a plurality of pixels in each of a plurality of target region regions included in the second feature map of any dimension. be done. The target part may include a human facial part and/or an anthropomorphic part.

目標特徴情報に基づくと、何れかの次元の第２特徴マップに含まれる複数の目標部位の領域において、何れか１つの画素に対応する特徴情報も、１つの一次元の特徴ベクトルを構成可能である。後の類似度算出が容易にするように、これらの特徴ベクトルから１つ又は複数の特徴ベクトルを選択して当該目標部位の領域の特徴情報を示してもよい。本発明の実施例において、目標部位の位置の画素に対応する特徴ベクトルを選択し、当該特徴ベクトルを当該次元の第２特徴マップにおける目標部位の位置に対応する目標特徴ベクトルとしてもよい。目標部位の位置は、人顔認識枠の中心位置及び／又は人体認識枠の中心位置を含んでもよい。 Based on the target feature information, feature information corresponding to any one pixel in a plurality of target region regions included in any dimension of the second feature map can also constitute one one-dimensional feature vector. be. From these feature vectors, one or more feature vectors may be selected to represent the feature information of the region of the target site to facilitate subsequent similarity calculations. In an embodiment of the present invention, a feature vector corresponding to the pixel at the target site location may be selected and used as the target feature vector corresponding to the target site location in the second feature map of that dimension. The position of the target part may include the center position of the human face recognition frame and/or the center position of the human body recognition frame.

類似度を特定する手順では、隣接する２枚ずつのシーン画像のうちの第１シーン画像におけるＮ個の目標特徴ベクトルのそれぞれと第２シーン画像におけるＭ個の目標特徴ベクトルとの類似度を特定してもよい。Ｎ及びＭは、２以上の正整数である。即ち、第１シーン画像における複数の目標特徴ベクトルのそれぞれと第２シーン画像における複数の目標特徴ベクトルとの間の類似度を特定する。 In the step of identifying similarity, the degree of similarity between each of the N target feature vectors in the first scene image and the M target feature vectors in the second scene image among the two adjacent scene images is specified. You may N and M are positive integers of 2 or more. That is, the degree of similarity between each of the plurality of target feature vectors in the first scene image and the plurality of target feature vectors in the second scene image is identified.

１つの可能な実現方式では、類似度の特定時に、目標特徴ベクトルの間の余弦類似度値を特定してもよい。第１シーン画像における何れか１つの目標特徴ベクトルと第２シーン画像における何れか１つの目標特徴ベクトルとの夾角の余弦値を算出することにより、それらの類似度を評価する。 In one possible implementation, cosine similarity values between the target feature vectors may be determined during similarity determination. By calculating the cosine value of the included angle between any one target feature vector in the first scene image and any one target feature vector in the second scene image, their similarity is evaluated.

ステップ１０４－１２では、前記第１シーン画像におけるＮ個の目標特徴ベクトルのそれぞれと前記第２シーン画像におけるＭ個の目標特徴ベクトルとの前記類似度に基づいて、Ｎ×Ｍ次元の類似度行列を取得する。 In step 104-12, based on the similarities between each of the N target feature vectors in the first scene image and the M target feature vectors in the second scene image, an N×M dimensional similarity matrix to get

本発明の実施例において、類似度行列における何れかの次元の値は、前記第１シーン画像の何れかの第１目標部位と前記第２シーン画像中の何れかの第２目標部位との類似度を表す。ＮとＭは、等しくてもよく、等しくなくてもよい。 In an embodiment of the present invention, the value of any dimension in the similarity matrix indicates the similarity between any first target portion in said first scene image and any second target portion in said second scene image. represents degrees. N and M may or may not be equal.

上記実施例では、第１シーン画像におけるＮ個の目標特徴ベクトルのそれぞれと第２シーン画像におけるＭ個の目標特徴ベクトルとの類似度を特定することにより、Ｎ×Ｍ次元の類似度行列を取得し、前記第１シーン画像の何れかの第１目標部位と前記第２シーン画像中の何れかの第２目標部位との類似度を類似度行列で示してもよく、実現しやすくなり、利用可能性が高くなる。 In the above embodiment, an N×M dimensional similarity matrix is obtained by determining the similarity between each of the N target feature vectors in the first scene image and the M target feature vectors in the second scene image. The degree of similarity between any of the first target parts in the first scene image and any of the second target parts in the second scene image may be indicated by a similarity matrix, which facilitates implementation and use. more likely.

幾つかの選択可能な実施例において、ステップ１０４－２に関し、２部グラフアルゴリズムを採用してもよい。空間距離制約を満たす条件で、前記隣接する２枚ずつのシーン画像における各目標部位の間の類似度に基づいて、前記異なるシーン画像に現れた複数の同じオブジェクトを特定する。 In some alternative embodiments, a bipartite graph algorithm may be employed for step 104-2. A plurality of identical objects appearing in the different scene images are identified based on the similarity between each target portion in the two adjacent scene images under the condition that the spatial distance constraint is satisfied.

２部グラフアルゴリズムとは、１つの２部グラフ内において、左頂点をＸ、右頂点をＹとし、各グループの左右接続Ｘ_ｉＹ_ｊについて重み付け値ｗ_ｉｊを与え、全てのｗ_ｉｊの和が最大となるマッチングを求めることを指す。本発明の実施例において、Ｘ_ｉは、第１シーン画像におけるＮ個の目標特徴ベクトルのうちの１つに相当し、Ｙ_ｊは、第２シーン画像におけるＭ個の目標特徴ベクトルのうちの１つに相当し、重み付け値ｗ_ｉｊは、類似度に対応する。本発明では、類似度が最大である場合に、Ｎ個の目標特徴ベクトルと第２目標特徴ベクトルとをマッチングさせ、現在隣接する２枚ずつのシーン画像における複数の同じオブジェクトを最終的に特定できるようにする必要がある。 In a bipartite graph algorithm, in one bipartite graph, the left vertex is X and the right vertex is Y, the weight value _wij is given to the left and right connections X _i Y _j of each group, and the sum of all w _ij is It refers to finding the maximum matching. In an embodiment of the present invention, X _i corresponds to one of the N target feature vectors in the first scene image and Y _j corresponds to one of the M target feature vectors in the second scene image. and the weighting value w _ij corresponds to the similarity. In the present invention, when the similarity is the maximum, the N target feature vectors and the second target feature vector are matched, and the same objects in the current two adjacent scene images can finally be identified. It is necessary to

本発明の実施例において、空間距離制約を満たす条件は、Ｎ個の目標特徴ベクトルとＭ個の目標特徴ベクトルとの間の類似度の次元がＮ×Ｍを超えないことを含む。 In an embodiment of the present invention, the condition for satisfying the spatial distance constraint includes that the dimension of similarity between the N target feature vectors and the M target feature vectors does not exceed N×M.

１つの可能な実現方式において、複数オブジェクト追跡の正確性が更に向上するように、類似度が最大であるとともにこの類似度最大値が所定閾値を超えることも確保する必要がある。 In one possible implementation, it is necessary to ensure that the similarity is maximum and also that this maximum similarity exceeds a predetermined threshold so that the accuracy of multiple object tracking is further improved.

図５に示すように、ステップ１０４－２は、以下のステップを含んでもよい。 As shown in FIG. 5, step 104-2 may include the following steps.

ステップ１０４－２１では、前記類似度行列に基づいて、前記Ｎ個の目標特徴ベクトルのうちの第１目標特徴ベクトルのそれぞれと前記Ｍ個の目標特徴ベクトルとの類似度から類似度最大値を特定する。 At step 104-21, based on the similarity matrix, identify a maximum similarity value from the similarity between each of the first target feature vectors of the N target feature vectors and the M target feature vectors. do.

本発明の実施例において、第１目標特徴ベクトルは、第１シーン画像において特定されたＮ個の目標特徴ベクトルのうちの何れか１つである。類似度行列に基づいて当該第１目標特徴ベクトルと第２シーン画像における各目標特徴ベクトルとの間の類似度を取得してもよい。これらの類似度から１つの類似度最大値を特定してもよい。 In an embodiment of the present invention, the first target feature vector is any one of the N target feature vectors identified in the first scene image. A similarity between the first target feature vector and each target feature vector in the second scene image may be obtained based on a similarity matrix. One similarity maximum value may be identified from these similarities.

類似度行列が

であり、第１目標特徴ベクトルとＭ個の第２目標特徴ベクトルとの間の類似度がそれぞれａ₁₁、ａ₁₂及びａ₁₃であるとすれば、その中の最大値（ａ₁₁と仮定する）は、特定可能である。 The similarity matrix is

, and the similarities between the _first target feature vector and the M second target feature vectors are a ₁₁ , a ₁₂ and a ₁₃ , respectively. ) is identifiable.

ステップ１０４－２２では、前記類似度最大値が所定閾値よりも大きい場合に、前記Ｍ個の目標特徴ベクトルのうち、前記類似度最大値に対応する第２目標特徴ベクトルを特定する。 At step 104-22, if the maximum similarity value is greater than a predetermined threshold, a second target feature vector corresponding to the maximum similarity value among the M target feature vectors is identified.

本発明の実施例において、第２目標特徴ベクトルは、第２シーン画像に含まれるＭ個の目標特徴ベクトルのうち、当該類似度最大値に対応する目標特徴ベクトルである。 In the embodiment of the present invention, the second target feature vector is the target feature vector corresponding to the maximum similarity value among the M target feature vectors included in the second scene image.

複数オブジェクト追跡の正確性が更に確保されるように、類似度最大値が所定閾値よりも大きいことを確保する必要がある。 To further ensure the accuracy of multiple object tracking, it is necessary to ensure that the maximum similarity value is greater than a predetermined threshold.

ステップ１０４－２３では、前記第１シーン画像における前記第１目標特徴ベクトルに対応する第１目標部位の所属するオブジェクトと前記第２シーン画像における第２目標特徴ベクトルに対応する第２目標部位の所属するオブジェクトとを同じオブジェクトとする。 In step 104-23, the object to which the first target portion corresponding to the first target feature vector in the first scene image belongs and the second target portion to which the second target feature vector in the second scene image belongs. be the same object.

本発明の実施例において、上記類似度最大値が所定閾値よりも大きいときこそ、前記第１シーン画像の第１目標特徴ベクトルに対応する第１目標部位の所属するオブジェクトと前記第２シーン画像における第２目標特徴ベクトルに対応する第２目標部位の所属するオブジェクトとを同じオブジェクトとする。 In the embodiment of the present invention, when the maximum similarity value is greater than a predetermined threshold, the object to which the first target part corresponding to the first target feature vector of the first scene image belongs and the second scene image The object to which the second target part corresponding to the second target feature vector belongs is assumed to be the same object.

類似度最大値が所定閾値以下であれば、第１シーン画像における第１目標特徴ベクトルに対応する第１目標部位の所属するオブジェクトが第２シーン画像において同じオブジェクトを有さないと考えられてもよい。 If the maximum similarity value is equal to or less than the predetermined threshold, even if the object to which the first target part corresponding to the first target feature vector in the first scene image belongs does not have the same object in the second scene image. good.

上記ステップ１０４－２１から１０４－２３を繰り返し、繰り返し回数が第１シーン画像に含まれる目標特徴ベクトルの数Ｎであり、第１シーン画像と第２シーン画像とに現れた全ての同じオブジェクトを最終的に特定することができる。 Repeat steps 104-21 to 104-23 above, where the number of repetitions is the number N of target feature vectors contained in the first scene image, and all the same objects appearing in the first scene image and the second scene image are finalized. can be specifically identified.

上記実施例において、類似度行列に応じて、隣接する２枚ずつのシーン画像における目標部位の間の類似度が最も近接する２つのオブジェクトを同じオブジェクトとしてもよく、複数オブジェクト追跡の目的が果たされ、利用可能性が高くなる。 In the above embodiment, according to the similarity matrix, the two objects with the closest similarities between the target parts in the two adjacent scene images may be regarded as the same object, thus achieving the purpose of tracking multiple objects. and increased availability.

幾つかの選択可能な実施例において、複数枚のシーン画像が取得された後、前記複数枚のシーン画像のうちの少なくとも２枚を予めトレーニングされた特徴検出モデルに入力し、前記特徴検出モデルを介して前記複数枚のシーン画像のうちの各シーン画像に対して特徴抽出処理及び目標部位検出を行い、前記各シーン画像の特徴情報と前記各シーン画像における複数の目標部位の位置とを取得し、且つ、前記各シーン画像における複数の目標部位の位置に基づいて、前記各シーン画像の特徴情報のうち、前記複数の目標部位に対応する複数の目標特徴情報を取得してもよい。 In some alternative embodiments, after a plurality of scene images are acquired, at least two of the plurality of scene images are input to a pre-trained feature detection model, and the feature detection model is: performing feature extraction processing and target site detection on each scene image out of the plurality of scene images via the camera, and acquiring feature information of each scene image and positions of a plurality of target sites in each scene image; and, based on the positions of the plurality of target parts in each scene image, a plurality of pieces of target feature information corresponding to the plurality of target parts may be obtained from among the feature information of each of the scene images.

特徴検出モデルの構成は、図６に示すように、複数枚のシーン画像を特徴検出モデルに入力し、特徴検出モデルは、まず、バックボーンネットワーク（ｂａｃｋｂｏｎｅ）を介して複数枚のシーン画像のうちの各シーン画像に対して特徴抽出を行い、各シーン画像の第１特徴マップを取得する。 As shown in FIG. 6, the configuration of the feature detection model is such that a plurality of scene images are input to the feature detection model. Feature extraction is performed on each scene image to obtain a first feature map of each scene image.

更に、特徴検出モデルの部位検出ブランチを介して、前記各シーン画像の第１特徴マップにおいて目標部位検出を行い、前記各シーン画像における複数の目標部位の位置を取得し、且つ、前記特徴検出モデルの特徴抽出ブランチを介して、前記各シーン画像の第１特徴マップに対して特徴抽出処理を行い、多次元の第２特徴マップを取得する。オブジェクトは、人物を含んでもよく、目標部位は、人顔部位及び／又は人体部位を含んでもよい。特徴抽出ブランチは、少なくとも１つの畳み込み層を直列に接続して形成されてもよい。第２特徴マップのサイズは、第１特徴マップのサイズと同じである。このように、各次元の第２特徴マップにおいて、複数の目標部位の位置は、同じである。第２特徴マップの次元値は、各シーン画像に対応する所定チャンネル数と同じである。 Further, performing target site detection in a first feature map of each scene image via a site detection branch of the feature detection model to obtain locations of a plurality of target sites in each scene image; through the feature extraction branch of to perform feature extraction processing on the first feature map of each scene image to obtain a multi-dimensional second feature map. The object may include a person, and the target part may include a human facial part and/or a human body part. A feature extraction branch may be formed by serially connecting at least one convolutional layer. The size of the second feature map is the same as the size of the first feature map. Thus, the positions of multiple target sites are the same in the second feature map of each dimension. The dimension value of the second feature map is the same as the predetermined number of channels corresponding to each scene image.

更に、前記多次元の第２特徴マップにおいて、前記複数の目標部位の位置に対応する複数の目標特徴ベクトルを取得してもよい。目標部位の位置は、人顔認識枠の中心位置及び／又は人体認識枠の中心位置で示されてもよい。目標特徴ベクトルの次元値は、第２特徴マップの次元値と同じである。ある人顔認識枠の中心位置座標が（ｘ、ｙ）であり、特徴抽出ブランチで得られた第２特徴マップのサイズが第１特徴マップサイズに一致しており、何れもＨ×Ｗであり、Ｈ及びＷがそれぞれ画像の長さ及び幅であり、第２特徴マップの次元値がＣであり、Ｃが各シーン画像に対応する所定チャンネル数である。各チャンネルの何れにも、人顔認識枠中心位置（ｘ、ｙ）に対応する目標特徴ベクトルを取得可能であるため、目標特徴ベクトルの次元値は、Ｃとなる。 Furthermore, a plurality of target feature vectors corresponding to the positions of the plurality of target regions may be obtained in the second multidimensional feature map. The position of the target part may be indicated by the center position of the human face recognition frame and/or the center position of the human body recognition frame. The dimension values of the target feature vector are the same as the dimension values of the second feature map. The center position coordinates of a certain human face recognition frame are (x, y), and the size of the second feature map obtained in the feature extraction branch matches the size of the first feature map, both of which are H×W. , H and W are the length and width of the image, respectively, the dimension value of the second feature map is C, and C is the predetermined number of channels corresponding to each scene image. Since the target feature vector corresponding to the center position (x, y) of the human face recognition frame can be obtained for each channel, the dimension value of the target feature vector is C.

本発明の実施例では、前記多次元の第２特徴マップにおいて前記複数の目標部位の位置に対応する複数の目標特徴ベクトルを抽出した後、第１シーン画像におけるＮ個の目標特徴ベクトルのそれぞれと第２シーン画像におけるＭ個の目標特徴ベクトルとの類似度を特定することにより、類似度行列を取得し、当該類似度行列に基づいて、前記異なるシーン画像に現れた複数の同じオブジェクトを特定してもよい。特定方式は、上記ステップ１０４－２の方式と同じであるため、ここで繰り返し説明しない。 In an embodiment of the present invention, after extracting a plurality of target feature vectors corresponding to the positions of the plurality of target parts in the multidimensional second feature map, each of the N target feature vectors in the first scene image and obtaining a similarity matrix by identifying similarities to the M target feature vectors in a second scene image, and identifying a plurality of identical objects appearing in the different scene images based on the similarity matrix; may The identification method is the same as the method in step 104-2 above, and will not be described repeatedly here.

図７に示すように、第１シーン画像Ｔ_０及び第２シーン画像Ｔ_１へ上記特徴検出モデルを別々に入力することにより、Ｎ個の目標特徴ベクトル及びＭ個の目標特徴ベクトルをそれぞれ取得してもよい。更に、２部グラフアルゴリズムを採用し、空間距離制約を満たす条件で、抽出された前記目標部位の特徴に対してマッチングを取ることにより、Ｔ_０とＴ_１とに現れた同じオブジェクトを特定してもよい。 As shown in FIG. 7, N target feature vectors and M target feature vectors are obtained by separately inputting the feature detection models to the first scene image _T0 and the second scene image _T1 . may Furthermore, a bipartite graph algorithm is employed to identify the same object appearing in _T0 and _T1 by matching against the extracted features of the target site under the condition that the spatial distance constraint is satisfied. good too.

上記実施例において、各シーン画像に対して単一フレーム推定を行い、各シーン画像に幾つのオブジェクトが含まれても、複数オブジェクト追跡が迅速に実現可能であり、複数オブジェクト追跡の検出効率が有効的に向上する。 In the above embodiment, single-frame estimation is performed for each scene image. No matter how many objects are included in each scene image, multi-object tracking can be realized quickly, and the detection efficiency of multi-object tracking is effective. substantially improved.

幾つかの選択可能な実施例において、図８に示すように、当該方法は、以下のステップを更に含んでもよい。 In some alternative embodiments, as shown in FIG. 8, the method may further include the following steps.

ステップ１００－１では、同一シーンに対応する複数枚のサンプルシーン画像を初期ニューラルネットワークモデルに入力し、前記初期ニューラルネットワークモデルから出力された各サンプルシーン画像における複数の目標部位の位置のそれぞれに対応するサンプル特徴ベクトルを取得する。 In step 100-1, a plurality of sample scene images corresponding to the same scene are input to the initial neural network model, and each sample scene image output from the initial neural network model corresponds to each of the positions of a plurality of target parts. Get a sample feature vector that

本発明の実施例において、同一シーンに対応する既存の複数枚のサンプル画像を初期ニューラルネットワークモデルの入力値として採用し、複数枚のサンプル画像において予め各認識枠及び／又は対応するオブジェクト識別子によって複数の同じオブジェクト及び異なるオブジェクトをマーキングする。 In an embodiment of the present invention, a plurality of existing sample images corresponding to the same scene are adopted as input values for an initial neural network model, and a plurality of sample images are preliminarily identified by recognition frames and/or corresponding object identifiers in the plurality of sample images. marking the same object and different objects in .

本発明の実施例において、初期ニューラルネットワークモデルの構成は、同様に図６に示すように、バックボーンネットワーク、部位検出ブランチ及び特徴抽出ブランチを含んでもよい。入力値が複数枚のサンプルシーン画像を含む場合に、各サンプルシーン画像における複数の目標部位の位置のそれぞれに対応するサンプル特徴ベクトルを取得してもよい。 In an embodiment of the present invention, the configuration of the initial neural network model may include a backbone network, a site detection branch and a feature extraction branch, also shown in FIG. If the input values include multiple sample scene images, a sample feature vector corresponding to each of multiple target site locations in each sample scene image may be obtained.

ステップ１００－２では、前記各サンプルシーン画像におけるマーキングされた複数の目標部位のそれぞれに対応するオブジェクト識別子に基づいて、隣接する２枚ずつのサンプルシーン画像において、同じ前記オブジェクト識別子の前記目標部位の位置に対応する前記サンプル特徴ベクトルの間の第１類似度を特定し、及び／又は、異なる前記オブジェクト識別子の前記目標部位の位置に対応する前記サンプル特徴ベクトルの間の第２類似度を特定する。 In step 100-2, based on the object identifiers corresponding to each of the plurality of marked target portions in each of the sample scene images, two adjacent sample scene images of the target portions having the same object identifier are selected. determining a first similarity between the sample feature vectors corresponding to locations and/or determining a second similarity between the sample feature vectors corresponding to locations of the target sites of different object identifiers. .

本発明の実施例において、初期ニューラルネットワークモデルから出力された各サンプルシーン画像における複数の目標部位の位置のそれぞれに対応するサンプル特徴ベクトルに基づいて、隣接する２枚ずつのサンプルシーン画像における同じ前記オブジェクト識別子の前記目標部位の位置に対応する前記サンプル特徴ベクトルの間の第１類似度、及び／又は、前記隣接する２枚ずつのサンプルシーン画像における異なる前記オブジェクト識別子の前記目標部位の位置に対応する前記サンプル特徴ベクトルの間の第２類似度を特定してもよい。 In an embodiment of the present invention, based on the sample feature vectors corresponding to each of the positions of a plurality of target regions in each sample scene image output from the initial neural network model, the same feature vectors in adjacent two sample scene images are obtained. a first degree of similarity between the sample feature vectors corresponding to the target site locations of object identifiers and/or corresponding to different target site locations of the object identifiers in the adjacent two sample scene images; A second degree of similarity between the sample feature vectors may be determined.

サンプル特徴ベクトルの間の余弦類似度値に基づいて上記第１類似度値及び第２類似度値を取得してもよい。 The first similarity value and the second similarity value may be obtained based on cosine similarity values between sample feature vectors.

ステップ１００－３では、前記各サンプルシーン画像におけるマーキングされた複数の目標部位のそれぞれに対応するオブジェクト識別子に基づいて、前記第１類似度と前記第２類似度とのうちの少なくとも一方に基づいて、前記初期ニューラルネットワークモデルに対して教師ありトレーニングを行い、前記特徴検出モデルを取得する。 In step 100-3, based on at least one of the first similarity and the second similarity based on object identifiers corresponding to each of the plurality of marked target parts in each of the sample scene images , perform supervised training on the initial neural network model to obtain the feature detection model.

本発明の実施例において、第１類似度値を上げて第２類似度値を下げる方式で、図９に示すように、損失関数を特定してもよい。前記隣接する２枚ずつのサンプルシーン画像における複数の目標部位のそれぞれに対応するオブジェクト識別子を基に、特定された損失関数によって所定モデルのネットワークパラメータを調整し、教師ありトレーニングを完了した後、特徴検出モデルを取得する。 In an embodiment of the present invention, the loss function may be specified by increasing the first similarity value and decreasing the second similarity value, as shown in FIG. Based on the object identifiers corresponding to each of the plurality of target parts in the adjacent two sample scene images, the network parameters of the predetermined model are adjusted by the specified loss function, and after completing supervised training, the feature Get the detection model.

上記実施例では、前記各サンプルシーン画像におけるマーキングされた複数の目標部位のそれぞれに対応するオブジェクト識別子に基づいて、初期ニューラルネットワークモデルに対して教師ありトレーニングを行って前記特徴検出モデルを取得することにより、特徴検出モデルの検出性能及び汎化性能を向上させる。 In the above embodiment, supervised training is performed on an initial neural network model to obtain the feature detection model based on object identifiers corresponding to each of the plurality of marked target regions in each of the sample scene images. improves the detection performance and generalization performance of the feature detection model.

幾つかの選択可能な実施例において、ステップ１００－３に関し、第１類似度参照値と前記第１類似度との差分を第１損失関数としてもよい。第１類似度参照値は、前記２枚ずつのサンプルシーン画像におけるマーキングされた同じオブジェクト識別子の目標部位に対応するサンプル特徴ベクトルの間の類似度参照値である。例示として、第１類似度参照値は、サンプル特徴ベクトルの間の余弦類似度値であり、その値が１であってもよい。 In some alternative embodiments, for step 100-3, the difference between the first similarity reference value and said first similarity may be the first loss function. The first similarity reference value is a similarity reference value between sample feature vectors corresponding to target portions of the same marked object identifier in the two sample scene images. Illustratively, the first similarity reference value is a cosine similarity value between the sample feature vectors, which may have a value of one.

初期ニューラルネットワークモデルのネットワークパラメータを調整して第１損失関数を最小にする又は所定トレーニング回数に達させることにより、特徴検出モデルを取得する。 A feature detection model is obtained by adjusting the network parameters of the initial neural network model to minimize the first loss function or reach a predetermined number of training times.

又は、第２類似度参照値と前記第２類似度との差分を第２損失関数としてもよい。第２類似度参照値は、前記２枚ずつのサンプルシーン画像におけるマーキングされた異なるオブジェクト識別子の目標部位に対応するサンプル特徴ベクトルの間の類似度参照値である。例示として、第２類似度参照値は、サンプル特徴ベクトルの間の余弦類似度値であり、その値が０であってもよい。 Alternatively, the difference between the second similarity reference value and the second similarity may be used as the second loss function. The second similarity reference value is a similarity reference value between sample feature vectors corresponding to target portions of different marked object identifiers in the two sample scene images. Illustratively, the second similarity reference value is a cosine similarity value between sample feature vectors, which may be zero.

同様に、初期ニューラルネットワークモデルのネットワークパラメータを調整して第２損失関数を最小にする又は所定トレーニング回数に達させることにより、特徴検出モデルを取得する。 Similarly, the feature detection model is obtained by adjusting the network parameters of the initial neural network model to minimize the second loss function or reach a predetermined number of training times.

又は、第１損失関数と第２損失関数との両方を初期ニューラルネットワークモデルの損失関数とし、初期ニューラルネットワークモデルのネットワークパラメータを調整して２つの損失関数を最小にする又は所定トレーニング回数に達させることにより、特徴検出モデルを取得してもよい。 Or, both the first loss function and the second loss function are the loss functions of the initial neural network model, and the network parameters of the initial neural network model are adjusted to minimize the two loss functions or reach a predetermined number of training times. Thus, a feature detection model may be obtained.

幾つかの選択可能な実施例において、図１０に示すように、当該方法は、以下のステップを更に含んでもよい。 In some alternative embodiments, as shown in FIG. 10, the method may further include the following steps.

ステップ１０５では、前記複数のシーン画像に現れた複数の同じオブジェクトのうちの少なくとも１つのオブジェクトの、所定時間帯内における運動軌跡が目標運動軌跡に合致するか否かを特定する。 In step 105, it is determined whether or not the motion trajectory of at least one of the plurality of same objects appearing in the plurality of scene images within a predetermined time period matches the desired motion trajectory.

本発明の実施例において、複数枚のシーン画像は、教室シーンに対応し、前記オブジェクトは、ティーチング対象を含み、前記目標運動軌跡は、ティーチングタスクにおいて前記ティーチング対象へ指定される少なくとも１種の運動軌跡を含む。ティーチングタスクにおいて前記ティーチング対象へ指定される少なくとも１種の運動軌跡は、現在の所在する位置から教師によって指定された他の位置まで歩くことを含むが、それに限定されない。他の位置は、教壇、黒板若しくは他のクラスメートの所在する位置であってもよく、又は、目標運動軌跡は、現在位置に移動が発生しないことを更に含んでもよい。 In an embodiment of the present invention, the plurality of scene images correspond to a classroom scene, the object includes a teaching target, and the desired motion trajectory is at least one type of motion specified for the teaching target in a teaching task. Includes trajectories. At least one motion trajectory specified for the teaching target in the teaching task includes, but is not limited to, walking from the current location to another location specified by the teacher. The other position may be the podium, the blackboard, or the position where other classmates are located, or the desired motion trajectory may further include that no movement occurs at the current position.

例えば、教室では、教室内に配置された、カメラヘッドを有するティーチングマルチメディア機器（ティーチングプロジェクタ、教室内のモニタリング機器等を含むが、それらに限定されない）を用いて教室内で複数枚のシーン画像を前後して収集してもよい。教室シーン画像に含まれる少なくとも１つティーチング対象の運動軌跡を特定する。当該ティーチング対象は、学生であってもよい。 For example, in a classroom, a teaching multimedia device (including, but not limited to, a teaching projector, a classroom monitoring device, etc.) having a camera head placed in the classroom can be used to capture multiple scene images in the classroom. may be collected before or after. At least one motion trajectory of the teaching object included in the classroom scene image is identified. The teaching subject may be a student.

更に、設定時間帯内、例えば、教師が授業する１コマの時間帯内で、各ティーチング対象例えば各学生の運動軌跡がティーチングタスクにおいて前記ティーチング対象へ指定される少なくとも１種の運動軌跡に合致するか否かを特定してもよい。例えば、教師の指示に従って現在位置から黒板の前に若しくは他のクラスメートの所在する位置に移動したか否か、又は、運動軌跡の移動が発生せずにずっと同一位置にいる、例えばずっと自分の位置に座って授業を受けているか否か等を特定する。教師がティーチングタスクをより良好に行うように、ティーチングマルチメディア機器を介して上記結果を表示してもよい。 Furthermore, within a set time period, for example, within a time period during which the teacher teaches, the motion trajectory of each teaching object, such as each student, matches at least one type of motion trajectory specified for the teaching object in the teaching task. You may specify whether For example, whether or not you moved from your current position in front of the blackboard or to the position where other classmates are located according to the teacher's instructions, or whether you have been in the same position all the time without movement of the movement trajectory, such as your own Identify whether or not the person is sitting in a position and taking a class. The results may be displayed via a teaching multimedia device to help the teacher perform teaching tasks better.

上記方法実施例に対応し、本発明は、装置の実施例を更に提供する。 Corresponding to the above method embodiments, the present invention further provides apparatus embodiments.

図１１に示すように、図１１は、本発明の１つの例示的な実施例に示すオブジェクト追跡装置ブロック図である。装置は、同一シーンに対応する複数枚のシーン画像を取得するための取得モジュール２１０と、前記複数枚のシーン画像のうちの各シーン画像に対して特徴抽出処理及び目標部位検出を行い、前記各シーン画像の特徴情報と前記各シーン画像における複数の目標部位の位置とを取得するための処理モジュール２２０と、前記各シーン画像の特徴情報のうち、前記複数の目標部位の位置のそれぞれに対応する目標特徴情報を取得するための特徴情報特定モジュール２３０と、取得された前記複数の目標部位の位置のそれぞれに対応する目標特徴情報に基づいて、前記複数枚のシーン画像に現れた複数の同じオブジェクトを特定するためのオブジェクト特定モジュール２４０と、を備え、各シーン画像は、前記複数の同じオブジェクトのうちの一部又は全部を含む。 Referring to FIG. 11, FIG. 11 is an object tracking device block diagram illustrating one exemplary embodiment of the present invention. The apparatus includes an acquisition module 210 for acquiring a plurality of scene images corresponding to the same scene, performing feature extraction processing and target site detection on each scene image among the plurality of scene images, and a processing module 220 for acquiring feature information of a scene image and positions of a plurality of target regions in each of the scene images; A plurality of identical objects appearing in the plurality of scene images based on a feature information identifying module 230 for acquiring target feature information and target feature information corresponding to each of the acquired positions of the plurality of target parts. and an object identification module 240 for identifying , wherein each scene image includes some or all of said plurality of same objects.

幾つかの選択可能な実施例において、前記処理モジュールは、前記複数枚のシーン画像のうちの各シーン画像の第１特徴マップを抽出するための第１処理サブモジュールと、前記各シーン画像の第１特徴マップにおいて目標部位検出を行い、前記各シーン画像における複数の目標部位の位置を取得し、且つ、前記各シーン画像の第１特徴マップに対して特徴抽出処理を行い、多次元の第２特徴マップを取得するための第２処理サブモジュールと、を備え、前記特徴情報特定モジュールは、前記多次元の第２特徴マップにおいて前記複数の目標部位の位置に対応する複数の目標特徴ベクトルを取得するための特徴ベクトル特定サブモジュールを備える。 In some alternative embodiments, the processing module comprises a first processing sub-module for extracting a first feature map for each scene image of the plurality of scene images; Target site detection is performed in one feature map to obtain positions of a plurality of target sites in each scene image, Feature extraction processing is performed on the first feature map of each scene image, and a multidimensional second feature map is obtained. a second processing sub-module for obtaining a feature map, wherein the feature information identifying module obtains a plurality of target feature vectors corresponding to the positions of the plurality of target sites in the second multidimensional feature map. a feature vector identification sub-module for

幾つかの選択可能な実施例において、前記オブジェクト特定モジュールは、前記複数枚のシーン画像のうちの隣接する２枚ずつのシーン画像にそれぞれ対応する複数の目標特徴情報を利用し、前記隣接する２枚ずつのシーン画像における各目標部位の間の類似度を取得するための類似度特定サブモジュールと、前記隣接する２枚ずつのシーン画像における各目標部位の間の類似度に基づいて、前記異なるシーン画像に現れた複数の同じオブジェクトを特定するためのオブジェクト特定サブモジュールと、を備える。 In some optional embodiments, the object identification module utilizes a plurality of target feature information respectively corresponding to two adjacent scene images of the plurality of scene images, a similarity determination sub-module for obtaining a similarity between each target part in each scene image; an object identification sub-module for identifying multiple identical objects appearing in the scene image.

幾つかの選択可能な実施例において、前記隣接する２枚ずつのシーン画像は、第１シーン画像及び第２シーン画像であり、前記類似度特定サブモジュールは、第１シーン画像におけるＮ個の目標特徴ベクトルのそれぞれと第２シーン画像におけるＭ個の目標特徴ベクトルとの類似度を特定することと、前記第１シーン画像におけるＮ個の目標特徴ベクトルのそれぞれと前記第２シーン画像におけるＭ個の目標特徴ベクトルとの前記類似度に基づいて、Ｎ×Ｍ次元の類似度行列を取得することとを実行し、Ｎ及びＭは、２以上の正整数であり、前記類似度行列における何れかの次元の値は、前記第１シーン画像の何れかの第１目標部位と前記第２シーン画像中の何れかの第２目標部位との類似度を表す。 In some optional embodiments, the adjacent two scene images are a first scene image and a second scene image, and the similarity determination sub-module determines N targets in the first scene image. identifying a similarity between each of the feature vectors and M target feature vectors in a second scene image; determining a similarity between each of the N target feature vectors in the first scene image and the M target feature vectors in the second scene image and obtaining an N×M dimensional similarity matrix based on the similarity with the target feature vector, where N and M are positive integers of 2 or more, and any The dimension value represents the degree of similarity between any first target portion in the first scene image and any second target portion in the second scene image.

幾つかの選択可能な実施例において、前記オブジェクト特定サブモジュールは、前記類似度行列に基づいて、前記Ｎ個の目標特徴ベクトルのうちの第１目標特徴ベクトルのそれぞれと前記Ｍ個の目標特徴ベクトルとの類似度から類似度最大値を特定することと、前記類似度最大値が所定閾値よりも大きい場合に、前記Ｍ個の目標特徴ベクトルのうち、前記類似度最大値に対応する第２目標特徴ベクトルを特定することと、前記第１シーン画像における前記第１目標特徴ベクトルに対応する第１目標部位の所属するオブジェクトと前記第２シーン画像における第２目標特徴ベクトルに対応する第２目標部位の所属するオブジェクトとを同じオブジェクトとすることと、を実行する。 In some optional embodiments, the object identification sub-module determines each of the first target feature vectors of the N target feature vectors and the M target feature vectors based on the similarity matrix. and if the maximum similarity value is greater than a predetermined threshold, a second target corresponding to the maximum similarity value among the M target feature vectors identifying a feature vector; an object to which a first target portion corresponding to the first target feature vector in the first scene image belongs; and a second target portion corresponding to the second target feature vector in the second scene image; to be the same object as the object to which .

幾つかの選択可能な実施例において、前記処理モジュールは、特徴検出モデルのバックボーンネットワークを介して前記複数枚のシーン画像のうちの各シーン画像の第１特徴マップを抽出するための第３処理サブモジュールと、前記特徴検出モデルの部位検出ブランチを介して、前記各シーン画像の第１特徴マップにおいて目標部位検出を行い、前記各シーン画像における複数の目標部位の位置を取得し、且つ、前記特徴検出モデルの特徴抽出ブランチを介して、前記各シーン画像の第１特徴マップに対して特徴抽出処理を行い、多次元の第２特徴マップを取得するための第４処理サブモジュールと、を備える。 In some alternative embodiments, the processing module comprises a third processing sub for extracting a first feature map for each scene image of the plurality of scene images via a backbone network of feature detection models. and, via a feature detection branch of the feature detection model, perform target feature detection in a first feature map of each scene image to obtain locations of a plurality of target features in each scene image; a fourth processing sub-module for performing feature extraction processing on the first feature map of each scene image via a feature extraction branch of the detection model to obtain a multi-dimensional second feature map.

幾つかの選択可能な実施例において、前記装置は、同一シーンに対応する複数枚のサンプルシーン画像を所定モデルに入力し、前記所定モデルから出力された各サンプルシーン画像における複数の目標部位の位置に対応する複数の特徴ベクトルを取得するための特徴ベクトル特定モジュールと、隣接する２枚ずつのサンプルシーン画像におけるマーキングされた複数の目標部位のそれぞれに対応するオブジェクト識別子に基づいて、前記隣接する２枚ずつのサンプルシーン画像における同じ前記オブジェクト識別子の前記目標部位の位置に対応するサンプル特徴ベクトルの間の第１類似度を特定し、及び／又は、前記隣接する２枚ずつのサンプルシーン画像における異なるオブジェクト識別子の目標部位の位置に対応するサンプル特徴ベクトルの間の第２類似度を特定するための類似度特定モジュールと、前記隣接する２枚ずつのサンプルシーン画像におけるマーキングされた複数の目標部位のそれぞれに対応するオブジェクト識別子を基に、前記第２類似度と前記第１類似度とのうちの少なくとも一方に基づいて、前記所定モデルに対して教師ありトレーニングを行い、前記特徴検出モデルを取得するためのトレーニングモジュールと、を更に備える。 In some alternative embodiments, the apparatus inputs a plurality of sample scene images corresponding to the same scene into a given model, and calculates the locations of a plurality of target regions in each sample scene image output from the given model. and a feature vector identification module for obtaining a plurality of feature vectors corresponding to the two adjacent sample scene images based on object identifiers corresponding to each of the plurality of marked target sites in each of the two adjacent sample scene images. determining a first degree of similarity between sample feature vectors corresponding to locations of the target portion of the same object identifier in each of the sample scene images; a similarity determination module for determining a second similarity between sample feature vectors corresponding to target site locations of object identifiers; Supervised training is performed on the predetermined model based on at least one of the second similarity and the first similarity based on the corresponding object identifiers to obtain the feature detection model. and a training module for.

幾つかの実施例において、第１類似度参照値と前記第１類似度との差分を第１損失関数とすることと、第２類似度参照値と前記第２類似度との差分を第２損失関数とすることと、前記第１損失関数と前記第２損失関数とのうちの少なくとも一方に基づいて、前記初期ニューラルネットワークモデルをトレーニングし、前記特徴検出モデルを取得することとを実行し、前記第１類似度参照値は、前記隣接する２枚ずつのサンプルシーン画像におけるマーキングされた同じオブジェクト識別子の目標部位に対応するサンプル特徴ベクトルの間の類似度参照値であり、前記第２類似度参照値は、前記隣接する２枚ずつのサンプルシーン画像におけるマーキングされた異なるオブジェクト識別子の目標部位に対応するサンプル特徴ベクトルの間の類似度参照値である。 In some embodiments, the difference between a first similarity reference value and the first similarity measure is a first loss function, and the difference between a second similarity reference value and the second similarity measure is a second loss function. and training the initial neural network model to obtain the feature detection model based on at least one of the first loss function and the second loss function; The first similarity reference value is a similarity reference value between sample feature vectors corresponding to target portions of the same marked object identifier in the adjacent two sample scene images, and the second similarity The reference value is a similarity reference value between the sample feature vectors corresponding to the marked target portions of different object identifiers in the adjacent two sample scene images.

幾つかの選択可能な実施例において、前記装置は、前記複数のシーン画像に現れた複数の同じオブジェクトのうちの少なくとも１つのオブジェクトの、所定時間帯内における運動軌跡が目標運動軌跡に合致するか否かを特定するための運動軌跡特定モジュールを更に備える。 In some optional embodiments, the apparatus determines whether a motion trajectory of at least one of the plurality of same objects appearing in the plurality of scene images within a predetermined time period matches a desired motion trajectory. It further comprises a motion trajectory identification module for identifying whether or not.

装置実施例は、方法実施例に基本的に対応するため、その関連箇所が方法実施例部分の説明を参照すればよい。上述した装置実施例は、単に例示であり、その中、分離部品として説明される手段が物理的に分離されるものであってもよくでなくてもよい。また、手段として表示される部品は、物理手段であってもでなくてもよい。更に、それらの手段は、１箇所に位置してもよく、複数のネットワークセルに分散してもよい。実際の需要に応じてその中の一部又は全部のモジュールを選択して本実施例の目的を果たすことが可能である。当業者は、進歩性に値する労働をせずに、理解して実施可能である。 Since the device embodiment basically corresponds to the method embodiment, the relevant part can be referred to the description of the method embodiment. The apparatus embodiments described above are merely exemplary, in which the means described as separate parts may or may not be physically separated. Also, a part displayed as a means may or may not be a physical means. Moreover, these means may be located at one location or distributed over multiple network cells. Some or all of the modules can be selected according to actual needs to achieve the purpose of this embodiment. A person of ordinary skill in the art can understand and implement it without the effort worth the inventive step.

本発明の実施例は、コンピュータ可読記憶媒体を更に提供する。記憶媒体には、コンピュータプログラムが記憶され、コンピュータプログラムは、上記何れか一項に記載のオブジェクト追跡方法を実行するために用いられる。 Embodiments of the invention further provide a computer-readable storage medium. A storage medium stores a computer program, and the computer program is used to execute the object tracking method according to any one of the above items.

幾つかの選択可能な実施例において、本発明の実施例は、コンピュータプログラム製品を提供する。当該コンピュータプログラム製品は、コンピュータ可読コードを含み、コンピュータ可読コードが機器で運転されたときに、機器におけるプロセッサは、上述した何れか１つの実施例に係るオブジェクト追跡方法を実施するための指令を実行する。 In some alternative embodiments, embodiments of the invention provide computer program products. The computer program product includes computer readable code, and when the computer readable code is run on the device, a processor in the device executes instructions for performing the object tracking method according to any one of the embodiments described above. do.

幾つかの選択可能な実施例において、本発明の実施例は、別のコンピュータプログラム製品を更に提供する。当該コンピュータプログラム製品は、コンピュータ可読指令を記憶し、指令が実行されたときに、コンピュータは、上記何れか１つの実施例に係るオブジェクト追跡方法の操作を実行する。 In some alternative embodiments, embodiments of the invention further provide another computer program product. The computer program product stores computer readable instructions, and when the instructions are executed, the computer performs the operations of the object tracking method according to any one of the embodiments above.

当該上記コンピュータプログラム製品は、具体的にハードウェア、ソフトウェア又はそれらの組み合わせで実現されてもよい。ある好適な実施例において、前記コンピュータプログラム製品は、コンピュータ記憶媒体として具現化されてもよく、別の好適な実施例において、コンピュータプログラム製品は、ソフトウェア製品、例えばソフトウェア開発キット（ＳｏｆｔｗａｒｅＤｅｖｅｌｏｐｍｅｎｔＫｉｔ、ＳＤＫ）等として具現化される。 Said computer program product may be specifically implemented in hardware, software or a combination thereof. In one preferred embodiment, the computer program product may be embodied as a computer storage medium, and in another preferred embodiment, the computer program product is a software product, such as a Software Development Kit (SDK). ), etc.

幾つかの選択可能な実施例において、本発明の実施例は、コンピュータプログラムを提供する。前記コンピュータプログラムが実行されたときに、コンピュータは、上記何れか１つの実施例に係るオブジェクト追跡方法の操作を実行する。 In some alternative embodiments, embodiments of the invention provide computer programs. When the computer program is executed, the computer performs the operations of the object tracking method according to any one of the embodiments above.

本発明の実施例は、オブジェクト追跡装置を更に提供する。当該オブジェクト追跡装置は、プロセッサと、プロセッサで実行され得る実行可能指令を記憶するためのメモリと、を備え、プロセッサは、前記メモリに記憶された実行可能指令を呼び出すことにより、上記何れか１つのオブジェクト追跡方法を実施するように構成される。 Embodiments of the present invention further provide an object tracking device. The object tracking device comprises a processor and a memory for storing executable instructions executable by the processor, the processor performing any one of the above by invoking the executable instructions stored in the memory. It is configured to implement an object tracking method.

図１２は、本発明の実施例に係るオブジェクト追跡装置のハードウェア構造模式図である。当該オブジェクト追跡装置３１０は、プロセッサ３１１を備え、入力装置３１２、出力装置３１３及びメモリ３１４を更に備えてもよい。当該入力装置３１２、出力装置３１３、メモリ３１４とプロセッサ３１１の間は、バスを介して互いに接続される。 FIG. 12 is a hardware structural schematic diagram of an object tracking device according to an embodiment of the present invention. The object tracking device 310 comprises a processor 311 and may further comprise an input device 312 , an output device 313 and a memory 314 . The input device 312, output device 313, memory 314 and processor 311 are connected to each other via a bus.

メモリは、ランダムアクセスメモリ（ｒａｎｄｏｍａｃｃｅｓｓｍｅｍｏｒｙ、ＲＡＭ）、読み出し専用メモリ（ｒｅａｄ－ｏｎｌｙｍｅｍｏｒｙ、ＲＯＭ）、消去可能なプログラマブル読み出し専用メモリ（ｅｒａｓａｂｌｅｐｒｏｇｒａｍｍａｂｌｅｒｅａｄｏｎｌｙｍｅｍｏｒｙ、ＥＰＲＯＭ）、又は携帯型読み出し専用メモリ（ｃｏｍｐａｃｔｄｉｓｃｒｅａｄ－ｏｎｌｙｍｅｍｏｒｙ、ＣＤ－ＲＯＭ）を含むが、それらに限定されない。当該メモリは、関連する指令及びデータを記憶する。 The memory may be random access memory (RAM), read-only memory (ROM), erasable programmable read only memory (EPROM), or portable read-only memory ( compact disc read-only memory, CD-ROM). The memory stores relevant instructions and data.

入力装置は、データ及び／又は信号を入力し、出力装置は、データ及び／又は信号を出力する。出力装置と入力装置は、独立するデバイスであってもよく、１つの全体のデバイスであってもよい。 The input device inputs data and/or signals, and the output device outputs data and/or signals. The output device and the input device may be independent devices or may be one whole device.

プロセッサは、１つ又は複数のプロセッサであってもよく、例えば１つ又は複数の中央処理装置（ｃｅｎｔｒａｌｐｒｏｃｅｓｓｉｎｇｕｎｉｔ、ＣＰＵ）を含んでもよい。プロセッサが１つのＣＰＵである場合に、当該ＣＰＵは、シングルコアＣＰＵであってもよく、マルチコアＣＰＵであってもよい。 A processor may be one or more processors, and may include, for example, one or more central processing units (CPUs). When the processor is one CPU, the CPU may be a single-core CPU or a multi-core CPU.

メモリは、ネットワーク機器のプログラムコード及びデータを記憶する。 The memory stores program codes and data for the network appliance.

プロセッサは、当該メモリにおけるプログラムコード及びデータを呼び出して、上記方法実施例におけるステップを実行する。詳細は、方法実施例における記述を参照すればよく、ここで繰り返し説明しない。 The processor calls up the program code and data in the memory to perform the steps in the above method embodiments. For details, please refer to the description in the method embodiment, which will not be repeated here.

理解できるように、図１２は、単に１種のオブジェクト追跡装置の簡素化設計を示す。実際の応用において、オブジェクト追跡装置は、必要な他の素子をそれぞれ含んでもよく、任意数の入力／出力装置、プロセッサ、コントローラ及びメモリ等を含むが、それらに限定されない。本発明の実施例を実現できる全てのオブジェクト追跡装置は、何れも本発明の保護範囲内に含まれる。 As can be appreciated, FIG. 12 merely shows a simplified design of one type of object tracker. In practical applications, the object tracking device may include other elements as required, including, but not limited to, any number of input/output devices, processors, controllers, memories, and the like. All object tracking devices that can implement the embodiments of the present invention are within the protection scope of the present invention.

当業者は、明細書を考慮してここで開示された本発明を実践した後、本発明の他の実施案を容易に想到し得る。本発明は、本発明の如何なる変形、用途又は適応的変化もカバーすることを意図する。これらの変形、用途又は適応的変化は、本発明の一般的な原理に従い、本発明に開示されていない当分野における公知常識或いは慣用技術手段を含む。明細書及び実施例は、単に例示と見なされ、本発明の真の範囲及び要旨は、請求項から与えられる。 Other embodiments of the invention may readily occur to those skilled in the art after considering the specification and practicing the invention disclosed herein. The present invention is intended to cover any variations, uses or adaptations of the present invention. These modifications, uses, or adaptive changes follow the general principles of the present invention and include common knowledge or common technical means in the art that are not disclosed in the present invention. It is intended that the specification and examples be considered as exemplary only, with the true scope and spirit of the invention being given by the following claims.

上述したのは、本発明の好適な実施例に過ぎず、本発明を制限するためのものではない。本発明の精神及び原則内でなされた如何なる変更、均等物による置換、改良等も、本発明の保護範囲内に含まれるべきである。 The above are only preferred embodiments of the present invention and are not intended to limit the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

本願は、２０２０年４月２８日に提出された、出願番号が２０２０１０３５２３６５．６であって発明名称が「オブジェクト追跡方法及び装置、記憶媒体」である中国特許出願の優先権を要求し、当該出願の全ての内容が引用によって本願に組み込まれる。 This application claims the priority of the Chinese patent application with application number 202010352365.6 and entitled "Object Tracking Method and Apparatus, Storage Medium" filed on April 28, 2020, and is incorporated herein by reference in its entirety.

Claims

An object tracking method comprising:
acquiring a plurality of scene images corresponding to the same scene;
performing feature extraction processing and target site detection on each scene image of the plurality of scene images to obtain feature information of each scene image and positions of a plurality of target sites in each scene image; ,
a step of obtaining target feature information corresponding to each of the positions of the plurality of target parts among the feature information of each of the scene images;
identifying a plurality of identical objects appearing in the plurality of scene images based on the acquired target feature information corresponding to each of the plurality of target site positions;
each scene image includes some or all of the plurality of same objects;
performing feature extraction processing and target site detection on each scene image out of the plurality of scene images, and acquiring feature information of each scene image and positions of a plurality of target sites in each scene image; ,
extracting a first feature map for each scene image of the plurality of scene images;
performing target site detection on the first feature map of each scene image, obtaining positions of a plurality of target sites on each scene image, and performing feature extraction processing on the first feature map of each scene image; , obtaining a multi-dimensional second feature map;
The step of acquiring target feature information corresponding to each of the positions of the plurality of target parts among the feature information of each scene image,
An object tracking method , comprising obtaining a target feature vector corresponding to each of the positions of the plurality of target regions in the multidimensional second feature map.

The step of identifying a plurality of identical objects appearing in the plurality of scene images based on the target feature information corresponding to each of the acquired positions of the plurality of target parts,
Using a plurality of pieces of target feature information respectively corresponding to two adjacent scene images out of the plurality of scene images, obtaining a degree of similarity between each target part in each of the adjacent two scene images. and
and identifying a plurality of identical objects appearing in different scene images based on the similarity between each target portion in the adjacent two scene images. object tracking method.

The two adjacent scene images are a first scene image and a second scene image,
Using a plurality of pieces of target feature information respectively corresponding to two adjacent scene images out of the plurality of scene images, obtaining a degree of similarity between each target part in each of the adjacent two scene images. to do
determining the similarity between each of the N target feature vectors in the first scene image and the M target feature vectors in the second scene image;
obtaining an N×M dimensional similarity matrix based on the similarity between each of N target feature vectors in the first scene image and M target feature vectors in the second scene image; including
N and M are positive integers of 2 or more, and the value of any dimension in the similarity matrix is any first target site in the first scene image and any one in the second scene image. 3. The object tracking method according to claim 2 , wherein the degree of similarity with the second target portion is expressed.

Identifying a plurality of identical objects appearing in the different scene images based on the similarity between each target part in the two adjacent scene images;
identifying a maximum similarity value from similarities between each of the first target feature vectors of the N target feature vectors and the M target feature vectors based on the similarity matrix;
identifying a second target feature vector corresponding to the maximum similarity value among the M target feature vectors when the maximum similarity value is greater than a predetermined threshold;
The object to which the first target part corresponding to the first target feature vector in the first scene image belongs and the object to which the second target part corresponding to the second target feature vector in the second scene image belong are made the same object. 4. The object tracking method of claim 3 , comprising:

performing feature extraction processing and target site detection on each scene image out of the plurality of scene images, and acquiring feature information of each scene image and positions of a plurality of target sites in each scene image; ,
extracting a first feature map for each scene image of the plurality of scene images via a backbone network of feature detection models;
performing target site detection in a first feature map of each scene image via a site detection branch of the feature detection model to obtain locations of a plurality of target sites in each scene image; performing a feature extraction process on the first feature map of each scene image via a feature extraction branch to obtain a multi-dimensional second feature map . The object tracking method according to any one of 1.

A plurality of sample scene images corresponding to the same scene are input to an initial neural network model, and sample feature vectors corresponding to the positions of a plurality of target parts in each sample scene image output from the initial neural network model are obtained. and
Based on the object identifiers corresponding to each of the plurality of marked target parts in each of the sample scene images, the samples corresponding to the positions of the target parts having the same object identifier in two adjacent sample scene images. determining a first degree of similarity between feature vectors and/or a second degree of similarity between the sample feature vectors corresponding to locations of the target sites with different object identifiers;
The initial neural network model based on at least one of the first similarity and the second similarity based on object identifiers corresponding to each of the plurality of marked target parts in each of the sample scene images. 6. The object tracking method of claim 5 , further comprising performing supervised training on to obtain the feature detection model.

The initial neural network model based on at least one of the first similarity and the second similarity based on object identifiers corresponding to each of the plurality of marked target parts in each of the sample scene images. supervised training of to obtain the feature detection model,
setting a difference between a first similarity reference value and the first similarity as a first loss function;
setting a difference between a second similarity reference value and the second similarity as a second loss function;
training the initial neural network model to obtain the feature detection model based on at least one of the first loss function and the second loss function;
The first similarity reference value is a similarity reference value between sample feature vectors corresponding to target portions of the same marked object identifier in the adjacent two sample scene images, and the second similarity 7. The reference value of claim 6 , wherein the reference value is a similarity reference value between sample feature vectors corresponding to marked target portions of different object identifiers in the adjacent two sample scene images. Object tracking method.

The method further includes the step of determining whether or not a motion trajectory of at least one of the plurality of same objects appearing in the plurality of scene images within a predetermined time period matches a desired motion trajectory. Object tracking method according to any one of claims 1 to 7 .

The plurality of scene images correspond to a classroom scene, the object includes a teaching subject, and the desired motion trajectory includes at least one motion trajectory specified for the teaching subject in a teaching task. The object tracking method according to claim 8 , wherein

An object tracking device,
an acquisition module for acquiring multiple scene images corresponding to the same scene;
performing feature extraction processing and target site detection on each scene image out of the plurality of scene images, and acquiring feature information of each scene image and positions of a plurality of target sites in each scene image; a processing module;
a feature information specifying module for acquiring target feature information corresponding to each of the positions of the plurality of target parts among the feature information of each scene image;
an object identification module for identifying a plurality of identical objects appearing in the plurality of scene images based on target feature information corresponding to each of the acquired positions of the plurality of target sites;
each scene image includes some or all of the plurality of same objects;
The processing module is
a first processing sub-module for extracting a first feature map of each scene image of the plurality of scene images;
performing target site detection on the first feature map of each scene image, obtaining positions of a plurality of target sites on each scene image, and performing feature extraction processing on the first feature map of each scene image; , a second processing sub-module for obtaining a multi-dimensional second feature map;
The feature information identification module comprises a feature vector identification sub-module for obtaining a plurality of target feature vectors corresponding to the positions of the plurality of target parts in the second multidimensional feature map. Device.

A computer-readable storage medium,
A computer readable storage, characterized in that a computer program is stored in the computer readable storage medium, and the computer program is used to execute the object tracking method according to any one of claims 1 to 9 . medium.

An object tracking device,
a processor;
a memory for storing executable instructions that can be executed by the processor;
10. Object tracking device, characterized in that the processor is arranged to implement the object tracking method according to any one of claims 1 to 9 by invoking executable instructions stored in the memory.

A computer program,
A computer program, characterized in that, when said computer program is executed by a processor, an object tracking method according to any one of claims 1 to 9 is implemented.