WO2021130951A1 - 物体追跡装置、物体追跡方法及び記録媒体 - Google Patents
物体追跡装置、物体追跡方法及び記録媒体 Download PDFInfo
- Publication number
- WO2021130951A1 WO2021130951A1 PCT/JP2019/051088 JP2019051088W WO2021130951A1 WO 2021130951 A1 WO2021130951 A1 WO 2021130951A1 JP 2019051088 W JP2019051088 W JP 2019051088W WO 2021130951 A1 WO2021130951 A1 WO 2021130951A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- position information
- image
- information
- tracking device
- unit
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/74—Image or video pattern matching; Proximity measures in feature spaces
- G06V10/75—Organisation of the matching processes, e.g. simultaneous or sequential comparisons of image or video features; Coarse-fine approaches, e.g. multi-scale approaches; using context analysis; Selection of dictionaries
- G06V10/751—Comparing pixel values or logical combinations thereof, or feature values having positional relevance, e.g. template matching
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/20—Analysis of motion
- G06T7/246—Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/70—Determining position or orientation of objects or cameras
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20084—Artificial neural networks [ANN]
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V2201/00—Indexing scheme relating to image or video recognition or understanding
- G06V2201/07—Target detection
Definitions
- the present invention relates to a technical field of an object tracking device, an object tracking method, and a recording medium for tracking an object reflected in a plurality of images corresponding to time series data.
- An object tracking device that tracks an object reflected in an image by using a plurality of images as time-series data taken by a camera or the like is known.
- An example of the object tracking device is described in Patent Documents 1 to 3.
- an example of an algorithm for tracking an object is described in Non-Patent Document 1 to Non-Patent Document 2.
- Non-Patent Document 3 can be mentioned.
- the conventional object tracking device has an object detection operation for detecting an object reflected in an image and an object O reflected in an image IM t- ⁇ at a time t- ⁇ (where ⁇ indicates a reference period).
- the object collation operation for collating t- ⁇ with the object O t reflected in the image IM t at time t is performed as separate operations independent of each other. For this reason, the conventional object tracking device needs to perform preprocessing and the like in order to perform the object matching operation after the object detection operation is performed. As a result, in the conventional object tracking device, the processing cost for tracking the object may be relatively high.
- An object of the present invention is to provide an object tracking device, an object tracking method, and a recording medium capable of solving the above-mentioned technical problems.
- One aspect of the object tracking device is the first position information regarding the position of the object in the first image taken at the first time and the object in the second image taken at the second time different from the first time.
- a first generation means for generating a first feature vector indicating the feature amount of the first position information and a second feature vector indicating the feature amount of the second position information based on the second position information related to the position.
- a second generation means that generates information obtained by arithmetic processing using the first and second feature vectors as correspondence information indicating a correspondence relationship between an object in the first image and an object in the second image. To be equipped with.
- One aspect of the object tracking method is the first position information regarding the position of the object in the first image taken at the first time and the object in the second image taken at the second time different from the first time. Based on the second position information regarding the position, a first feature vector indicating the feature amount of the first position information and a second feature vector indicating the feature amount of the second position information are generated, and the first feature amount is generated.
- the information obtained by the arithmetic processing using the second feature vector is generated as correspondence information indicating the correspondence relationship between the object in the first image and the object in the second image.
- One aspect of the recording medium is a non-temporary recording medium on which a computer program that causes a computer to execute an object tracking method is recorded, wherein the object tracking method is an object in a first image taken at a first time.
- the feature amount of the first position information is shown.
- the information obtained by generating the one feature vector and the second feature vector indicating the feature amount of the second position information and the arithmetic processing using the first and second feature vectors is obtained in the first image. This includes generating as correspondence information indicating the correspondence relationship between the object of the above and the object in the second image.
- the corresponding information can be generated by the arithmetic processing using the first and second feature vectors, and will be described in detail later.
- the object can be tracked at a relatively low processing cost.
- FIG. 1 is a block diagram showing a configuration of an object tracking device of the present embodiment.
- FIG. 2 is a block diagram showing a configuration of a logical functional block realized in an object tracking device for performing an object matching operation and a refinement operation.
- FIG. 3 is a plan view conceptually showing the object position information detected by the object detection operation.
- FIG. 4 is a flowchart showing the flow of the object matching operation.
- FIG. 5 is a plan view conceptually showing the relationship between the feature vector and the similarity matrix.
- FIG. 6 is a flowchart showing the flow of the refinement operation.
- FIG. 7 is a block diagram showing a configuration of an object tracking device of the first modification.
- FIG. 8 is a data structure diagram showing the data structure of the learning DB.
- FIG. 9 is a block diagram showing a configuration of an object tracking device of the second modification.
- FIG. 10 is a plan view showing how the similarity matrix is normalized using the softmax function.
- the object tracking device 1 performs an object tracking operation for tracking at least one object O reflected in each image IM when a plurality of image IMs corresponding to time series data are input.
- the object tracking operation includes, for example, an object detection operation for detecting an object O reflected in the image IM.
- the object tracking operation is, for example, at least one object O t- reflected in the image IM t- ⁇ acquired (for example, taken) at the time t- ⁇ (where ⁇ indicates a reference period).
- Object tracking operation may include, for example, using the results of the object matching operation, the detection result of the object O t which is reflected on the image IM t (that is, the result of the object detection operation) the refinement operation for correcting.
- the expression “X and / or Y” is used as an expression that includes both the expression “X and Y” and the expression “X or Y”.
- FIG. 1 is a block diagram showing a configuration of the object tracking device 1 of the present embodiment.
- FIG. 2 is a block diagram showing a configuration of a logical functional block realized in the object tracking device 1 for performing an object matching operation and a refinement operation.
- the object tracking device 1 includes an arithmetic unit 2 and a storage device 3. Further, the object tracking device 1 may include an input device 4 and an output device 5. However, the object tracking device 1 does not have to include at least one of the input device 4 and the output device 5.
- the arithmetic unit 2, the storage device 3, the input device 4, and the output device 5 are connected via the data bus 6.
- the arithmetic unit 2 includes, for example, at least one of a CPU (Central Processing Unit) and a GPU (Graphic Processing Unit).
- the arithmetic unit 2 reads a computer program.
- the arithmetic unit 2 may read the computer program stored in the storage device 3.
- the arithmetic unit 2 may read a computer program stored in a recording medium that is readable by a computer and is not temporary, using a recording medium reading device (not shown).
- the arithmetic unit 2 may acquire a computer program from a device (not shown) arranged outside the object tracking device (1) via a communication device (not shown) (that is, it may be downloaded or read). ..
- the arithmetic unit 2 executes the read computer program.
- a logical functional block for executing the operation to be performed by the object tracking device 1 (specifically, the above-mentioned object tracking operation) is realized in the arithmetic unit 2. That is, the arithmetic unit 2 can function as a controller for realizing a logical functional block for executing the object tracking operation.
- FIG. 1 shows an example of a logical functional block realized in the arithmetic unit 2 to execute the object tracking operation.
- an object detection unit 21, an object matching unit 22, and a refinement unit 23 are realized as logical functional blocks in the arithmetic unit 2.
- the object detection unit 21 performs an object detection operation.
- the object matching unit 22 performs an object matching operation.
- the object matching unit 22 includes a feature map conversion unit 221, a feature vector conversion unit 222, a feature map conversion unit 223, a feature vector conversion unit 224, and a matrix. It is provided with a calculation unit 225.
- the refinement unit 23 performs a refinement operation. In order to perform the refinement operation, as shown in FIG.
- the refinement unit 23 includes a matrix calculation unit 231, a feature vector conversion unit 232, a feature map conversion unit 233, a residual processing unit 234, and a feature map conversion unit. It includes 235, a feature map conversion unit 236, and a feature vector conversion unit 237. The object detection operation, the object matching operation, and the refinement operation will be described in detail later.
- the storage device 3 can store desired data.
- the storage device 3 may temporarily store the computer program executed by the arithmetic unit 2.
- the storage device 3 may temporarily store data temporarily used by the arithmetic unit 2 while the arithmetic unit 2 is executing a computer program.
- the storage device 3 may store data that the object tracking device 1 stores for a long period of time.
- the storage device 3 may store an image DB (DataBase) 31 for storing a plurality of image IMs as time-series data taken by a camera (not shown).
- Image DB DataBase
- the storage device 3 stores the object detection DB 32 for storing the object detection information indicating the result of the object detection operation (that is, the information regarding the detection result of the object O reflected in the image IM). Good. Further, in the storage device 3, the object matching information indicating the result of an object verification operation (i.e., the object O t which is reflected on the object O t-tau image IM t that is reflected on the image IM t-tau is information on the verification result, typically, an object matching DB33 for storing information) indicating the correspondence between the object O t-tau and the object O t may be stored.
- the object matching information indicating the result of an object verification operation i.e., the object O t which is reflected on the object O t-tau image IM t that is reflected on the image IM t-tau is information on the verification result, typically, an object matching DB33 for storing information
- the storage device 3 may include at least one of a RAM (Random Access Memory), a ROM (Read Only Memory), a hard disk device, a magneto-optical disk device, an SSD (Solid State Drive), and a disk array device. Good. That is, the storage device 3 may include a recording medium that is not temporary.
- the input device 4 is a device that receives input of information to the object tracking device 1 from the outside of the object tracking device 1.
- a plurality of image IMs as time-series data taken by a camera (not shown) are input to the input device 4.
- the plurality of image IMs input to the input device 4 are stored in the image DB 31 stored in the storage device 3.
- the output device 5 is a device that outputs information to the outside of the object tracking device 1.
- the output device 5 may output information regarding the result of the object tracking operation.
- the output device 5 may output information regarding the result of the object tracking operation as an image.
- the output device 5 may include a display device for displaying an image.
- the output device 5 may output information regarding the result of the object tracking operation as data.
- the output device 5 may include a data output device that outputs data.
- the object tracking operation includes an object detection operation, an object matching operation, and a refinement operation. Therefore, in the following, the object tracking operation will be described in order of the object detection operation, the object matching operation, and the refinement operation.
- the object detection unit 21 reads out the image IM stored in the image DB 31, and performs an object detection operation on the read image IM.
- the object detection unit 21 may detect the object O reflected in the image IM by using the existing method for detecting the object reflected in the image.
- the object detection unit 21 can acquire information on the position of the object O in the image IM (hereinafter referred to as "object position information PI") by detecting the object O reflected in the image IM. It is preferable to use the object detection operation.
- object position information PI acquired by the object detection unit 21 is stored in the object detection DB 32 as object detection information indicating the result of the object detection operation by the object detection unit 21.
- the object detection unit 21 detects the object O by using the method (so-called CenterNet) described in Non-Patent Document 3 described above.
- the object detection unit 21 generates a heat map (so-called score map) indicating the center position (Key Point) KP of the object O in the image IM as the object position information PI. More specifically, the object detection unit 21 generates a heat map showing the center position KP of the object O in the image IM for each class of the object O.
- the information indicating the center position KP of the object O in the image IM is when the number of pixels in the vertical direction of the image IM is H, the number of pixels in the horizontal direction of the image IM is W, and the number of classes is K.
- the object detection unit 21 uses the size of the detection frame (Bounding Box) BB of the object O as the object position information PI. May be generated as a score map.
- the information indicating the size of the detection frame BB of the object O may be substantially regarded as the information indicating the size of the object O.
- the information indicating the size of the detection frame BB of the object O is, for example, map information having a size of H ⁇ W ⁇ 2.
- the map information indicating the size of the detection frame BB may also be referred to as a position map because it is a map related to the position.
- the object detection unit 21 uses the correction amount (Local Offset) of the detection frame BB of the object O as the object position information PI. ) May be generated as a score map.
- the information indicating the correction amount of the detection frame BB of the object O is map information having a size of H ⁇ W ⁇ 2.
- the map information indicating the correction amount of the detection frame BB may also be referred to as a position map because it is a map related to the position.
- FIG. 3 is a plan view conceptually showing the object position information PI detected by the object detection operation.
- FIG. 3 shows an example in which four objects O (specifically, object O # 1, object O # 2, object O # 3, and object O # 4) are reflected in the image IM.
- the object detection unit 21 uses the object position information PI as information indicating the center position KP of each of the four objects O, information indicating the size of the detection frame BB of each of the four objects, and the four objects O.
- Information indicating the correction amount of each detection frame BB is generated.
- the object detection unit 21 may perform an object detection operation by using an arithmetic model that outputs an object position information PI when an image IM is input.
- an arithmetic model using a neural network (for example, CNN: Convolutional Neural Network) can be mentioned.
- the calculation model may be trained using the training data as described later. That is, the parameters of the calculation model may be optimized to output the appropriate object position information PI.
- the object detection unit 21 may perform the object detection operation by using a method different from the method described in Non-Patent Document 3 (so-called CenterNet). Examples of other methods include a method called Faster R-CNN described in Non-Patent Document 4 and a method called SSD described in Non-Patent Document 5.
- the object collation unit 22 reads out the object position information PI stored in the object detection DB 32, and performs an object collation operation using the read object position information PI.
- the object matching for matching the object O t-tau that is reflected on image IM t-tau shooting at time t-tau the object O t which is reflected on the photographed image IM t at time t
- FIG. 4 is a flowchart showing the flow of the object matching operation.
- the object matching unit 22 generates a feature vector CV from the object position information PI (steps S221 to S226).
- the object matching unit 22 generates a feature map CM from the object position information PI in order to generate a feature vector CV from the object position information PI, and then generates a feature vector CV from the feature map CM.
- the feature map CM is a feature map showing the feature amount of the object position information PI for each arbitrary channel.
- the feature map CM having the size of H ⁇ W ⁇ C is generated from the object position information PI which is the map information having the size of (the number of classes registered in the object detection DB 32 or the object matching DB 33), and then H ⁇ A feature vector CV having a size of HW ⁇ C is generated from a feature map CM having a size of W ⁇ C.
- the object matching unit 22 may directly generate the feature vector CV from the object position information PI without generating the feature map CM.
- the feature map transformation section 221 which is one specific example of the "first generation means" object is from the object detection DB 32, is reflected on the image IM t-tau taken at time t- ⁇ O t Acquire (that is, read) the object position information PI t- ⁇ related to ⁇ (step S221).
- the object position information PI t- ⁇ is a specific example of the first position information. For example, as shown in FIG. 3, when four objects O t- ⁇ are reflected in the image IM t- ⁇ , the feature map conversion unit 221 displays the four objects O t- ⁇ from the object detection DB 32. Acquires the object position information PI t- ⁇ related to.
- the feature map conversion unit 221 After that, the feature map conversion unit 221 generates a feature map CM t- ⁇ from the object position information PI t- ⁇ acquired in step S221 (step S222). The feature map conversion unit 221 generates a feature map CM t- ⁇ having a size of H ⁇ W ⁇ C from the object position information PI t ⁇ which is map information having a size of H ⁇ W ⁇ (K + 4).
- the feature map conversion unit 221 may generate the feature map CM t- ⁇ by using an arithmetic model that outputs the feature map CM when the object position information PI is input.
- an arithmetic model using a neural network (for example, CNN: Convolutional Neural Network) can be mentioned.
- the calculation model may be trained using the training data as described later. That is, the parameters of the calculation model may be optimized to output an appropriate feature map CM (particularly, a feature map CM suitable for generating the similarity matrix AM described later).
- the feature vector conversion unit 222 which is a specific example of the "first generation means" generates the feature vector CV t- ⁇ from the feature map CM t- ⁇ generated in step S222 (step S223).
- the feature vector conversion unit 222 generates a feature vector CV t ⁇ having a size of HW ⁇ C from the feature map CM t ⁇ having a size of H ⁇ W ⁇ C.
- the feature vector CV t ⁇ is a specific example of the first feature vector.
- the feature map conversion unit 223 which is a specific example of the "first generation means" is an image IM captured at time t from the object detection DB 32.
- the object position information PI t regarding the object O t reflected in t is acquired (step S224).
- the object position information PI t is a specific example of the second position information.
- the feature map conversion unit 223 generates a feature map CM t from the object position information PI t acquired in step S224 (step S225).
- the feature map conversion unit 223 generates a feature map CM t having a size of H ⁇ W ⁇ C from the object position information PI t which is map information having a size of H ⁇ W ⁇ (K + 4).
- the contents of the processes from steps S224 to S225 may be the same as the contents of the processes from steps S221 to S222 described above. Therefore, the feature map conversion unit 223 may generate the feature map CM t by using the calculation model that outputs the feature map CM when the object position information PI is input, like the feature map conversion unit 221. Good.
- the feature vector conversion unit 224 which is a specific example of the “first generation means”, generates the feature vector CV t from the feature map CM t generated in step S225 (step S226).
- the feature vector conversion unit 224 generates a feature vector CV t having a size of HW ⁇ C from a feature map CM t having a size of H ⁇ W ⁇ C.
- the content of the process in step S226 may be the same as the content of the process in step S223 described above.
- the feature vector CV t is a specific example of the second feature vector.
- a is a matrix operation unit 225 a specific example of "second generation means", using the feature vector CV t generated by the feature vector CV t-tau and step S226 generated at step S223, similarity A matrix (Affinity Matrix) AM is generated (step S227).
- the matrix calculation unit 225 generates information obtained by arithmetic processing using the feature vector CV t ⁇ and the feature vector CV t as the similarity matrix AM.
- the matrix calculating unit 225 information obtained by the processing for calculating the matrix product between the feature vector CV t-tau and feature vector CV t (i.e., matrix product between the feature vector CV t-tau and feature vector CV t ) May be generated as the similarity matrix AM.
- the matrix product referred to here may typically be a tensor product (in other words, a direct product).
- the matrix product may be a Kronecker product.
- the size of the similarity matrix AM is HW ⁇ HW.
- the similarity matrix AM is the object O t- ⁇ and the object O. It is information indicating the correspondence with t.
- affinity matrix AM is (1) of the plurality of objects O t-tau first object O t-tau is, corresponds to the first object O t of the plurality of objects O t (i.e., both are located in the same object), (2) the second object O t-tau of the plurality of objects O t-tau is, the second object O t of the plurality of objects O t corresponds, ⁇ ⁇ ⁇ , (N) a plurality of objects O t-tau first N object O t-tau in out of, in response to the object O t of the N of the plurality of objects O t It will be information indicating that you are there. Note that affinity matrix AM is therefore is information indicating the correspondence between the object O t-tau and the object O t, may be referred to as corresponding information.
- the vertical axis of the similarity matrix AM corresponds to the vector component of the feature vector CV t ⁇
- the horizontal axis corresponds to the vector component of the feature vector CV t. It can be regarded as a matrix that is doing. Therefore, the size of the vertical axis of the similarity matrix AM is HW (that is, the size of the feature vector CV t- ⁇ , and the size of the image IM t- ⁇ taken at time t- ⁇ (that is, the number of pixels). The size corresponding to).
- the size of the horizontal axis of the similarity matrix AM is HW (that is, the size of the feature vector CV t , which corresponds to the size of the image IM t taken at time t (that is, the number of pixels)).
- affinity matrix AM may correspond to the detection result of the object O t-tau that its longitudinal axis is reflected on the image IM t-tau of time t-tau (i.e., the detected position of the object O t- ⁇ ) detection result of the to and and object O t of the horizontal axis is reflected on the image IM t at time t (i.e., the detected position of the object O t) can be regarded as a matrix to correspond to.
- the vector component corresponding to a top vertical axis object O t-tau and a vector component corresponding to the same object O t on the horizontal axis is at a position intersecting, the elements of the affinity matrix AM to react (typically Has a non-zero value).
- the detection result is the intersection of the object O t on the detection result of the object O t-tau on the vertical axis and the horizontal axis, the elements of affinity matrix AM to react.
- the similarity matrix AM typically has a vector component corresponding to a certain object O t- ⁇ included in the feature vector CV t- ⁇ and a vector component corresponding to the same object O t included in the feature vector CV t.
- the value of the element at the position where and intersects is the value obtained by multiplying both vector components (that is, the value that is not 0), while the value of the other elements is 0.
- the elements of the similarity matrix AM react. That is, the elements of the similarity matrix AM react at the position where the detection result of the object O # k reflected in the image IM t ⁇ and the detection result of the object O # k reflected in the image IM t intersect.
- the element does not react when the (typically becomes 0), the object O t-tau was not reflected on the image IM t-tau, not crowded reflected in the image IM t (e.g., a camera It is presumed that it went out of the shooting angle of view).
- affinity matrix AM may be used as information indicating the correspondence between the object O t-tau and the object O t. That is, affinity matrix AM may be used as information indicating the matching result of the object O t which is reflected on the object O t-tau image IM t that is reflected on the image IM t-tau.
- the similarity matrix AM can be used as information for tracking the position of the object Ot - ⁇ reflected in the image IM t - ⁇ in the image IM t.
- the information indicating the similarity matrix AM generated by the matrix calculation unit 225 is stored in the object matching DB 33 as the object matching information indicating the result of the object matching operation by the object matching unit 22.
- matrix operation unit 225 based on the affinity matrix AM, to produce other types of information indicating the correspondence between the object O t-tau and the object O t, and the object O t-tau and object O t
- Other types of information indicating the correspondence between the above may be stored in the object matching DB 33 as the object matching information.
- the refinement unit 23 reads the object position information PI stored in the object detection DB 32, acquires the similarity matrix AM from the object collation unit 22, and uses the acquired similarity matrix AM to obtain the read object position information PI. to correct.
- a refinement operation for correcting the object position information PI t using the similarity matrix AM generated based on the object position information PI t ⁇ and PI t will be described with reference to FIG.
- FIG. 6 is a flowchart showing the flow of the refinement operation.
- the feature map conversion unit 236 acquires (that is, reads out) the object position information PI t stored in the object detection DB 32 (step S231). Then, feature map conversion unit 236, from the object position information PI t acquired in step S231, generates a feature map CM 't (step S232).
- the feature map conversion unit 223 included in the object matching unit 22 also generates the feature map CM t from the object position information PI t.
- the feature map conversion unit 223 generates the feature map CM t for the purpose of generating the similarity matrix AM.
- feature map conversion unit 236, the purpose of correcting the object position information PI t use affinity matrix AM generated in the object matching operation (i.e., purpose of the refined behavior), a characteristic map CM t ' It is generating. Therefore, the respective learning by the learning operation described later (i.e., which is updated) feature map conversion unit 223 and 236, wherein the map transformation section 223, the feature map was better suited to generate an affinity matrix AM CM t while generating a feature map converter 236 is different in that it generates a feature map CM t 'that is more suitable for correcting the object position information PI t.
- the refinement unit 23 includes a feature map conversion unit 236 separately from the feature map conversion unit 223 included in the object collation unit 22. It should be noted that such feature map conversion units 223 and 236 are constructed as a result of the learning operation described later.
- the feature map conversion unit 223 by learning the operation to be described later, the feature map conversion unit 223, while being trained to produce a feature map CM t that is more suitable for generating similarity matrices AM, the feature map conversion unit 236, an object position It is learned to generate a more suitable feature map CM t'to correct the information PI t.
- Feature map conversion unit 236, using the computerized model to output the feature map CM when the object position information PI is input, may generate a feature map CM 't.
- an arithmetic model using a neural network for example, CNN: Convolutional Neural Network
- the calculation model may be trained using the training data as described later.
- the parameters of the calculation model, (especially for correcting the object position information PI t) may be optimized so as to output the appropriate feature map CM 't.
- step S233 the feature vector conversion unit 237 'from t, the feature vector CV' feature map CM generated in step S232 to generate a t.
- the matrix calculation unit 231 acquires the similarity matrix AM generated by the object matching unit 22 from the object matching unit 22 (step S234).
- the matrix calculation unit 231 may acquire the similarity matrix AM generated by the object matching unit 22 from the object matching DB 33 (step S234).
- the matrix calculator 231 using the obtained similarity matrix AM by the feature vector CV 't a step S234 generated at step S233, and generates a feature vector CV_res (step S235). Specifically, the matrix calculator 231, the information obtained by the arithmetic processing using the feature vector CV 't and affinity matrix AM, produces a feature vector CV_res.
- the matrix calculator 231 the 'information obtained by the processing for calculating the matrix product of the affinity matrix AM and t (i.e., feature vector CV' feature vector CV matrix product of t and the affinity matrix AM) , May be generated as a feature vector CV_res.
- the feature vector conversion unit 232 After that, the feature vector conversion unit 232 generates a feature map CM_res having the same size as the feature map CM from the feature vector CV_res generated in step S235 (step S236). That is, the feature vector conversion unit 232 generates a feature map CM_res having a size of H ⁇ W ⁇ C from the feature vector CV_res having an arbitrary size. For example, the feature vector conversion unit 232 may generate the feature map CM_res by converting the feature vector CV_res into the feature map CM_res.
- feature map conversion unit 233, the feature map CM_res generated in step S236, generates the object position information PI t _res having the same size as the object position information PI t (step S237). That is, the feature map conversion unit 233, the feature map CM_res generated in step S236, generates the object position information PI t _res a map information having a size of H ⁇ W ⁇ (K + 4 ).
- feature map conversion unit 233 to convert the feature map CM_res the object position information PI t _res, may generate an object position information PI t _res.
- Feature map conversion unit 233 using the computerized model that outputs object position information PI in the case of feature maps CM is input, may generate an object position information PI t _res.
- an arithmetic model using a neural network (for example, CNN: Convolutional Neural Network) can be mentioned.
- the calculation model may be trained using the training data as described later. That is, the parameters of the calculation model may be optimized to output appropriate object position information PI t _res.
- the process of step SS237 may be regarded as substantially equivalent to the process of generating the object position information PI t _res by using an attention mechanism (Attention Mechanism) using the similarity matrix AM as a weight. That is, the refinement unit 23 may be regarded as forming at least a part of the attention mechanism including the matrix calculation unit 231 and the feature vector conversion unit 232 and the feature map conversion unit 233.
- the object position information PI_res may be used as the refined object position information PI t.
- the process of step SS237 is substantially the process of correcting (in other words, updating, adjusting or improving) the object position information PI t by using the attention mechanism using the similarity matrix AM as a weight. It may be considered equivalent.
- the object position information PI t _res generated in step S235 the original object position information PI t (i.e., refinement operation object position information PI t which is not subjected) information contained in is lost there is a possibility.
- the object position information PI t _res, (in the present embodiment, the detected position of the object O) moiety should pay attention Caution mechanism because although affinity matrix AM showing a is used as a weight, of the object detection information This is because there is a possibility that an information portion different from the information regarding the detection position of the object O of the object O may be lost. Therefore, in the present embodiment, the refinement unit 23 may further perform a process for suppressing the loss of the information included in the original object position information PI t.
- residual processing unit 234 by adding the object position information PI t _res generated in step S237 to the original object position information PI t, may generate an object position information PI t _ref (Step S238).
- the object position information PI t _ref has the same size as the object position information PI t. Therefore, the residual processor 234 generates object position information PI t _ref a map information having a size of H ⁇ W ⁇ (K + 4 ).
- the residual processing unit 234 (i) the object included in the object position information PI t _res.
- the map information indicating the center position KP of O t and the map information indicating the center position KP of the object O t included in the original object position information PI t are added and included in (ii) object position information PI t _res. It adds the map information indicating the size of the detection frame BB of the object O t, and the map information indicating the size of the detection frame BB of the object O t contained in the original object position information PI t, (iii) the object position information PI
- the map information indicating the correction amount of the detection frame BB included in t_res and the map information indicating the correction amount of the detection frame BB included in the original object position information PI t are added.
- the processing in step SS238 is essentially regarded as using a residual processing mechanism (Residual Attention Mechanism) containing residual processing unit 234 is equivalent to a process of generating an object position information PI t _ref You may. That is, the refinement unit 23 may be regarded as forming at least a part of the residual attention mechanism including the matrix calculation unit 231, the feature vector conversion unit 232, the feature map conversion unit 233, and the residual processing unit 234.
- the object position information PI t _ref is the refined object position information PI t , and also includes the information included in the original object position information PI t. In this case, the object position information PI t _ref may be used as a refined object position information PI t.
- the refinement unit 23 does not have to perform the process (process of step S238) for suppressing the loss of the information contained in the original object position information PI t. In this case, the refinement unit 23 does not have to include the residual processing unit 234.
- feature map conversion unit 235 from the object position information PI t _ref, the same size as the object position information PI t _ref: H ⁇ W ⁇ (K + 4) may generate an object position information PI t _ref 'with ( Step S239).
- the object position information PI t _ref' has the same size as the object position information PI t. Therefore, the feature map conversion unit 235 generates the object position information PI_ref', which is the map information having a size of H ⁇ W ⁇ (K + 4).
- the feature map conversion unit 235 generates the object position information PI t _ref'by applying a convolution process (for example, a process performed by the convolution layer constituting the neural network) to the object position information PI t _ref'. You may.
- Feature map conversion unit 235 by inputting the object position information PI t _ref convolution layer constituting the neural network may convert the object position information PI t _ref the object position information PI t _ref '.
- the object position information PI t _ref' may be used as the refined object position information PI t.
- the refinement unit 23 does not have to include the feature map conversion unit 235.
- Object position information PI t _res when the object position information PI t _ref or object position information PI t _ref 'is used as a refined object position information PI t is the object stored in the object detection DB32 position information PI t May be replaced with object position information PI t _res, object position information PI t _ref, or object position information PI t _ref'.
- the object the object matching unit 22 in place of the object position information PI t before being refined, using a refined object position information PI t, which is reflected on the image IM t taken at time t O t and it can perform the object matching operation for matching the object O t + tau that is reflected on the photographed image IM t + tau at time t + tau. Furthermore, using the results of the object matching operation for matching the object O t and object O t + tau, object position information about the object O t + ⁇ PI t + ⁇ is refined.
- the object the object matching unit 22 in place of the object position information PI t + tau before being refined, using object position information PI t + tau which is refined, and is reflected on the time t + image was taken tau IM t + tau O t + tau and can perform the object matching operation for matching the object O t + 2.tau that is reflected on the photographed image IM t + 2.tau time t + 2.tau.
- the refined object position is input to the object matching unit 22 in a chained manner. Therefore, it is expected that the accuracy of the matching of the object O will be improved and the processing cost required for the object tracking operation will be reduced as compared with the case where the object matching operation is performed using the object position information PI before being refined.
- the object tracking device 1 of the present embodiment generates a similarity matrix AM from the object position information PI t- ⁇ and PI t, and the similarity matrix AM. It is possible to refine the object position information PI t using. That is, the object tracking device 1 can perform the object matching operation without performing preprocessing or the like on the object position information PI t- ⁇ and PI t which are the outputs of the object detection unit 21. In other words, the object tracking device 1 can use the object position information PI t- ⁇ and PI t , which are the outputs of the object detection unit 21, as they are as inputs of the object matching unit 22 that performs the object matching operation.
- the object tracking device 1 can perform a refinement operation on the similarity matrix AM, which is the output of the object matching unit 22, without performing preprocessing or the like.
- the object tracking device 1 uses the similarity matrix AM, which is the output of the object matching unit 22, and the object position information PI t , which is the output of the object detection unit 21, as they are as inputs of the refinement unit 23 that performs the refinement operation. Can be done. Therefore, as compared with the object tracking device of the comparative example in which the object detection operation, the object matching operation, and the refinement operation are performed as three independent operations independently of each other, the object tracking device 1 has the object detection operation, the object matching operation, and the refinement. It is not necessary to perform preprocessing or the like for each operation. As a result, the object tracking device 1 can track the object O at a relatively low processing cost.
- the object tracking device 1 pays attention to the fact that the object tracking operation and the operation performed by the general attention mechanism are substantially similar to each other, and performs the refinement operation by using the information generated in the object matching operation. It can be said that it is going. Specifically, in the object tracking operation, as described above, a process of detecting the object O, a process of collating the object O, and a process of refining the detection result of the object O are performed. On the other hand, in a general attention mechanism, a process of extracting the feature of the object O, a process of calculating the weight, and a process of refining the extraction result of the feature of the object O are performed.
- the object tracking device 1 also uses the process of calculating the weight in the attention mechanism as a process of collating the object O in the object tracking operation.
- the object tracking device 1 uses the process of collating the object O in the object tracking operation as a process of calculating the weight in the attention mechanism. Therefore, it can be said that the object tracking device 1 realizes the object detection operation, the object matching operation, and the refinement operation by using the attention mechanism.
- the object tracking device 1 uses an object by using a caution mechanism in which the object position information PI t- ⁇ is used as a query, the object position information PI t is used as a key and a value, and the similarity matrix AM is used as a weight. It can be said that the tracking operation is performed.
- the object tracking device 1 can perform an object tracking operation including an object detection operation, an object matching operation, and a refinement operation as a series of operations. That is, the object tracking device 1 can perform an object tracking operation by using a single network structure (so-called end-to-end single-stage network structure) that performs an object detection operation, an object matching operation, and a refinement operation. it can. Since the object tracking operation (particularly, the refinement operation) can be performed using the attention mechanism in this way, the object tracking device 1 performs the object tracking operation without using the attention mechanism (that is, the object detection operation).
- the object position information PI can be refined more appropriately as compared with the case where the network structure for performing the object collation operation, the network structure for performing the object matching operation, and the network structure for performing the refinement operation are used separately and independently.
- the object tracking device of the comparative example shows N t- ⁇ objects in the image taken at time t- ⁇ . If N t objects are reflected in the image taken at time t, it is necessary to individually collate each of the N t- ⁇ objects with each of the N t objects. There is. That is, the object tracking device of the comparative example needs to repeat the operation of collating two objects N t ⁇ ⁇ N t times. Therefore, there is a technical problem that the processing cost for tracking the object may be high.
- the object tracking device 1 of the present embodiment uses the feature vectors CV t- ⁇ and CV t with the object O t- ⁇ reflected in the image IM t- ⁇ taken at the time t- ⁇ . It is possible to perform an object collation operation for collating the object O t reflected in the image IM t taken at the time t. Therefore, the object tracking device 1 can track the object O at a relatively low processing cost.
- the object tracking apparatus 1 includes N t-tau pieces in the image IM t-tau (Note, N t-tau is an integer of 1 or more) objects O t-tau are crowded reflected and image IM of the N t to t (Note, N t is an integer of 1 or more) even when the object O t of is crowded reflected, of N t-tau pieces of object O t-tau respectively as the N t better without collating the respective object O t separately. That is, the object tracking apparatus 1 includes an operation to match the two objects O t-tau and O t, better be not repeated N t- ⁇ ⁇ N t times.
- the object tracking device 1 performs the process of generating the similarity matrix AM using the feature vectors CV t- ⁇ and CV t , N t- ⁇ objects O t- ⁇ and N t objects, respectively, are performed once. it can be completed collated with each of the object O t of. Therefore, the object O can be tracked at a relatively low processing cost as compared with the object tracking device of the comparative example.
- the processing cost increases exponentially as the number of objects O reflected in the image IM increases, while in the object tracking device 1 of the present embodiment, the processing cost is the image. It becomes less dependent on the number of objects O reflected in the IM. Therefore, as the number of objects O reflected in the image IM increases, the effect of reducing the processing cost by the object tracking device 1 increases.
- FIG. 7 is a block diagram showing the configuration of the object tracking device 1a of the first modification.
- the object tracking device 1a of the first modification performs the learning operation of the object detection unit 21, the object matching unit 22, and the refinement unit 23 as logical functional blocks realized in the arithmetic unit 2. It differs in that it further includes a learning unit 24a to perform. Further, the object tracking device 1a is different in that the learning DB 34a for storing the learning data 341a for the learning operation is stored by the storage device 3. Other features of the object tracking device 1a may be the same as other features of the object tracking device 1. The learning operation is typically performed in advance before the object tracking device 1a actually performs the object tracking operation. However, the learning operation may be performed at a desired timing after the object tracking device 1a starts the object tracking operation.
- the learning data 341a includes, for example, an image IM acquired (for example, captured) at a certain time, as shown in FIG. Further, the learning data 341a includes the object position information PI_label indicating the correct label of the position of the object O reflected in the image IM. In the example shown in FIG.
- the learning DB34a the learning data 341a including the object position information PI s _label about the position of the object O s that is reflected on the obtained image IM s and the image IM s time s, A plurality of learning data 341a including the image IM s + ⁇ acquired at the time s + ⁇ and the learning data 341a including the object position information PI s + ⁇ _label regarding the position of the object Os + ⁇ reflected in the image IM s + ⁇ are stored. ing.
- the learning data 341a may include a plurality of object position information PI_labels indicating correct labels for the positions of the plurality of objects O, respectively.
- the learning unit 24a inputs the images IM s and IM s + ⁇ included in the learning data 341a to the object detection unit 21 in order to perform the learning operation.
- the object detecting unit 21 outputs the object position information PI s about the position of the object O s, and the object O s + object position information on the position of ⁇ PI s + ⁇ .
- the object matching unit 22 generates the similarity matrix AM by using the object position information PI s and the object position information PI s + ⁇ .
- the refinement unit 23 refines the object position information PI s + ⁇ by using the similarity matrix AM.
- the learning unit 24a is used by the object detection unit 21 to perform the object detection operation, and is used by the object matching unit 22 to perform the object matching operation, based on at least the loss function L1 related to the refined object position information PI s + ⁇ . / Or update the learnable arithmetic model used by the refinement unit 23 to perform the refinement operation. That is, the learning unit 24a updates the calculation model that defines at least one operation content of the object detection unit 21, the object matching unit 22, and the refinement unit 23 based on the loss function L1.
- the learning unit 24a is a loss function.
- the arithmetic model may be updated so that L1 becomes smaller (typically to the minimum).
- the object detection unit 21, the object matching unit 22, and the refinement unit 23 can be realized by a single network structure (that is, a single calculation model). Therefore, when the object detection unit 21, the object matching unit 22, and the refinement unit 23 are realized by a single network structure (that is, a single calculation model), the learning unit 24a performs the single calculation. You may update the model.
- an arithmetic model an arithmetic model using a neural network (for example, CNN: Convolutional Neural Network) can be mentioned.
- the operation of updating the calculation model may include the operation of updating, determining, or adjusting the parameters of the calculation model.
- the parameters of the calculation model may include at least one of the weights between the nodes included in the neural network, the bias given by each node, and the connection path between the nodes.
- the learning unit 24a sequentially inputs a plurality of image IMs included in the plurality of learning data 341a stored in the learning DB 34a into the object detection unit 21 in the order of the time corresponding to the plurality of image IMs, thereby performing a learning operation. repeat. That is, the learning unit 24a inputs the images IM s and IM s + ⁇ to the object detection unit 21, and updates the calculation model based on the loss function L1 regarding the refined object position information PI s + ⁇ . After that, the learning unit 24a inputs the images IM s + ⁇ and IM s + 2 ⁇ to the object detection unit 21, and updates the calculation model based on the loss function L1 regarding the refined object position information PI s + 2 ⁇ . After that, the learning unit 24a repeats the same operation. As a result, the calculation model that defines at least one operation content of the object detection unit 21, the object matching unit 22, and the refinement unit 23 is appropriately updated (that is, learned).
- the learning unit 24a may change the time interval between the two times corresponding to the two image IMs input to the object detection unit 21.
- the learning unit 24a inputs two image IMs (for example, images IM s and IM s + ⁇ ) acquired at two times separated by a time interval of ⁇ into the object detection unit 21. ..
- the learning unit 24a has two times separated by a time interval of m ⁇ ⁇ (where m is a coefficient that the learning unit 24a can change, for example, an integer such as 1, 2, 3, ).
- Two image IMs (for example, image IM s and IM s + m ⁇ ) acquired in each of the above may be input to the object detection unit 21.
- the learning unit 24a may input the images IM s and IM m + ⁇ to the object detection unit 21 and update the calculation model based on the loss function L1 regarding the refined object position information PI s + m ⁇ . After that, the learning unit 24a may input the images IM s + m ⁇ and IM s + 2m ⁇ into the object detection unit 21 and update the calculation model based on the loss function L1 regarding the refined object position information PI s + 2m ⁇ . After that, the learning unit 24a may repeat the same operation. In this case, the amount of movement of the object O between the two image IMs input to the object detection unit 21 changes according to the coefficient m.
- the coefficient m may be determined, for example, by a random number for each learning operation, that is, each time the learning unit 24a inputs two image IMs (for example, images IM s and IM s + ⁇ ) to the object detection unit 21.
- the calculation model that defines at least one operation content of the object detection unit 21, the object matching unit 22, and the refinement unit 23 is updated so that an object moving at various moving speeds can be tracked.
- Loss function L2 is, for example, be a small consisting loss function as the error is reduced with the object position information PI s _label as true label the object position information PI s of the object detecting unit 21 is outputted.
- the loss function L3 may be, for example, a loss function that becomes smaller as the error between the object position information PI s + ⁇ output by the object detection unit 21 and the object position information PI s + ⁇ _label as the correct label becomes smaller.
- the learning unit 24a may update the calculation model so that the sum of the loss functions L1 to L3 becomes smaller (typically, minimized).
- the learning unit 24a weights the loss functions L1 to L3 and updates the calculation model based on the weighted loss functions L1 to L3. Good. That is, the learning unit 24a weights the loss functions L1 to L3 using the weighting coefficients ⁇ 1, ⁇ 2 and ⁇ 3, respectively, and based on the loss function specified by the mathematical formula ⁇ 1 ⁇ L1 + ⁇ 2 ⁇ L2 + ⁇ 3 ⁇ L3. The arithmetic model may be updated.
- each of the loss functions L1 and L3 is a loss function related to the object position information PI s + ⁇
- the loss function L2 is a loss function related to the object position information PI s.
- the contribution of the object position information PI s to the update of the calculation model that is, the contribution of the loss functions L1 and L3
- the object position information PI s + ⁇ to the update of the calculation model.
- the contribution of the object position information PI s for updating of the operational model is preferably the same as the contribution of the object position information PI s + tau for updating the calculation model.
- the learning unit 24a may perform weighting processing so that the weight on the sum of the loss function L1 and the loss function L3 and the weight on the loss function L2 are the same.
- the learning unit 24a weights the loss functions L1 to L3 using the formula 0.5 ⁇ (L1 + L3) + 0.5 ⁇ L2, and 0.5 ⁇ (L1 + L3) + 0.5 ⁇ L2.
- the arithmetic model may be updated based on the loss function specified by the formula.
- Such a weighting process is particularly useful when the similarity matrix AM is normalized by using the softmax function in the third modification described later. The reason is as follows.
- the learning unit 24a determines that the initial period of the learning operation (for example, the period during which most of the elements of the similarity matrix AM become zero due to the normalization process) is the weight of the total loss function L1 and the loss function L3. And, the weighting process may be performed so that the weight for the loss function L2 is the same. As a result, the learning unit 24a is less likely to lose its learning effect even in the initial period of the learning operation (for example, the period in which most of the elements of the similarity matrix AM are reduced to zero by the normalization process).
- the learning unit 24a has 0.25 ⁇ L1 + 0.25 ⁇ L3 + 0.5.
- the loss functions L1 to L3 are weighted using the formula ⁇ L2, and the calculation model is updated based on the loss function specified by the formula 0.25 ⁇ L1 + 0.25 ⁇ L3 + 0.5 ⁇ L2. May be good.
- the object tracking device 1 performs an object tracking operation including an object detection operation.
- the object tracking operation does not have to include the object detection operation. That is, the object tracking device 1 does not have to perform the object detection operation.
- the storage device 3 stores an image DB 31 for storing the image IM used for performing the object detection operation and an object detection DB 32 for storing the object detection information indicating the result of the object detection operation. It does not have to be.
- the object tracking device 1b (particularly, the object matching unit 22) may perform the object matching operation by using the object detection information indicating the result of the object detection operation performed by a device different from the object tracking device 1b.
- the object tracking device 1 performs an object tracking operation including a refinement operation.
- the object tracking operation does not have to include the refinement operation. That is, the object tracking device 1 does not have to perform the refinement operation.
- FIG. 9 which shows the configuration of the object tracking device 1b which is the second modification of the object tracking device 1
- the object tracking device 1b does not have to include the refinement unit 23.
- the refinement operation may be performed by a device different from the object tracking device 1b.
- the object tracking device 1b (particularly, the object matching unit 22) may output the object matching information indicating the result of the object matching operation to a device performing the refinement operation different from the object tracking device 1b. ..
- the matrix calculation unit 225 may normalize the similarity matrix AM obtained by arithmetic processing using the feature vector CV t ⁇ and the feature vector CV t. For example, the matrix calculation unit 225 may normalize the similarity matrix AM by normalizing the matrix product of the feature vector CV t ⁇ and the feature vector CV t.
- the matrix calculation unit 225 may perform arbitrary normalization processing on the similarity matrix AM.
- the matrix calculation unit 225 may perform a normalization process using a sigmoid function on the similarity matrix AM.
- each element of the similarity matrix AM is normalized using a sigmoid function.
- the matrix calculation unit 225 may perform a normalization process using the softmax function on the similarity matrix AM.
- the matrix calculation unit 225 is composed of a column vector component composed of a plurality of elements of each row of the similarity matrix AM and a plurality of elements of each column of the similarity matrix AM.
- a normalization process using a softmax function may be performed on each of the column vector components to be performed.
- the matrix calculation unit 225 performs normalization processing using the softmax function on the column vector component so that the sum of the plurality of elements constituting the column vector component is 1, and constitutes the row vector component.
- a normalization process using a softmax function may be performed on the row vector component so that the sum of the plurality of elements to be processed becomes 1. After that, the matrix containing the elements obtained by multiplying the vector component and the row vector component after the normalization processing is performed becomes the normalization processing-performed similarity matrix AM.
- a normalization process using a softmax function may be performed on each of the column vector components composed of a plurality of elements in one column.
- the matrix calculation unit 225 uses a softmax function to normalize the column vector components so that the sum of the plurality of elements constituting the column vector components corresponding to the object Ot ⁇ is 1. It was carried out, so that the sum of a plurality of elements constituting a row vector components corresponding to the object O t is 1, the normalization processing may be performed using the softmax function on a row vector components. After that, the matrix containing the elements obtained by multiplying the vector component and the row vector component after the normalization processing is performed becomes the normalization processing-performed similarity matrix AM.
- the object position information PI generated by the object detection unit 21 by performing the object detection operation is information indicating the center position KP of the object O and the detection frame BB of the object O.
- Information indicating the size of the detection frame BB and information indicating the correction amount of the detection frame BB are included.
- the object position information PI does not have to include at least one of the information indicating the size of the detection frame BB of the object O and the information indicating the correction amount of the detection frame BB.
- the object position information PI may include information regarding the position of a portion different from the center of the object O.
- the object matching unit 22 generates the feature vector CV by using the object position information PI indicating the result of the object detection operation.
- the object matching unit 22 is in the middle of the arithmetic model in addition to the object position information PI corresponding to the final output of the arithmetic model.
- the output may be used to generate a feature vector CV.
- the object matching unit 22 uses the output of the intermediate layer of the neural network used as the arithmetic model in addition to the object position information PI corresponding to the output of the output layer of the neural network used as the arithmetic model to generate the feature map CM.
- the feature vector CV may be generated from the generated feature map CM.
- the matrix calculation unit 225 generates information obtained by the calculation process for calculating the matrix product of the feature vector CV t ⁇ and the feature vector CV t as the similarity matrix AM.
- the matrix calculation unit 225 may generate information obtained by arbitrary arithmetic processing using the feature vector CV t ⁇ and the feature vector CV t as the similarity matrix AM.
- the matrix calculation unit 225 may generate information obtained by arithmetic processing for calculating the matrix sum of the feature vector CV t ⁇ and the feature vector CV t as the similarity matrix AM.
- the matrix calculation unit 225 may generate the similarity matrix AM by using an arbitrary calculation model that outputs the similarity matrix AM when the feature vector CV t ⁇ and the feature vector CV t are input. Good.
- an arithmetic model using a neural network for example, CNN: Convolutional Neural Network
- the matrix computing unit 225 as long as performing correspondence can be generated arithmetically processing the affinity matrix AM indicating the the object O t-tau and object O t, how to generate a similarity matrix AM Arithmetic processing may be performed.
- the object collation unit 22 generates a similarity matrix AM having a size of HW ⁇ HW from the object position information PI which is map information having a size of H ⁇ W ⁇ (K + 4).
- the object matching unit 22 generates a similarity matrix AM having a size smaller than the size of HW ⁇ HW from the object position information PI having a size of H ⁇ W ⁇ (K + 4). That is, the object matching unit 22 may generate the downscaled similarity matrix AM.
- the feature map conversion units 221 and 223 of the object matching unit 22 have a size smaller than the size of H ⁇ W ⁇ C from the object position information PI t ⁇ and PI t having a size of H ⁇ W ⁇ (K + 4).
- the feature maps CM t- ⁇ and CM t having the above may be generated respectively.
- the matrix calculation unit 225 of the object matching unit 22 generates a similarity matrix AM having a size smaller than the size of HW ⁇ HW from the feature map CM having a size smaller than the size of H ⁇ W ⁇ C. be able to.
- the feature map conversion units 221 and 223 of the object matching unit 22 are used to generate the feature map CM t-.
- the stride amount (that is, the amount of movement) of the kernel (that is, the convolution filter) used in the convolution layer that performs the convolution process for generating ⁇ and CM t may be adjusted.
- [Appendix 2] The object tracking device according to Appendix 1, wherein the arithmetic processing includes a process of calculating a matrix product of the first feature vector and the second feature vector.
- [Appendix 3] The object tracking device according to Appendix 1 or 2, wherein the correspondence information indicates a correspondence relationship between an object in the first image and an object in the second image using a matrix.
- the second generation means normalizes the matrix by normalizing each of the vector component of one row of the matrix and the vector component of one column of the matrix with a softmax function. The object tracking device according to any one item.
- the correspondence information includes a row vector component corresponding to one object in either one of the first and second images and a column vector corresponding to the one object in any one of the first and second images. Using a matrix in which the elements react at the positions where the components intersect, the correspondence between the object in the first image and the object in the second image is shown.
- the object tracking device according to any one of Supplementary note 1 to 4, wherein the second generation means normalizes the matrix by normalizing each of the row vector component and each column vector component with a softmax function. ..
- the object tracking device according to Appendix 6, wherein the correction means corrects the second position information by using a caution mechanism that uses the corresponding information as a weight.
- the correspondence information indicates the correspondence between the object in the first image and the object in the second image using a matrix.
- the second position information includes a position map showing information about the position of an object in the second image.
- the object tracking device according to Appendix 7, wherein the attention mechanism corrects the position map, which is the second position information, by performing a process of calculating the matrix product of the position map and the corresponding map.
- the attention mechanism corrects the position map, which is the second position information, by performing a process of adding the correction map obtained by calculating the matrix product of the position map and the corresponding information to the position map.
- the object tracking device according to Appendix 8.
- the first generation means is intermediate between the first and second position information and the calculation model from the calculation model that outputs the first and second position information respectively when the first and second images are input. Get the intermediate output information corresponding to the output, The object tracking device according to any one of Appendix 1 to 9, wherein the first generation means calculates the first and second feature vectors based on the first and second position information and the intermediate output information. ..
- Addendums 1 to 10 further include a learning means for updating a learning model that defines the operation content of at least one of the first to third generation means and the correction means based on the first loss function related to position information.
- the learning means includes the first loss function, a second loss function related to the first position information generated by the third generation means when the learning data is input to the third generation means, and the learning.
- the learning model is updated based on the third loss function related to the second position information generated by the third generation means when the data for use is input to the third generation means.
- the learning means weights the first to third loss functions so that the weights of the first and third loss functions as a whole are equal to the weights of the second loss function, and the weighting is performed.
- the object tracking device according to Appendix 11, which generates the learning model based on the processed first to third loss functions.
- [Appendix 14] Based on the first position information about the position of the object in the first image taken at the first time and the second position information about the position of the object in the second image taken at the second time different from the first time. To generate a first feature vector showing the feature amount of the first position information and a second feature vector showing the feature amount of the second position information.
- An object including generating information obtained by arithmetic processing using the first and second feature vectors as correspondence information indicating a correspondence relationship between an object in the first image and an object in the second image. Tracking method.
- the object tracking method is Based on the first position information about the position of the object in the first image taken at the first time and the second position information about the position of the object in the second image taken at the second time different from the first time.
- a record including generating information obtained by arithmetic processing using the first and second feature vectors as correspondence information indicating a correspondence relationship between an object in the first image and an object in the second image.
- Medium A computer program that causes a computer to perform an object tracking method.
- the object tracking method is Based on the first position information about the position of the object in the first image taken at the first time and the second position information about the position of the object in the second image taken at the second time different from the first time.
- a computer including generating information obtained by arithmetic processing using the first and second feature vectors as correspondence information indicating a correspondence relationship between an object in the first image and an object in the second image. program.
- the present invention can be appropriately modified within the scope of the claims and within the scope not contrary to the gist or idea of the invention which can be read from the entire specification, and the object tracking device, the object tracking method and the recording medium accompanied by such a modification are also included in the present invention. It is included in the technical idea of the invention.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- General Physics & Mathematics (AREA)
- Physics & Mathematics (AREA)
- Multimedia (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Medical Informatics (AREA)
- Software Systems (AREA)
- General Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- Computing Systems (AREA)
- Health & Medical Sciences (AREA)
- Image Analysis (AREA)
Priority Applications (3)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| PCT/JP2019/051088 WO2021130951A1 (ja) | 2019-12-26 | 2019-12-26 | 物体追跡装置、物体追跡方法及び記録媒体 |
| US17/788,444 US12243299B2 (en) | 2019-12-26 | 2019-12-26 | Object tracking apparatus, object tracking method and recording medium |
| JP2021566677A JP7310927B2 (ja) | 2019-12-26 | 2019-12-26 | 物体追跡装置、物体追跡方法及び記録媒体 |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| PCT/JP2019/051088 WO2021130951A1 (ja) | 2019-12-26 | 2019-12-26 | 物体追跡装置、物体追跡方法及び記録媒体 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2021130951A1 true WO2021130951A1 (ja) | 2021-07-01 |
Family
ID=76573764
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/JP2019/051088 Ceased WO2021130951A1 (ja) | 2019-12-26 | 2019-12-26 | 物体追跡装置、物体追跡方法及び記録媒体 |
Country Status (3)
| Country | Link |
|---|---|
| US (1) | US12243299B2 (https=) |
| JP (1) | JP7310927B2 (https=) |
| WO (1) | WO2021130951A1 (https=) |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2025186903A1 (ja) * | 2024-03-05 | 2025-09-12 | 日本電気株式会社 | 情報処理装置、情報処理方法、及び記録媒体 |
Families Citing this family (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2022014327A1 (ja) * | 2020-07-14 | 2022-01-20 | ソニーセミコンダクタソリューションズ株式会社 | 情報処理装置、情報処理方法、およびプログラム |
| US11972348B2 (en) * | 2020-10-30 | 2024-04-30 | Apple Inc. | Texture unit circuit in neural network processor |
Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2018030048A1 (ja) * | 2016-08-08 | 2018-02-15 | パナソニックIpマネジメント株式会社 | 物体追跡方法、物体追跡装置およびプログラム |
| WO2019008951A1 (ja) * | 2017-07-06 | 2019-01-10 | 株式会社デンソー | 畳み込みニューラルネットワーク |
| JP2019536154A (ja) * | 2016-11-15 | 2019-12-12 | マジック リープ, インコーポレイテッドMagic Leap,Inc. | 直方体検出のための深層機械学習システム |
Family Cites Families (8)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP5166102B2 (ja) | 2008-04-22 | 2013-03-21 | 株式会社東芝 | 画像処理装置及びその方法 |
| JP2012181710A (ja) | 2011-03-02 | 2012-09-20 | Fujifilm Corp | オブジェクト追跡装置、方法、及びプログラム |
| JP6488647B2 (ja) | 2014-09-26 | 2019-03-27 | 日本電気株式会社 | 物体追跡装置、物体追跡システム、物体追跡方法、表示制御装置、物体検出装置、プログラムおよび記録媒体 |
| US9858496B2 (en) | 2016-01-20 | 2018-01-02 | Microsoft Technology Licensing, Llc | Object detection and classification in images |
| JP6832504B2 (ja) * | 2016-08-08 | 2021-02-24 | パナソニックIpマネジメント株式会社 | 物体追跡方法、物体追跡装置およびプログラム |
| JP6946649B2 (ja) | 2017-01-31 | 2021-10-06 | ソニーグループ株式会社 | 電子機器、情報処理方法およびプログラム |
| US12406380B2 (en) * | 2020-03-12 | 2025-09-02 | Nec Corporation | Image processing apparatus, image processing system, image processing method, and non-transitory computer-readable medium storing image processing program therein |
| JP7491383B2 (ja) * | 2020-08-19 | 2024-05-28 | 日本電気株式会社 | 基準状態決定装置、基準状態決定方法、及びプログラム |
-
2019
- 2019-12-26 US US17/788,444 patent/US12243299B2/en active Active
- 2019-12-26 WO PCT/JP2019/051088 patent/WO2021130951A1/ja not_active Ceased
- 2019-12-26 JP JP2021566677A patent/JP7310927B2/ja active Active
Patent Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2018030048A1 (ja) * | 2016-08-08 | 2018-02-15 | パナソニックIpマネジメント株式会社 | 物体追跡方法、物体追跡装置およびプログラム |
| JP2019536154A (ja) * | 2016-11-15 | 2019-12-12 | マジック リープ, インコーポレイテッドMagic Leap,Inc. | 直方体検出のための深層機械学習システム |
| WO2019008951A1 (ja) * | 2017-07-06 | 2019-01-10 | 株式会社デンソー | 畳み込みニューラルネットワーク |
Non-Patent Citations (1)
| Title |
|---|
| FEICHTENHOFER ET AL.: "Detect to Track and Track to Detect", PROCEEDINGS OF 2017 1EEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2017, 29 October 2017 (2017-10-29), pages 3057 - 3065, XP033283173, ISBN: 978-1-5386-1032-9, DOI: 10.1109/ICCV.2017.330 * |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2025186903A1 (ja) * | 2024-03-05 | 2025-09-12 | 日本電気株式会社 | 情報処理装置、情報処理方法、及び記録媒体 |
Also Published As
| Publication number | Publication date |
|---|---|
| JP7310927B2 (ja) | 2023-07-19 |
| JPWO2021130951A1 (https=) | 2021-07-01 |
| US12243299B2 (en) | 2025-03-04 |
| US20230031931A1 (en) | 2023-02-02 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US12499363B2 (en) | Learning to generate synthetic datasets for training neural networks | |
| US12406488B2 (en) | Neural network model training method, image processing method, and apparatus | |
| KR102635987B1 (ko) | 이미지 시맨틱 세그멘테이션 네트워크를 트레이닝하기 위한 방법, 장치, 디바이스 및 저장 매체 | |
| CN107945204B (zh) | 一种基于生成对抗网络的像素级人像抠图方法 | |
| US11593611B2 (en) | Neural network cooperation | |
| US11954755B2 (en) | Image processing device and operation method thereof | |
| CN107766823B (zh) | 基于关键区域特征学习的视频中异常行为检测方法 | |
| CN111104925A (zh) | 图像处理方法、装置、存储介质和电子设备 | |
| CN113537462B (zh) | 数据处理方法、神经网络的量化方法及相关装置 | |
| CN112434618A (zh) | 基于稀疏前景先验的视频目标检测方法、存储介质及设备 | |
| WO2021130951A1 (ja) | 物体追跡装置、物体追跡方法及び記録媒体 | |
| CN114495006A (zh) | 遗留物体的检测方法、装置及存储介质 | |
| WO2022038660A1 (ja) | 情報処理装置、情報処理方法、および、情報処理プログラム | |
| CN111461862A (zh) | 为业务数据确定目标特征的方法及装置 | |
| CN111382619B (zh) | 图片推荐模型的生成、图片推荐方法、装置、设备及介质 | |
| CN118152897A (zh) | 敏感信息识别模型的训练方法、装置、设备及存储介质 | |
| Wang et al. | Bird-Count: a multi-modality benchmark and system for bird population counting in the wild | |
| CN116091905B (zh) | 一种域自适应撞击坑检测方法、系统、装置及存储介质 | |
| US11688175B2 (en) | Methods and systems for the automated quality assurance of annotated images | |
| KR20230071719A (ko) | 이미지 처리용 신경망 훈련 방법 및 전자 장치 | |
| CN113658218B (zh) | 一种双模板密集孪生网络跟踪方法、装置及存储介质 | |
| CN118351138B (zh) | 融合原型学习机制的通用类别多目标跟踪方法 | |
| Tomar | A critical evaluation of activation functions for autoencoder neural networks | |
| CN116071719B (zh) | 基于模型动态修正的车道线语义分割方法及装置 | |
| CN117708688A (zh) | 一种基于自蒸馏辅助的噪声标签校正方法及系统 |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 19957428 Country of ref document: EP Kind code of ref document: A1 |
|
| ENP | Entry into the national phase |
Ref document number: 2021566677 Country of ref document: JP Kind code of ref document: A |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 19957428 Country of ref document: EP Kind code of ref document: A1 |
|
| WWG | Wipo information: grant in national office |
Ref document number: 17788444 Country of ref document: US |