CN113205072A

CN113205072A - Object association method and device and electronic equipment

Info

Publication number: CN113205072A
Application number: CN202110592975.8A
Authority: CN
Inventors: 邬紫阳; 罗兵华; 刘晓东; 杨涛; 宋荣
Original assignee: Shanghai Goldway Intelligent Transportation System Co Ltd
Current assignee: Shanghai Goldway Intelligent Transportation System Co Ltd
Priority date: 2021-05-28
Filing date: 2021-05-28
Publication date: 2021-08-03
Also published as: CN114742112A

Abstract

The application provides an object association method, an object association device and electronic equipment, wherein the object association method comprises the following steps: acquiring an image frame sequence comprising a first image and a second image; the acquisition time of the second image is earlier than that of the first image; acquiring target position information and target characteristic information of a target object identified from a first image; acquiring historical position information, historical characteristic information and historical tracking information of the tracking object identified from the second image, and predicting the predicted position information of the tracking object in the first image based on the historical position information of each tracking object; target position information of a target object, predicted position information of a tracked object, historical tracking information, target characteristic information of the target object and historical characteristic information of the tracked object are input into a trained similarity prediction model to obtain whether the target object and the tracked object are the same target or not, and therefore accuracy of establishing an incidence relation between the target object and the tracked object is improved.

Description

Object association method and device and electronic equipment

Technical Field

The present application relates to the field of image processing, and in particular, to an object association method and apparatus, and an electronic device.

Background

The image-based target object tracking technology refers to: and associating at least one target object identified from the current video frame with a tracking object identified from the previous video frame, and determining the motion track of the target object based on the position information of the target object and the associated tracking object, thereby realizing tracking.

Specifically, after the related information of at least one target object is identified from the first video frame, the identified target object may be used as a tracking object, and the tracking object and the related information thereof may be recorded. When a target object is identified from a subsequent video frame, the target object can be matched with each tracked object, if the target object is matched with any tracked object, the target object and any tracked object are determined to be the same object, and the incidence relation between the target object and any tracked object is established. Then, based on the position information of the target object and the tracked object related to the target object, the motion trail of the target object is determined, and therefore the tracking of the target object is achieved.

Therefore, how to establish the association relationship between the target object and the tracked object is the key for realizing the tracking of the target object, so that the establishment of the association relationship between the target object and the tracked object becomes a problem to be solved urgently.

Disclosure of Invention

In view of this, the present application provides an object association method, an object association device, and an electronic device, which are used to improve accuracy of establishing an association relationship between a target object and a tracked object.

Specifically, the method is realized through the following technical scheme:

according to a first aspect of the present application, there is provided an object association method, the method comprising:

acquiring an image frame sequence comprising a first image and a second image; wherein the acquisition time of the second image is earlier than the acquisition time of the first image;

acquiring target position information and target characteristic information of a target object identified from a first image;

acquiring historical position information, historical characteristic information and historical tracking information of the tracking object identified from the second image, and predicting the predicted position information of the tracking object in the first image based on the historical position information of each tracking object;

inputting the target position information of the target object, the predicted position information of the tracked object, the historical tracking information, the target characteristic information of the target object and the historical characteristic information of the tracked object into a trained similarity prediction model to obtain whether the target object and the tracked object are the same target.

Optionally, the number of the target objects is at least one, and the number of the tracking objects is at least one;

inputting the target position information of the target object, the predicted position information of the tracked object, the historical tracking information, the target feature information of the target object and the historical feature information of the tracked object into a trained similarity prediction model to obtain whether the target object and the tracked object are the same target, wherein the method comprises the following steps:

inputting target position information of the target object, predicted position information of the tracked object, historical tracking information, target characteristic information of the target object and historical characteristic information of the tracked object into a trained similarity prediction model;

the similarity prediction model determines first feature information used for representing the similarity of the predicted positions between each target object and each tracked object based on the target position information of each target object and the predicted position information of each tracked object, determines second feature information used for representing the similarity of the features between each target object and each tracked object based on the target feature information of each target object and the historical feature information of each tracked object, and convolves the historical tracking information of each tracked object to obtain an attention probability mask;

the similarity prediction model fuses first characteristic information and second characteristic information between each target object and each tracked object, and performs mask operation on a fusion result and the attention probability mask to obtain the similarity between each target object and each tracked object;

and the similarity prediction model determines and outputs whether each target object and each tracking object are the same target or not based on the trained similarity threshold and the similarity between each target object and each tracking object.

Optionally, inputting the target feature information of the target object and the historical feature information of the tracked object into a trained similarity prediction model, including:

clustering the historical characteristic information of the tracked object to obtain a clustering result, wherein the clustering result comprises: at least one characteristic category and at least one clustering cluster corresponding to the characteristic category respectively; the historical characteristic information in each cluster is matched with the characteristic category corresponding to the cluster;

and splicing the target characteristic information of the target object with the clustering center of at least one clustering cluster, and inputting a splicing result to the trained similarity prediction model.

Optionally, the number of the target objects is m, the number of the tracking objects is n, and the number of the clustering clusters is k;

the stitching result is represented by a tensor of (m × n) × k × 2 dims; wherein m × n represents the matching logarithm of the target object and the tracking object, k represents the number of clustering clusters, and 2dims represents the dimension number of the characteristic information;

the determining, based on the target feature information of each target object and the historical feature information of the respective tracked objects, second feature information representing a feature similarity between each target object and the respective tracked objects includes:

performing convolution operation on the splicing result on a dimensionality corresponding to 2dims to obtain a first scalar, wherein the first scalar is represented by (m x n) k x p, and p represents the dimensionality number of the convolution operation result;

exchanging elements on the dimensionality corresponding to k and the dimensionality corresponding to p in the first tensor to obtain a second tensor, wherein the second tensor is expressed by (m × n) × p × k;

and performing convolution operation on the second tensor on the dimensionality corresponding to the k, and determining the obtained result as second feature information used for expressing feature similarity between each target object and each tracked object.

Optionally, the similarity prediction model represents the similarity between each target object and each tracked object through a similarity matrix, and each element in the similarity matrix represents the similarity between one target object and one tracked object;

determining and outputting whether each target object and each tracking object are the same target or not based on the trained similarity threshold and the similarity between each target object and each tracking object, including:

expanding the similarity matrix by adopting the similarity threshold value so that the expanded similarity matrix comprises the similarity threshold value;

setting a weight for each row of elements based on the value of the elements on each row of the expanded similarity matrix, and setting a weight for each column of elements based on the value of the elements on each row of the expanded similarity matrix;

performing minimum flow calculation on the expanded similarity matrix based on the weight set for each row element and each column element to obtain a matching value of each target object and each tracked object;

and obtaining and outputting whether each target object and each tracking object are the same target or not based on the matching value of each target object and each tracking object.

Optionally, the number of the target objects is m, and the number of the tracking objects is n; the size of the similarity matrix is m x n;

the expanding the similarity matrix by adopting the similarity threshold value comprises the following steps:

expanding the M x N similarity matrix to obtain an M x N similarity matrix; wherein M represents the maximum number of detectable target objects and N represents the maximum number of trackable tracking target objects;

and expanding the M x N similarity matrix by adopting a similarity threshold value to obtain a 2M x 2N similarity matrix.

Optionally, the obtaining whether each target object and each tracked object are the same target based on the matching value of each target object and each tracked object includes:

for each target object, if the matching value of the target object and any one tracked object is a first preset value, determining that the target object and any one tracked object are the same target;

and if the target object is matched with all the tracked objects to be the second preset value, the target object and all the tracked objects are not the same target.

Optionally, the method further includes:

for each target object, if the target object and all the tracked objects are not the same target, determining the target object as a new object;

and for each tracking object, if the tracking object and all the target objects are not the same target, determining that the tracking object disappears in the first image.

Optionally, the similarity threshold is a model parameter of the similarity prediction model, and the similarity threshold is obtained by training with the similarity prediction model.

According to a second aspect of the present application, there is provided an object associating apparatus, the apparatus comprising:

an acquisition unit configured to acquire an image frame sequence including a first image and a second image; wherein the acquisition time of the second image is earlier than the acquisition time of the first image; acquiring target position information and target characteristic information of a target object identified from a first image; acquiring historical position information, historical characteristic information and historical tracking information of the tracking object identified from the second image;

a prediction unit configured to predict predicted position information of each tracking object in the first image based on the historical position information of the tracking object;

and the output unit is used for inputting the target position information of the target object, the predicted position information and the historical tracking information of the tracking object, the target characteristic information of the target object and the historical characteristic information of the tracking object into a trained similarity prediction model to obtain whether the target object and the tracking object are the same target or not.

the output unit is configured to input target position information of the target object, predicted position information of the tracked object, historical tracking information, and target feature information of the target object and historical feature information of the tracked object to a trained similarity prediction model when target position information of the target object, predicted position information of the tracked object, historical tracking information, and target feature information of the target object and historical feature information of the tracked object are input to the trained similarity prediction model to obtain whether the target object and the tracked object are the same target; the similarity prediction model determines first feature information used for representing the similarity of the predicted positions between each target object and each tracked object based on the target position information of each target object and the predicted position information of each tracked object, determines second feature information used for representing the similarity of the features between each target object and each tracked object based on the target feature information of each target object and the historical feature information of each tracked object, and convolves the historical tracking information of each tracked object to obtain an attention probability mask; the similarity prediction model fuses first characteristic information and second characteristic information between each target object and each tracked object, and performs mask operation on a fusion result and the attention probability mask to obtain the similarity between each target object and each tracked object; and the similarity prediction model determines and outputs whether each target object and each tracking object are the same target or not based on the trained similarity threshold and the similarity between each target object and each tracking object.

Optionally, the output unit is configured to, when the target feature information of the target object and the historical feature information of the tracked object are input to a trained similarity prediction model, perform clustering on the historical feature information of the tracked object to obtain a clustering result, where the clustering result includes: at least one characteristic category and at least one clustering cluster corresponding to the characteristic category respectively; the historical characteristic information in each cluster is matched with the characteristic category corresponding to the cluster; and splicing the target characteristic information of the target object with the clustering center of at least one clustering cluster, and inputting a splicing result to the trained similarity prediction model.

the input feature information is represented by a tensor of (m × n) × k × 2 dims; wherein m × n represents the matching logarithm of the target object and the tracking object, k represents the number of clustering clusters, and 2dims represents the dimension number of the characteristic information;

the output unit is used for performing convolution operation on the splicing result on a dimension corresponding to 2dims to obtain a first vector when determining second feature information used for representing feature similarity between each target object and each tracking object based on the target feature information of each target object and the historical feature information of each tracking object, wherein the first vector is represented by (m × n) × k × p, and p represents the dimension number of the convolution operation result; exchanging elements on the dimensionality corresponding to k and the dimensionality corresponding to p in the first tensor to obtain a second tensor, wherein the second tensor is expressed by (m × n) × p × k; and performing convolution operation on the second tensor on the dimensionality corresponding to the k, and determining the obtained result as second feature information used for expressing feature similarity between each target object and each tracked object.

Optionally, the similarity prediction model represents the similarity between each target object and each tracked object through a similarity matrix, the size of the similarity matrix is m × n, and each element in the similarity matrix represents the similarity between one target object and one tracked object;

the output unit is used for expanding the similarity matrix by adopting the similarity threshold value when determining and outputting whether each target object and each tracked object are the same target or not based on the trained similarity threshold value and the similarity between each target object and each tracked object, so that the expanded similarity matrix comprises the similarity threshold value; setting a weight for each row of elements based on the value of the elements on each row of the expanded similarity matrix, and setting a weight for each column of elements based on the value of the elements on each row of the expanded similarity matrix; performing minimum flow calculation on the expanded similarity matrix based on the weight set for each row element and each column element to obtain a matching value of each target object and each tracked object; and obtaining and outputting the incidence relation between each target object and each tracking object based on the matching value of each target object and each tracking object.

the output unit is used for expanding the M × N similarity matrix to obtain an M × N similarity matrix when the similarity matrix is expanded by adopting a similarity threshold; wherein M represents the maximum number of detectable target objects and N represents the maximum number of trackable tracking target objects; and expanding the M x N similarity matrix by adopting a similarity threshold value to obtain a 2M x 2N similarity matrix.

Optionally, the output unit is configured to, when obtaining whether each target object and each tracked object are the same target based on a matching value of each target object and each tracked object, determine that the target object and any tracked object are the same target if the matching value of the target object and any tracked object is a first preset value; and if the target object is matched with all the tracked objects to be the second preset value, the target object and all the tracked objects are not the same target.

Optionally, the output unit is further configured to determine, for each target object, that the target object is a new object if the target object is not the same as all the tracked objects;

According to a third aspect of the present application, there is provided an electronic device comprising a readable storage medium and a processor;

wherein the readable storage medium is configured to store machine executable instructions;

the processor is configured to read the machine executable instructions on the readable storage medium and execute the instructions to implement the object association method.

According to a fourth aspect of the present application, there is provided a computer-readable storage medium having stored therein a computer program which, when executed by a processor, implements the above object association method.

According to a fifth aspect of the present application, there is provided a computer program, which is stored on a computer-readable storage medium and causes a processor to implement the above-mentioned object association method when the computer program is executed by the processor.

As can be seen from the above description, in one aspect, the electronic device may identify target position information and target feature information of the target object from the first image. The electronic device identifies historical position information, historical feature information and historical tracking information of the tracking object from a second image whose acquisition time is before the first image, and predicts predicted position information of each tracking object in the first image based on the historical position information of the tracking object. The electronic equipment can input the target position information of the target object, the predicted position information of the tracked object, the historical tracking information, the target characteristic information of the m target objects and the historical characteristic information of the tracked object into the trained similarity prediction model to obtain whether the target object and the tracked object are the same target (or whether the target object and the tracked object are related), and accordingly the establishment of the association relationship between the target object and the tracked object is achieved.

On the other hand, in the present application, since it is predicted whether the target object and the tracking object are the same target based on the related information extracted from the target object and the historical tracking object by the supervised similarity prediction model, instead of manually setting the similarity threshold, and manually setting the metric function to determine whether the target object and the tracking object are the same object, the present application provides a more accurate way of determining whether the target object and the tracking object are the same target.

Drawings

FIG. 1 is a flow chart illustrating a method of object association in accordance with an exemplary embodiment of the present application;

FIG. 2 is a schematic diagram illustrating a clustering method according to an exemplary embodiment of the present application;

FIG. 3 is a schematic diagram illustrating a target feature information and cluster center stitching according to an exemplary embodiment of the present application;

FIG. 4 is a diagram illustrating similarity prediction performed by a similarity prediction model according to an exemplary embodiment of the present application;

FIG. 5 is a schematic diagram illustrating an expansion of a similarity matrix according to an exemplary embodiment of the present application;

FIG. 6 is a diagram illustrating a minimum flow calculation according to an exemplary embodiment of the present application;

FIG. 7 is a diagram illustrating a hardware configuration of an electronic device according to an exemplary embodiment of the present application;

fig. 8 is a block diagram of an object association apparatus according to an exemplary embodiment of the present application.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present application, as detailed in the appended claims.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this application and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.

It is to be understood that although the terms first, second, third, etc. may be used herein to describe various information, such information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present application. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.

Image-based object tracking techniques refer to: and associating at least one target object identified from the current video frame with a tracking object identified from the previous video frame, and determining the motion track of the target object based on the position information of the target object and the associated tracking object, thereby realizing tracking. The association described here is to match a target object with each tracking object, and to associate the target object with a tracking object if the target object and a certain tracking object are the same target.

For example, the electronic device may identify 3 target objects from the first video frame, and use the 3 target objects as tracking objects. When the electronic device receives the second video frame, assuming that 2 target objects are identified from the second video frame, in order to track the objects, the electronic device needs to match the 2 target objects with 3 tracked objects. And for each target object, if the target object is successfully matched with any tracking object, determining that the target object and the tracking object are the same target, establishing a corresponding relation between the target object and the tracking object, and determining the motion track of the object based on the position information of the tracking object and the position information of the target object, thereby realizing the tracking of the object.

Therefore, the establishment of the association relationship between the target object in the first image and the tracking object in the historical frame is an essential link for realizing object tracking.

In view of the above, the present application is directed to an object association method, in which, on one hand, an electronic device can identify target position information and target feature information of a target object from a first image. The electronic device identifies historical position information, historical feature information and historical tracking information of the tracking object from a second image whose acquisition time is before the first image, and predicts predicted position information of each tracking object in the first image based on the historical position information of the tracking object. The electronic equipment can input the target position information of the target object, the predicted position information of the tracked object, the historical tracking information, the target characteristic information of the m target objects and the historical characteristic information of the tracked object into the trained similarity prediction model to obtain whether the target object and the tracked object are the same target (or whether the target object and the tracked object are related), and accordingly the establishment of the association relationship between the target object and the tracked object is achieved.

On the other hand, in the method, the similarity between the target object and the tracking object is predicted through a supervised similarity prediction model based on the relevant information extracted from the target object and the historical tracking object, and then whether the target object and the tracking object are the same target is determined according to the similarity and a similarity threshold value obtained by training along with the similarity prediction model instead of manually setting the similarity threshold value, and the similarity between the target object and the tracking object is calculated by manually setting a measurement function, so that the method for determining whether the target object and the tracking object are the same target is more accurate.

Before introducing the object association method provided by the present application, concepts related to the present application are introduced.

1. Object

In this application, an object refers to things that need to be tracked, for example, the object may be a person, a vehicle, an animal, etc., and the object is only exemplary and not particularly limited.

In the present application, for convenience of description, an object recognized from the first image may be referred to as a target object, and an object recognized from a history frame before the first image may be referred to as a tracking object.

2. Location information of an object

The position information of the object can indicate the position of the object in the video frame. The position information may be represented by position information of a rectangular frame enclosing the object, for example, coordinates of a vertex at the upper left corner of the rectangular frame and width and height of the rectangular frame, or coordinates of other vertices of the rectangular frame and width and height of the rectangular frame, or coordinates of respective vertices of the rectangular frame.

3. Characteristic information of an object

The feature information of the object may be information such as an apparent feature indicating the appearance of the object.

For example, when the object is a person, the apparent feature may be a human face feature, and when the object is a vehicle, the apparent feature may be a vehicle feature (such as a license plate feature, a vehicle color, a brand, and the like) of the vehicle. Here, the feature information of the object is merely exemplified and not particularly limited.

4. Tracking information of an object

In the application, the same object can be identified from a plurality of video frames, and the electronic device can form the tracking information based on the related information of the same object in the plurality of video frames.

The tracking information can show the information of the same tracking object in different video frames, so that the internal association relation of different video frames corresponding to the same tracking object can be shown, and the method provides help for the accuracy of subsequent similarity calculation.

Specifically, the tracking information refers to information related to tracking of the tracking object, such as a track length of the tracking object, a number of lost frames of the tracking object, an overlapping situation of the tracking object and other tracking objects, and the like.

The track length of the tracking object refers to the total number of frames from the beginning of the establishment of the tracking object to the video frame where the object associated with the tracking object is located. For example, if there are currently 10 video frames, the tracked object 1 is associated with a certain object identified from the 1 st frame to the 4 th frame, and the tracked object 1 is not associated with any object identified from the 5 th frame to the 10 th frame, the track length of the tracked object 1 is 4 frames.

The number of lost frames of a tracked object refers to the number of video frames of an unassociated object which is continuous from the beginning of establishment of a tracked object. For example, if the tracked object 1 is associated with an object from frame 1 to frame 3, and the tracked object 1 is not associated with an object identified in frame 4, the number of lost frames of the tracked object 1 is 0 in frames 1 to 4. Assuming that the tracked object 1 is not associated with the object identified in frame 5, the number of lost frames of the tracked object 1 at frame 5 is 1. Assuming that the tracked object 1 is associated with an object from the 6 th frame to the 10 th frame, the number of lost frames of the tracked object 1 at the 6 th frame is 2. In the 7 th frame to the 10 th frame, the number of lost frames of the tracking object is 0.

The overlapping condition of the tracking object and other tracking objects refers to: due to the fact that the acquired images have occlusion between objects, the tracking object and the rectangular frame where other tracking objects are located have an overlapping phenomenon, and therefore the overlapping condition of the tracking object and other tracking objects can be represented by the overlapping condition of the tracking object and the rectangular frame where other tracking objects are located, such as the overlapping area and the like.

Here, the tracking information is only exemplified, and in practical applications, all information that may be related to tracking of the tracked object may be referred to as tracking information, and the tracking information is not specifically limited herein.

After the above concepts are introduced, the object association method provided in the present application will be described in detail below.

Referring to fig. 1, fig. 1 is a flowchart illustrating an object association method according to an exemplary embodiment of the present application, where the method is applicable to an electronic device, and the electronic device may be a server, a data center, a computer, and the like, and the electronic device is only illustrated by way of example and is not particularly limited.

The object association method of the present application is suitable for determining whether a target object and a tracking object are the same object, determining whether a target object and each of a plurality of tracking objects are the same object, and determining whether each of a plurality of target objects and each of a plurality of tracking objects are the same object, where a scenario in which the method is applied is not specifically limited.

The object association method may include the steps shown below.

Step 101: the electronic equipment acquires an image frame sequence containing a first image and a second image; wherein the acquisition time of the second image is earlier than the acquisition time of the first image.

In an alternative implementation, the front-end capturing device sends the currently captured image to the electronic device. The electronic device may use the currently captured image as a first image, and select a second image with a capture time earlier than the first image from among images captured before the front-end capture device, to form a sequence of image frames.

In another alternative implementation, the electronic device records all images captured by the front-end capture device. The electronic equipment can take the image designated by the user as a first image and form an image frame sequence by taking a second image with the acquisition time earlier than that of the first image.

The manner in which the image frame sequence is acquired is merely exemplary and is not particularly limited.

Step 102: the electronic device acquires target position information and target feature information identifying a target object from the first image.

In an optional implementation manner, the front-end acquisition device performs object identification on each acquired image, obtains position information and feature information of an object, and sends the position information and the feature information to the electronic device. The electronic equipment can directly acquire the target position information and the target characteristic information of the target object in the first image, which are identified from the first image by the front-end acquisition equipment.

In another alternative implementation, the electronic device may identify the first image, and identify target position information and target feature information of the target object in the first image from the first image.

Here, it is only exemplified how the electronic apparatus acquires the target position information and the target feature information for identifying the target object from the first image, and is not particularly limited.

Step 103: the electronic device acquires historical position information, historical feature information and historical tracking information identifying the tracking object from the second image, and predicts the predicted position information of each tracking object in the first image based on the historical position information of the tracking object.

Step 103 is explained in detail from step 1031 to step 1032.

Step 1031: the electronic device acquires historical position information, historical feature information and historical tracking information identifying the n tracking objects from the second image.

The following describes how the historical position information, the historical feature information, and the historical tracking information of the next tracking object are recorded, and then describes the implementation manner of step 1031.

1) Historical position information, historical feature information, and history of tracking information of tracking object

In the application, when the electronic device receives each video frame, in addition to acquiring the position information and the feature information of the object identified from the video frame, the electronic device needs to associate the identified object with the object identified in the previous video frame and generate the tracking information of the object based on the association relationship. Wherein the information of the tracked object comprises: position information, feature information, and tracking information of a tracking object identified from the video frame.

For example, after receiving the first video frame, the electronic device recognizes the position information and feature information of 3 objects, i.e., object 1, object 2, and object 3, and records information related to tracking, such as the overlapping condition of the 3 objects.

After receiving the second video frame, the electronic device, assuming that two objects are identified, and assuming that it is determined that the first object of the two objects matches with the object 1 and the second object matches with the object 2, may record the position information and the feature information of the object 1 and the object 2 in the second video frame, respectively, and the overlapping condition of the object 1 and the object 2 in the second video frame.

Then, by the two video frames, the electronic device can determine that the trajectory of the object 1 is composed of the position of the object 1 in the first video frame and the position in the second video frame, the trajectory length is 2 frames, the number of lost frames is 0 frames, and the overlap condition can be represented by the overlap condition in the two video frames. Then, the electronic apparatus may form tracking information of the object 1 based on the track length, the number of lost frames, the overlap condition, and the like of the object 1, which are information related to tracking.

Similarly, the electronic device may determine that the trajectory of the object 2 is composed of the position of the object 2 in the first video frame and the position in the second video frame, the length of the trajectory is 2 frames, the number of lost frames is 0 frames, and the overlap condition may be represented by the overlap condition in the two video frames. Then, the electronic apparatus may form tracking information of the object 2 based on the track length of the object 2, the number of lost frames, the overlap condition, and the like, which are information related to tracking.

The electronic device may determine that the trajectory of the object 3 is formed by the position of the object 3 in the first video frame, the length of the trajectory is 1 frame, the number of lost frames is 0 frame, and the overlap condition may be represented by the overlap condition in the first video frame. Then, the electronic apparatus may form tracking information of the object 3 based on information related to tracking such as a track length, a number of lost frames, an overlap condition, and the like of the object 3.

Thus, the electronic device records position information, feature information, and tracking information of each tracking object identified from the received video frame.

2) Implementation manner of step 1031

In implementing step 1021, the electronic device may acquire, from the recorded position information, feature information, and tracking information of each tracking object, historical position information, historical feature information, and historical tracking information identifying the tracking object from the second image.

Still taking the above example as an example, assuming that the third video frame is the first image, the electronic device may determine, from the information of the recorded tracking objects, the historical position information, the historical feature information, and the historical tracking information of the 3 tracking objects identified from the previous two video frames.

Specifically, the electronic device may obtain historical position information, historical feature information, and historical tracking information of the object 1 in the first video frame and the second video frame, respectively. The electronic device can acquire historical position information, historical feature information and historical tracking information of the object 2 in the first video frame and the second video frame respectively. The electronic device may obtain historical feature information for object 3 in the first video frame, and historical tracking information for object 1.

Step 1032: the electronic device predicts the predicted position information of each tracked object in the first image based on the historical position information of the tracked object.

In implementation, for each tracked object, the electronic device may input historical position information of the tracked object in at least one second image to the trained position prediction model to output predicted position information of the tracked object in the first image by the position prediction model.

For example, assume that the historical frames are video frame 1-video frame 10, and the first image is video frame 11. Assume that video frame 1-video frame 10 each contain object 1.

The electronic device may input historical position information of the object 1 in the video frames 1-10, respectively, to the position prediction model to output predicted position information of the object 1 in the video frame 11 by the position prediction model.

Step 104: and the electronic equipment inputs the target position information of the target object, the predicted position information of the tracked object, the historical tracking information, the target characteristic information of the target object and the historical characteristic information of the tracked object into a trained similarity prediction model to obtain whether the target object and the tracked object are the same target.

The number of the target objects is at least one, and the number of the tracking objects is at least one.

The "obtaining whether the target object and the tracking object are the same target" may include the following cases:

obtaining whether a target object and a tracking object are the same target or not;

obtaining whether a target object and each tracking object in a plurality of tracking objects are the same target or not;

and obtaining whether each target object in the plurality of target objects and each tracking object in the plurality of tracking objects are the same target.

Optionally, the number of the target objects is m, m is an integer greater than or equal to 1, the number of the tracking objects is n, and n is an integer greater than or equal to 1.

Step 104 is explained in detail below through step 1041 to step 1044.

Step 1041: the electronic device inputs the target position information of the m target objects and the predicted position information of the n tracking objects into the trained similarity prediction model.

In implementation, in order to simplify the calculation amount of the similarity prediction model and to enable the input data format to be adapted to the similarity prediction model, the electronic device may first pre-process the target position information of the m target objects and the predicted position information of the n tracking objects, and input the pre-processed target position information of the m target objects and the pre-processed predicted position information of the n tracking objects to the trained similarity prediction model.

1) Preprocessing of target location information for m target objects

It is assumed that the present application uses the position information of the rectangular frame in which the target object is located to represent the target position information of the target object.

It is assumed that the position information of the rectangular frame may include: vertex coordinate (x) of upper left corner of rectangular box₁,y₁) Width w of the rectangular frame₁The height h of the rectangular frame₁. And the rectangular frame position information can be expressed as [ x ]₁,y₁,w₁,h₁]

For each target object, the electronic device can obtain target position information [ x ] of the target object₁,y₁,w₁,h₁]Normalization is performed, and the formula of normalization can be seen as the following formula:

x_1g＝x₁/W；

y_1g＝y₁/H；

w_1g＝w₁/W；

h_1g＝h₁/H；

wherein, W is the width of a frame of image, and H is the height of a frame of image.

The target position information of the normalized target object may be represented as [ x ]_1g,y_1g,w_1g,h_1g]。

2) Preprocessing of predicted location information for n tracked objects

It is assumed that the present application uses the position information of the rectangular frame in which the tracking object is located to represent the predicted position information of the tracking object.

It is assumed that the position information of the rectangular frame may include: vertex coordinate (x) of upper left corner of rectangular box₂,y₂) Width w of the rectangular frame₂The height h of the rectangular frame₂. Thereby tracking the predicted bit of the objectThe setting information may be represented as [ x ]₂,y₂,w₂,h₂]；

For each tracked object, the electronic device may predict location information [ x ] for the tracked object₂,y₂,w₂,h₂]Normalization is performed, and the formula of normalization can be seen as the following formula:

x_2g＝x₂/W；

y_2g＝y₂/H；

w_2g＝w₂/W；

h_2g＝h₂/H；

The normalized predicted position information of the tracked object may be represented as [ x ]_2g,y_2g,w_2g,h_2g]。

Then, the electronic device may combine the normalized target position information of each target object and the predicted position information of each tracking object into a matrix of (m × n) × 8, and then input the matrix into the trained similarity prediction model.

Wherein m x n represents the matching logarithm of the target object and the tracked object. For example, assume that there are two target objects (i.e., m is 2), which are target object 1 and target object 2, respectively. There are 3 tracking objects (i.e., n is 3), which are respectively tracking object 1, tracking object 2, and tracking object 3, and the matching pair formed by the target object and the tracking object includes 6 matching pairs (i.e., m × n), which are respectively [ target object 1, tracking object 1], [ target object 1, tracking object 2], [ target object 1, tracking object 3], [ target object 2, tracking object 1], [ target object 2, tracking object 2], [ target object 2, tracking object 3 ].

"8" in the above (m × n) × 8 indicates the number of dimensions of the position information in which the target position information is represented by [ x_1g,y_1g,w_1g,h_1g]These 4 dimensions represent the predicted position information represented by x_2g,y_2g,w_2g,h_2g]These 4 dimensions represent, so here the position information is represented by 8 dimensions。

Step 1042: target characteristic information of m target objects and historical characteristic information of n tracking objects of the electronic equipment are input into the trained similarity prediction model.

Step 1042 is explained in detail below by step a1 through step a 2.

Step A1: the electronic equipment clusters the historical characteristic information of the tracked object to obtain a clustering result, wherein the clustering result comprises: at least one characteristic category and at least one clustering cluster corresponding to the characteristic category respectively; and the historical characteristic information in each cluster is matched with the characteristic category corresponding to the cluster.

In implementation, typically, the feature information of one object may include a plurality of categories. For example, when the object is a person, the image of the person walking may be a front image, a back image or a side image of the person, and for the subsequent similarity prediction model, the feature similarity between the more prepared target object and the tracked object may be calculated.

Specifically, the electronic device may cluster the historical feature information of the n tracked objects to obtain a clustering result, where the clustering result includes: k characteristic categories and k clustering clusters corresponding to the k characteristic categories respectively; and the historical characteristic information in each cluster is matched with the characteristic category corresponding to the cluster.

For example, as shown in fig. 2, a triangle represents a cluster center, and circles surrounding the triangle constitute a cluster. Suppose that the historical feature information of n tracking objects is clustered to obtain 3 cluster clusters, namely cluster 1, cluster 2 and cluster 3. The three cluster categories are respectively front, back and side features, so that cluster 1 contains front features of some tracked objects, cluster 2 contains back features of some tracked objects, and cluster 3 contains side features of some tracked objects.

Step A2: the electronic equipment can splice the target characteristic information of the m target objects with the clustering centers of the k clustering clusters, and input the splicing result to the trained similarity prediction model.

During splicing, the electronic equipment can splice the target characteristic information of the m target objects with the k clustering clusters respectively, and the splicing result is used as input characteristic information.

For example, as shown in fig. 3, assuming that k is 3, the electronic device may splice target feature information of m target objects with the cluster 1 to obtain a splicing result 1, splice target feature information of m target objects with the cluster 2 to obtain a splicing result 2, splice target feature information of m target objects with the cluster 3 to obtain a splicing result 3, and then combine the splicing result 1, the splicing result 2, and the splicing result 3 into input feature information.

Wherein the input feature information may be represented by a tensor of (m × n) × k × 2 dims.

Wherein, m × n represents the matching logarithm of the target object and the tracked object, which is specifically referred to the above description and is not described herein again.

k represents the number of cluster clusters.

2dims represents the dimension number of the feature information (including the target feature information and the historical feature information)

In addition, in order to reduce the calculation amount of the similarity prediction model, before clustering, the electronic device may perform normalization processing on the target feature information of the m target objects and the historical feature information of the n tracking objects.

For example, the target feature information and the historical feature information are generally represented by a multi-dimensional feature vector (e.g., a 64-dimensional feature vector or a 128-dimensional feature vector), and the electronic device may normalize the feature vector F representing the target feature information and the historical feature information in length, where the normalization processing formula is as follows:

wherein the content of the first and second substances,

the feature vector F is normalized in length to obtain a value.

Step A2: the electronic device may input the input feature information to the trained similarity prediction model.

In implementation, the electronic device may input the stitching results of the tensor representation of (m × n) × k × 2dims to the trained similarity prediction model.

Step 1043: the electronic device inputs historical tracking information for the n tracked objects into the trained similarity prediction model.

In order to simplify the calculation amount of the similarity prediction model, the electronic device may perform preprocessing, such as normalization processing, on the historical tracking information of the n tracked objects.

Specifically, the historical tracking information may include: the trace length, the number of lost frames, the overlapping rate and other dimensions of the tracked object can be obtained, so that the electronic equipment can normalize all dimensions in the historical tracking information.

For example, the electronic device may normalize the track length l of the tracked object by the following formula:

wherein l_1gIndicating the normalized track length and Lmax the maximum allowed track length.

For example, the electronic device may normalize the number of lost frames t of the tracking object by the following formula:

wherein, t_1gAfter expressing normalizationTmax represents the maximum allowed number of lost frames.

The electronic device may then compose the normalized quantities into a matrix of (m x n) q and input the matrix into the trained similarity prediction model.

q represents the number of dimensions of the tracking information. Such as the track length and the number of lost frames included in the tracking information, the overlapping rate is the dimension of the tracking information. If the tracking information includes 3 dimensions of track length, lost frame number and overlapping rate, q is 3.

Step 1044: the similarity prediction model can obtain whether each target object and each tracking object are the same target or not through the following steps.

The similarity prediction model may refer to a model with supervised learning. For example, the similarity prediction model may be a neural network, etc., and the similarity prediction model is only exemplified here and is not specifically limited.

As shown in fig. 4, the similarity prediction model may determine the position similarity between each target object and each tracked object based on the target position information of each target object and the predicted position information of each tracked object. The similarity prediction model can also determine the feature similarity between each target object and each tracked object based on the input feature information, and convolve the historical tracking information of each tracked object to obtain the attention probability mask.

Then, the similarity prediction model can fuse the position similarity and the feature similarity between each target object and each tracked object, perform mask operation on the fusion result and the attention probability mask to obtain the similarity between each target object and each tracked object, and determine a similarity threshold value based on the model parameters of the similarity prediction model.

Step 1044 is explained in detail by step B1 through step B6.

Step B1: the similarity prediction model may determine first feature information representing a degree of positional similarity between each target object and each tracked object based on the target position information of each target object and the predicted position information of each tracked object.

In implementation, as described above, the electronic device may represent the target location information of each target object and the predicted location information of each tracked object with a matrix of (m × n) × 8, and input the matrix of (m × n) × 8 to the similarity prediction model.

The similarity prediction model may perform full-join computation at least once, for example, 3 times, on the matrix of (m × n) × 8 to obtain a position similarity feature matrix of (m × n) × p, where the position similarity feature matrix of (m × n) × p is the first feature information and may represent the position similarity between each target object and each tracked object.

Where p represents the data element of the fully-connected layer, or in other words, the number of dimensions of the fully-connected result.

The similarity prediction model may also determine second feature information indicating feature similarity between each target object and each tracked object based on the target feature information of each target object and the historical feature information of each tracked object, step B2.

As can be seen from the above description, the target feature information of each target object and the historical feature information of each tracked object are represented by the concatenation result, so that the similarity prediction model receives the concatenation result for representing the target feature information of each target object and the historical feature information of each tracked object.

As indicated above, the stitching result may be represented by a tensor of (m × n) × k × 2 dims; wherein m × n represents the matching logarithm of the target object and the tracking object, k represents the number of the cluster clusters, and 2dims represents the dimension number of the characteristic information.

The similarity prediction model can firstly carry out convolution operation on the splicing result on the dimensionality corresponding to 2dims to obtain a first scalar, the first scalar is represented by (m × n) × k × p, and p represents the dimensionality number of the convolution operation result. Then, the similarity prediction model may exchange elements in the dimension corresponding to k and the dimension corresponding to p in the first tensor to obtain a second tensor, where the second tensor is expressed by (m × n) × p × k. Then, the similarity prediction model may perform convolution operation on the second tensor in the dimension corresponding to k to obtain a tensor of (m × n) × p × 1, where the third dimension of the tensor is 1, so that the obtained tensor of (m × n) × p is actually an apparent similarity feature matrix of (m × n) × p, and the apparent similarity feature matrix of (m × n) × p is the second feature information, and may represent the feature similarity between each target object and each tracked object.

It should be noted that, the similarity prediction model performs convolution operation on the input feature information in the dimension corresponding to 2dims, so that similarity calculation between the target feature information and k clustering centers can be realized, the similarity prediction model exchanges elements in the dimension corresponding to k and the dimension corresponding to p in the first tensor to obtain a second tensor, and performs convolution operation on the second tensor in the dimension corresponding to k, so that fusion of similar results of the target feature information and the k clustering centers can be realized. Therefore, the method can perform similar calculation on the target characteristic information and the clustering centers of different categories, and overlap the similar results, so that the calculated characteristic similarity is more accurate.

Step B3: and the similarity prediction model convolves the historical tracking information of each tracked object to obtain an attention probability mask.

As can be seen from the above description, the historical tracking information may be represented by a matrix of (m × n) × q.

The similarity prediction model may perform a convolution operation (e.g., perform a convolution operation of 1 × 1) on the (m × n) × q after receiving the matrix of (m × n) × q. The similarity prediction model may then map the convolution result to [0,1], resulting in an attention probability mask.

In mapping, the similarity prediction model may map the convolution result to [0,1] through a sigmoid function (a kind of function), and here, the mapping manner is only exemplarily illustrated and is not specifically limited.

Step B4: the similarity prediction model can fuse the position similarity and the feature similarity between each target object and each tracked object, and perform mask operation on the fusion result and the attention probability mask to obtain the similarity between each target object and each tracked object.

The similarity prediction model can splice the position similarity and the feature similarity between each target object and each tracked object. And then, fusing the spliced position similarity and the feature similarity by the similarity prediction model in a full connection mode at least once.

The similarity prediction model may perform mask operation on the fusion result and the attention probability mask (for example, the fusion result may be multiplied by the attention probability mask), perform full join operation on the mask operation, and map the full join operation result to a [0,1] interval, so as to obtain the similarity between each target object and each tracked object.

The similarity between each target object and each tracking object can be represented by a similarity matrix of m × n, the size of the similarity matrix is m × n, and each element in the similarity matrix represents the similarity between one target object and one tracking object.

For example, as shown in the similarity matrix of m × n shown in fig. 4, in the matrix shown in fig. 4, m is 4, n is 4, and this indicates that there are 4 tracked objects and 4 target objects in this example.

The value of the element in the first row and the first column of the matrix is 0.11, which means that the similarity between the first target object and the first tracking object is 0.11.

Step B5: and the similarity prediction model acquires a trained similarity threshold value.

In the application, the similarity threshold is used as a learnable parameter of the similarity prediction model to construct the similarity prediction model, so that the similarity threshold can be adjusted in the training process of the similarity prediction model.

For example, during training, the similarity prediction model obtains a similarity matrix and a similarity threshold value based on an input sample, then the similarity prediction model uses the similarity threshold value to perform extended filling on the similarity matrix to obtain a correlation result between each detection object and a tracking object, calculates an error between the obtained correlation result and an actual correlation result, and reversely transfers the error to the similarity prediction model, so that the similarity prediction model adjusts a parameter of the model based on the error. The specific training process can be referred to as described later, and is only briefly described here.

When the similarity threshold is obtained, the similarity prediction model may determine a model parameter representing the similarity threshold from model parameters of the model, and call the determined model parameter as the similarity threshold.

Of course, in practical application, the similarity threshold may also be represented by a plurality of parameters of the similarity prediction model, and the similarity prediction model may obtain the similarity threshold based on an operation that may be performed on the plurality of parameters according to a preset rule.

Here, the determination method of the similarity threshold is only exemplified and not specifically limited.

It should be noted that: as can be seen from the above description, on one hand, the similarity prediction model takes into account a plurality of dimensions such as position similarity and feature similarity when calculating the similarity between each target object and each tracked object, so that the calculated similarity is more accurate. In addition, compared with a mode of obtaining the similarity by weighting and calculating the similarity of multiple dimensions (such as position similarity, feature similarity and the like), the method and the device adopt a supervised similarity prediction model (such as a neural network) to fuse the similarity of the multiple dimensions to obtain the final similarity, so that the calculation of the final similarity is not limited to linear calculation, and the calculation result is more accurate.

On the other hand, when the similarity is calculated, the similarity prediction model also considers tracking information used for representing the incidence relation of the same object in different frames, and the tracking information is used as a mask to perform mask operation with the fused position similarity and feature similarity, so that the similarity obtained by the prediction model is more accurate.

In a third aspect, the similarity threshold is not a manually set threshold, but is used as a parameter of the similarity prediction model, so that the similarity threshold is continuously adjusted along with the training of the similarity prediction model, and the similarity threshold becomes a learnable similarity threshold, which is beneficial to subsequently determining the association relationship between the target object and the tracked object.

Step B6: the electronic equipment determines and outputs whether each target object and each tracking object are the same target or not based on the trained similarity threshold and the similarity between each target object and each tracking object.

When implemented, the similarity prediction model outputs a similarity matrix representing the similarity between each target object and each tracked object. The size of the similarity matrix is m x n, and each element in the similarity matrix represents the similarity of one target object and one tracking object.

The electronic device may expand the similarity matrix using the similarity threshold value, so that the expanded similarity matrix includes the similarity threshold value, and set a weight for each row of elements based on a value of the element in the row of the expanded similarity matrix, and set a weight for each column of elements based on a value of the element in the row of the expanded similarity matrix. Then, the electronic device may perform minimum flow calculation on the expanded similarity matrix based on the weight set for each row element and each column element to obtain a matching value of each target object and each tracked object, and obtain an association relationship between each target object and each tracked object based on the matching value of each target object and each tracked object.

Step B6 is described in detail below by steps B61 through B64.

Step B61: and expanding the similarity matrix by adopting a similarity threshold value, so that the expanded similarity matrix comprises the similarity threshold value.

In implementation, the electronic device may extend the M × N similarity matrix to an M × N similarity matrix using preset values for representing the invalid bits. Where M denotes a preset maximum detectable number of target objects and N denotes a preset maximum trackable number of tracking target objects.

For example, as shown in fig. 5, a in fig. 5 represents a similarity matrix of M × N, and the electronic device may set the value of each element from the M th row to the M th row, and from N columns to 0, thereby forming a similarity matrix of M × N, which is shown as b in fig. 5.

The M × N similarity matrix is called an effective region in the M × N matrix and participates in subsequent minimum flow calculation, and a region filled with 0 in the M × N matrix is called an ineffective region and does not participate in subsequent minimum flow calculation.

And filling and expanding the M × N similarity matrix into an M × N similarity matrix by using preset values representing invalid bits, mainly in order to make the length and width of the similarity matrix corresponding to each frame consistent.

After obtaining the M × N similarity matrix, the electronic device may expand the M × N similarity matrix using a similarity threshold value to obtain a 2M × 2N similarity matrix.

For example, as shown in fig. 5, when expanding, the electronic device may perform pad operation (a filling operation) below and to the right of the M × N similarity matrix (shown in fig. b), where the value of pad is a similarity threshold. In other words, the electronic device sets the value of each element in the M-th row to the 2M-th row and the N-th column to the 2N-th column as the similarity threshold, thereby forming a similarity matrix of 2M × 2N (as shown in c in fig. 5).

The region filled with the similarity threshold in the 2M × 2N matrix is called a pad region, and participates in the subsequent minimum flow calculation.

Step B62: the electronic device sets a weight for each row of elements based on the values of the elements in each row of the expanded similarity matrix, and sets a weight for each column of elements based on the values of the elements in each row of the expanded similarity matrix.

When setting the weights, in order to enable the valid regions and the pad regions in 2M × 2N to participate in the subsequent minimum flow calculation and enable the invalid regions not to participate in the minimum flow calculation, the electronic device may set the weights of the rows and columns of the valid regions and the pad regions to 1 and set the weights of the rows and columns of the invalid regions to 0.

In implementation, for each row element in 2M × 2N, if the row element includes an element in the M × N similarity matrix, the weight of the row element is set to 1, if the row element does not include an element in the M × N similarity matrix but includes 0 and a similarity threshold, the weight of the row element is set to 0, and if the row element includes only the similarity threshold, the weight of the row element is set to 1.

For each column element in 2M × 2N, if the column element includes an element in the M × N similarity matrix, the weight of the column element is set to 1, if the column element does not include an element in the M × N similarity matrix but includes 0 and a similarity threshold, the weight of the column element is set to 0, and if the column element includes only a similarity threshold, the weight of the column element is set to 1.

Step B63: and the electronic equipment performs minimum flow calculation on the expanded similarity matrix based on the weight set for each row element and each column element to obtain the matching value of each target object and each tracked object.

The minimum flow calculation may be hungarian calculation or an EMD (Earth moving Distance) algorithm, and here, the minimum flow algorithm is only exemplarily described and is not specifically limited.

As shown in fig. 6, in implementation, the electronic device may perform minimum flow calculation on the expanded 2M × 2N similarity matrix based on the weights set for each row element and each column element, to obtain a matching value matrix shown in fig. 6, where each element of the matching value matrix may represent a matching value of one tracking object and one target object.

It should be noted that, in the existing method, after the similarity matrix of each target object and each tracked object is obtained, minimum flow solution is performed first to obtain the matching value of each target object and each tracked object. Then, for the matching value of each target object and each tracked object, comparing the matching value with a preset threshold, if the matching value is greater than the preset threshold, determining that the target object is associated with the tracked object, and if the matching value is less than the preset threshold, determining that the target object is not associated with the tracked object.

This is disadvantageous in that: because the similarity threshold is not considered in the process of solving the minimum flow in the existing method, the final correlation result is not very accurate.

According to the method, a similarity matrix is expanded and filled by adopting a similarity threshold, the weight values of each row and each column are set, and the minimum flow calculation is carried out on the expanded and filled similarity matrix based on the weight values. The matching result is made more accurate since the learnable similarity threshold is considered in the minimum flow calculation.

Step B64: and obtaining the matching value of each target object and each tracked object of the similarity prediction model to obtain whether each target object and each tracked object are the same target (namely whether each target object is associated with each tracked object).

During implementation, for each target object, if the matching value of the target object and any tracked object is a first preset value, determining that the target object and any tracked object are the same target, namely that the target object is associated with any tracked object;

if the matching condition of the target object and all the tracked objects is a second preset value, the target object and all the tracked objects are not the same target, namely the target object is not related to all the tracked objects.

The first preset value may be represented by 1, and the second preset value may be represented by 0, where the first preset value and the second preset value are only exemplarily illustrated and are not specifically limited.

For example, as described above, the matching value of each target object and each tracking object may be represented by the matching value matrix in fig. 6.

Taking the matching value matrix in fig. 6 as an example, since the element in the 1 st row and the 3 rd column is the first preset value (i.e. 1), it is determined that the first tracking object is associated with the third target object.

Since the element in row 2 and column 4 is 1, it is determined that the 2 nd tracked object and the 4 th target object are the same target, i.e. the 2 nd tracked object is associated with the 4 th target object.

Since the element in row 3 and column 2 is 1, it is determined that the 3 rd tracked object and the 2 nd target object are the same target, i.e., the 3 rd tracked object is associated with the 2 nd target object.

Since the element in row 4 and column 1 is 1, it is determined that the 4 th tracked object and the 1 st target object are the same target, i.e., the 4 th tracked object is associated with the 1 st target object.

Since the 5 th column of effective area elements are all 0 and the pad area element is 1, it is determined that the 5 th target object and all the tracked objects are not the same target, that is, the 5 th target object is not associated with all the tracked objects.

In addition, in the application, the electronic device may determine, in addition to the association relationship between each target object and the tracking object, a new appearing object in the first image and a disappearing tracking object in the first image.

Specifically, for each target object, if the target object is not associated with all tracking objects, the electronic device may determine that the target object is a new appearing object. For each tracked object, if the tracked object is not associated with all target objects, the electronic device may determine that the tracked object disappears in the first image.

For example, the matching value matrix in fig. 6 is taken as an example.

Since the 5 th column of effective area elements are all 0, and the 5 th column of pad area elements are 1, the electronic device determines that the 5 th target object and all the tracked objects are not the same object, and further determines that the 5 th target object is a new object.

Since the effective area elements in the 5 th row are all 0, and the pad area element in the 5 th row is 1, the electronic device determines that the 5 th tracked object and all the target objects are not the same target, and further determines that the 5 th tracked object disappears from the first image.

Further, in the present application, the electronic device may train the similarity prediction model as follows. Wherein the similarity threshold is a learnable parameter in the similarity prediction model.

Prior to training, the present application provides pairs of sample labels.

Wherein, the sample includes: the position information and the characteristic information of the at least one target object identified in the Nth frame, and the characteristic information of the at least one tracking object identified from the video frame before the Nth frame, and the predicted position information of each tracking object in the Nth frame, and the tracking information of each tracking object are predicted.

The label is a matching value of each target object and each tracking object.

The electronic device may input the sample and the tag to the similarity prediction model, and the similarity prediction model may determine, in the manner described above, first feature information indicating a degree of position similarity between each target object and each tracked object based on the target position information of each target object and the predicted position information of each tracked object, and determine, based on the target feature information of each target object and the historical feature information of each tracked object, second feature information indicating a degree of feature similarity between each target object and each tracked object, and convolve the historical tracking information of each tracked object to obtain the attention probability mask. And the similarity prediction model fuses the position similarity and the feature similarity between each target object and each tracked object, and performs mask operation on the fusion result and the attention probability mask to obtain a similarity matrix for expressing the similarity between each target object and each tracked object.

Then, the similarity prediction model may call a model parameter representing a similarity threshold and the obtained similarity matrix to obtain a matching value between each target object and each tracked object.

Specifically, the similarity prediction model may expand the similarity matrix by using a similarity threshold value as a similarity prediction model parameter, set row and column weights for the expanded similarity matrix, perform minimum flow calculation on the expanded similarity matrix by using the set weights, and calculate a matching value between each target object and each tracked object.

Then, the electronic device can calculate cross entropy loss between the matching value of each target object and each tracking object obtained by the similarity prediction model and the label, and the cross entropy loss is used as an error and is transmitted back to the similarity prediction model, so that the similarity prediction model adjusts the model parameters thereof to achieve the purpose of training.

In the training process, since the similarity threshold is a learnable parameter of the similarity prediction model, the similarity prediction model also adjusts the similarity threshold when adjusting the model parameter of the similarity prediction model. In other words, the similarity threshold is learned continuously with the model training, so the similarity threshold is a learnable similarity threshold.

As can be seen from the above description, in the first aspect, in the present application, since the similarity between the target object and the tracked object is predicted based on the related information extracted from the target object and the historical tracked object by the supervised similarity prediction model, and then it is determined whether the target object and the tracked object are the same target according to the similarity and the similarity threshold value obtained by training together with the similarity prediction model, instead of manually setting the similarity threshold value, and the similarity between the target object and the tracked object is calculated by manually setting the metric function, it is more accurate to determine whether the target object and the tracked object are the same target.

In the second aspect, when the similarity between each target object and each tracked object is calculated, the similarity prediction model takes multiple dimensions such as position similarity and feature similarity into consideration when calculating the similarity between each target object and each tracked object, so that the calculated similarity is more accurate. In addition, compared with a mode of obtaining the similarity by weighting and calculating the similarity of multiple dimensions (such as position similarity, feature similarity and the like), the method and the device adopt a supervised similarity prediction model (such as a neural network) to fuse the similarity of the multiple dimensions to obtain the final similarity, so that the calculation of the final similarity is not limited to linear calculation, and the calculation result is more accurate.

In addition, when the similarity is calculated, the similarity prediction model also considers tracking information used for representing the incidence relation of the same object in different frames, and the tracking information is used as a mask to perform mask operation with the fused position similarity and feature similarity, so that the similarity obtained by the prediction model is more accurate.

In addition, the similarity threshold is not a manually set threshold, but is used as a parameter of the similarity prediction model, so that the similarity threshold can be continuously adjusted along with the training of the similarity prediction model, and the similarity threshold becomes a learnable similarity threshold, which is beneficial to subsequently determining the association relationship between the target object and the tracking object.

In the third aspect, when determining the incidence relation between each target object and each tracked object based on the similarity matrix and the similarity threshold, the similarity matrix is expanded and filled by adopting the similarity threshold, the weight values of each row and each column are set, and the minimum flow calculation is performed on the expanded and filled similarity matrix based on the weights. The matching result is made more accurate since the learnable similarity threshold is considered in the minimum flow calculation.

Referring to fig. 7, fig. 7 is a hardware structure diagram of an electronic device according to an exemplary embodiment of the present application.

The electronic device includes: a communication interface 701, a processor 702, a machine-readable storage medium 703, and a bus 704; the communication interface 701, the processor 702, and the machine-readable storage medium 703 are in communication with one another via a bus 704. The processor 702 may perform the object association methods described above by reading and executing machine-executable instructions in the machine-readable storage medium 703 corresponding to the object association control logic.

Referring to fig. 8, fig. 8 is a block diagram illustrating an object association apparatus according to an exemplary embodiment of the present application, where the apparatus is applicable to an electronic device and may include the following units.

An acquisition unit 801 configured to acquire an image frame sequence including a first image and a second image; wherein the acquisition time of the second image is earlier than the acquisition time of the first image; acquiring target position information and target characteristic information of a target object identified from a first image; acquiring historical position information, historical characteristic information and historical tracking information of the tracking object identified from the second image;

a prediction unit 802 that predicts predicted position information of each tracking object in the first image based on the historical position information of the tracking object;

an output unit 803, configured to input the target position information of the target object, the predicted position information of the tracked object, historical tracking information, and the target feature information of the target object and the historical feature information of the tracked object into a trained similarity prediction model, so as to obtain whether the target object and the tracked object are the same target.

the output unit 803 is configured to input, to the trained similarity prediction model, the target position information of the target object, the predicted position information of the tracked object, the historical tracking information, the target feature information of the target object, and the historical feature information of the tracked object, when inputting the target position information of the target object, the predicted position information of the tracked object, the historical tracking information, and the target feature information of the target object to the trained similarity prediction model to obtain whether the target object and the tracked object are the same target; the similarity prediction model determines first feature information used for representing the similarity of the predicted positions between each target object and each tracked object based on the target position information of each target object and the predicted position information of each tracked object, determines second feature information used for representing the similarity of the features between each target object and each tracked object based on the target feature information of each target object and the historical feature information of each tracked object, and convolves the historical tracking information of each tracked object to obtain an attention probability mask; the similarity prediction model fuses first characteristic information and second characteristic information between each target object and each tracked object, and performs mask operation on a fusion result and the attention probability mask to obtain the similarity between each target object and each tracked object; and the similarity prediction model determines and outputs whether each target object and each tracking object are the same target or not based on the trained similarity threshold and the similarity between each target object and each tracking object.

Optionally, the output unit 803 is configured to, when the target feature information of the target object and the historical feature information of the tracked object are input to a trained similarity prediction model, perform clustering on the historical feature information of the tracked object to obtain a clustering result, where the clustering result includes: at least one characteristic category and at least one clustering cluster corresponding to the characteristic category respectively; the historical characteristic information in each cluster is matched with the characteristic category corresponding to the cluster; and splicing the target characteristic information of the target object with the clustering center of at least one clustering cluster, and inputting a splicing result to the trained similarity prediction model.

the output unit 803, when determining second feature information used for representing feature similarity between each target object and each tracked object based on the target feature information of each target object and the historical feature information of each tracked object, is configured to perform convolution operation on the concatenation result in a dimension corresponding to 2dims to obtain a first scalar, where the first scalar is represented by (m × n) × k × p, and p represents the dimension number of the convolution operation result; exchanging elements on the dimensionality corresponding to k and the dimensionality corresponding to p in the first tensor to obtain a second tensor, wherein the second tensor is expressed by (m × n) × p × k; and performing convolution operation on the second tensor on the dimensionality corresponding to the k, and determining the obtained result as second feature information used for expressing feature similarity between each target object and each tracked object.

the output unit 803, when determining and outputting whether each target object and each tracked object are the same target based on the trained similarity threshold and the similarity between each target object and each tracked object, is configured to extend the similarity matrix by using the similarity threshold, so that the extended similarity matrix includes the similarity threshold; setting a weight for each row of elements based on the value of the elements on each row of the expanded similarity matrix, and setting a weight for each column of elements based on the value of the elements on each row of the expanded similarity matrix; performing minimum flow calculation on the expanded similarity matrix based on the weight set for each row element and each column element to obtain a matching value of each target object and each tracked object; and obtaining and outputting the incidence relation between each target object and each tracking object based on the matching value of each target object and each tracking object.

the output unit 803 is configured to expand the M × N similarity matrix to obtain an M × N similarity matrix when the similarity matrix is expanded by using the similarity threshold; wherein M represents the maximum number of detectable target objects and N represents the maximum number of trackable tracking target objects; and expanding the M x N similarity matrix by adopting a similarity threshold value to obtain a 2M x 2N similarity matrix.

Optionally, the output unit 803, when obtaining whether each target object and each tracked object are the same target based on the matching value of each target object and each tracked object, is configured to determine, for each target object, that the target object and any tracked object are the same target if the matching value of the target object and any tracked object is the first preset value; and if the target object is matched with all the tracked objects to be the second preset value, the target object and all the tracked objects are not the same target.

Optionally, the output unit 803 is further configured to, for each target object, determine that the target object is a new object if the target object is not the same as all the tracked objects;

In addition, the present application also provides a computer-readable storage medium, in which a computer program is stored, and the computer program is executed by a processor to implement the object association method.

The implementation process of the functions and actions of each unit in the above device is specifically described in the implementation process of the corresponding step in the above method, and is not described herein again.

A computer-readable storage medium as referred to herein may be any electronic, magnetic, optical, or other physical storage device that can contain or store information such as executable instructions, data, and the like. For example, the machine-readable storage medium may be: volatile memory, non-volatile memory, or similar storage media. In particular, the computer-readable storage medium may be a RAM (random Access Memory), a flash Memory, a storage drive (e.g., a hard disk drive), a solid state disk, any type of storage disk (e.g., a compact disk, a DVD, etc.), or similar storage medium, or a combination thereof.

Further, the present application also provides a computer program, which is stored in a computer-readable storage medium and causes a processor to implement the above-described object association method when the processor executes the computer program.

For the device embodiments, since they substantially correspond to the method embodiments, reference may be made to the partial description of the method embodiments for relevant points. The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the scheme of the application. One of ordinary skill in the art can understand and implement it without inventive effort.

The above description is only exemplary of the present application and should not be taken as limiting the present application, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the scope of protection of the present application.

Claims

1. An object association method, characterized in that the method comprises:

2. The method according to claim 1, wherein the number of the target objects is at least one, and the number of the tracking objects is at least one;

3. The method of claim 2, wherein inputting the target feature information of the target object and the historical feature information of the tracked object to a trained similarity prediction model comprises:

4. The method according to claim 3, wherein the number of the target objects is m, the number of the tracking objects is n, and the number of the cluster clusters is k;

5. The method according to claim 2, wherein the similarity prediction model represents the similarity between each target object and each tracking object by a similarity matrix, each element in the similarity matrix representing the similarity between one target object and one tracking object;

6. The method according to claim 5, wherein the number of the target objects is m, and the number of the tracking objects is n; the size of the similarity matrix is m x n;

7. The method according to claim 5, wherein the obtaining whether each target object and each tracking object are the same target based on the matching value of each target object and each tracking object comprises:

8. The method of claim 7, further comprising:

9. The method of claim 1, wherein the similarity threshold is a model parameter of the similarity prediction model, and the similarity threshold is trained with the similarity prediction model.

10. An object association apparatus, characterized in that the apparatus comprises:

11. An electronic device, comprising a readable storage medium and a processor;

the processor configured to read the machine executable instructions on the readable storage medium and execute the instructions to implement the steps of the method of any one of claims 1-9.