WO2023093086A1

WO2023093086A1 - Target tracking method and apparatus, training method and apparatus for model related thereto, and device, medium and computer program product

Info

Publication number: WO2023093086A1
Application number: PCT/CN2022/106523
Authority: WO
Inventors: 章国锋; 鲍虎军; 叶伟才; 兰馨悦
Original assignee: 上海商汤智能科技有限公司
Priority date: 2021-11-26
Filing date: 2022-07-19
Publication date: 2023-06-01
Also published as: CN114155278A

Abstract

Disclosed in the embodiments of the present disclosure are a target tracking method and apparatus, a training method and apparatus for a model related thereto, and a device, a medium and a computer program product. The target tracking method comprises: respectively performing target segmentation on a first image and a second image, so as to obtain a first mask image of a first object in the first image and a second mask image of a second object in the second image; performing object matching in terms of a feature dimension on the basis of the first mask image and the second mask image, so as to obtain first matching information, and performing object matching in terms of a spatial dimension on the basis of the first mask image and the second mask image, so as to obtain second matching information; and fusing the first matching information and the second matching information, so as to obtain tracking information, wherein the tracking information comprises information regarding whether the first object and the second object are the same object.

Description

Object tracking and related model training method, device, equipment, medium, computer program product

Cross References to Related Applications

The embodiment of the present disclosure is based on the Chinese patent application with the application number 202111424075.9, the application date is November 26, 2021, and the application name is "Target Tracking and Related Model Training Method and Related Devices, Equipment, and Media", and requires the Chinese The priority of the patent application, the entire content of the Chinese patent application is hereby incorporated by reference into this disclosure.

technical field

The present disclosure relates to but is not limited to the technical field of image processing, and in particular relates to a method, device, device, medium, and computer program product for training a target tracking and related model.

Background technique

Object tracking technology is widely used in many application scenarios. Taking video panoptic segmentation (Video Panoptic Segmentation, VPS) as an example, it is not only required to generate consistent panoramic segmentation between frames, but also to achieve inter-frame tracking for all pixels, so as to improve the realization effect of many technologies such as automatic driving, video surveillance, and video editing. .

At present, the existing target tracking methods still face many problems in terms of tracking accuracy, such as tracking loss, which seriously affects the implementation effect of target tracking when it is applied to the above-mentioned technologies such as automatic driving, video surveillance, and video editing. In view of this, how to improve the target tracking accuracy has become an urgent problem to be solved.

Contents of the invention

Embodiments of the present disclosure provide a target tracking and related model training method, device, device, medium, and computer program product.

The first aspect of the embodiments of the present disclosure provides an object tracking method, including: separately performing object segmentation on the first image and the second image, and obtaining the first mask image of the first object in the first image and the first mask image of the first object in the second image. The second mask image of the two objects; performing object matching in the feature dimension based on the first mask image and the second mask image to obtain the first matching information, and based on the first mask image and the second mask image in the spatial dimension performing object matching to obtain second matching information; fusing the first matching information and the second matching information to obtain tracking information; wherein the tracking information includes whether the first object and the second object are the same object.

In the above solution, target segmentation is performed on the first image and the second image respectively to obtain the first mask image of the first object in the first image and the second mask image of the second object in the second image, and based on the first Object matching is performed on the mask image and the second mask image in the feature dimension to obtain the first matching information, and object matching is performed on the spatial dimension based on the first mask image and the second mask image to obtain the second matching information, where Based on this, the first matching information and the second matching information are fused together to obtain tracking information, and the tracking information includes whether the first object and the object are the same object, that is, in the process of target tracking, on the one hand, in the feature dimension, between images Object matching can help ensure the tracking effect of large-sized objects. On the other hand, object matching between images in the spatial dimension can help ensure the tracking effect of small-sized objects, and based on this, the two matching The matching information obtained by this method is used to obtain tracking information, so it can take into account both large-size objects and small-size objects, which is conducive to improving the accuracy of target tracking.

The second aspect of the present disclosure provides a method for training a target tracking model, including: obtaining a first sample mask image of a first sample object in a first sample image, and a first sample mask image of a second sample object in a second sample image. Two sample mask images and sample tracking information; wherein, the sample tracking information includes whether the first sample object and the second sample object are actually the same object; the first matching network based on the target tracking model combines the first sample mask image and the second sample object Object matching is performed on the second sample mask image in the feature dimension to obtain the first predicted matching information, and the first sample mask image and the second sample mask image are object-matched in the spatial dimension based on the second matching network of the target tracking model. matching to obtain the second predicted matching information; using the information fusion network of the target tracking model to fuse the first predicted matching information and the second predicted matching information to obtain the predicted tracking information; wherein the predicted tracking information includes the first sample object and the second sample object Whether the object is predicted to be the same object; based on the difference between the sample tracking information and the predicted tracking information, the network parameters of the object tracking model are adjusted.

In the above scheme, on the one hand, object matching between images in the feature dimension can help ensure the tracking effect of large-sized objects; on the other hand, object matching between images in the spatial dimension can help ensure the tracking effect of The tracking effect of small-sized objects is based on the fusion of the matching information obtained by the two matching methods to obtain tracking information, so it can take into account both large-sized objects and small-sized objects, which is conducive to improving the accuracy of the target tracking model.

The third aspect of the embodiments of the present disclosure provides a target tracking device, including: a target segmentation part, an object matching part and an information fusion part, and the target segmentation part is configured to perform target segmentation on the first image and the second image respectively, to obtain The first mask image of the first object in the first image and the second mask image of the second object in the second image; the object matching part is configured to be based on the first mask image and the second mask image in the feature dimension Perform object matching to obtain first matching information, and perform object matching in the spatial dimension based on the first mask image and the second mask image to obtain second matching information; the information fusion part is configured to fuse the first matching information and the second mask image Second, matching information to obtain tracking information; wherein, the tracking information includes whether the first object and the second object are the same object.

The fourth aspect of an embodiment of the present disclosure provides a training device for a target tracking model, including: a sample acquisition part, a sample matching part, a sample fusion part, and a parameter adjustment part, and the sample acquisition part is configured to obtain the first sample image The first sample mask image of the first sample object, the second sample mask image of the second sample object in the second sample image, and sample tracking information; wherein, the sample tracking information includes the first sample object and the second sample Whether the object is actually the same object; the sample matching part is configured to match the first sample mask image and the second sample mask image in the feature dimension based on the first matching network of the target tracking model to obtain the first predicted match Information, and based on the second matching network of the target tracking model, the first sample mask image and the second sample mask image are matched in the spatial dimension to obtain the second predicted matching information; the sample fusion part is configured to use the target The information fusion network of the tracking model fuses the first predicted matching information and the second predicted matching information to obtain predicted tracking information; wherein, the predicted tracking information includes whether the first sample object and the second sample object are predicted to be the same object; the parameter adjustment part, configured to adjust the network parameters of the object tracking model based on the difference between the sample tracking information and the predicted tracking information.

A fifth aspect of an embodiment of the present disclosure provides an electronic device, including a memory and a processor coupled to each other, and the processor is configured to execute program instructions stored in the memory to implement the target tracking method in the first aspect above, or Realize the training method of the target tracking model in the second aspect above.

The sixth aspect of the embodiments of the present disclosure provides a computer-readable storage medium, on which program instructions are stored. When the program instructions are executed by a processor, the target tracking method in the above-mentioned first aspect is realized, or the object tracking method in the above-mentioned second aspect is realized. A method for training object tracking models.

The seventh aspect of the embodiments of the present disclosure provides a computer program product, the computer program product includes a computer program or an instruction, and when the computer program or instruction is run on an electronic device, the electronic device executes the above-mentioned first The target tracking method in one aspect, or the target tracking model training method in the second aspect above.

Description of drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings that need to be used in the embodiments of the present disclosure will be described below.

FIG. 1 is a schematic flowchart of a target tracking method provided by an embodiment of the present disclosure;

FIG. 2 is a schematic framework diagram of a target tracking model provided by an embodiment of the present disclosure;

FIG. 3 is a process schematic diagram of an information fusion process provided by an embodiment of the present disclosure;

Fig. 4A is a schematic diagram of a panorama segmented image provided by an embodiment of the present disclosure;

FIG. 4B is another schematic diagram of a panorama segmented image provided by an embodiment of the present disclosure;

FIG. 5 is a schematic flow chart of object matching in the feature dimension provided by an embodiment of the present disclosure;

FIG. 6 is a schematic diagram of a process of object matching in the feature dimension provided by an embodiment of the present disclosure;

FIG. 7 is a schematic flow chart of object matching in spatial dimensions provided by an embodiment of the present disclosure;

FIG. 8 is a schematic diagram of a process of object matching in the spatial dimension provided by an embodiment of the present disclosure;

FIG. 9 is a schematic flowchart of a target tracking method provided by an embodiment of the present disclosure;

FIG. 10 is a schematic diagram of time consistency constraints provided by an embodiment of the present disclosure;

FIG. 11 is a schematic flowchart of a method for training a target tracking model provided by an embodiment of the present disclosure;

Fig. 12 is a schematic framework diagram of a target tracking device provided by an embodiment of the present disclosure;

FIG. 13 is a schematic frame diagram of a training device for a target tracking model provided by an embodiment of the present disclosure;

Fig. 14 is a schematic frame diagram of an electronic device provided by an embodiment of the present disclosure;

Fig. 15 is a schematic diagram of a computer-readable storage medium provided by an embodiment of the present disclosure.

The accompanying drawings here are incorporated into the description and constitute a part of the present description. These drawings show embodiments consistent with the present disclosure, and are used together with the description to explain the technical solution of the present disclosure.

Detailed ways

The solutions of the embodiments of the present disclosure will be described in detail below in conjunction with the accompanying drawings.

In the following description, for purposes of illustration rather than limitation, specific details, such as specific system architectures, interfaces, techniques, are set forth in order to provide a thorough understanding of the present disclosure.

The terms "system" and "network" are often used interchangeably herein. The term "and/or" in this article is just an association relationship describing associated objects, which means that there can be three relationships, for example, A and/or B can mean: A exists alone, A and B exist simultaneously, and there exists alone B these three situations. In addition, the character "/" in this article generally indicates that the contextual objects are an "or" relationship. In addition, "many" herein means two or more than two.

Please refer to FIG. 1 . FIG. 1 is a schematic flowchart of a target tracking method provided by an embodiment of the present disclosure. Specifically, the following steps may be included:

Step S11: Carry out object segmentation on the first image and the second image respectively to obtain a first mask image of the first object in the first image and a second mask image of the second object in the second image.

In an implementation scenario, the first image and the second image can be two consecutive frames of images in the captured video data; or, the first image and the second image can also be separated by several frames of images in the video data, which is not done here limited. It should be noted that, the first image may be obtained by shooting before the second image. For the convenience of description, the first image can be marked as t-δ, and the second image can be marked as t, where, when the first image and the second image are two adjacent frames of images, δ is 1, and in the first In the case that the image and the second image are separated by one frame of image, δ is 2, and so on, and examples are not given here.

In an implementation scenario, in the actual application process, the first image and the second image can be captured by electronic devices integrated with cameras such as smartphones and autonomous driving devices, and the frame rate and movement rate of the camera can be combined , to determine the number of image frames between the first image and the second image. Exemplarily, the faster the moving rate, the greater the change between adjacent images, and the fewer the number of image frames at intervals. On the contrary, the slower the moving rate, the smaller the change between adjacent images, and the fewer the number of image frames at intervals. It can be more; or, the larger the frame rate, the smaller the change between adjacent images, and the more the number of image frames at intervals can be. On the contrary, the smaller the frame rate, the greater the change between adjacent images, and the interval of image frames The number can be less.

In an implementation scenario, the first object in the first image may not be limited to one, for example, it may include one first object, two first objects, three first objects, etc., which is not limited here; similarly, the first The second object in the second image may not be limited to one, for example, may include one second object, two second objects, three second objects, etc., which is not limited here. In addition, the aforementioned objects may include, but are not limited to: objects such as pedestrians, vehicles, and street signs. It should be noted that in the embodiments of the present disclosure, multiple objects of the same type cannot be counted as the same object, that is, even if multiple objects are objects of the same type, they need to be counted as multiple objects. For example, the image may contain two pedestrians, such as pedestrian A and pedestrian B, then pedestrian A and pedestrian B may be counted as two objects, or the image may contain three vehicles, such as vehicle A, vehicle B and vehicle C, then vehicle A, vehicle B and vehicle C can be counted as three objects, and so on, no more examples here.

In an implementation scenario, the first object and the second object are foreground objects in the first image and the second image respectively, such as the aforementioned pedestrians, vehicles, street signs, and the like. In addition, the image may also contain background objects, such as but not limited to: roads, sky, buildings, etc. In order to realize video panorama segmentation, after object segmentation is performed on the first image and the second image respectively, the mask image of the first background object in the first image and the mask image of the second background object in the second image can also be obtained, In order to subsequently mark each foreground object and background object on the image in combination with the mask image and tracking information. For example, pixel regions belonging to the same object (eg, the same foreground object and the same background object) in different images may be marked with the same color and the like. For example, the pixel area of pedestrian A may be marked red in the first image, and the pixel area of pedestrian A may also be marked red in the second image. Other situations can be deduced by analogy, and no more examples will be given here.

In one implementation scenario, each first mask image has the same size as the first image, and similarly, each second mask image has the same size as the second image. Further, for the first mask image of each first object, the pixel value of the pixel contained in it indicates the possibility that the pixel corresponding to the pixel position in the first image belongs to the first object, for example Conversely, the greater the possibility, the greater the pixel value; conversely, the smaller the possibility, the smaller the pixel value; similarly, for the second mask image of each second object, the pixels contained in it The pixel value of represents the possibility that the pixel corresponding to the pixel position in the second image belongs to the second object. For example, the greater the possibility, the greater the pixel value; otherwise, the smaller the possibility, the pixel value Also smaller.

In an implementation scenario, the corresponding meaning of the above positions may specifically have the same pixel coordinates. For example, the pixel point at pixel coordinate (i, j) in the first mask image corresponds to the pixel point at pixel coordinate (i, j) in the first image; or, the pixel point at pixel coordinate (i, j) in the second mask image The pixel at the coordinate (m, n) corresponds to the pixel at the pixel coordinate (m, n) in the second image.

In one implementation scenario, for the first mask image of each first object, if the pixel value of the pixel contained in it is higher than the preset threshold, it can be considered that the position of the pixel in the first image is The corresponding pixel belongs to the first object. Similarly, for the second mask image of each second object, if the pixel value of the pixel contained in it is higher than the preset threshold, it can be considered as the second mask image. The pixel corresponding to the pixel position in the image belongs to the second object. It should be noted that the preset threshold can be set according to the actual situation. For example, when the pixel value has been normalized to a range of 0 to 1, the preset threshold can be set to 0.5, 0.6, etc., which is not limited here.

In an implementation scenario, as mentioned above, when the pixel value is higher than the preset threshold, it can be considered that the pixel belongs to the object. On this basis, the pixel value can be further reset to the first value (such as 1 ), on the contrary, if the pixel value is not higher than the preset threshold, it can be considered that the pixel does not belong to the object, on this basis, the pixel value can be further reset to a second value (eg, 0). Exemplarily, for the first mask image of each first object, it may be checked whether the pixel value of the pixel contained therein is higher than a preset threshold, and if so, the pixel value may be reset to the first value, otherwise can be reset to the second value to update the first mask image of each first object; similarly, for the second mask image of each second object, it can be checked whether the pixel value of the pixel contained in it is If it is higher than the preset threshold, then the pixel value can be reset to the first value, otherwise it can be reset to the second value, so as to update the second mask image of each second object.

In an implementation scenario, in order to improve the efficiency of target segmentation, a target tracking model may be pre-trained, please refer to FIG. 2 , which is a schematic diagram of a frame of the target tracking model. As shown in Figure 2, the target tracking model may include a target segmentation network, and then the first image and the second image may be respectively input into the target segmentation network to obtain a first mask image of each first object and a second mask image of each second object. film image. Specifically, several sample images can be collected in advance, and the sample mask images of each sample object in the sample image can be obtained, and then the target segmentation network can be used to segment the sample image to obtain the predicted mask image of each sample object, so that based on the same The difference between the sample mask image and the predicted mask image of the object adjusts the network parameters of the object segmentation network.

In an implementation scenario, for example, loss functions such as dice segmentation loss and position loss can be used to measure the difference between the sample mask image and the predicted mask image belonging to the same object, and the loss value of the target segmentation network can be obtained , and use optimization methods such as gradient descent to adjust the network parameters of the target segmentation network. For the specific measurement process of the difference, you can refer to the technical details of loss functions such as dice segmentation loss and position loss. For the specific adjustment process of parameters, you can refer to the technical details of optimization methods such as gradient descent.

In an implementation scenario, in order to obtain mask images of foreground objects such as the first object and the second object, the target segmentation network may include but not limited to instance segmentation networks such as Mask R-CNN, PointRend, Instance-sensitive FCN, etc., in This does not limit the network structure of the target segmentation network.

In an implementation scenario, in order to simultaneously acquire the mask images of foreground objects such as the first object and the second object and the mask images of background objects such as the aforementioned roads, sky, buildings, etc., the target segmentation network may include but not limited to such as For panoptic segmentation networks such as PanopticFCN, the network structure of the target segmentation network is not limited here.

Step S12: Perform object matching in the feature dimension based on the first mask image and the second mask image to obtain the first matching information, and perform object matching in the spatial dimension based on the first mask image and the second mask image to obtain the second mask image Two matching information.

In an implementation scenario, for object matching in the feature dimension, the first feature representations of each first object can be extracted based on the first mask images of each first object, and the second feature representations of each second object can be extracted respectively. The mask image is extracted to obtain the second feature representation of each second object, and on this basis, the feature similarity between each first object and each second object is obtained by using the first feature representation and the second feature representation, And based on the feature similarity between each first object and each second object, first matching information is obtained. For the process of feature extraction and feature matching, please refer to the following related disclosed embodiments. The above method only needs to perform feature extraction on the mask image of each object, and then measure the feature similarity, which can reduce the difference between images in the feature dimension. The complexity of object matching between objects is beneficial to improve the tracking speed.

In an implementation scenario, different from the above-mentioned method of performing feature extraction and feature matching in stages, in order to improve the efficiency of object matching in the feature dimension, as an optional implementation in the actual application process, pre-training An object tracking model, and the object tracking model includes a first matching network. Specifically, the first matching network may include several feature extraction layers (such as convolutional layers, fully connected layers, etc.) and a multi-layer perceptron, the first mask image of each first object and the second mask image of each second object After the membrane image is preprocessed, it can be input into the first matching network for processing. For the related process of preprocessing, please refer to the relevant description in the following disclosed embodiments. For the convenience of description, the above-mentioned first objects and each second object can be collectively referred to as N objects, and the first mask images of the above-mentioned first objects and the second mask images of each second object can be collectively referred to as N masks image. In this process, after the above N mask images are processed by several feature extraction layers, N feature representations can be obtained, and further processed by the multi-layer perceptron, and an N*N matrix is output, and each row of the matrix represents N One of the objects, each column of the matrix represents one of the N objects, and the element in the i-th row and column j in the matrix represents the distance between the i-th object in the N objects and the j-th object in the N objects matching degree, the matching degree between each first object and each second object can be extracted from the matrix to obtain the first matching information. Of course, in the actual application process, in order to reduce the weight of the model as much as possible for training and deployment, you can choose the above-mentioned method of performing feature extraction and feature matching in stages, and in order to improve efficiency, the first feature representation and the second The second feature representation can be performed by the first matching network. At this time, the first matching network can only include a small number of network layers such as convolutional layers and fully connected layers, thereby greatly reducing the amount of parameters. For details, please refer to the following related disclosed embodiments. I won't go into details here.

In one implementation scenario, for object matching in the spatial dimension, the second image can be used to perform optical flow prediction on the first image to obtain the optical flow image of the first image, and based on the optical flow image, the first The mask image is shifted pixel by pixel to obtain the predicted mask image of the first object at the shooting moment of the second image, and based on the difference between the predicted mask image of each first object and the second mask image of each second object The degree of coincidence between them is used to obtain the second matching information. For the specific process of the above optical flow prediction, pixel offset and coincidence degree measurement, please refer to the relevant description in the following disclosed embodiments. The above method, on the one hand, can realize object matching based on pixel-level matching, which is conducive to greatly improving the tracking effect, especially for small-sized objects. On the other hand, after pixel-by-pixel offset based on the optical flow image, only the image coincidence degree The matching information can be obtained, and the complexity of object matching between images in the spatial dimension can also be reduced, which is conducive to improving the tracking speed.

In an implementation scenario, different from the above method, as an optional implementation method in the actual application process, the difference between the first mask image of each first object and the second mask image of each second object may also be obtained first. It should be noted that, after the first optimal displacement vector is shifted pixel by pixel, the first mask image has the maximum overlap with the second mask image, and each first mask image is recorded A first optimal displacement vector and a corresponding maximum overlap between the first mask image of the object and the second mask images of each second object. At the same time, the second optimal displacement vector between the first image and the second image can be obtained. Similarly, after the first image is shifted pixel by pixel by the second optimal displacement vector, it has the maximum Coincidence degree. On this basis, the vector similarity between each first optimal displacement vector and the second optimal displacement vector can be measured. It should be noted that the closer the first optimal displacement vector is to the second optimal displacement vector, the The larger the vector similarity is, on the contrary, the farther the first optimal displacement vector is from the second optimal displacement vector, the smaller the vector similarity is. Based on this, for each first object and each second object, the corresponding vector similarity and maximum coincidence degree can be weighted to obtain the matching degree between the two, that is, the second matching information can be obtained.

Step S13: Fusing the first matching information and the second matching information to obtain tracking information.

In an implementation scenario, as mentioned above, the first matching information may include the matching degree between each first object and each second object, which may be referred to as the first matching degree for the convenience of distinction; similarly, the second The matching information may include the matching degree between each first object and each second object, which may be referred to as a second matching degree for the convenience of distinction. On this basis, the first matching degree in the first matching information and the second matching degree in the second matching information can be weighted respectively by using the first preset weight and the second preset weight to obtain the first weighted matching information and the second matching information. Two weighted matching information, wherein the first weighted matching information includes a first weighted matching degree between the first object and the second object, and the second weighted matching information includes a second weighted matching degree between the first object and the second object. Based on this, the first weighted matching information and the second weighted matching information may be fused to obtain final matching information, and the final matching information includes a final matching degree between the first object and the second object. That is to say, during the fusion process, the preset weights can be directly used to carry out weighted fusion on the matching degree.

In an implementation scenario, in order to improve the fusion accuracy, different from the aforementioned method, adaptive weighting can be performed on the first matching degree in the first matching information to obtain the first weighted matching information, and the second matching degree in the second matching information Based on this, the first weighted matching information and the second weighted matching information are fused to obtain the final matching information, and the tracking information is obtained by analyzing based on the final matching information. In the above method, in the fusion process of matching information, by performing adaptive weighting on the first matching information and the second matching information respectively, the importance of the two can be adaptively measured according to the actual situation, and then the fusion can be performed on this basis , which is conducive to greatly improving the tracking accuracy.

In an implementation scenario, as mentioned above, in order to improve the efficiency of target tracking, a target tracking model can be pre-trained to process the first image and the second image through the target tracking model to obtain tracking information, and the target tracking model can include Information Fusion Network. Please refer to FIG. 3 in conjunction with FIG. 3 , which is a process schematic diagram of the information fusion process. As shown in Figure 3, the information fusion network may further include a first weighting subnetwork and a second weighting subnetwork, the first weighting subnetwork is used to adaptively weight the first matching information, and the second weighting subnetwork is used to weight the first matching information Two matching information is adaptively weighted. Specifically, in order to reduce the weight of the target tracking model as much as possible, the first weighted subnetwork may include but not limited to a 1*1 convolutional layer, and the second weighted subnetwork may include but not limited to a 1*1 convolutional layer. In the above method, the importance of the feature dimension and the spatial dimension to target tracking can be learned according to the actual situation through the neural network, which is conducive to improving the efficiency and accuracy of adaptive weighting.

In an implementation scenario, as shown in FIG. 2 and FIG. 3 , both the first matching information and the second matching information may be represented by a matrix. Taking M first objects in the first image and N second objects in the second image as an example, both the first matching information and the second matching information can be represented by an M*N matrix, and for the first matching In terms of information, the i-th row and j-th column element in the matrix represent the first matching degree between the i-th first object and the j-th second object, while for the second matching information, the i-th row and j-th element in the matrix Elements in column j represent the second matching degree between the i-th first object and the j-th second object. On this basis, the first weighted matching information obtained after the first matching information is adaptively weighted can also be represented by an M*N matrix, and the second weighted matching information obtained after the second matching information is adaptively weighted It can also be represented by a matrix of M*N, and the meaning represented by each element in the matrix can refer to the above related description.

In an implementation scenario, in the process of fusing the first weighted matching information and the second weighted matching information, the element in row i and column j in the matrix representing the first weighted matching information can be combined with the element representing the second weighted matching information The elements in row i and column j in the matrix are added directly to obtain a matrix representing the final matching information. That is to say, for each group of first objects and second objects, the first weighted matching degree and the second weighted matching degree can be directly added to obtain the final matching degree. Exemplarily, the first image may contain two first objects, the first object A and the first object B, and the second image may contain two second objects, the second object A and the second object B, then the final matching The information can be represented by a 2*2 matrix. The first row of the matrix represents the final matching degree between the first object A and the second object A and the second object B respectively, and the second row of the matrix represents the first object B respectively. The final matching degree between the second object A and the second object B. The first column of the matrix represents the final matching degree between the second object A and the first object A and the first object B respectively. The second column of the matrix Represents the final matching degree between the second object B and the first object A and the first object B respectively.

In an implementation scenario, it should be noted that the tracking information may specifically include whether the first object and the second object are the same object. On this basis, each pair of first objects and each second object can be combined as the current object group, and based on at least one of the first reference information and the second reference information of the current object group, determine Whether the current first object and the current second object are the same object, and the current first object is the first object in the current object group, and the current second object is the second object in the current object group. The first reference information includes: current The final matching degrees between the first object and each second object, the second reference information includes: the final matching degrees between the current second object and each first object. As mentioned above, the final matching degree can also be represented by a matrix, then the first reference information can include all the elements of the matrix row representing the first current object in the matrix, and similarly, the second reference information can include the elements in the matrix representing the second current object All elements of the matrix columns of the object. The above method can avoid omission as much as possible, which is beneficial to improve the tracking accuracy. On the other hand, combining at least one of the first reference information and the second reference information in the determination process is also beneficial to improve the determination accuracy.

Here, in the case of only combining the first reference information, the final matching degree between the current first object and the current second object can be used as the matching degree to be analyzed, and in response to the matching degree to be analyzed as the The maximum value of , to determine that the current first object and the current second object are the same object. Taking the aforementioned final matching information represented by a 2*2 matrix as an example, in the case where the current first object is the first object A and the current second object is the second object A, if the element in the first row and the first column in the matrix is the maximum value in the first row of the matrix, it can be determined that the first object A and the second object A are the same object. Other situations can be deduced by analogy, and no more examples will be given here. In the above method, the determination operation can be completed only by searching for the maximum value in the first reference information, which is beneficial to reduce the determination complexity and increase the determination speed.

Here, in the case of only combining the second reference information, the final matching degree between the current first object and the current second object can be used as the matching degree to be analyzed, and in response to the matching degree to be analyzed as the The maximum value of , to determine that the current first object and the current second object are the same object. Taking the aforementioned final matching information represented by a 2*2 matrix as an example, in the case where the current first object is the first object A and the current second object is the second object A, if the element in the first row and the first column in the matrix is the maximum value in the first column of the matrix, it can be determined that the first object A and the second object A are the same object. Other situations can be deduced by analogy, and no more examples will be given here. In the above manner, the determination operation can be completed only by searching for the maximum value in the second reference information, which is beneficial to reduce the determination complexity and increase the determination speed.

Here, in the case of combining the first reference information and the second reference information at the same time, the final matching degree between the current first object and the current second object can be used as the matching degree to be analyzed, and in response to the matching degree to be analyzed is The maximum value in the first reference information and the second reference information determines that the current first object and the current second object are the same object. Taking the aforementioned final matching information represented by a 2*2 matrix as an example, in the case where the current first object is the first object A and the current second object is the second object A, if the element in the first row and the first column in the matrix If they are both the maximum value in the first row of the matrix and the maximum value in the first column, it can be determined that the first object A and the second object A are the same object. Other situations can be deduced by analogy, and no more examples will be given here. In the above manner, the determination operation is completed by searching for the maximum value in the first reference information and the second reference information at the same time, and collaborative verification can be realized on the basis of the first reference information and the second reference information, so as to realize a pair of One matching constraint is beneficial to reduce the complexity of determination and improve the accuracy of determination.

In addition, it should be noted that in order to further improve the accuracy and robustness of target tracking, in the above process, if it is determined that the matching degree to be analyzed is the maximum value, it can be further detected whether the matching degree to be assigned is higher than the preset threshold, and if Then it can be determined that the current first object and the current second object are the same object; otherwise, the current first object and the current second object can be considered not to be the same object.

In one implementation scenario, when the requirements for tracking accuracy are relatively loose, it is also possible to extract the first feature representation of each first object and the second feature of each second object during the object matching process in the feature dimension. After representation, analysis is performed directly based on these feature representations to obtain tracking information. Specifically, for each first object, the probability that it and each second object are predicted to be the same object can be obtained based on the feature similarity between its first feature representation and the second feature representation of each second object value, and based on each probability value, a second object that is the same object as the first object is obtained. In the above manner, the tracking information is analyzed and obtained directly based on the feature similarity between the first feature representation of the first object and the second feature representation of the second object, which is beneficial to reduce tracking complexity.

In an implementation scenario, the feature similarity between the first feature representation and the second feature representations of each second object can be normalized to obtain the prediction that the first object and each second object are the same object probability value. Still taking the example that the first image contains M first objects and the second image contains N second objects, when performing object matching on the i-th first object among the M first objects, the first object can be The first feature representation of the object is denoted as M(i), and the second feature representation of the jth second object can be denoted as N(j). Taking the normalization operation through softmax as an example, the i-th first The probability values that the object and each second object are predicted to be the same object can be expressed as:

In the above formula (1), x∈t indicates that it belongs to each second object in the second image, and the superscript T indicates transposition.

In an implementation scenario, each second object is marked with a serial number value, for example, the first second object can be marked with the serial number value "1", the second second object can be marked with the serial number value "2", and so on , no more examples here. On this basis, the expected value can be obtained based on the serial number value of the second object and the probability value corresponding to the second object, and the value after rounding the expected value is used as the target serial number value, and the second object to which the target serial number value belongs , considered to be the same object as the first object. For the convenience of expression, the target serial number value can be recorded as

Then the target serial number value can be expressed as:

In the above formula (2), t−δ→t means that the first object in the first image t−δ is matched to the second object in the second image t. It should be noted that the rounding up operation is not shown in the above formula (2). In the actual application process, since the expected value may be a decimal, in order to determine the value of the target serial number, the rounding up operation can be directly performed on the expected value.

In an implementation scenario, as mentioned above, after the tracking information is obtained, pixel regions belonging to the same object (eg, the same foreground object and the same background object) in different images may be marked with the same color. Please refer to FIG. 4A and FIG. 4B in conjunction. FIG. 4A and FIG. 4B are two schematic diagrams of panorama segmented images respectively. As shown in Figure 4A and Figure 4B, Figure 4A represents the panorama segmentation image corresponding to the first image in Figure 2, and Figure 4B represents the panorama segmentation image corresponding to the second image in Figure 2, and the two images in Figure 4A and Figure 4B correspond to Pixel areas of the same object can be represented with the same gray scale.

In the above solution, target segmentation is performed on the first image and the second image respectively to obtain the first mask image of the first object in the first image and the second mask image of the second object in the second image, and based on the first mask Object matching is performed on the film image and the second mask image in the feature dimension to obtain the first matching information, and object matching is performed on the spatial dimension based on the first mask image and the second mask image to obtain the second matching information. On the other hand, the first matching information and the second matching information are fused to obtain the tracking information, and the tracking information includes whether the first object and the object are the same object, that is, in the process of target tracking, on the one hand, between images in the feature dimension Object matching can help ensure the tracking effect of large-sized objects. On the other hand, object matching between images in the spatial dimension can help ensure the tracking effect of small-sized objects, and based on this, the two matching methods are combined The obtained matching information is obtained as tracking information, so both large-size objects and small-size objects can be taken into consideration, which is beneficial to improving target tracking accuracy.

Please refer to Figure 5. Figure 5 is a schematic flow chart of object matching in the feature dimension, which may include the following steps:

Step S51: Based on the first mask image of each first object, extract the first feature representation of each first object, and respectively based on the second mask image of each second object, extract the first feature representation of each second object. Two features are represented.

Here, the object boundary can be determined based on the pixel values of each pixel in the mask image, and the object boundary is the boundary of the object to which the mask image belongs, and a region image is cut out from the mask image along the object boundary, and based on the region image Feature extraction, to obtain the feature representation of the belonging object, and in the case of the mask image being the first mask image, the belonging object is the first object, the feature representation is the first feature representation, and the mask image is the second mask image In the case of , the belonging object is the second object, and the feature representation is the second feature representation. The above method can eliminate the interference of pixels irrelevant to the object to which the mask image belongs during the feature extraction process, which is conducive to improving the accuracy of feature representation.

In an implementation scenario, as described in the aforementioned disclosed embodiments, for the mask image of each object, the pixel points belonging to the object have a pixel value higher than a preset threshold (eg, 0.5, 0.6, etc.), Or its pixel value is directly set to the first value (eg, 1), then the pixel whose pixel value is higher than the preset threshold (or the pixel value is the first value) can be used as the target pixel, and the surrounding target pixel A rectangular box that acts as its object bounds.

In an implementation scenario, please refer to FIG. 6 in conjunction with FIG. 6 , which is a schematic diagram of an embodiment of the process of object matching in the feature dimension. As shown in Figure 6, still taking the example that the first image contains M first objects and the second image contains N second objects, the size of the first mask image can be expressed as M*H*W, and the first mask image The size of the second mask image can be expressed as N*H*W. It should be noted that H is the height of the mask image, and W is the width of the mask image. After the above cropping, it can be further adjusted to a preset size (eg, 256*512) through an interpolation algorithm such as bilinear interpolation, and the blank area is filled with 0 to obtain an area image.

In an implementation scenario, as mentioned above, in order to improve the efficiency of target tracking, a target tracking model can be pre-trained, and the target tracking model includes a first matching network, and the first matching network can specifically include a first extraction sub-network and a second The sub-network is extracted, and the first extraction sub-network is used to extract the first feature representation, and the second extraction sub-network is used to extract the second feature representation. In order to further lighten the network model as much as possible, both the first extraction sub-network and the second extraction sub-network can include several fully connected layers (Fully Connection layer, FC), as shown in Figure 6, can include two layers of fully connected layers ( That is, 2*FC in Figure 6), the first feature representation of 1024 dimensions and the second feature representation of 1024 dimensions are obtained. It should be noted that in the actual application process, it is not limited to the network structure of the first extraction sub-network and the second extraction sub-network, which can be set according to the actual situation, such as convolutional layers, etc., which are not limited here .

Step S52: Obtain the feature similarity between each first object and each second object by using the first feature representation and the second feature representation.

Here, for any first object and any second object, the first feature representation of the first object may be multiplied by the second feature representation of the second object to obtain the feature similarity between the two. Taking the first feature representation as a 1024-dimensional feature vector and the second feature representation as a 1024-dimensional feature vector as an example, the elements at the corresponding positions of the two can be multiplied and accumulated to obtain the feature similarity.

Step S53: Obtain first matching information based on the feature similarity between each first object and each second object.

Here, after the feature similarity is obtained, a normalization operation may be performed on the calculated feature similarity to obtain the first matching degree. After obtaining the first matching degrees between any first object and any second object, these first matching degrees can be regarded as the first matching information. In addition, please refer to FIG. 6, still taking the example that the first image contains M first objects and the second image contains N second objects, the first matching information can finally be expressed as an M*N matrix, and the matrix The element in row i and column j in represents the first matching degree between the i first object and the j second object.

In the above solution, the first feature representation of each first object is extracted based on the first mask image of each first object, and the first feature representation of each second object is extracted based on the second mask image of each second object. Two feature representations, based on which the first feature representation and the second feature representation are used to obtain the feature similarity between each first object and each second object, and based on the feature between each first object and each second object Similarity, to obtain the first matching information, that is, in the process of object matching between images in the feature dimension, it is only necessary to perform feature extraction on the mask image of each object, and then measure the feature similarity, which can reduce the feature dimension On the complexity of object matching between images, it is beneficial to improve the tracking speed.

Please refer to FIG. 7. FIG. 7 is a schematic flow chart of object matching in the spatial dimension, which may include the following steps:

Step S71: Using the second image to perform optical flow prediction on the first image to obtain an optical flow image of the first image.

In an implementation scenario, please refer to FIG. 8 , which is a schematic diagram of a process of object matching in a spatial dimension. As shown in Figure 8, the optical flow image can be a two-channel image, wherein one channel image includes the offset value of each pixel point in the first image in the horizontal direction, and the other channel image includes the offset value of each pixel point in the first image in the vertical direction offset value. It should be noted that, under the condition that the optical flow prediction is accurate, the pixel in the first image can be shifted according to the offset value in the horizontal direction and the vertical direction respectively, and a pixel position can be obtained, and the pixel in the second image is located at The pixel of the position is theoretically still itself. Exemplarily, after the topmost pixel of the first object A in the first image is offset according to the offset value in the horizontal direction and the vertical direction, a pixel position can be obtained, and according to the pixel position found in the second image The pixel point of is still the topmost pixel point of the first object A. Other situations can be deduced by analogy, and no more examples will be given here.

In an implementation scenario, as mentioned above, in order to improve the efficiency of target tracking, a target tracking model can be pre-trained, and the target tracking model can include an optical flow prediction network, and the optical flow prediction network can include but not limited to RAFT (Recurrent All -Pairs Field Transforms for Optical Flow), etc., the network structure of the optical flow prediction network is not limited here. On this basis, the first image and the second image can be input into the optical flow prediction network to obtain an optical flow image. It should be noted that for the working principle of the optical flow prediction network, you can refer to the technical details of the optical flow prediction network such as RAFT.

Step S72: Based on the optical flow image, perform a pixel-by-pixel shift on the first mask image of the first object to obtain a predicted mask image of the first object at the shooting moment of the second image.

Here, the optical flow image and the first mask image can be multiplied pixel by pixel to obtain the offset value of the pixel in the first mask image, and the first pixel coordinate of the pixel in the first mask image can be compared with the offset The shift value is added to obtain the second pixel coordinate of the pixel point at the shooting moment of the second image (that is, the predicted pixel coordinate at the above shooting moment), and based on the second pixel coordinate of the pixel point in the first mask image, the predicted mask image. In the above method, in the process of pixel-by-pixel offset, only simple operations such as pixel multiplication and addition are required, so the complexity of pixel offset can be greatly reduced, which is conducive to further improving the tracking speed.

In an implementation scenario, the pixel value of each pixel in the first mask image may be multiplied by the pixel value of a pixel at a corresponding position in the optical flow image to obtain an offset value of the pixel in the first mask image. For meanings of corresponding positions, reference may be made to relevant descriptions in the aforementioned disclosed embodiments. Please refer to the mask image example in Figure 8. Each grid in the mask image represents each pixel. For the convenience of description, the pixel value of the grid filled with grayscale in the first mask image is 1, then the first The mask image can be expressed as a matrix as:

In addition, the pixel values of each pixel in the optical flow image of the horizontal channel can be 0, and the pixel values of each pixel in the optical flow image of the vertical channel can be 1, then the above-mentioned first mask image and After multiplying the optical flow images of the channels in the lateral direction, the offset value of each pixel in the first mask image in the lateral direction can be obtained:

Similarly, after the above first mask image is multiplied by the optical flow image of the longitudinal channel, the offset value of each pixel in the first mask image in the longitudinal direction can be obtained:

Therefore, by combining the above matrix (4) and matrix (5), the offset values of each pixel in the first mask image in the horizontal direction and the vertical direction can be obtained, plus the offset value of each pixel in the first mask image The first pixel coordinate of the pixel point at the time of shooting can be obtained. Exemplarily, for the pixel at the first pixel coordinate (1,1) in the first mask image, since its offset values in the horizontal and vertical directions are both 0, it is The pixel coordinate is still (1,1); or, for the first pixel coordinate (1,2) in the first mask image as an example, since its offset value in the horizontal direction is 0 and the offset value in the vertical direction is 1, so its second pixel coordinate at the time of shooting is (1,3), and other pixels can be deduced in the same way, and no more examples are given here. After the pixel shift operation is performed on all pixels in the first mask image, a predicted mask image as shown in the example of the mask image in FIG. 8 can be obtained.

Step S73: Obtain second matching information based on the degree of overlap between the predicted mask image of each first object and the second mask image of each second object.

Here, the dice coefficient can be used to calculate the degree of overlap between the predicted mask image of the first object and the second mask image of the second object, and use the degree of overlap as the second value between the first object and the second object. The matching degree can be regarded as obtaining the second matching information after obtaining the second matching degree between any first object and any second object.

In an implementation scenario, as an example for ease of description, the total number of pixels in the predicted mask image can be recorded as N, then the total number of pixels in the second mask image can also be recorded as N, and the i-th pixel in the predicted mask image The pixel value of the pixel point can be recorded as p _i , and the pixel value of the i-th pixel point in the second mask image can be recorded as g _i , then the coincidence degree between the predicted mask image and the second mask image can be expressed as :

In the above formula (6), sim _pos represents the coincidence degree. Taking the predicted mask image and the second mask image shown in Figure 8 as an example, the coincidence degree between the two is calculated as 3/8 by the above formula (6) , which is the Intersection over Union (IoU) between two mask images. Other situations can be deduced by analogy, and no more examples will be given here.

In an implementation scenario, as described in the foregoing disclosed embodiments, the second matching information may also be represented by a matrix. Please refer to FIG. 8. Taking M first objects in the first image and N second objects in the second image as an example, the second matching information can be represented by an M*N matrix, and the i-th object in the matrix The element in the jth column of the row represents the second matching degree between the i-th first object and the j-th second object.

In the above solution, the second image is used to predict the optical flow of the first image to obtain the optical flow image of the first image, and based on the optical flow image, the first mask image of the first object is shifted pixel by pixel to obtain the first The second matching information is obtained based on the predicted mask image of the object at the shooting moment of the second image and the degree of overlap between the predicted mask image of each first object and the second mask image of each second object, namely In the process of object matching between images in the spatial dimension, on the one hand, object matching can be realized based on pixel-level matching, which is conducive to greatly improving the tracking effect, especially for small-sized objects. After the pixel offset, it is only necessary to measure the image coincidence to obtain the matching information, which can also reduce the complexity of object matching between images in the spatial dimension, which is conducive to improving the tracking speed.

Please refer to FIG. 9. FIG. 9 is a schematic flowchart of a target tracking method provided by an embodiment of the present disclosure, which may include the following steps:

Step S91: Carry out object segmentation on the first image and the second image respectively to obtain a first mask image of the first object in the first image and a second mask image of the second object in the second image.

Here, reference may be made to relevant descriptions in the aforementioned disclosed embodiments.

Step S92: Perform object matching in the feature dimension based on the first mask image and the second mask image to obtain the first matching information, and perform object matching in the spatial dimension based on the first mask image and the second mask image to obtain the second Two matching information.

Step S93: Fusing the first matching information and the second matching information to obtain tracking information.

In the embodiments of the present disclosure, the tracking information includes whether the first object and the second object are the same object, and reference may be made to relevant descriptions in the aforementioned embodiments of the disclosure.

Step S94: in response to the tracking information meeting the preset condition, using the tracking information as the first tracking information, and acquiring a third image.

In the embodiment of the present disclosure, the third image, the first image and the second image are successively captured respectively. For example, the third image can be recorded as t-δ, the first image can be recorded as t, and the second image It can be recorded as t+δ.

In an implementation scenario, the preset condition may include: a target object exists in the second image. It should be noted that the target object is not the same object as any first object. At this time, the target object may be a new object that appears in the second image, or it may be blocked in the first image and appear in the second image. The middle occlusion disappears, so that no matching can be obtained when matching the first object in the second image, so further verification can be performed through the following verification process. The above method can greatly alleviate the impact of object disappearance and occlusion on tracking accuracy through timing consistency verification, which is conducive to improving tracking accuracy.

In one implementation scenario, different from the aforementioned method of setting the preset condition as the existence of the target object in the second image, the subsequent verification is triggered only when the second object that is not successfully matched appears in the second image, as another example in the actual application process. In a possible implementation manner, the preset condition may also be set to be empty, that is, no additional condition is set for triggering the verification, and subsequent verification is triggered as long as tracking information is obtained.

Step S95: Perform target tracking based on the third image and the second image to obtain second tracking information.

In the embodiment of the present disclosure, the second tracking information includes whether the second object and the third object in the third image are the same object, and for the process of object tracking, please refer to any of the foregoing object tracking method embodiments.

Step S96: Perform consistency verification based on the first tracking information and the second tracking information, and obtain a verification result.

Here, the same object in different images may have the same object identifier, and the target object may be analyzed based on the second tracking information to obtain an analysis result, and in response to the analysis result including that the target object and the reference object are the same object, the object identifier of the reference object is used as The object identifier of the target object, and the reference object is one of the third objects, that is, if there is an unmatched target object in the second image, if a third object is successfully matched in the third image, Then the third object can be regarded as a reference object, and the object identifier of the reference object is used as the object identifier of the target object, that is, the target object and the reference object are determined to be the same object; in addition, in response to the analysis result, the target object and the first object can also be included Any third object in the three images is not the same object, then a new object identifier can be marked for the target object, that is to say, if there is an unmatched target object in the second image, if in the third image If no third object that is the same object as the target object can be matched, then the target object can be considered as newly appearing in the second image, so a new object identifier can be marked for it. The above method can deal with the complex situation of reappearing after the disappearance of the object due to object occlusion, object deformation and other reasons through timing consistency verification, and verify according to the actual situation, which is conducive to improving the tracking effect of target tracking in complex situations.

In an implementation scenario, the above verification operation can be used to constrain the tracking consistency between multiple frames of images. Here, you can use

represents a differentiable operation

Among them, s and t represent the time step, and the above-mentioned differentiable operation

An object p in image x _s used to measure time step s (ie

) and an object p in the image x _t of time step t (ie

) between similarities. As mentioned above, in the actual application process, differentiable operations can be implemented from image t-δ to image t, and from image t to image t+δ

From this the timing consistency can be established as follows:

In an implementation scenario, please refer to FIG. 10 , which is a schematic diagram of an embodiment of a time consistency constraint. As shown in Figure 10, due to occlusion, the car in the dotted frame in the first image t is occluded by pedestrians, and is mistakenly segmented as a pedestrian, resulting in the loss of its true segmentation. Therefore, when tracking is performed based on the third image t−δ and the first image t, or the first image t and the second image t+δ, the object tracking of the car will fail. In this case, this limitation can be solved by conducting based on the relationship between the third image t-δ and the second image t+δ. Since the car in the second image t+δ is not successfully matched in the first image t, the matching information can be obtained by tracking the second image t+δ and the third image t-δ, that is, each of the cars in the second image t+δ The matching degree between the object and each object in the third image t-δ, on this basis, if the matching degree between the car in the second image t+δ and an object in the third image t-δ is higher than the preset If the threshold is set, it can be considered that the car in the second image t+δ and the object in the third image t-δ are the same object, and the car in the second image t+δ is marked with the object in the third image t-δ The object identifier, on the other hand, can be marked with a new object identifier for the car in the second image t+δ.

In the above solution, after obtaining the tracking information, further responding to the tracking information meeting the preset condition, using the tracking information as the first tracking information, and acquiring the third image, and the third image, the first image and the second image are respectively taken successively obtained, and based on the third image and the second image for target tracking, the second tracking information is obtained, and the second tracking information includes whether the second object and the third object in the third image are the same object, on this basis, Then, the consistency check is performed based on the first tracking information and the second tracking information, and the check result is obtained, so the inconsistency in the timing of the target tracking can be greatly reduced, which is conducive to further improving the tracking accuracy.

Please refer to FIG. 11. FIG. 11 is a schematic flowchart of a training method for a target tracking model in an embodiment of the present disclosure, which may include the following steps:

Step S111: Acquiring a first sample mask image of a first sample object in a first sample image, a second sample mask image of a second sample object in a second sample image, and sample tracking information.

In the embodiment of the present disclosure, the sample tracking information includes whether the first sample object and the second sample object are actually the same object. For example, when the first sample object and the second sample object are actually the same object, it can Mark as the first value (for example, 1), otherwise, if the first sample object and the second sample object are not actually the same object, it may be marked as the second value (for example, 0). In addition, regarding the meanings of the first sample mask image and the second sample mask image, reference may be made to the related descriptions about the first mask image and the second mask image in the aforementioned disclosed embodiments.

In an implementation scenario, as in the aforementioned disclosed embodiments, in order to improve the efficiency of acquiring mask images, the target tracking model may include a target segmentation network, and its network structure may refer to the relevant descriptions in the aforementioned disclosed embodiments. On this basis, the target segmentation model can be used to perform target segmentation on the first sample image and the second sample image respectively to obtain the first sample mask image and the second sample mask image. Here, reference may be made to relevant descriptions about object segmentation in the aforementioned disclosed embodiments.

In an implementation scenario, before the overall training of the target tracking network, the target segmentation network can be trained to converge. For the training process of the target segmentation network, you can refer to the technical details of segmentation networks such as Mask R-CNN, PointRend, and Instance-sensitive FCN. .

Step S112: The first matching network based on the target tracking model performs object matching on the first sample mask image and the second sample mask image in the feature dimension to obtain the first predicted matching information, and performs the second matching based on the target tracking model The network performs object matching on the first sample mask image and the second sample mask image in the spatial dimension to obtain second predicted matching information.

Here, reference may be made to related descriptions about object matching in the feature dimension and related descriptions about object matching in the space dimension in the aforementioned disclosed embodiments.

In an implementation scenario, before the overall training of the target tracking model, the first matching network may be trained to convergence, that is, the first matching network has completed training before the overall training of the target tracking model. It should be noted that, in this case, the aforementioned target segmentation network has been trained before training the first matching network.

In an implementation scenario, during the training process of the first matching network, feature extraction can be performed on the first sample mask image of the first sample object based on the first extraction sub-network of the first matching network to obtain the first sample The first sample feature representation of the object, and performing feature extraction on the second sample mask image of the second sample object based on the second extraction sub-network of the first matching network, to obtain the second sample feature representation of the second sample object, On this basis, for each first sample object, based on the feature similarity between the first sample feature representation of the first sample object and each second sample feature representation, the first sample object and each The second sample object is predicted to be the predicted probability value of the same object, and based on the expected value of each predicted probability value, the predicted matching object of the first sample object and the relationship between the predicted matching object and the actual matching object of the first sample object are obtained The difference between the sub-losses corresponding to the first sample object is obtained, and the predicted matching object is the second sample object predicted to be the same object as the first sample object, and the actual matching object is actually the same object as the first sample object The second sample object, the actual matching object is determined based on the sample tracking information, so as to count the sub-loss corresponding to each first sample object, obtain the total loss value of the first matching network, and then adjust the first matching network based on the total loss value network parameters. The above method, on the one hand, trains the first matching network in the overall training target tracking model, which is beneficial to improve the training efficiency of the target tracking model; Computing a loss enables the first matching network to learn feature representations during training via differentiable matching.

In an implementation scenario, for the feature extraction process, reference may be made to relevant descriptions in the aforementioned disclosed embodiments.

In an implementation scenario, the feature similarity can be normalized to obtain a predicted probability value, and the normalization operation can be implemented through softmax. Further, based on the serial number value of the second sample object and the predicted probability value corresponding to the second sample object, the expected value can be obtained, and the value after rounding up the expected value can be used as the target serial number value, and the second Sample object, as the predicted matching object of the first sample object. In addition, for the calculation process of the predicted probability value and the determination process of the predicted matching object, please refer to the "based on the feature similarity between its first feature representation and the second feature representation of each second object, obtain the probability values predicted to be the same object as each second object", and related descriptions about "obtain a second object that is the same object as the first object based on each probability value".

In an implementation scenario, a loss function such as cross-entropy can be used to calculate the sub-loss. Exemplarily, the sub-loss can be expressed as:

In the above formula (8), y is used to mark whether the predicted matching object is the same as the actual matching object of the first sample object. In the same case, y can be set to 1, and in the different case, y can be set to is 0, and p represents the predicted probability value corresponding to the aforementioned predicted matching object. Further, taking the first sample image containing M first sample objects as an example, the sub-losses corresponding to these M first sample objects can be averaged to obtain the total loss of the first matching network

In an implementation scenario, after calculating the total loss of the first matching network, optimization methods such as gradient descent can be used to adjust the network parameters of the first matching network. For the adjustment process, please refer to the technical details of optimization methods such as gradient descent .

In an implementation scenario, the second matching network may include an optical flow prediction network, configured to use the second sample image to perform optical flow prediction on the first sample image to obtain a sample optical flow image of the first sample image, and the second The sample matching information is obtained based on the sample optical flow image, and reference may be made to related descriptions about the optical flow image and the second matching information in the aforementioned disclosed embodiments.

Step S113: Using the information fusion network of the target tracking model to fuse the first predicted matching information and the second predicted matching information to obtain predicted tracking information.

In the embodiments of the present disclosure, the prediction tracking information includes whether the first sample object and the second sample object are predicted to be the same object, and the process of information fusion can refer to the relevant descriptions in the aforementioned disclosed embodiments.

Step S114: Based on the difference between the sample tracking information and the predicted tracking information, adjust the network parameters of the target tracking model.

Here, loss functions such as cross-entropy can be used to process the difference between sample tracking information and predicted tracking information to obtain the total loss of the target tracking model, and then adjust the network parameters of the target tracking model based on optimization methods such as gradient descent. It should be noted that for the calculation process of loss, you can refer to the technical details of loss functions such as cross entropy, and for the adjustment process of parameters, you can refer to the technical details of optimization methods such as gradient descent. In addition, as mentioned above, before the overall training of the target tracking model, the above-mentioned target segmentation network, the first matching network, and the second matching network have all been trained and converged, so the aforementioned target can be fixed during the process of adjusting the network parameters of the target tracking model. The network parameters of the split network, the first matching network, and the second matching network only adjust the network parameters of the information fusion network. Of course, the network parameters of each network can also be adjusted at the same time, which is not limited here.

The above solution, on the one hand, performs object matching between images in the feature dimension, which can help to ensure the tracking effect of large-sized objects; The tracking effect of large-size objects, and based on this, the matching information obtained by fusing the two matching methods to obtain tracking information, so it can take into account both large-size objects and small-size objects, which is conducive to improving the accuracy of the target tracking model.

Please refer to FIG. 12 . FIG. 12 is a schematic frame diagram of the target tracking device 120 . The target tracking device 120 includes: a target segmentation part 121, an object matching part 122 and an information fusion part 123. The target segmentation part 121 is configured to perform target segmentation on the first image and the second image respectively to obtain the first object in the first image The first mask image of and the second mask image of the second object in the second image; the object matching part 122 is configured to perform object matching in the feature dimension based on the first mask image and the second mask image, and obtain the first a matching information, and perform object matching in the spatial dimension based on the first mask image and the second mask image to obtain the second matching information; the information fusion part 123 is configured to fuse the first matching information and the second matching information to obtain Tracking information; wherein, the tracking information includes whether the first object and the second object are the same object.

The above solution, on the one hand, performs object matching between images in the feature dimension, which can help to ensure the tracking effect of large-sized objects; Based on the tracking effect of large-size objects, and based on the fusion of the matching information obtained by the two matching methods to obtain tracking information, it can take into account both large-size objects and small-size objects, which is conducive to improving the accuracy of target tracking.

In some disclosed embodiments, the object matching part 122 includes a feature extraction subsection configured to extract the first feature representations of each first object based on the first mask image of each first object, and respectively based on the first mask images of each first object The second mask image of the two objects is extracted to obtain the second feature representation of each second object; the object matching part 122 includes a similarity measure subsection configured to use the first feature representation and the second feature representation to obtain each first object Feature similarity with each second object; the object matching part 122 includes a first matching subsection configured to obtain first matching information based on feature similarity between each first object and each second object.

Therefore, in the process of object matching between images in the feature dimension, it is only necessary to perform feature extraction on the mask image of each object, and then measure the feature similarity, which can reduce the cost of object matching between images in the feature dimension. Complexity, which is conducive to improving the tracking speed.

In some disclosed embodiments, the feature extraction subsection includes a boundary determination section configured to determine the boundary of the object based on the pixel values of each pixel in the mask image; wherein, the boundary of the object is the boundary of the object to which the mask image belongs; the feature extraction The sub-part includes an image cropping part, which is configured to cut out a region image from the mask image along the object boundary; the feature extraction sub-part includes a representation extraction part, which is configured to perform feature extraction based on the region image, and obtain a feature representation of the object to which it belongs; wherein , when the mask image is the first mask image, the belonging object is the first object, and the feature representation is the first feature representation; when the mask image is the second mask image, the belonging object is the second object , the feature representation is the second feature representation.

Therefore, in the process of feature extraction, the interference of pixels irrelevant to the object to which the mask image belongs can be eliminated, which is conducive to improving the accuracy of feature representation.

In some disclosed embodiments, the object matching part 122 includes an optical flow prediction subsection configured to use the second image to perform optical flow prediction on the first image to obtain an optical flow image of the first image; the object matching part 122 includes a pixel offset The shifting part is configured to shift the first mask image of the first object pixel by pixel based on the optical flow image, so as to obtain the predicted mask image of the first object at the shooting moment of the second image; the object matching part 122 includes The second matching subsection is configured to obtain second matching information based on the degree of overlap between the predicted mask image of each first object and the second mask image of each second object.

Therefore, in the process of object matching between images in the spatial dimension, on the one hand, object matching can be achieved based on pixel-level matching, which is conducive to greatly improving the tracking effect, especially for small-sized objects. On the other hand, based on optical flow After the image is shifted pixel by pixel, it only needs to measure the image coincidence to obtain the matching information, and it can also reduce the complexity of object matching between images in the spatial dimension, which is conducive to improving the tracking speed.

In some disclosed embodiments, the pixel offset subsection includes a pixel multiplication section configured to multiply the optical flow image and the first mask image pixel by pixel to obtain the offset value of the pixel in the first mask image The pixel offset subsection includes a pixel addition section configured to add the first pixel coordinates of the pixel in the first mask image to the offset value to obtain the second pixel coordinates of the pixel at the shooting moment; the pixel offset The shifting part includes an image acquisition part configured to obtain a predicted mask image based on the second pixel coordinates of the pixels in the first mask image.

Therefore, in the process of pixel-by-pixel offset, only simple operations such as pixel multiplication and addition are required, so the complexity of pixel offset can be greatly reduced, which is conducive to further improving the tracking speed.

In some disclosed embodiments, the first matching information includes a first matching degree between the first object and the second object, the second matching information includes a second matching degree between the first object and the second object, and the information fusion part 123 includes a weighting subpart configured to adaptively weight the first matching degree in the first matching information to obtain the first weighted matching information, and perform adaptive weighting to the second matching degree in the second matching information to obtain the second matching information Two weighted matching information; wherein, the first weighted matching information includes a first weighted matching degree between the first object and the second object, and the second weighted matching information includes a second weighted matching degree between the first object and the second object ; The information fusion part 123 includes a fusion subsection configured to fuse the first weighted matching information and the second weighted matching information to obtain final matching information; wherein, the final matching information includes the final matching information between the first object and the second object Matching degree; the information fusion part 123 includes an analysis sub-part configured to analyze based on the final matching information to obtain tracking information.

Therefore, in the fusion process of matching information, by adaptively weighting the first matching information and the second matching information, the importance of the two can be adaptively measured according to the actual situation, and then fusion is performed on this basis. It is beneficial to greatly improve the tracking accuracy.

In some disclosed embodiments, the tracking information is obtained by using a target tracking model to detect the first image and the second image, the target tracking model includes an information fusion network, and the information fusion network includes a first weighted sub-network and a second weighted sub-network , the first weighting subnetwork is used to adaptively weight the first matching degree, and the second weighting subnetwork is used to adaptively weight the second matching degree.

Therefore, the neural network can be used to learn the importance of the feature dimension and the spatial dimension to target tracking according to the actual situation, which is conducive to improving the efficiency and accuracy of adaptive weighting.

In some disclosed embodiments, the analysis subsection includes a combination section configured to combine pairs of each first object and each second object as a current object group respectively; the analysis subsection includes a determination section configured to be based on the current At least one of the first reference information and the second reference information of the object group determines whether the current first object and the current second object are the same object; wherein, the current first object is the first object in the current object group, The current second object is the second object in the current object group, the first reference information includes: the final matching degree between the current first object and each second object, and the second reference information includes: the current second object and each second object respectively The final degree of matching between the first objects.

Therefore, on the one hand, it can be determined whether the two objects in each object group are the same object, so as to avoid omission as much as possible, which is beneficial to improve the tracking accuracy; At least one of the information is also beneficial to improve the accuracy of the determination.

In some disclosed embodiments, the analysis sub-part includes a selection part configured to use the final matching degree between the current first object and the current second object as the matching degree to be analyzed; the determining part is also configured to perform any of the following: or: in response to the matching degree to be analyzed is the maximum value in the first reference information, determine that the current first object and the current second object are the same object; in response to the matching degree to be analyzed is the maximum value in the second reference information, determine that the current The first object and the current second object are the same object; in response to the matching degree to be analyzed being the maximum value of the first reference information and the second reference information, it is determined that the current first object and the current second object are the same object.

Therefore, on the one hand, the first two determination methods only need to search for the maximum value in the first reference information or the second reference information to complete the determination operation, which is beneficial to reduce the determination complexity and improve the determination speed; on the other hand, through the final A determination method searches the maximum value of the first reference information and the second reference information at the same time to complete the determination operation, and can realize collaborative verification on the basis of the first reference information and the second reference information, which is beneficial to reduce the complexity of determination, and Improve determination accuracy.

In some disclosed embodiments, the object tracking device 120 further includes a condition response part configured to use the tracking information as the first tracking information and acquire a third image in response to the tracking information meeting a preset condition; wherein, the third image, The first image and the second image are successively photographed respectively; the target tracking device 120 also includes a repeating tracking part configured to perform target tracking based on the third image and the second image to obtain second tracking information, wherein the second tracking The information includes whether the second object and the third object in the third image are the same object; the target tracking device 120 also includes an information checking part configured to perform a consistency check based on the first tracking information and the second tracking information, and obtain Check result.

Therefore, the timing inconsistency in target tracking can be greatly reduced, which is conducive to further improving the tracking accuracy.

In some disclosed embodiments, the preset condition includes: a target object exists in the second image; wherein, the target object is not the same object as any first object.

Therefore, the preset condition is set to the fact that there is no target object in the second image, and the target object is not the same object as any first object, so it can pass the timing consistency check, greatly reducing the impact of object disappearance, occlusion, etc. The impact of accuracy is beneficial to improve tracking accuracy.

In some disclosed embodiments, the same object in different images has the same object identifier, and the information checking part includes an information analysis subsection configured to analyze the target object based on the second tracking information to obtain an analysis result; the information checking part includes The first response subpart is configured to use the object identifier of the reference object as the object identifier of the target object in response to the fact that the analysis result includes that the target object and the reference object are the same object; wherein, the reference object is one of the third objects; information verification The section includes a second response subsection configured to mark the target object with a new object identification in response to the analysis result including that the target object is not the same object as any third object in the third image.

Therefore, it is possible to deal with the complex situation that the object reappears after disappearing due to object occlusion, object deformation and other reasons through timing consistency verification, and verify it according to the actual situation, which is conducive to improving the tracking effect of target tracking in complex situations.

Please refer to FIG. 13 . FIG. 13 is a schematic frame diagram of an object tracking model training device 130 provided by an embodiment of the present disclosure. The training device 130 of the target tracking model includes: a sample acquisition part 131, a sample matching part 132, a sample fusion part 133 and a parameter adjustment part 134, the sample acquisition part 131 is configured to obtain the first sample object in the first sample image The first sample mask image, the second sample mask image of the second sample object in the second sample image, and sample tracking information; wherein the sample tracking information includes whether the first sample object and the second sample object are actually the same object The sample matching part 132 is configured to perform object matching on the feature dimension of the first sample mask image and the second sample mask image based on the first matching network of the target tracking model to obtain the first predicted matching information, and based on the target The second matching network of the tracking model performs object matching on the first sample mask image and the second sample mask image in the spatial dimension to obtain the second predicted matching information; the sample fusion part 133 is configured to use the information of the target tracking model The fusion network fuses the first predicted matching information and the second predicted matching information to obtain predicted tracking information; wherein, the predicted tracking information includes whether the first sample object and the second sample object are predicted to be the same object; the parameter adjustment part 134 is configured as Based on the difference between the sample tracking information and the predicted tracking information, the network parameters of the object tracking model are adjusted.

In some disclosed embodiments, the first matching network has completed training before the overall training of the target tracking model, and the training device 130 of the target tracking model further includes a sample feature extraction part configured as a first extraction sub-network based on the first matching network Feature extraction is performed on the first sample mask image of the first sample object to obtain the first sample feature representation of the first sample object, and based on the second extraction sub-network of the first matching network for the second sample object The second sample mask image is subjected to feature extraction to obtain the second sample feature representation of the second sample object; the training device 130 of the target tracking model also includes a sub-loss calculation part configured to, for each first sample object, based on the first The feature similarity between the first sample feature representation of the sample object and each second sample feature representation is obtained to obtain the predicted probability value that the first sample object and each second sample object are predicted to be the same object, and based on each The expected value of the predicted probability value, the predicted matching object of the first sample object is obtained, and the sub-loss corresponding to the first sample object is obtained based on the difference between the predicted matching object and the actual matching object of the first sample object; wherein, The predicted matching object is the second sample object that is predicted to be the same object as the first sample object, the actual matching object is the second sample object that is actually the same object as the first sample object, and the actual matching object is determined based on the sample tracking information The training device 130 of the target tracking model also includes a total loss calculation part, which is configured to count the corresponding sub-losses of each first sample object to obtain the total loss value of the first matching network; the training device 130 of the target tracking model also includes a network The optimization part is configured to adjust network parameters of the first matching network based on the total loss value.

Therefore, on the one hand, the first matching network is trained first in the overall training target tracking model, which is beneficial to improve the training efficiency of the target tracking model; loss, which enables the first matching network to learn feature representations during training through differentiable matching.

In some disclosed embodiments, the sub-loss calculation section includes a normalization subsection; or, the sub-loss calculation section includes an expectation calculation subsection, a sequence number determination subsection, and an object prediction subsection; or, the sub-loss calculation section includes a normalization subsection subsection, expectation calculation subsection, serial number determination subsection and object prediction subsection; the normalization subsection is configured to normalize the feature similarity to obtain the predicted probability value; the sub loss calculation section includes the expectation calculation subsection , is configured to obtain the expected value based on the serial number value of the second sample object and the predicted probability value corresponding to the second sample object; the sub-loss calculation part includes a serial number determination sub-section, which is configured to take the value after rounding the expected value as the target The serial number value; the sub-loss calculation part includes an object prediction subsection configured to use the second sample object to which the target serial number value belongs as the predicted matching object of the first sample object.

Therefore, by normalizing the feature similarity to obtain the predicted probability value, it can help reduce the complexity of obtaining the predicted probability value, and based on the serial number value of the second sample object and the predicted probability value corresponding to the second sample object, get Expected value, and the value after rounding up the expected value is used as the target serial number value, and then the second sample object to which the target serial number value belongs is used as the predicted matching object of the first sample object, which can be simple through mathematical expectation and rounding up The operation determines the predicted matching object, which is beneficial to greatly reduce the complexity of determining the predicted matching object.

In some disclosed embodiments, the target tracking model further includes a target segmentation network; or, the second matching network includes an optical flow prediction network; or, the target tracking model further includes a target segmentation network, and the second matching network includes an optical flow prediction network; The first sample mask image and the second sample mask image are obtained by using the target segmentation network to perform target segmentation on the first sample image and the second sample image respectively, and the target segmentation network has been completed before training the first matching network Training; the optical flow prediction network is used to perform optical flow prediction on the first sample image by using the second sample image to obtain a sample optical flow image of the first sample image, and the second sample matching information is obtained based on the sample optical flow image of.

Therefore, the target tracking model also includes a target segmentation network. The first sample mask image and the second sample mask image are obtained by using the target segmentation network to perform target segmentation on the first sample image and the second sample image respectively, and the target The segmentation network has been trained before training the first matching network, so by training the target segmentation network in stages, the target tracking model can be trained step by step, which is conducive to improving the training efficiency and effect; while the second matching network includes the optical flow prediction network , which is used to predict the optical flow of the first sample image by using the second sample image to obtain the sample optical flow image of the first sample image, and the matching information of the second sample is obtained based on the sample optical flow image, which is beneficial to improve the optical flow Accuracy and efficiency of stream forecasting.

In the embodiments of the present disclosure and other embodiments, a "part" may be a part of a circuit, a part of a processor, a part of a program or software, etc., of course it may also be a unit, a module or a non-modular one.

Please refer to FIG. 14 . FIG. 14 is a schematic frame diagram of an electronic device 140 provided by an embodiment of the present disclosure. The electronic device 140 includes a memory 141 and a processor 142 coupled to each other, and the processor 142 is configured to execute the program instructions stored in the memory 141, so as to realize the steps of any of the above object tracking method embodiments, or to realize any of the above object tracking The steps of the embodiment of the training method of the model. In a specific implementation scenario, the electronic device 140 may include, but is not limited to: a microcomputer and a server. In addition, the electronic device 140 may also include mobile devices such as notebook computers and tablet computers, which are not limited here.

Here, the processor 142 is configured to control itself and the memory 141 to implement the steps of any of the above object tracking method embodiments, or to implement the steps of any of the above object tracking model training method embodiments. The processor 142 may also be called a CPU (Central Processing Unit, central processing unit). The processor 142 may be an integrated circuit chip with signal processing capability. The processor 142 can also be a general-purpose processor, a digital signal processor (Digital Signal Processor, DSP), an application-specific integrated circuit (Application Specific Integrated Circuit, ASIC), a field-programmable gate array (Field-Programmable Gate Array, FPGA) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components. A general-purpose processor may be a microprocessor, or the processor may be any conventional processor, or the like. In addition, the processor 142 may be jointly realized by an integrated circuit chip.

The above solution, on the one hand, performs object matching between images in the feature dimension, which can help to ensure the tracking effect of large-sized objects; tracking effect, and based on the fusion of the matching information obtained by the two matching methods to obtain tracking information, it can take into account both large-size objects and small-size objects, which is conducive to improving the accuracy of target tracking.

Please refer to FIG. 15 . FIG. 15 is a schematic frame diagram of a computer-readable storage medium 150 provided by an embodiment of the present disclosure. The computer-readable storage medium 150 stores program instructions 151 that can be executed by the processor, and the program instructions 151 are used to implement the steps of any of the above-mentioned object tracking method embodiments, or realize the steps of any of the above-mentioned object tracking model training method embodiments. .

An embodiment of the present disclosure also provides a computer program product, the computer program product includes a computer program or an instruction, and when the computer program or instruction is run on an electronic device, the electronic device is made to perform any of the above objectives The steps of the embodiment of the tracking method, or the steps of the embodiment of the training method of any target tracking model mentioned above.

In some embodiments of the present disclosure, a computer program (computer instructions) may take the form of a program, software, software module, script, or code in any form of programming language (including compiled or interpreted languages, or declarative or procedural languages) ) and may be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.

This disclosure relates to the field of augmented reality. By acquiring the image information of the target object in the real environment, and then using various visual correlation algorithms to detect or identify the relevant features, states and attributes of the target object, and thus obtain the image information that matches the specific application. AR effect combining virtual and reality. Exemplarily, the target object may involve faces, limbs, gestures, actions, etc. related to the human body, or markers and markers related to objects, or sand tables, display areas or display items related to venues or places. Vision-related algorithms can involve visual positioning, SLAM, 3D reconstruction, image registration, background segmentation, object key point extraction and tracking, object pose or depth detection, etc. Specific applications can not only involve interactive scenes such as guided tours, navigation, explanations, reconstructions, virtual effect overlays and display related to real scenes or objects, but also special effects processing related to people, such as makeup beautification, body beautification, special effect display, virtual Interactive scenarios such as model display.

The relevant features, states and attributes of the target object can be detected or identified through the convolutional neural network. The above-mentioned convolutional neural network is a network model obtained by performing model training based on a deep learning framework.

In the several embodiments provided in the present disclosure, it should be understood that the disclosed methods and devices may be implemented in other ways. For example, the device implementation described above is only illustrative, for example, the division of parts is only a logical function division, and there may be other division methods in actual implementation, for example, units or components can be combined or integrated into another A system, or some feature, can be ignored, or not implemented. In another point, the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be in electrical, mechanical or other forms.

A unit described as a separate component may or may not be physically separated, and a component shown as a unit may or may not be a physical unit, that is, it may be located in one place, or may also be distributed to network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present disclosure may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit. The above-mentioned integrated units can be implemented in the form of hardware or in the form of software functional units.

If the integrated unit is realized in the form of a software function unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present disclosure is essentially or part of the contribution to the prior art, or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , including several instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (processor) execute all or part of the steps of the methods in various embodiments of the present disclosure. The aforementioned storage medium may be a tangible device capable of holding and storing instructions used by the instruction execution device, and may be a volatile storage medium or a non-volatile storage medium. A computer readable storage medium may be, for example, but is not limited to, an electrical storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of computer-readable storage media include: portable computer diskettes, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM), or flash memory), static random access memory (SRAM), compact disc read only memory (CD-ROM), digital versatile disc (DVD), memory stick, floppy disk, mechanically encoded device, such as a printer with instructions stored thereon A hole card or a raised structure in a groove, and any suitable combination of the above. As used herein, computer-readable storage media are not to be construed as transient signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., pulses of light through fiber optic cables), or transmitted electrical signals.

Industrial Applicability

Embodiments of the present disclosure provide a target tracking and related model training method, device, device, medium, and computer program product, wherein the target tracking method includes: performing target segmentation on the first image and the second image respectively to obtain the first A first mask image of the first object in the image and a second mask image of the second object in the second image; performing object matching in feature dimensions based on the first mask image and the second mask image to obtain first matching information , and perform object matching in the spatial dimension based on the first mask image and the second mask image to obtain the second matching information; fuse the first matching information and the second matching information to obtain tracking information; wherein, the tracking information includes the first object Whether it is the same object as the second object. The above solution can improve the target tracking accuracy.

Claims

A target tracking method, comprising:

performing object segmentation on the first image and the second image respectively, to obtain a first mask image of the first object in the first image and a second mask image of the second object in the second image;

Perform object matching in the feature dimension based on the first mask image and the second mask image to obtain first matching information, and perform object matching in the spatial dimension based on the first mask image and the second mask image The object is matched, and the second matching information is obtained;

Fusing the first matching information and the second matching information to obtain tracking information; wherein the tracking information includes whether the first object and the second object are the same object.
The method according to claim 1, wherein said performing object matching in feature dimensions based on said first mask image and said second mask image to obtain first matching information comprises:

Extracting first feature representations of each of the first objects based on the first mask images of each of the first objects, and extracting based on the second mask images of each of the second objects obtaining a second feature representation of each of said second objects;

Obtaining the feature similarity between each of the first objects and each of the second objects by using the first feature representation and the second feature representation;

The first matching information is obtained based on the feature similarity between each of the first objects and each of the second objects.
The method according to claim 2, wherein the step of extracting the first feature representation or the second feature representation comprises:

Determining an object boundary based on the pixel values of each pixel in the mask image; wherein, the object boundary is the boundary of the object to which the mask image belongs;

cropping a region image from the mask image along the object boundary;

performing feature extraction based on the region image to obtain a feature representation of the object to which it belongs;

Wherein, when the mask image is the first mask image, the belonging object is the first object, the feature representation is the first feature representation, and when the mask image is In the case of the second mask image, the belonging object is the second object, and the feature representation is the second feature representation.
The method according to any one of claims 1 to 3, wherein said performing object matching in spatial dimensions based on said first mask image and said second mask image to obtain second matching information includes:

performing optical flow prediction on the first image by using the second image to obtain an optical flow image of the first image;

Based on the optical flow image, performing a pixel-by-pixel shift on the first mask image of the first object to obtain a predicted mask image of the first object at the shooting moment of the second image;

The second matching information is obtained based on a degree of overlap between the predicted mask images of each of the first objects and the second mask images of each of the second objects.
The method according to claim 4, wherein, based on the optical flow image, the first mask image of the first object is shifted pixel by pixel to obtain the first object in the second image Predicted mask images at the time of capture, including:

Multiplying the optical flow image and the first mask image pixel by pixel to obtain the offset value of the pixel in the first mask image;

adding the first pixel coordinate of the pixel in the first mask image to the offset value to obtain the second pixel coordinate of the pixel at the shooting moment;

The predicted mask image is obtained based on the second pixel coordinates of the pixel points in the first mask image.
The method according to any one of claims 1 to 5, wherein the first matching information includes a first matching degree between the first object and the second object, and the second matching information includes the The second matching degree between the first object and the second object, and the fusing of the first matching information and the second matching information to obtain tracking information includes:

performing adaptive weighting on the first matching degree in the first matching information to obtain first weighted matching information, and performing adaptive weighting on the second matching degree in the second matching information to obtain a second weighted Matching information; wherein, the first weighted matching information includes a first weighted matching degree between the first object and the second object, and the second weighted matching information includes the first object and the second object a second weighted degree of matching between the two objects;

Fusing the first weighted matching information and the second weighted matching information to obtain final matching information; wherein the final matching information includes a final matching degree between the first object and the second object;

Analyze based on the final matching information to obtain the tracking information.
The method according to claim 6, wherein the tracking information is obtained by using a target tracking model to detect the first image and the second image, and the target tracking model includes an information fusion network, and the information The fusion network includes a first weighting subnetwork and a second weighting subnetwork, the first weighting subnetwork is used to adaptively weight the first matching degree, and the second weighting subnetwork is used to weight the second weighting subnetwork. The matching degree is adaptively weighted.
The method according to claim 6 or 7, wherein the analyzing based on the final matching information to obtain the tracking information includes:

Combining pairs of each of the first objects and each of the second objects as a current object group;

determining whether the current first object and the current second object are the same object based on at least one of the first reference information and the second reference information of the current object group;

Wherein, the current first object is the first object in the current object group, the current second object is the second object in the current object group, and the first reference information includes: the current first object The final matching degrees between an object and each of the second objects, the second reference information includes: the final matching degrees between the current second object and each of the first objects.
The method according to claim 8, wherein at least one of the first reference information and the second reference information based on the current object group is used to determine whether the current first object and the current second object are the same Before the object, the method also includes:

Taking the final matching degree between the current first object and the current second object as the matching degree to be analyzed;

The determining whether the current first object and the current second object are the same object based on at least one of the first reference information and the second reference information of the current object group includes any of the following:

In response to the matching degree to be analyzed being the maximum value in the first reference information, determining that the current first object and the current second object are the same object;

In response to the matching degree to be analyzed being the maximum value in the second reference information, determining that the current first object and the current second object are the same object;

In response to the matching degree to be analyzed being the maximum value of the first reference information and the second reference information, it is determined that the current first object and the current second object are the same object.
The method according to any one of claims 1 to 9, wherein, after said fusing said first matching information and said second matching information to obtain tracking information, said method further comprises:

In response to the tracking information meeting the preset condition, using the tracking information as the first tracking information, and acquiring a third image; wherein, the third image, the first image, and the second image are successively photographed;

Perform target tracking based on the third image and the second image to obtain second tracking information, wherein the second tracking information includes whether the second object is the same as a third object in the third image object;

Consistency verification is performed based on the first tracking information and the second tracking information to obtain a verification result.
The method according to claim 10, wherein the preset condition comprises: a target object exists in the second image; wherein the target object is not the same object as any one of the first objects.
The method according to claim 11, wherein the same object in different images has the same object identifier, and performing consistency verification based on the first tracking information and the second tracking information to obtain a verification result includes:

Analyzing the target object based on the second tracking information to obtain an analysis result;

In response to the analysis result including that the target object and the reference object are the same object, using the object identifier of the reference object as the object identifier of the target object; wherein the reference object is one of the third objects;

In response to the analysis result including that the target object is not the same object as any of the third objects in the third image, a new object identifier is marked for the target object.
A training method for a target tracking model, comprising:

Acquiring a first sample mask image of a first sample object in a first sample image, a second sample mask image of a second sample object in a second sample image, and sample tracking information; wherein the sample tracking information includes Whether the first sample object and the second sample object are actually the same object;

The first matching network based on the target tracking model performs object matching on the first sample mask image and the second sample mask image in the feature dimension to obtain first predicted matching information, and based on the target tracking The second matching network of the model performs object matching on the first sample mask image and the second sample mask image in the spatial dimension to obtain second predicted matching information;

Using the information fusion network of the target tracking model to fuse the first predicted matching information and the second predicted matching information to obtain predicted tracking information; wherein the predicted tracking information includes the first sample object and the Whether the second sample object is predicted to be the same object;

Adjusting network parameters of the target tracking model based on the difference between the sample tracking information and the predicted tracking information.
The method according to claim 13, wherein the first matching network has completed training before the overall training of the target tracking model, and the training step of the first matching network comprises:

performing feature extraction on the first sample mask image of the first sample object based on the first extraction sub-network of the first matching network to obtain a first sample feature representation of the first sample object, and performing feature extraction on the second sample mask image of the second sample object based on the second extraction sub-network of the first matching network to obtain a second sample feature representation of the second sample object;

For each of the first sample objects, based on the feature similarity between the first sample feature representation of the first sample object and each of the second sample feature representations, the first sample object is obtained Predicting the predicted probability values of the same object as each of the second sample objects, and obtaining the predicted matching object of the first sample object based on the expected value of each of the predicted probability values, and based on the predicted matching object and The difference between the actual matching objects of the first sample object is obtained to obtain the sub-loss corresponding to the first sample object; wherein, the predicted matching object is predicted to be the same object as the first sample object A second sample object, the actual matching object is a second sample object that is actually the same object as the first sample object, and the actual matching object is determined based on the sample tracking information;

Counting the sub-losses corresponding to each of the first sample objects to obtain the total loss value of the first matching network;

Adjusting network parameters of the first matching network based on the total loss value.
The method according to claim 14, wherein the first sample feature representation based on the first sample object is obtained based on the feature similarities between the first sample feature representations and each of the second sample feature representations, and the first The predicted probability values that the sample object and each of the second sample objects are predicted to be the same object include:

Normalizing the feature similarity to obtain the predicted probability value;

And/or, the obtaining the predicted matching object of the first sample object based on the expected value of each of the predicted probability values includes:

The expected value is obtained based on the serial number value of the second sample object and the predicted probability value corresponding to the second sample object; wherein, each of the second sample objects is respectively marked with a serial number value;

Taking the value after rounding up the expected value as the target serial number value;

The second sample object to which the target serial number value belongs is used as the predicted matching object of the first sample object.
The method according to any one of claims 13 to 15, wherein the target tracking model further includes a target segmentation network, and the first sample mask image and the second sample mask image use the target The segmentation network is obtained by performing target segmentation on the first sample image and the second sample image, and the target segmentation network has completed training before training the first matching network;

And/or, the second matching network includes an optical flow prediction network, configured to use the second sample image to perform optical flow prediction on the first sample image to obtain a sample optical flow of the first sample image image, and the second sample matching information is obtained based on the sample optical flow image.
A target tracking device, comprising:

The object segmentation part is configured to perform object segmentation on the first image and the second image respectively, to obtain a first mask image of the first object in the first image and a second mask image of the second object in the second image. film image;

The object matching part is configured to perform object matching in feature dimensions based on the first mask image and the second mask image, obtain first matching information, and based on the first mask image and the second mask image performing object matching on the mask image in a spatial dimension to obtain second matching information;

The information fusion part is configured to fuse the first matching information and the second matching information to obtain tracking information; wherein the tracking information includes whether the first object and the second object are the same object.
A training device for a target tracking model, comprising:

a sample acquisition part configured to acquire a first sample mask image of a first sample object in the first sample image, a second sample mask image of a second sample object in the second sample image, and sample tracking information; wherein , the sample tracking information includes whether the first sample object and the second sample object are actually the same object;

The sample matching part is configured to perform object matching on the feature dimension of the first sample mask image and the second sample mask image based on the first matching network of the target tracking model to obtain first predicted matching information , and performing object matching on the first sample mask image and the second sample mask image in the spatial dimension based on the second matching network of the target tracking model to obtain second predicted matching information;

The sample fusion part is configured to use the information fusion network of the target tracking model to fuse the first predicted matching information and the second predicted matching information to obtain predicted tracking information; wherein the predicted tracking information includes the first whether a sample object and the second sample object are predicted to be the same object;

The parameter adjustment part is configured to adjust the network parameters of the target tracking model based on the difference between the sample tracking information and the predicted tracking information.
An electronic device, comprising a memory and a processor coupled to each other, the processor is configured to execute program instructions stored in the memory, so as to implement the target tracking method according to any one of claims 1 to 12, or Realize the training method of the target tracking model described in any one of claims 13 to 16.
A computer-readable storage medium, on which program instructions are stored, and when the program instructions are executed by a processor, the target tracking method according to any one of claims 1 to 12 is realized, or any one of claims 13 to 16 is realized The training method of the target tracking model.
A computer program product, the computer program product comprising a computer program or an instruction, when the computer program or instruction is run on an electronic device, the electronic device is made to execute any one of claims 1 to 12 The target tracking method, or execute the training method of the target tracking model described in any one of claims 13 to 16.