WO2022222227A1

WO2022222227A1 - Target detection network-based target tracking method and apparatus, device, and medium

Info

Publication number: WO2022222227A1
Application number: PCT/CN2021/096757
Authority: WO
Inventors: 赵娅琳; 陆进; 刘玉宇; 肖京
Original assignee: 平安科技（深圳）有限公司
Priority date: 2021-04-22
Filing date: 2021-05-28
Publication date: 2022-10-27
Also published as: CN113159032B; CN113159032A

Abstract

A target detection network-based target tracking method and apparatus, a device, and a medium. The method is performed by means of obtaining first position information of a target object in a first image to undergo detection, and second position information of the target object in a second image to undergo detection (S10); utilizing a Kalman filter model to predict first predicted position information of the target object in the second image to undergo detection according to the first position information, and determining a first ROI region corresponding to the first predicted position information (S20); performing ROI region extraction on the second image to undergo detection according to the second position information, and obtaining a second ROI region (S30); determining a first least cosine distance between the first ROI region and the second ROI region, and simultaneously determining a first positional coincidence degree between the second position information and the first predicted position information (S40); and determining a first tracking matching result by means of the Hungarian algorithm and according to the first least cosine distance and the first positional coincidence degree (S50). The efficiency and accuracy of target tracking are improved.

Description

Target tracking method, device, device and medium based on target detection network

This application claims the priority of the Chinese patent application filed on April 22, 2021 with the application number 202110434628.2 and the title of the invention is "target tracking method, device, equipment and medium based on target detection network", the entire content of which is Incorporated herein by reference.

technical field

The present application relates to the technical field of detection models, and in particular, to a target tracking method, apparatus, device and medium based on a target detection network.

Background technique

Multi-target tracking technology is used in many application fields, such as motion correction, security monitoring, unmanned driving, etc. In the security monitoring system, it is a common task to accurately locate and track the target.

The inventor realized that since there are often multiple objects to be tracked at the same time, and the appearance similarity between the tracking objects is relatively close, the security monitoring technology in the prior art cannot determine the identity of the target only from the appearance features, resulting in the detection of the target and the tracking trajectory. Incorrect matching will affect the accuracy of multi-target tracking; further, in the process of target tracking, there will be occlusions and scale changes between the tracking objects, and it is impossible to determine whether the tracking objects temporarily disappear due to occlusion or have left the detection area. This leads to low accuracy of multi-target tracking.

Application content

Embodiments of the present application provide a target tracking method, apparatus, device, and medium based on a target detection network, so as to solve the problem of low accuracy of multi-target tracking.

A target tracking method based on target detection network, comprising:

Obtain the first position information of the target object in the first image to be detected, and the second position information of the target object in the second image to be detected; an image whose time is adjacent to the image to be detected and whose time is located after the first image to be detected;

According to the first position information, the Kalman filter model is used to predict the first predicted position information of the target object in the second to-be-detected image, and the first ROI area corresponding to the first predicted position information is determined;

According to the second position information, extract the ROI area on the second to-be-detected image to obtain a second ROI area;

determining a first minimum cosine distance between the first ROI region and the second ROI region, and simultaneously determining a first position coincidence degree between the second position information and the first predicted position information;

According to the first minimum cosine distance and the first position coincidence degree, the first tracking matching result of the target object in the second image to be detected is determined by the Hungarian algorithm.

A target tracking device based on a target detection network, comprising:

The first position information acquisition module is used to acquire the first position information of the target object in the first image to be detected, and the second position information of the target object in the second image to be detected; the second image to be detected refers to An image in the video to be detected that is time-adjacent to the first image to be detected and located after the first image to be detected;

The first position information prediction module is used to predict the first predicted position information of the target object in the second to-be-detected image by using the Kalman filter model according to the first position information, and determine the first predicted position information corresponding to the first predicted position information. The corresponding first ROI area;

a first ROI region extraction module, configured to perform ROI region extraction on the second to-be-detected image according to the second position information to obtain a second ROI region;

The first position coincidence degree determination module is used to determine the first minimum cosine distance between the first ROI area and the second ROI area, and at the same time determine the difference between the second position information and the first predicted position information. The first position coincidence degree between ;

The first tracking matching module is configured to determine the first tracking matching result of the target object in the second image to be detected by using the Hungarian algorithm according to the first minimum cosine distance and the first position coincidence degree.

A computer device, comprising a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor, the processor implementing the following steps when executing the computer-readable instructions:

One or more readable storage media storing computer-readable instructions that, when executed by one or more processors, cause the one or more processors to perform the following steps:

The above-mentioned target tracking method, device, device and medium based on target detection network, the method obtains the first position information of the target object in the first image to be detected, and the second position information of the target object in the second image to be detected ; The second to-be-detected image refers to an image that is time-adjacent to the first to-be-detected image in the video to be detected and the time is after the first to-be-detected image; The Mann filter model predicts the first predicted position information of the target object in the second to-be-detected image, and determines the first ROI area corresponding to the first predicted position information; 2. Extract the ROI area of the image to be detected to obtain a second ROI area; determine the first minimum cosine distance between the first ROI area and the second ROI area, and simultaneously determine the second position information and the first ROI area. A first position coincidence degree between predicted position information; according to the first minimum cosine distance and the first position coincidence degree, determine the first tracking matching of the target object in the second image to be detected by the Hungarian algorithm result.

By introducing the target detection network and the Kalman filter model, the application can better use the shallow features in the target detection network as the appearance features of target tracking, and then use the Kalman filter model to determine the appearance features of the target detection network, so that it is possible to The position information of the target object in the image to be detected in the next frame is predicted by the appearance feature, and the calculation speed is fast, and the accuracy of the target tracking is improved.

The details of one or more embodiments of the application are set forth in the accompanying drawings and the description below, and other features and advantages of the application will become apparent from the description, drawings, and claims.

Description of drawings

In order to illustrate the technical solutions of the embodiments of the present application more clearly, the following briefly introduces the drawings that are used in the description of the embodiments of the present application. Obviously, the drawings in the following description are only some embodiments of the present application. , for those of ordinary skill in the art, other drawings can also be obtained from these drawings without creative labor.

1 is a schematic diagram of an application environment of a target tracking method based on a target detection network in an embodiment of the present application;

2 is a flowchart of a target tracking method based on a target detection network in an embodiment of the present application;

3 is a flowchart of a target tracking method based on a target detection network in an embodiment of the present application;

4 is a flowchart of a target tracking method based on a target detection network in an embodiment of the present application;

5 is a schematic block diagram of a target tracking device based on a target detection network according to an embodiment of the present application;

6 is a schematic block diagram of a first location information acquisition module in a target tracking device based on a target detection network according to an embodiment of the present application;

7 is a schematic block diagram of a target detection sub-module in a target tracking device based on a target detection network according to an embodiment of the present application;

FIG. 8 is a schematic diagram of a computer device in an embodiment of the present application.

Detailed ways

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, not all of the embodiments. Based on the embodiments in the present application, all other embodiments obtained by those of ordinary skill in the art without creative work fall within the protection scope of the present application.

The target tracking method based on the target detection network provided by the embodiment of the present application, the target tracking method based on the target detection network can be applied in the application environment shown in FIG. 1 . Specifically, the target tracking method based on the target detection network is applied in the target tracking system based on the target detection network. The target tracking system based on the target detection network includes a client and a server as shown in FIG. 1 , and the client and the server pass through It communicates with the network and is used to solve the problem of low accuracy of multi-target tracking. Among them, the client, also known as the client, refers to the program corresponding to the server and providing local services for the client. Clients can be installed on, but not limited to, various personal computers, laptops, smartphones, tablets, and portable wearable devices. The server can be implemented as an independent server or a server cluster composed of multiple servers.

In one embodiment, as shown in FIG. 2, a target tracking method based on a target detection network is provided, and the method is applied to the server in FIG. 1 as an example for description, including the following steps:

S10: Acquire first position information of the target object in the first image to be detected, and second position information of the target object in the second image to be detected; The first to-be-detected images are adjacent in time and located after the first to-be-detected images.

Understandably, both the first image to be detected and the second image to be detected can be images of any two consecutive frames in the video to be detected (for example, assuming that the first image to be detected is the first frame of image in the video to be detected, the second The image to be detected is the second frame image in the video to be detected), and the video to be detected can be selected according to specific application scenarios. For example, in the field of intelligent monitoring, the first image to be detected and the second image to be detected can be from surveillance video. The image of two consecutive frames selected in . Further, the first position information refers to the position where the target object appears in the first image to be detected (exemplarily, the first position information may also include information such as the movement direction and movement speed of the target object), and the position may After the target detection is performed on the image to be detected, it is represented by a set of all image blocks corresponding to the target object; similarly, the second position information refers to the position where the target object appears in the second image to be detected; optionally, the target object It can be one target individual or multiple target individuals.

In one embodiment, as shown in FIG. 3 , step S10 includes:

S101: Acquire a video to be detected, where the video to be detected includes multiple frames of images to be detected.

S102: Record the image to be detected in any frame of the video to be detected as the first image to be detected.

Understandably, the video to be detected contains multiple frames of images to be detected. Since the frame rate of the images to be detected in the video to be detected is generally more than 25 frames per second, and the images to be detected in two consecutive frames may be relatively close, the target object changes. It is unlikely that the image to be detected is detected, so when the target detection is performed on each image to be detected in the video to be detected, the image to be detected can be obtained at intervals, that is, the image to be detected can be obtained from the video to be detected at intervals of n frames, and then the image to be detected can be obtained. The acquired images to be detected form a composite video in chronological order; in the composite video, an image to be detected can be arbitrarily selected as the first image to be detected, and at this time the second image to be detected is the composite video and the selected first image. A to-be-detected image adjacent to and behind the first to-be-detected image (the second to-be-detected image is not adjacent to the first to-be-detected image in the to-be-detected video and behind the first to-be-detected image , but the image to be detected after the interval of n frames from the first image to be detected). Wherein, n may be selected according to a specific application scenario, for example, n may be 4, 5, etc. Furthermore, by obtaining the images to be detected at intervals, the efficiency of target detection on the images to be detected can be improved, and the calculation pressure of the computer can be reduced. Further, if the computational pressure of the computer is not considered, target detection can also be performed on all the images to be detected in the video to be detected.

S103: Perform target detection on the first image to be detected through a target detection network to obtain the first position information; meanwhile, perform target detection on the second image to be detected through a target detection network to obtain the second position information .

Specifically, after recording any frame of the to-be-detected image in the to-be-detected video as the first to-be-detected image, the second to-be-detected image is determined according to the first to-be-detected image, so as to record the first to-be-detected image And the second to-be-detected image is sequentially input into the target detection network, and then the target detection network is used to perform target detection on the first to-be-detected image to obtain first position information; the target detection network is used to perform target detection on the second to-be-detected image, Obtain second location information.

In a specific embodiment, as shown in FIG. 4 , in step S103, target detection is performed on the first image to be detected through a target detection network to obtain the first position information, including:

S1031: Input the first image to be detected into a backbone network in the target detection network, so as to perform downsampling processing on the first image to be detected, to obtain a plurality of images to be detected corresponding to the first image to be detected Detect feature layers.

Optionally, the backbone network may adopt a Darknet network, a Resnet network, or the like. The down-sampling process refers to reducing the first image to be detected, that is, the size of the image obtained after down-sampling is smaller than the size of the first image to be detected.

Specifically, after determining the first to-be-detected image and the second to-be-detected image, the first to-be-detected image is input into the target detection network, and the first to-be-detected image is down-sampled through the backbone network in the target detection network, In this embodiment, the first image to be detected is down-sampled five times through the backbone network to obtain five feature layers of different sizes to be detected. Exemplarily, if the feature layer to be detected is obtained after sampling the first image to be detected for the first time, the feature layer to be detected is compared with the first image to be detected, and the size of the feature layer to be detected is the th The size of a to-be-detected image is half (that is, the length and width are both reduced by half), but the number of channels of the to-be-detected feature layer is twice the number of channels of the first to-be-detected image.

Further, since the feature layer to be detected obtained by down-sampling the first image to be detected for the first time has less semantics, in this embodiment, the first image to be detected is down-sampled five times through the backbone network. After that, the feature layers to be detected obtained by the first down-sampling will be discarded, and the total number of feature layers to be detected finally adopted in this embodiment is four.

S1032: Perform layer processing on each of the feature layers to be detected in sequence to obtain a target feature layer corresponding to the first image to be detected.

In a specific embodiment, the feature layers to be detected include a first feature layer, a second feature layer, a third feature layer, and a fourth feature layer; step S1032 includes:

Convolution processing is performed on the fourth feature layer, and the fourth feature layer after the convolution processing is up-sampled to obtain a fifth feature layer having the same dimension as the third feature layer.

It can be understood that the fourth feature layer is the layer obtained by the last downsampling in the five downsampling of the image to be detected, that is, the size of the fourth feature layer is the smallest; The detection image is input to the backbone network in the target detection network, so as to perform down-sampling processing on the first image to be detected, and after obtaining a plurality of feature layers to be detected corresponding to the first image to be detected, the first image to be detected is obtained. Convolution processing is performed on four feature layers. For example, convolution processing is performed on the fourth feature layer through a convolutional network with a convolution kernel of 3*3, so that the fourth feature layer has the same number of channels as the third feature layer ( In step S1031, it is pointed out that the size of the to-be-detected feature layer obtained after each downsampling is halved and the number of channels is doubled compared with the to-be-detected feature layer before downsampling), and the fourth feature layer after convolution processing is processed. Perform upsampling, so that the fourth feature layer after upsampling and convolution processing has the same size and the same number of channels as the third feature layer, so as to obtain the fifth feature with the same dimension as the third feature layer. layer.

After the fifth feature layer and the third feature layer are dimensionally superimposed to obtain a first superimposed layer, convolution processing is performed on the first superimposed layer, and the first superimposed layer after the convolution processing is processed. The layer is upsampled to obtain a sixth feature layer with the same dimension as the second feature layer.

Specifically, after performing convolution processing on the fourth feature layer and up-sampling the fourth feature layer after convolution processing to obtain a fifth feature layer with the same dimension as the third feature layer , after the fifth feature layer and the third feature layer are dimensionally superimposed to obtain the first superimposed layer, the first superimposed layer is subjected to convolution processing, so that the number of channels of the first superimposed layer and the second feature layer is The same, and upsampling the first overlay layer after convolution, so that the first overlay layer after upsampling and convolution processing has the same size and the same number of channels as the second feature layer, so as to obtain A sixth feature layer with the same dimensions as the second feature layer.

After the sixth feature layer and the second feature layer are dimensionally superimposed to obtain a second superimposed layer, convolution processing is performed on the second superimposed layer, and the second superimposed layer after the convolution processing is processed. The layer is upsampled to obtain a seventh feature layer with the same dimension as the first feature layer.

Specifically, after performing convolution processing on the first superimposed layer and up-sampling the first superimposed layer after the convolution processing to obtain a sixth feature layer with the same dimension as the second feature layer , after the sixth feature layer and the second feature layer are dimensionally superimposed to obtain the second superimposed layer, the second superimposed layer is subjected to convolution processing, so that the number of channels between the second superimposed layer and the first feature layer is The same, and upsampling the first overlay layer after convolution, so that the first overlay layer after upsampling and convolution processing has the same size and the same number of channels as the first feature layer, so as to obtain A seventh feature layer with the same dimensions as the first feature layer.

After the seventh feature layer and the first feature layer are dimensionally superimposed to obtain a third superimposed layer, convolution processing is performed on the third superimposed layer, and the third superimposed layer after the convolution processing is processed. The layer is upsampled to obtain the target feature layer.

Specifically, after performing convolution processing on the second superimposed layer, and performing up-sampling on the second superimposed layer after the convolution processing, to obtain a seventh feature layer having the same dimension as the first feature layer, After the seventh feature layer and the first feature layer are dimensionally superimposed to obtain the third superimposed layer, convolution processing is performed on the third superimposed layer, so that the number of channels of the third superimposed layer is doubled, and the convolution processing is performed. The third superimposed layer is up-sampled, so that the size of the third superimposed layer after convolution processing is doubled, so as to obtain the target feature layer.

S1303: Perform location feature extraction on the target feature layer to obtain the first location information.

Specifically, after layer processing is performed on each of the feature layers to be detected in turn to obtain a target feature layer corresponding to the first image to be detected, position feature extraction is performed on the target feature layer, that is, in the target feature layer. On the feature layer, the pixel frame associated with the target object is extracted, and then the first position information of the target object on the first image to be detected is obtained.

S20: According to the first position information, use a Kalman filter model to predict the first predicted position information of the target object in the second to-be-detected image, and determine a first ROI area corresponding to the first predicted position information.

It can be understood that the Kalman filter model is a state estimation model using Kalman filter, and the Kalman filter model is used according to the position information (such as the first image to be detected) of the target object on the image to be detected in the previous frame (such as the first image to be detected). a position information), to predict the position information of the target object in the next frame of the image to be detected (eg, the second image to be detected). Further, the Kalman filter model needs to be trained by the to-be-detected images of the first k frames in the video to be detected, so that the Kalman filter model can better predict the position information of the target object when it moves in step S20, thereby improving the performance of the Kalman filter model. target tracking accuracy.

Specifically, after obtaining the first position information of the target object in the first image to be detected, according to the first position information, a Kalman filter model is used to predict the position information of the target object in the second image to be detected, That is, the first predicted position information, and in the second to-be-detected image, an area associated with the first predicted position information, that is, the first ROI area, is extracted.

S30: According to the second position information, perform ROI region extraction on the second to-be-detected image to obtain a second ROI region.

Specifically, after acquiring the second position information of the target object in the second to-be-detected image, according to the second position information, extract an area associated with the second position information in the second to-be-detected image to obtain a second ROI area.

S40: Determine a first minimum cosine distance between the first ROI region and the second ROI region, and simultaneously determine a first position coincidence degree between the second position information and the first predicted position information.

Understandably, the first minimum cosine distance is used to characterize the feature similarity between the first ROI area and the second ROI area; the first position coincidence is used to characterize the position between the second position information and the first predicted position information. similarity.

Specifically, after extracting the ROI region of the second image to be detected according to the second position information to obtain the second ROI region, determine the first minimum cosine distance between the first ROI region and the second ROI region , the value range of the first minimum cosine distance can be 0 to 1, the larger the first minimum cosine distance, the higher the degree of similarity of features between the first ROI area and the second ROI area; at the same time, the second position information is determined. The first position coincidence degree with the first predicted position information. The value range of the first position coincidence degree can also be 0 to 1. The higher the first position coincidence degree is, it represents the second position information and the first predicted position. The greater the degree of correlation between the information.

In an embodiment, in step S40, the determining the first position coincidence degree between the second position information and the first predicted position information includes:

The intersection position information between the second position information and the first predicted position information is determined, and the union position information between the second position information and the first predicted position information is simultaneously determined.

Understandably, the intersection position information refers to the shared position information between the second position information and the first predicted position information; the union position information refers to all the position information of the second position information and the first predicted position information, and also That is, the shared location information and the uniquely owned location information are included.

The position coincidence degree is determined according to the intersection position information and the union position information.

Specifically, the position coincidence degree can be determined according to the following expression:

Among them, C is the position coincidence degree; A is the second position information; B is the first predicted position information; |A∪B| is the union position information; |A∩B| is the intersection position information.

S50: According to the first minimum cosine distance and the first position coincidence degree, determine the first tracking matching result of the target object in the second to-be-detected image by using the Hungarian algorithm.

Understandably, the first tracking matching result may be a result indicating that the matching is successful, that is, indicating that the first predicted position information matches the second position information (for example, the first predicted position information contains the characteristics of the target object, and the second position The information also contains the characteristics of the target object); it can also represent the result of the matching failure, that is, the first predicted position information does not match the second position information (for example, the first predicted position information contains the characteristics of the target object, while the first predicted position information does not match the second position information. The second position information does not include the feature of the target object; or the first predicted position information does not include the feature of the target object, and the second position information includes the feature of the target object).

Specifically, the first minimum cosine distance between the first ROI area and the second ROI area is determined, and at the same time, the first position coincidence between the second position information and the first predicted position information is determined. After the first minimum cosine distance and the first position coincidence degree are used as the tracking cost, the first tracking matching result of the target object in the second to-be-detected image is determined by the Hungarian algorithm. Exemplarily, if the first minimum cosine distance is compared with a preset cosine threshold (for example, set to 0.9, 0.95, etc.), when the first minimum cosine distance is greater than or equal to the preset cosine threshold, the first position is overlapped. When the first position coincidence degree is greater than or equal to the preset position coincidence threshold, it is determined that the first tracking matching result is a successful matching result; if the first position coincidence degree is greater than or equal to the preset position coincidence threshold If the minimum cosine distance is smaller than the preset cosine threshold, and/or the first position coincidence degree is smaller than the preset position coincidence threshold, it is determined that the first tracking matching result is a matching failure result.

Further, after determining the first tracking matching result of the target object in the second to-be-detected image by the Hungarian algorithm, no matter whether the first tracking matching result is a successful matching result or a matching failure result, the matching result can be determined. The first tracking and matching result, the corresponding first image to be detected and the second image to be detected are used as training samples for training the Kalman filter model and improving the prediction accuracy of the Kalman filter model.

Further, in order to improve the accuracy of the target detection network, the following loss function can be used to constrain the target detection network:

L=L ₁ +λ ₁ L ₂ +λ ₂ L ₃

Among them, L is the loss function of the target detection network; L ₁ is the focus loss function of the target detection network; L ₂ is the position loss function; L ₃ is the pixel offset loss function _; is 1); λ ₂ is the weight of the pixel offset loss function (the value can be 0.1).

Further, L ₁ can be characterized by the following expression:

Among them, N is the total number of target individuals in the target object;

is the first predicted position information; Y _m is the second position information; α and β are the detection parameters of the target detection network, which can be set according to specific application scenarios; if Y _m == 1 indicates that the second position information contains the target The m-th target individual in the object; otherwise it indicates that the m-th target individual in the target object is not included in the second position information.

Further, L2 can be characterized by the following expression _:

Among them, A is the second position information; B is the first predicted position information; |A∪B| is the union position information; |A∩B| is the intersection position information; |A _c | is the minimum closure area.

Further, L3 refers to the offset value _of the pixel during the downsampling process for the first image to be detected in step S1031 (similarly, the downsampling process also needs to be performed for the second image to be detected).

In this embodiment, by introducing the target detection network and the Kalman filter model, the shallow features in the target detection network can be better used as the appearance features of target tracking, and then the appearance determined by the target detection network through the Kalman filter model feature, so that the position information of the target object in the next frame of the image to be detected can be predicted by the appearance feature, and the calculation speed is fast, which improves the accuracy of target tracking.

In one embodiment, after step S50, that is, after determining the first tracking matching result of the target object in the second image to be detected by using the Hungarian algorithm, the method includes:

When the first tracking matching result is a matching failure result, the total number of matching failures is accumulated by one.

Understandably, the total number of matching failures refers to the total number of times that the first tracking matching result is a matching failure result.

During the preset detection time, when the total number of matching failures is less than the preset failure threshold, acquire the third image to be detected in the video to be detected, and the third position information of the target object in the third image to be detected ; The third to-be-detected image refers to an image that is time-adjacent to the second to-be-detected image and located after the second to-be-detected image.

Optionally, the preset failure threshold can be 3 times, 4 times, etc. The preset detection time can be 2 minutes, 5 minutes, etc. It can be understood that the second image to be detected and the third image to be detected can be any two consecutive frames of images in the video to be detected. The detected image is the second frame of the video to be detected, and the third to-be-detected image is the third frame of the video to be detected.

Within the preset detection time, when the total number of matching failures is less than the preset failure threshold, it indicates that the matching error may be caused by the target object being temporarily occluded, and the target tracking of the to-be-detected images corresponding to the subsequent frames is continued. , and then obtain the third to-be-detected image in the video to be detected; input the third to-be-detected image into the target detection network, and perform target detection on the third to-be-detected image through the target detection network, and obtain the target object in the third to-be-detected image. third position information in the image.

According to the second position information, a Kalman filter model is used to predict the second predicted position information of the target object in the third to-be-detected image, and a third ROI region corresponding to the second predicted position information is determined.

Specifically, after acquiring the second position information of the target object in the second to-be-detected image, according to the second position information, the Kalman filter model is used to predict the position of the target object in the third to-be-detected image, That is, the second predicted position information, and in the third to-be-detected image, an area associated with the second predicted position information, that is, the third ROI area, is extracted.

According to the third position information, ROI region extraction is performed on the third to-be-detected image to obtain a fourth ROI region.

Specifically, after acquiring the third position information of the target object in the third image to be detected, according to the third position information, an area associated with the third position information is extracted in the third image to be detected, to obtain a fourth ROI area.

A second minimum cosine distance between the third ROI region and the fourth ROI region is determined, and at the same time, a second position coincidence degree between the second position information and the predicted position information is determined.

Specifically, after performing ROI region extraction on the third image to be detected according to the third position information to obtain a fourth ROI region, the second minimum cosine distance between the third ROI region and the fourth ROI region is determined , the value range of the second minimum cosine distance can be 0 to 1, and the larger the second minimum cosine distance is, the higher the degree of feature similarity between the third ROI area and the fourth ROI area is; at the same time, the third position information is determined. The second position coincidence degree with the second predicted position information. The value range of the second position coincidence degree can also be 0 to 1. The higher the second position coincidence degree is, the third position information and the second predicted position are represented. The greater the degree of correlation between the information.

According to the second minimum cosine distance and the second position coincidence degree, the second tracking matching result of the target object in the third image to be detected is determined by the Hungarian algorithm.

Understandably, the second tracking matching result may be a result indicating that the matching is successful, that is, indicating that the second predicted position information matches the third position information (for example, the second predicted position information contains the characteristics of the target object, and the third position The information also contains the characteristics of the target object); it can also represent the result of the matching failure, that is, the second predicted position information does not match the third position information (for example, the second predicted position information contains the characteristics of the target object, and the first predicted position information does not match the third position information. The third position information does not include the feature of the target object; or the second predicted position information does not include the feature of the target object, and the third position information includes the feature of the target object).

Specifically, the second minimum cosine distance between the third ROI area and the fourth ROI area is determined, and at the same time, the second position coincidence between the third position information and the second predicted position information is determined. After the second minimum cosine distance and the second position coincidence degree are used as the tracking cost, the second tracking matching result of the target object in the third to-be-detected image is determined by the Hungarian algorithm. Exemplarily, if the second minimum cosine distance is compared with a preset cosine threshold (for example, set to 0.9, 0.95, etc.), when the second minimum cosine distance is greater than or equal to the preset cosine threshold, the second position is overlapped. When the second position coincidence degree is greater than or equal to the preset position coincidence threshold, it is determined that the second tracking matching result is a successful matching result; The two minimum cosine distances are less than the preset cosine threshold, and/or the second position coincidence degree is less than the preset position coincidence threshold, then it is determined that the second tracking matching result is a matching failure result.

In one embodiment, after determining the second tracking matching result of the target object in the third image to be detected by using the Hungarian algorithm, the method includes:

When the second tracking matching result is a matching failure result, adding one to the total number of matching failures;

Specifically, after determining the second tracking matching result of the target object in the third image to be detected by the Hungarian algorithm, if the second tracking matching result is a matching failure result, the total number of matching failures is accumulated by one.

Within the preset detection time, when the total number of matching failures is greater than or equal to the preset failure threshold, the tracking ID associated with the target object is deleted, and the end of tracking the target object is confirmed.

It can be understood that the tracking ID is a unique ID assigned to each target object before the target object is tracked. If the target object contains multiple target individuals, each target individual can be assigned a tracking ID.

In the preset detection time, when the total number of matching failures is greater than or equal to the preset failure threshold, it is understandable that within the preset detection time, that is, the tracking matching results corresponding to three consecutive target tracking are all matching failures, and then the The target object is not temporarily occluded in a short period of time to cause the matching failure, but the target object leaves the detection area, which causes the matching failure. Therefore, the tracking ID associated with the target object can be deleted. In subsequent target tracking, no need to Continuing to track the target object, that is, confirming that the tracking of the target object ends, can reduce the computational complexity of the computer. Further, within the preset detection time, when the total number of matching failures is greater than or equal to the preset failure threshold, the total number of matching failures may be cleared.

In this embodiment, by judging whether the total number of matching failures within the preset detection time is greater than or equal to the preset failure threshold, it is further determined whether the target object is temporarily blocked in a short time, or the target object is not in the detection area. target tracking accuracy.

It should be understood that the size of the sequence numbers of the steps in the above embodiments does not mean the sequence of execution, and the execution sequence of each process should be determined by its function and internal logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.

In one embodiment, a target tracking apparatus based on a target detection network is provided, and the target tracking apparatus based on a target detection network corresponds to the target tracking method based on the target detection network in the above embodiment. As shown in FIG. 5 , the target tracking device based on the target detection network includes a first position information acquisition module 10 , a first position information prediction module 20 , a first ROI region extraction module 30 , a first position coincidence degree determination module 40 , and a first position coincidence degree determination module 40 . A tracking matching module 50 . The detailed description of each functional module is as follows:

The first position information acquisition module 10 is used to obtain the first position information of the target object in the first image to be detected, and the second position information of the target object in the second image to be detected; the second image to be detected is Refers to the image in the video to be detected that is adjacent to the first image to be detected and the time is after the first image to be detected;

The first position information prediction module 20 is configured to use a Kalman filter model to predict the first predicted position information of the target object in the second to-be-detected image according to the first position information, and determine the difference between the first predicted position and the target object. the first ROI area corresponding to the information;

The first ROI region extraction module 30 is configured to perform ROI region extraction on the second to-be-detected image according to the second position information to obtain a second ROI region;

The first position coincidence degree determination module 40 is configured to determine the first minimum cosine distance between the first ROI area and the second ROI area, and simultaneously determine the second position information and the first predicted position information The first position coincidence between;

The first tracking matching module 50 is configured to determine the first tracking matching result of the target object in the second image to be detected by using the Hungarian algorithm according to the first minimum cosine distance and the first position coincidence degree.

Preferably, as shown in FIG. 6 , the first location information acquisition module 10 includes:

A sub-module 101 for obtaining a video to be detected, configured to obtain a video to be detected, and the video to be detected includes multiple frames of images to be detected;

A to-be-detected image recording sub-module 102, configured to record the to-be-detected image in any frame of the to-be-detected video as the first to-be-detected image;

The target detection sub-module 103 is configured to perform target detection on the selected first image to be detected through a target detection network to obtain the first position information; meanwhile, target the second image to be detected through a target detection network detection to obtain the second position information.

Preferably, as shown in FIG. 7 , the target detection sub-module 103 includes:

The downsampling processing unit 1031 is configured to input the first image to be detected into the backbone network in the target detection network, so as to perform downsampling processing on the first image to be detected, and obtain a Multiple feature layers to be detected corresponding to the image;

A layer processing unit 1032, configured to sequentially perform layer processing on each of the feature layers to be detected to obtain a target feature layer corresponding to the first image to be detected;

The location feature extraction unit 1033 is configured to perform location feature extraction on the target feature layer to obtain the first location information.

Preferably, the feature layers to be detected include a first feature layer, a second feature layer, a third feature layer and a fourth feature layer; the layer processing unit includes:

The first layer processing sub-unit is used to perform convolution processing on the fourth feature layer, and upsample the fourth feature layer after the convolution processing to obtain the same dimension as the third feature layer. The fifth feature layer;

The second layer processing subunit is configured to perform convolution processing on the first overlay layer after dimensionally overlaying the fifth feature layer and the third feature layer to obtain a first overlay layer, and up-sampling the first superimposed layer after convolution processing to obtain a sixth feature layer with the same dimension as the second feature layer;

The third layer processing subunit is configured to perform convolution processing on the second superimposed layer after dimensionally superimposing the sixth feature layer and the second feature layer to obtain a second superimposed layer, and up-sampling the second superimposed layer after convolution processing to obtain a seventh feature layer with the same dimension as the first feature layer;

The fourth layer processing subunit is configured to perform convolution processing on the third overlapping layer after dimensionally overlapping the seventh feature layer and the first feature layer to obtain a third overlapping layer, and up-sampling the third superimposed layer after convolution processing to obtain the target feature layer.

Preferably, the first position coincidence degree determination module 40 includes:

an intersection position information determination submodule, configured to determine the intersection position information between the second position information and the predicted position information, and simultaneously determine the union position between the second position information and the predicted position information information;

The position coincidence degree determination submodule is configured to determine the position coincidence degree according to the intersection position information and the union position information.

Preferably, the target tracking device based on the target detection network further includes:

a first matching failure total number accumulation module, configured to add one to the total number of matching failures when the first tracking matching result is a matching failure result;

A second location information acquisition module, configured to acquire the third image to be detected in the video to be detected, and the third image to be detected of the target object in the third image to be detected when the total number of matching failures is less than the preset failure threshold Three position information; the third image to be detected refers to an image that is time-adjacent to the second image to be detected and located after the second image to be detected;

The second position information prediction module is configured to use the Kalman filter model to predict the second predicted position information of the target object in the third to-be-detected image according to the second position information, and determine the difference between the second predicted position information and the second predicted position information. The corresponding third ROI area;

A second ROI region extraction module, configured to perform ROI region extraction on the third to-be-detected image according to the third position information to obtain a fourth ROI region;

A second position coincidence degree determination module, configured to determine the second minimum cosine distance between the third ROI area and the fourth ROI area, and at the same time determine the difference between the second position information and the predicted position information The second position coincidence degree;

The second tracking matching module is configured to determine, according to the second minimum cosine distance and the second position coincidence degree, the second tracking matching result of the target object in the third image to be detected by using the Hungarian algorithm.

Preferably, the target tracking device based on the target detection network includes:

A second accumulating module for the total number of matching failures, configured to add one to the total number of matching failures when the second tracking matching result is a matching failure result;

A tracking end confirmation module, configured to delete the tracking ID associated with the target object when the total number of matching failures is greater than or equal to the preset failure threshold, and confirm that the tracking of the target object is ended.

For the specific definition of the target tracking device based on the target detection network, reference may be made to the above definition of the target tracking method based on the target detection network, which will not be repeated here. Each module in the above-mentioned target tracking device based on target detection network can be implemented in whole or in part by software, hardware and combinations thereof. The above modules can be embedded in or independent of the processor in the computer device in the form of hardware, or stored in the memory in the computer device in the form of software, so that the processor can call and execute the operations corresponding to the above modules.

In one embodiment, a computer device is provided, and the computer device may be a server, and its internal structure diagram may be as shown in FIG. 8 . The computer device includes a processor, memory, a network interface, and a database connected by a system bus. Among them, the processor of the computer device is used to provide computing and control capabilities. The memory of the computer device includes a readable storage medium, an internal memory. The readable storage medium stores an operating system, computer readable instructions and a database. The internal memory provides an environment for the execution of the operating system and computer-readable instructions in the readable storage medium. The database of the computer device is used to store the data used by the target tracking method based on the target detection network in the above embodiment. The network interface of the computer device is used to communicate with an external terminal through a network connection. The computer readable instructions, when executed by a processor, implement a target tracking method based on a target detection network. The readable storage medium provided by this embodiment includes a non-volatile readable storage medium and a volatile readable storage medium.

In one embodiment, there is provided a computer apparatus comprising a memory, a processor, and computer readable instructions stored in the memory and executable on the processor, the processor executing the computer readable instructions Implement the following steps when instructing:

In one embodiment, one or more readable storage media are provided that store computer-readable instructions that, when executed by one or more processors, cause the one or more processors to execute Follow the steps below:

Those of ordinary skill in the art can understand that all or part of the processes in the methods of the above embodiments can be implemented by instructing relevant hardware through computer-readable instructions, and the computer-readable instructions can be stored in a non-volatile computer. In a readable storage medium or a volatile computer-readable storage medium, the computer-readable instructions, when executed, may include the processes of the foregoing method embodiments. Wherein, any reference to memory, storage, database or other medium used in the various embodiments provided in this application may include non-volatile and/or volatile memory. Nonvolatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory may include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in various forms such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Road (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

Those skilled in the art can clearly understand that, for the convenience and simplicity of description, only the division of the above-mentioned functional units and modules is used as an example. Module completion, that is, dividing the internal structure of the device into different functional units or modules to complete all or part of the functions described above.

The above-mentioned embodiments are only used to illustrate the technical solutions of the present application, but not to limit them; although the present application has been described in detail with reference to the above-mentioned embodiments, those of ordinary skill in the art should understand that: it is still possible to implement the above-mentioned implementations. The technical solutions described in the examples are modified, or some technical features thereof are equivalently replaced; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions in the embodiments of the application, and should be included in the within the scope of protection of this application.

Claims

A target tracking method based on target detection network, comprising:

Obtain the first position information of the target object in the first image to be detected, and the second position information of the target object in the second image to be detected; an image whose time is adjacent to the image to be detected and whose time is located after the first image to be detected;

According to the first position information, the Kalman filter model is used to predict the first predicted position information of the target object in the second to-be-detected image, and the first ROI area corresponding to the first predicted position information is determined;

According to the second position information, extract the ROI area on the second to-be-detected image to obtain a second ROI area;

determining a first minimum cosine distance between the first ROI region and the second ROI region, and simultaneously determining a first position coincidence degree between the second position information and the first predicted position information;

According to the first minimum cosine distance and the first position coincidence degree, the first tracking matching result of the target object in the second image to be detected is determined by the Hungarian algorithm.
The target tracking method based on a target detection network according to claim 1, wherein the acquiring first position information of the target object in the first to-be-detected image and the second position of the target object in the second to-be-detected image information, including:

Obtaining a video to be detected, the video to be detected includes multiple frames of images to be detected;

recording the to-be-detected image of any frame in the to-be-detected video as the first to-be-detected image;

Target detection is performed on the first image to be detected through a target detection network to obtain the first position information; meanwhile, target detection is performed on the second image to be detected through a target detection network to obtain the second position information.
The target tracking method based on a target detection network according to claim 2, wherein the target detection is performed on the selected first image to be detected through a target detection network to obtain the first position information, comprising:

Inputting the first image to be detected into the backbone network in the target detection network to perform downsampling processing on the first image to be detected to obtain a plurality of features to be detected corresponding to the first image to be detected layer;

Perform layer processing on each of the feature layers to be detected in sequence to obtain a target feature layer corresponding to the first image to be detected;

Perform location feature extraction on the target feature layer to obtain the first location information.
The target tracking method based on a target detection network according to claim 3, wherein the feature layers to be detected include a first feature layer, a second feature layer, a third feature layer and a fourth feature layer ; The layer processing is performed on each of the feature layers to be detected in turn to obtain a target feature layer corresponding to the first image to be detected, including:

Performing convolution processing on the fourth feature layer, and up-sampling the fourth feature layer after the convolution processing, to obtain a fifth feature layer having the same dimension as the third feature layer;

After the fifth feature layer and the third feature layer are dimensionally superimposed to obtain a first superimposed layer, convolution processing is performed on the first superimposed layer, and the first superimposed layer after the convolution processing is processed. The layer is upsampled to obtain a sixth feature layer with the same dimension as the second feature layer;

After the sixth feature layer and the second feature layer are dimensionally superimposed to obtain a second superimposed layer, convolution processing is performed on the second superimposed layer, and the second superimposed layer after the convolution processing is processed. The layer is upsampled to obtain a seventh feature layer with the same dimension as the first feature layer;

After the seventh feature layer and the first feature layer are dimensionally superimposed to obtain a third superimposed layer, convolution processing is performed on the third superimposed layer, and the third superimposed layer after the convolution processing is processed. The layer is upsampled to obtain the target feature layer.
The target tracking method based on a target detection network according to claim 1, wherein the determining the position coincidence degree between the second position information and the first predicted position information comprises:

determining the intersection location information between the second location information and the first predicted location information, and simultaneously determining the union location information between the second location information and the first predicted location information;

The position coincidence degree is determined according to the intersection position information and the union position information.
The target tracking method based on a target detection network according to claim 1, wherein after determining the first tracking matching result of the target object in the second image to be detected by the Hungarian algorithm, the method comprises:

When the first tracking matching result is a matching failure result, adding one to the total number of matching failures;

During the preset detection time, when the total number of matching failures is less than the preset failure threshold, acquire the third image to be detected in the video to be detected, and the third position information of the target object in the third image to be detected ; the third to-be-detected image refers to an image time-adjacent to the second to-be-detected image and after the second to-be-detected image;

According to the second position information, the Kalman filter model is used to predict the second predicted position information of the target object in the third to-be-detected image, and the third ROI area corresponding to the second predicted position information is determined;

According to the third position information, extract the ROI area on the third to-be-detected image to obtain a fourth ROI area;

determining a second minimum cosine distance between the third ROI region and the fourth ROI region, and simultaneously determining a second position coincidence degree between the second position information and the predicted position information;

According to the second minimum cosine distance and the second position coincidence degree, the second tracking matching result of the target object in the third image to be detected is determined by the Hungarian algorithm.
The target tracking method based on a target detection network according to claim 6, wherein after determining the second tracking matching result of the target object in the third to-be-detected image by the Hungarian algorithm, the method comprises:

Within the preset detection time, when the second tracking matching result is a matching failure result, adding one to the total number of matching failures;

When the total number of matching failures is greater than or equal to the preset failure threshold, delete the tracking ID associated with the target object, and confirm that the tracking of the target object ends.
A target tracking device based on a target detection network, comprising:

The first position information acquisition module is used to acquire the first position information of the target object in the first image to be detected, and the second position information of the target object in the second image to be detected; the second image to be detected refers to An image in the video to be detected that is time-adjacent to the first image to be detected and located after the first image to be detected;

The first position information prediction module is used to predict the first predicted position information of the target object in the second to-be-detected image by using the Kalman filter model according to the first position information, and determine the first predicted position information corresponding to the first predicted position information. The corresponding first ROI area;

a first ROI region extraction module, configured to perform ROI region extraction on the second to-be-detected image according to the second position information to obtain a second ROI region;

The first position coincidence degree determination module is used to determine the first minimum cosine distance between the first ROI area and the second ROI area, and at the same time determine the difference between the second position information and the first predicted position information. The first position coincidence degree between ;

The first tracking matching module is configured to determine the first tracking matching result of the target object in the second image to be detected by using the Hungarian algorithm according to the first minimum cosine distance and the first position coincidence degree.
A computer device comprising a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor, wherein the processor implements the following steps when executing the computer-readable instructions:

Obtain the first position information of the target object in the first image to be detected, and the second position information of the target object in the second image to be detected; an image whose time is adjacent to the image to be detected and whose time is located after the first image to be detected;

According to the first position information, the Kalman filter model is used to predict the first predicted position information of the target object in the second to-be-detected image, and the first ROI area corresponding to the first predicted position information is determined;

According to the second position information, extract the ROI area on the second to-be-detected image to obtain a second ROI area;

determining a first minimum cosine distance between the first ROI region and the second ROI region, and simultaneously determining a first position coincidence degree between the second position information and the first predicted position information;

According to the first minimum cosine distance and the first position coincidence degree, the first tracking matching result of the target object in the second image to be detected is determined by the Hungarian algorithm.
The computer device according to claim 9, wherein the acquiring the first position information of the target object in the first image to be detected and the second position information of the target object in the second image to be detected comprises:

Obtaining a video to be detected, the video to be detected includes multiple frames of images to be detected;

recording the to-be-detected image of any frame in the to-be-detected video as the first to-be-detected image;

Target detection is performed on the first image to be detected through a target detection network to obtain the first position information; meanwhile, target detection is performed on the second image to be detected through a target detection network to obtain the second position information.
The computer device according to claim 10, wherein, performing target detection on the selected first image to be detected through a target detection network to obtain the first position information, comprising:

Inputting the first image to be detected into the backbone network in the target detection network to perform downsampling processing on the first image to be detected to obtain a plurality of features to be detected corresponding to the first image to be detected layer;

Perform layer processing on each of the feature layers to be detected in sequence to obtain a target feature layer corresponding to the first image to be detected;

Perform location feature extraction on the target feature layer to obtain the first location information.
The computer device according to claim 11, wherein the feature layers to be detected include a first feature layer, a second feature layer, a third feature layer and a fourth feature layer; The to-be-detected feature layers are sequentially processed by layers to obtain a target feature layer corresponding to the first to-be-detected image, including:

Performing convolution processing on the fourth feature layer, and up-sampling the fourth feature layer after the convolution processing, to obtain a fifth feature layer having the same dimension as the third feature layer;

After the fifth feature layer and the third feature layer are dimensionally superimposed to obtain a first superimposed layer, convolution processing is performed on the first superimposed layer, and the first superimposed layer after the convolution processing is processed. The layer is upsampled to obtain a sixth feature layer with the same dimension as the second feature layer;

After the sixth feature layer and the second feature layer are dimensionally superimposed to obtain a second superimposed layer, convolution processing is performed on the second superimposed layer, and the second superimposed layer after the convolution processing is processed. The layer is upsampled to obtain a seventh feature layer with the same dimension as the first feature layer;

After the seventh feature layer and the first feature layer are dimensionally superimposed to obtain a third superimposed layer, convolution processing is performed on the third superimposed layer, and the third superimposed layer after the convolution processing is processed. The layer is upsampled to obtain the target feature layer.
The computer device according to claim 9, wherein the determining the position coincidence degree between the second position information and the first predicted position information comprises:

determining the intersection location information between the second location information and the first predicted location information, and simultaneously determining the union location information between the second location information and the first predicted location information;

The position coincidence degree is determined according to the intersection position information and the union position information.
The computer device according to claim 9, wherein after determining the first tracking matching result of the target object in the second image to be detected by using the Hungarian algorithm, the method comprises:

When the first tracking matching result is a matching failure result, adding one to the total number of matching failures;

During the preset detection time, when the total number of matching failures is less than the preset failure threshold, acquire the third image to be detected in the video to be detected, and the third position information of the target object in the third image to be detected ; the third to-be-detected image refers to an image time-adjacent to the second to-be-detected image and after the second to-be-detected image;

According to the second position information, the Kalman filter model is used to predict the second predicted position information of the target object in the third to-be-detected image, and the third ROI area corresponding to the second predicted position information is determined;

According to the third position information, extract the ROI area on the third to-be-detected image to obtain a fourth ROI area;

determining a second minimum cosine distance between the third ROI region and the fourth ROI region, and simultaneously determining a second position coincidence degree between the second position information and the predicted position information;

According to the second minimum cosine distance and the second position coincidence degree, the second tracking matching result of the target object in the third image to be detected is determined by the Hungarian algorithm.
One or more readable storage media storing computer-readable instructions, wherein the computer-readable instructions, when executed by one or more processors, cause the one or more processors to perform the following steps:

Obtain the first position information of the target object in the first image to be detected, and the second position information of the target object in the second image to be detected; an image whose time is adjacent to the image to be detected and whose time is located after the first image to be detected;

According to the first position information, the Kalman filter model is used to predict the first predicted position information of the target object in the second to-be-detected image, and the first ROI area corresponding to the first predicted position information is determined;

According to the second position information, extract the ROI area on the second to-be-detected image to obtain a second ROI area;

determining a first minimum cosine distance between the first ROI region and the second ROI region, and simultaneously determining a first position coincidence degree between the second position information and the first predicted position information;

According to the first minimum cosine distance and the first position coincidence degree, the first tracking matching result of the target object in the second image to be detected is determined by the Hungarian algorithm.
The readable storage medium according to claim 15, wherein the acquiring the first position information of the target object in the first image to be detected and the second position information of the target object in the second image to be detected comprises:

Obtaining a video to be detected, the video to be detected includes multiple frames of images to be detected;

recording the to-be-detected image of any frame in the to-be-detected video as the first to-be-detected image;

Target detection is performed on the first image to be detected through a target detection network to obtain the first position information; meanwhile, target detection is performed on the second image to be detected through a target detection network to obtain the second position information.
The readable storage medium according to claim 16, wherein the performing target detection on the selected first image to be detected through a target detection network to obtain the first position information comprises:

Inputting the first image to be detected into the backbone network in the target detection network to perform downsampling processing on the first image to be detected to obtain a plurality of features to be detected corresponding to the first image to be detected layer;

Perform layer processing on each of the feature layers to be detected in sequence to obtain a target feature layer corresponding to the first image to be detected;

Perform location feature extraction on the target feature layer to obtain the first location information.
The readable storage medium of claim 17, wherein the feature layers to be detected include a first feature layer, a second feature layer, a third feature layer, and a fourth feature layer; the pair of Each of the feature layers to be detected performs layer processing in sequence to obtain a target feature layer corresponding to the first image to be detected, including:

Performing convolution processing on the fourth feature layer, and up-sampling the fourth feature layer after the convolution processing, to obtain a fifth feature layer having the same dimension as the third feature layer;

After the fifth feature layer and the third feature layer are dimensionally superimposed to obtain a first superimposed layer, convolution processing is performed on the first superimposed layer, and the first superimposed layer after the convolution processing is processed. The layer is upsampled to obtain a sixth feature layer with the same dimension as the second feature layer;

After the sixth feature layer and the second feature layer are dimensionally superimposed to obtain a second superimposed layer, convolution processing is performed on the second superimposed layer, and the second superimposed layer after the convolution processing is processed. The layer is upsampled to obtain a seventh feature layer with the same dimension as the first feature layer;

After the seventh feature layer and the first feature layer are dimensionally superimposed to obtain a third superimposed layer, convolution processing is performed on the third superimposed layer, and the third superimposed layer after the convolution processing is processed. The layer is upsampled to obtain the target feature layer.
The readable storage medium of claim 15, wherein the determining a position coincidence degree between the second position information and the first predicted position information comprises:

determining the intersection location information between the second location information and the first predicted location information, and simultaneously determining the union location information between the second location information and the first predicted location information;

The position coincidence degree is determined according to the intersection position information and the union position information.
The readable storage medium according to claim 15, wherein after determining the first tracking matching result of the target object in the second image to be detected by using the Hungarian algorithm, the method comprises:

When the first tracking matching result is a matching failure result, accumulating the total number of matching failures by one;

During the preset detection time, when the total number of matching failures is less than the preset failure threshold, acquire the third image to be detected in the video to be detected, and the third position information of the target object in the third image to be detected ; The third to-be-detected image refers to an image that is time-adjacent to the second to-be-detected image and located after the second to-be-detected image;

According to the second position information, the Kalman filter model is used to predict the second predicted position information of the target object in the third to-be-detected image, and the third ROI area corresponding to the second predicted position information is determined;

According to the third position information, extract the ROI area on the third image to be detected to obtain a fourth ROI area;

determining a second minimum cosine distance between the third ROI region and the fourth ROI region, and simultaneously determining a second position coincidence degree between the second position information and the predicted position information;

According to the second minimum cosine distance and the second position coincidence degree, the second tracking matching result of the target object in the third image to be detected is determined by the Hungarian algorithm.