CN113762231B

CN113762231B - End-to-end multi-pedestrian posture tracking method and device and electronic equipment

Info

Publication number: CN113762231B
Application number: CN202111328115.XA
Authority: CN
Inventors: 阮威健; 何耀彬; 史周安
Original assignee: Smart City Research Institute Of China Electronics Technology Group Corp
Current assignee: Smart City Research Institute Of China Electronics Technology Group Corp
Priority date: 2021-11-10
Filing date: 2021-11-10
Publication date: 2022-03-22
Anticipated expiration: 2041-11-10
Also published as: CN113762231A

Abstract

The application discloses an end-to-end multi-pedestrian attitude tracking method, device, electronic equipment and medium, which are used for improving the attitude tracking accuracy. The method comprises the following steps: acquiring a target detection frame of each frame of the video through pedestrian detection and attitude estimation; the following procedure is performed for each frame: utilizing a feature extraction network to obtain the hierarchical convolution features of each target detection frame in the current frame and the previous frame; based on the hierarchical convolution characteristics, acquiring final characteristics by using a network based on a channel attention mechanism, then acquiring a similarity matrix based on a characteristic matrix including the final characteristics and a similarity estimation network, and then assigning values to the augmented rows and the augmented columns to obtain a forward similarity matrix and a reverse similarity matrix; and generating a data association matrix between the current frame and the previous frame based on the forward similarity matrix and the reverse similarity matrix, and finally performing identity distribution on each target detection frame of the current image frame according to the data association matrix to obtain a tracking result of the current frame.

Description

End-to-end multi-pedestrian posture tracking method and device and electronic equipment

Technical Field

The present application relates to the field of computer vision technologies, and in particular, to a method and an apparatus for tracking an end-to-end multi-pedestrian posture, an electronic device, and a computer-readable storage medium.

Background

Multi-target pose tracking aims at estimating poses of multiple persons in a continuous video, and associating identities of the persons in different video frames in a ready manner, and has wide application in multiple visual fields such as behavior recognition, multimedia analysis, event detection and the like.

At present, a posture tracking method generally detects all pedestrian frames in a video image, estimates human body key points in the pedestrian frames, and finally realizes identity association of the pedestrian frames between different video frames by constructing data association.

However, the existing data association mode is usually realized by three independent steps of feature learning, similarity estimation and identity allocation, and the similarity estimation and identity allocation are usually realized by adopting a manually designed model, so that the generalization capability of the model is low, the effective data association cannot be realized in the gesture tracking process, and the gesture tracking accuracy is low.

Disclosure of Invention

The embodiment of the application provides an end-to-end multi-pedestrian gesture tracking method, an end-to-end multi-pedestrian gesture tracking device, electronic equipment and a computer readable storage medium, and can solve the problem of low accuracy rate of existing gesture tracking.

In a first aspect, an embodiment of the present application provides an end-to-end-based multi-person pose tracking method, including:

acquiring a video to be detected, and performing pedestrian detection and attitude estimation on the video to be detected to obtain target detection frames of each image frame in the video to be detected, wherein the image frame comprises a plurality of target detection frames;

inputting the image pair and the target detection frame of each image frame in the image pair into a feature extraction network aiming at each image frame to be processed, and obtaining the hierarchical convolution feature of each target detection frame output by the feature extraction network, wherein the hierarchical convolution feature comprises multilayer convolution features, and the image pair comprises the image frame to be processed and the previous frame of the image frame to be processed;

respectively inputting each channel feature of each layer of convolution features of each layer of hierarchy convolution features into a network based on a channel attention mechanism, and obtaining a final feature of channel feature based on channel attention description of the channel feature output by the network based on the channel attention mechanism, wherein each layer of convolution features comprises a plurality of channel features;

processing the feature matrix of each image frame in the image pair to obtain feature flux, inputting the feature flux into a similarity estimation network to obtain a similarity matrix output by the similarity estimation network, wherein the feature matrix comprises final features based on channel attention description;

assigning values to elements of the augmented rows in the forward similarity matrix according to the peak side lobe ratio of each row in the similarity matrix to obtain a forward similarity matrix, and assigning values to elements of the augmented rows in the reverse similarity matrix according to the peak side lobe ratio of each row in the similarity matrix to obtain a reverse similarity matrix;

generating a forward prediction matrix according to the forward similarity matrix, and generating a reverse prediction matrix according to the reverse similarity matrix;

generating a data association matrix between the image frame to be processed and a previous frame of the image frame to be processed according to the forward prediction matrix and the backward prediction matrix;

and according to the data incidence matrix, carrying out identity distribution on each target detection frame of the image frame to be processed based on each target detection frame of the previous frame of the image frame to be processed so as to obtain a tracking result of the image frame to be processed.

As can be seen from the above, in the embodiment of the present application, through the feature extraction network and the network based on the channel attention mechanism, the features of each target detection frame are extracted, the similarity estimation network is adopted to generate the similarity matrix when the similarity is measured, and the similarity matrix is further augmented to obtain the forward similarity matrix and the reverse similarity matrix, which can solve the problems of "old target disappearing" and "new target coming" in the tracking process, and provide a bidirectional sensing data association mode for the target detection frame between the current frame and the previous frame, thereby realizing effective data association, and improving the accuracy of posture tracking.

In some possible implementation manners of the first aspect, assigning values to elements of the augmented rows in the forward similarity matrix according to a peak-to-side lobe ratio of each row in the similarity matrix to obtain the forward similarity matrix, and assigning values to elements of the augmented rows in the reverse similarity matrix according to a peak-to-side lobe ratio of each column in the similarity matrix to obtain the reverse similarity matrix, includes:

by the formula

And calculating the peak-to-side lobe ratio of each row element in the similarity matrix, wherein,

is the peak to side lobe ratio of the ith row of the similarity matrix,

is the standard deviation of the element in row i;

judging whether the peak value sidelobe ratio of the ith row is smaller than a first preset threshold value or not;

when the peak value sidelobe ratio of the ith row is smaller than a first preset threshold value, setting the ith row in the augmented column as 1; when the peak value sidelobe ratio of the ith row is larger than or equal to a first preset threshold value, setting the ith row in the augmented column as 0;

by the formula

And calculating the peak-to-side lobe ratio of each column element in the similarity matrix, wherein,

the peak to side lobe ratio of the jth column of the similarity matrix,

is the standard deviation of the jth column element;

judging whether the peak sidelobe ratio of the jth column is smaller than a second preset threshold value or not;

when the peak sidelobe ratio of the jth column is smaller than a second preset threshold, setting the jth column in the augmented row as 1; and when the peak-to-side lobe ratio of the jth column is greater than or equal to a second preset threshold value, setting the jth column in the augmented columns to be 0.

In some possible implementations of the first aspect, generating a forward prediction matrix according to the forward similarity matrix, and generating a reverse prediction matrix according to the reverse similarity matrix includes:

performing point-by-point operation on the rows of the forward similarity matrix by using a softmax function to generate a forward prediction matrix;

and performing point-by-point operation on the columns of the reverse similarity matrix by using a softmax function to generate a reverse prediction matrix.

In some possible implementations of the first aspect, generating a data association matrix between the image frame to be processed and a frame preceding the image frame to be processed according to the forward prediction matrix and the reverse prediction matrix includes:

removing the last column of the forward prediction matrix to obtain a target forward prediction matrix;

removing the last row of the reverse prediction matrix to obtain a target reverse prediction matrix;

by the formula

Generating a data correlation matrix between the image frame to be processed and a previous frame of the image frame to be processed;

wherein,

in order to be a matrix of the data associations,

in order to target the forward prediction matrix,

is the target reverse prediction matrix.

In some possible implementations of the first aspect, separately inputting each channel feature of each layer of convolution features of each hierarchical convolution feature to the channel attention mechanism-based network, and obtaining a final feature based on channel attention description of the channel feature output by the channel attention mechanism-based network includes:

convolution feature for each layer

Each channel feature of

Characterization of the channel

Inputting the channel characteristics into the network based on the channel attention mechanism, and obtaining the channel characteristics of the network output based on the channel attention mechanism

Based on the final characteristics of the channel attention description

；

Wherein,

，

，

；

indicating the c-th channel characteristic

A value at position (i, j);

and

respectively representing a sigmod function and a ReLU layer function;

a set of weights representing a first fully-connected layer,

a set of weights representing a second fully connected layer; the resolution of the c-th channel feature is

；

The network based on the channel attention mechanism comprises a global pooling layer, a first full-connection layer, a ReLU layer function, a second full-connection layer and a sigmod function which are connected in sequence.

In comparison, each feature channel is considered as the same in the prior art, so that the difference between the feature channels cannot be effectively utilized, and the generalization capability of the features in a real scene is reduced.

In this implementation, the network based on the attention mechanism is used to capture the differences between the feature channels, increasing the discriminative power and generalization capability of the features.

In some possible implementations of the first aspect, before acquiring the video to be detected, the method further includes:

acquiring a training data set;

training a data association network model by adopting a training data set, wherein the data association network model comprises a feature extraction network, a network based on a channel attention mechanism and a bidirectional perception network, and the bidirectional perception network comprises a similarity estimation network;

calculating the overall loss through an overall loss function of the data correlation network model, and iteratively training the data correlation network model according to the overall loss;

wherein the overall loss function is

；

Is a weight coefficient;

representing a slave image frame I_t-1To image frame I_tThe loss of the positive-going matching of (c),

；

representing a slave image frame I_tTo image frame I_t-1The loss of the reverse matching of (a),

；

in order to match the loss for the two-way consistency,

；

in the form of a forward prediction matrix, the prediction matrix,

the incidence matrix is labeled for the forward direction,

in order to label the correlation matrix in the reverse direction,

in order to be a reverse prediction matrix,

in order to target the forward prediction matrix,

is the target reverse prediction matrix.

In the implementation manner, the feature learning, the similarity measurement and the target identity distribution are integrated into a unified data association network model, and an overall loss function is designed for the data association network model to realize end-to-end model training and model testing, so that the subsequent attitude tracking accuracy is further improved.

In a second aspect, an embodiment of the present application provides an end-to-end multi-person pose tracking apparatus, including:

the pedestrian detection and attitude estimation module is used for acquiring a video to be detected, performing pedestrian detection and attitude estimation on the video to be detected and acquiring target detection frames of all image frames in the video to be detected, wherein the image frames comprise a plurality of target detection frames;

the multilayer convolution feature extraction module is used for inputting the image pair and the target detection frame of each image frame in the image pair into the feature extraction network aiming at each image frame to be processed, and obtaining the hierarchical convolution feature of each target detection frame output by the feature extraction network, wherein the hierarchical convolution feature comprises multilayer convolution features, and the image pair comprises the image frame to be processed and the previous frame of the image frame to be processed;

the feature extraction module based on the channel attention is used for respectively inputting each channel feature of each layer of convolution features of each layer of hierarchy convolution features into the network based on the channel attention mechanism, and obtaining final features of the channel features output by the network based on the channel attention mechanism and described based on the channel attention, wherein each layer of convolution features comprises a plurality of channel features;

the similarity matrix generation module is used for processing the feature matrix of each image frame in the image pair to obtain feature flux, inputting the feature flux into the similarity estimation network to obtain a similarity matrix output by the similarity estimation network, wherein the feature matrix comprises final features based on channel attention description;

the assignment module is used for assigning values to elements of the augmented rows in the forward similarity matrix according to the peak-to-side lobe ratio of each row in the similarity matrix to obtain the forward similarity matrix, and assigning values to elements of the augmented rows in the reverse similarity matrix according to the peak-to-side lobe ratio of each row in the similarity matrix to obtain the reverse similarity matrix;

the prediction matrix generation module is used for generating a forward prediction matrix according to the forward similarity matrix and generating a reverse prediction matrix according to the reverse similarity matrix;

the data correlation matrix generation module is used for generating a data correlation matrix between the image frame to be processed and the previous frame of the image frame to be processed according to the forward prediction matrix and the reverse prediction matrix;

and the identity distribution module is used for carrying out identity distribution on each target detection frame of the image frame to be processed based on each target detection frame of the previous frame of the image frame to be processed according to the data incidence matrix so as to obtain a tracking result of the image frame to be processed.

In some possible implementations of the second aspect, the assignment module is specifically configured to:

by the formula

is the peak to side lobe ratio of the ith row of the similarity matrix,

is the standard deviation of the element in row i;

by the formula

the peak to side lobe ratio of the jth column of the similarity matrix,

is the standard deviation of the jth column element;

In a third aspect, an embodiment of the present application provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the computer program, the method according to any one of the first aspect is implemented.

In a fourth aspect, the present application provides a computer-readable storage medium, in which a computer program is stored, and the computer program is executed by a processor to implement the method according to any one of the above first aspects.

In a fifth aspect, embodiments of the present application provide a computer program product, which, when run on an electronic device, causes the electronic device to perform the method of any one of the above first aspects.

It is understood that the beneficial effects of the second aspect to the fifth aspect can be referred to the related description of the first aspect, and are not described herein again.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

FIG. 1 is a schematic flow chart of an end-to-end multi-person pose tracking method according to an embodiment of the present disclosure;

FIG. 2 is a schematic diagram of a network based on a channel attention mechanism according to an embodiment of the present disclosure;

FIG. 3 is a flowchart illustrating a data association model training process according to an embodiment of the present disclosure;

FIG. 4 is a block diagram of an end-to-end multi-row human pose tracking apparatus provided by an embodiment of the present application;

fig. 5 is a block diagram schematically illustrating a structure of an electronic device according to an embodiment of the present application.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It should also be understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.

As used in this specification and the appended claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to" determining "or" in response to detecting ". Similarly, the phrase "if it is determined" or "if a [ described condition or event ] is detected" may be interpreted contextually to mean "upon determining" or "in response to determining" or "upon detecting [ described condition or event ]" or "in response to detecting [ described condition or event ]".

Furthermore, in the description of the present application and the appended claims, the terms "first," "second," "third," and the like are used for distinguishing between descriptions and not necessarily for describing or implying relative importance.

Reference throughout this specification to "one embodiment" or "some embodiments," or the like, means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the present application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," or the like, in various places throughout this specification are not necessarily all referring to the same embodiment, but rather "one or more but not all embodiments" unless specifically stated otherwise. The terms "comprising," "including," "having," and variations thereof mean "including, but not limited to," unless expressly specified otherwise.

The end-to-end multi-person gesture tracking method provided by the embodiment of the application can be applied to electronic equipment such as monitoring equipment and the like, for example, a video analysis all-in-one machine. The embodiment of the present application does not set any limit to the specific type of the electronic device. Exemplarily, in a monitoring scene, the pedestrian posture tracking in the monitoring scene is realized by the multi-pedestrian posture tracking method of the base end to the end in the embodiment of the application.

Referring to fig. 1, a flow chart of an end-to-end multi-person pose tracking method provided by an embodiment of the present application is schematically shown, where the method may include the following steps:

s101, acquiring a video to be detected, and performing pedestrian detection and attitude estimation on the video to be detected to obtain target detection frames of each image frame in the video to be detected, wherein the image frame comprises a plurality of target detection frames.

Illustratively, pedestrian frames on all image frames in the video to be detected are detected by a pedestrian detector, and then human key points in all the pedestrian frames are estimated by using an attitude estimation network to obtain the target detection frame. Namely, the target detection frame comprises a pedestrian frame and human key points.

It is understood that the video to be detected includes a plurality of image frames, each image frame may include a plurality of pedestrians, and each image frame includes at least one object detection frame.

The pedestrian detector can be obtained based on the existing COCO data set and the training of the target detection network fast-RCNN. And the attitude estimation network can be obtained based on the attitude estimation subset in the existing COCO data set, the attitude tracking data set posetrack and the attitude estimation network HRnet training.

Step S102, aiming at each image frame to be processed, inputting the image pair and the target detection frame of each image frame in the image pair into a feature extraction network, and obtaining the hierarchical convolution feature of each target detection frame output by the feature extraction network, wherein the hierarchical convolution feature comprises multilayer convolution features, and the image pair comprises the image frame to be processed and the previous frame of the image frame to be processed.

After the target detection frame of each image frame is detected, in order to realize data association between frames and realize pose tracking, a data association matrix of two adjacent frames needs to be constructed.

The data association process can be divided into three processes of feature learning, similarity measurement and identity allocation. In the embodiment of the present application, the feature learning process may include step S102 and step S103.

In specific application, each image frame is processed in an iteration mode to obtain a data association matrix of each image frame and the previous image frame, and then identity distribution is conducted on each target detection frame of the image frame according to the data association matrix of each image frame. And circulating the steps, so that the attitude tracking result of the video to be detected can be obtained.

For each image frame, a previous frame of the image frame is first selected to constitute an image pair, i.e. the image pair comprises an image frame I_tAnd the previous image frame I_t-1。

Then, the image pair and the target detection frame on each image frame in the image pair are input into a feature extraction network, and the hierarchical convolution feature of each target detection frame output by the feature extraction network is obtained. Wherein each image frame includes a plurality of target detection boxes, the output of the feature extraction network may include: image frame I_tThe hierarchical convolution characteristic of each target detection frame, and an image frame I_t-1And (4) the hierarchical convolution characteristics of each target detection box. Each target detection box corresponds to a hierarchical convolution feature. The hierarchical convolution features include features extracted from multiple convolutional layers, and thus include multiple layers of convolution features.

In the embodiment of the application, a parameter-shared dual-stream convolutional network can be used as the feature extraction network, and the dual-stream convolutional network comprises two branches, wherein one branch is used for extracting the image frame I_tAnother branch for extracting the image frame I_t-1The characteristics of (1).

Illustratively, a deep convolutional network is constructed based on VGG16 as a feature extraction network, the first 34 layers of the deep convolutional network adopt the original VGG16, and a plurality of groups of 'convolutional layer-BN layer-ReLU layer' are spliced behind the original VGG 16. Some of the architectural parameters of the deep convolutional network may be as shown in table 1.

Among them, the BN layer and the ReLU layer are not presented in table 1 below for simplicity.

TABLE 1

Ind	In-sz	In-ch	0-sz	0-ch	Str	Kern	Pad	R-sz	R-ch
										15	225x225	256	225x225	256	-	-	-	1x1	60
22	113x113	512	113x113	512	-	-	-	1x1	80
										34	56x56	1024	56x56	1024	-	-	-	1x1	100
35	56x56	1024	56x56	256	1	1x1	0	1x1	-
										38	56x56	512	28x28	512	2	3x3	1	1x1	80
41	28x28	512	28x28	128	1	1x1	0	1x1	-
										44	14x14	128	14x14	256	2	3x3	1	1x1	60
47	14x14	256	14x14	128	1	1x1	0	1x1	-
										50	14x14	128	12x12	256	2	3x3	1	1x1	50
53	12x12	256	12x12	128	1	1x1	0	1x1	-
										56	14x14	128	10x10	256	2	3x3	1	1x1	40
59	10x10	256	10x10	128	1	1x1	0	1x1	-
										62	10x10	128	5x5	256	2	3x3	1	1x1	30
65	5x5	256	5x5	128	1	1x1	0	1x1	-
										68	5x5	128	5x5	256	2	3x3	0	1x1	20

By way of example and not limitation, convolution characteristics of the 15 th layer, 22 th layer, 34 th layer, 38 th layer, 44 th layer, 50 th layer, 56 th layer, 62 th layer and 68 th layer of the deep convolutional network corresponding to the table 1 can be selected as basic characteristic representations of the target detection box respectively. Namely, the hierarchical convolution feature of each target detection box includes: layer 15 extracted convolution features, layer 22 extracted convolution features, layer 34 extracted convolution features, layer 38 extracted convolution features, layer 44 extracted convolution features, layer 50 extracted convolution features, layer 56 extracted convolution features, layer 62 extracted convolution features, and layer 68 extracted convolution features.

Step S103, respectively inputting each channel feature of each layer of convolution features of each layer of hierarchy convolution features into the network based on the channel attention mechanism, and obtaining the final feature of the channel feature output by the network based on the channel attention mechanism and described based on the channel attention, wherein each layer of convolution features comprises a plurality of channel features.

It will be appreciated that each layer of the convolutional network includes a plurality of channels, and thus each layer of the convolutional features includes a plurality of channel features. Namely, for any target detection frame, the corresponding hierarchical convolution feature comprises multiple layers of convolution features, and each layer of convolution feature comprises multiple channel features.

In a specific application, the channel feature may be used as an input of the network based on the channel attention mechanism, and an output of the network based on the channel attention mechanism is a final feature of the channel feature described based on the channel attention mechanism.

In some embodiments, referring to the schematic architecture diagram of the channel attention mechanism-based network shown in fig. 2, the channel attention mechanism-based network includes a global pooling layer (global pooling), a first fully connected layer (FC), a ReLU layer function, a second fully connected layer (FC) and a sigmod function connected in this way, and the input of the network is a resolution of

The output is a resolution of

A feature map based on the channel attention description.

Based on the network based on the channel attention mechanism shown in FIG. 2, for each layer of convolution characteristics

Each channel feature of

Characterization of the channel

Based on the final characteristics of the channel attention description

。

In the network based on the channel attention mechanism shown in fig. 2, the channel attention descriptor is constructed in a global average pooling manner in the embodiment of the present application:

。

indicating the c-th channel characteristic

The value at position (i, j).

In addition, in order to further capture the inter-feature-channel variability, in the embodiment of the present application, after the global pooling layer, a valve mechanism based on the sigmod function is constructed by using two fully-connected layers and one ReLU layer, so as to generate a final channel attention descriptor:

。

and

respectively representing a sigmod function and a ReLU layer function;

a set of weights representing the first fully-connected layer for down-sampling the feature map with a certain reduction ratio,

a set of weights representing a second fully-connected layer for implementing upsampling; the resolution of the c-th channel feature is

。

Then, for the c-th channel, the final features of its channel attention-based description are:

。

it is worth pointing out that, in the prior art, each feature channel is considered as the same, so that the difference between the feature channels cannot be effectively utilized, and the generalization capability of the features in a real scene is reduced. In the embodiment of the application, the network based on the attention mechanism is used for capturing the difference between the characteristic channels, so that the discrimination and generalization capability of the characteristics are increased.

And step S104, processing the feature matrix of each image frame in the image pair to obtain feature flux, inputting the feature flux into the similarity estimation network to obtain a similarity matrix output by the similarity estimation network, wherein the feature matrix comprises final features based on channel attention description.

In a specific application, for the image frame I_tSequentially inputting each channel feature of each layer of convolution features of each target detection frame on the image frame into the network based on the channel attention mechanism in the manner of the step S103 to obtain the final feature of each target detection frame based on the channel attention description, and then using each channel-based featureThe final features of the attention description constitute the image frame I_tFeature matrix M of_t。

Similarly, the image frame I is composed based on the final features of the channel attention description_t-1Feature matrix M of_t-1。

Obtaining an image frame I_tAnd image frame I_t-1After the eigen-matrix, the eigen-matrix may be stitched to a tensor along the eigen-channel

In (1), i.e. M_tAnd M_t-1All the rows are spliced along the depth of the tensor in an NxN arrangement mode to obtain the characteristic flux

. And N is the maximum number of pedestrians on any image frame in the video to be detected.

Then, the characteristic flux is input into a similarity estimation network, and a similarity matrix S output by the similarity estimation network is obtained. Illustratively, the similarity estimation network includes 5 convolutional layers of 1 × 1, and the number of characteristic channels of these several convolutional layers is 512, 256, 128, 64, and 1, respectively. Thus, the similarity estimation network may implement a gradual dimensionality reduction along the depth to generate a similarity moment of size NxN

。

It is worth pointing out that, for similarity estimation, the embodiment of the present application is different from the existing manual calculation method, and the model generalization capability can be improved.

And step S105, assigning values to elements of the augmented rows in the forward similarity matrix according to the peak-to-side lobe ratio of each row in the similarity matrix to obtain the forward similarity matrix, and assigning values to elements of the augmented rows in the reverse similarity matrix according to the peak-to-side lobe ratio of each row in the similarity matrix to obtain the reverse similarity matrix.

It should be noted that the similarity estimation network can estimate a similarity matrix S for N pedestrian pairs of two image frames. If the similarity is not further analyzed, the situations of 'old target' and 'new target appear' cannot be considered.

In the embodiment of the application, in order to solve the typical situations of 'old target disappearing' and 'new target coming' in the target tracking process, the similarity matrix is further augmented to obtain a forward correlation similarity matrix and a reverse correlation similarity matrix.

In order to construct the forward similarity matrix, after the similarity matrix is obtained, a column is added to the similarity matrix in a self-adaptive manner to obtain the forward similarity matrix to be assigned

. Forward similarity matrix to be assigned

Meaning that the elements on a newly added column (called an augmented column) have not been assigned values.

Obtaining a forward similarity matrix to be assigned

Then, according to the peak value side lobe ratio of each row in the similarity matrix, assigning values to the elements of the augmented column in the forward similarity matrix to obtain the assigned forward similarity matrix

。

In some embodiments, by formula

is the peak to side lobe ratio of the ith row of the similarity matrix,

is the standard deviation of the element in row i; then, judge again

Whether the peak-to-side lobe ratio of the row is less than a first preset threshold.

Finally, when the peak value sidelobe ratio of the ith row is smaller than a first preset threshold value, setting the ith row in the augmented column as 1; and when the peak-to-side lobe ratio of the ith row is greater than or equal to a first preset threshold value, setting the ith row in the augmented column to be 0.

Similarly, in order to construct the reverse similarity matrix, after the similarity matrix is obtained, a row is added to the similarity matrix in a self-adaptive manner to obtain the reverse similarity matrix to be assigned

. Forward similarity matrix to be assigned

Meaning that the elements on the newly added row (called the augmented row) have not yet been assigned values.

Obtaining a forward similarity matrix to be assigned

。

In some embodiments, by formula

peak side lobe of j column of similarity matrixThe ratio of the amount of the acid to the amount of the water,

is the standard deviation of the jth column element; judging whether the peak sidelobe ratio of the jth column is smaller than a second preset threshold value or not; when the peak sidelobe ratio of the jth column is smaller than a second preset threshold, setting the jth column in the augmented row as 1; and when the peak-to-side lobe ratio of the jth column is greater than or equal to a second preset threshold value, setting the jth column in the augmented columns to be 0.

And S106, generating a forward prediction matrix according to the forward similarity matrix, and generating a reverse prediction matrix according to the reverse similarity matrix.

In the embodiment of the application, the forward and reverse similarity matrix is obtained by considering the requirement of the design of the loss function

And

then, the forward similarity matrix is aligned by utilizing the softmax function

Performs a point-by-point operation to generate a forward prediction matrix

(ii) a Similarly, the inverse similarity matrix is subjected to the softmax function

Performs a point-by-point operation on the column to generate a backward prediction matrix

。

Forward prediction matrix

Capable of encoding objects from image frame I_t-1To image frame I_tIncluding in the image frame I_tThe upper object leaves the screen condition, i.e. "old object disappears". Similarly, the inverse prediction matrix

Capable of encoding objects from image frame I_tTo image frame I_t-1Including in the image frame I_t-1The last new target is present, i.e. "new target is in".

And step S107, generating a data association matrix between the image frame to be processed and the previous frame of the image frame to be processed according to the forward prediction matrix and the backward prediction matrix.

In some embodiments, the forward prediction matrix is obtained

And a backward prediction matrix

The forward prediction matrix may then be removed

To obtain a target forward prediction matrix

(ii) a Removing backward prediction matrix

To obtain a target reverse prediction matrix

。

Then, by the formula

Generating a pending image frame I_tAnd a frame I preceding the image frame to be processed_t-1Data correlation matrix between

。

And S108, according to the data incidence matrix, carrying out identity distribution on each target detection frame of the image frame to be processed based on each target detection frame of the previous frame of the image frame to be processed so as to obtain a tracking result of the image frame to be processed.

In a particular application, in obtaining an image frame I_tAnd image frame I_t-1After the data correlation matrix between the pedestrians and the pedestrian, the similarity between all the pedestrians on the previous frame and all the pedestrians on the current frame can be obtained, and then identity distribution can be performed based on the data correlation matrix.

For example, the previous frame includes 10 pedestrian frames, and the 10 pedestrian frames each have a corresponding identity ID. Because the data correlation matrix can represent the similarity between all pedestrians on the previous frame and all pedestrians of the current frame, regarding any one pedestrian frame D on the previous frame, regarding the pedestrian frame E which has the highest similarity with the pedestrian frame D and is to be assigned with the ID on the current frame as the same pedestrian, that is, the pedestrian frame E on the current frame is assigned with the same ID as the pedestrian D on the previous frame.

And (4) executing the steps S102 to S108 for each image frame in the video to be detected, so that the motion tracks of all pedestrians and postures in the whole video to be detected can be obtained, and a posture tracking result can be obtained.

Based on any of the above embodiments, referring to a flow diagram of a data association model training process provided in the embodiment of the present application shown in fig. 3, before acquiring a video to be detected, the method further includes the following steps:

and S301, acquiring a training data set.

And S302, training the data association network model by adopting a training data set.

The data association network model in the embodiment of the application is specifically a network based on channel attention bidirectional awareness, that is, the data association network model includes a feature extraction network, a network based on a channel attention mechanism, and a bidirectional awareness network, and the bidirectional awareness network includes a similarity estimation network.

In other words, in the embodiment of the application, feature learning, similarity measurement and target identity allocation are integrated into a unified data association network model, and compared with the existing implementation that three independent steps of feature learning, similarity estimation and identity allocation are adopted, the method improves the coupling between modules in the testing process of the network, and implements end-to-end network learning and testing.

In the training process, a labeled incidence matrix in the training process needs to be constructed.

First, for easy understanding, the maximum number N of object detection frames in an image frame is set to 6. In the training process, the data association matrix G is a code for encoding two adjacent frame image pairs (I)_t，I_t-1) And (4) a binary matrix of the matching relations of all pedestrians.

Specifically, assume that in image frame I_t-1Firstly, the image frame I is firstly processed_t-1All the pedestrians in the sequence are randomly ordered to obtain ID-1, ID-2, ID-3, … and ID-N. Similarly, for image frame I_tAll the pedestrians in the row are also sequentially randomly ordered.

Suppose for image I_t-1Pedestrian with ID-3 when moving to image frame I_tWhen corresponding to I on the image frame_tID-2 pedestrian. Then, the value of row 3 and column 2 of the data correlation matrix G is set to "1", otherwise it is 0. In order to indicate the situations of 'old object leaves the picture' and 'new object enters the picture' in the current frame, the embodiment of the application expands a row and a column to the matrix G to obtain

I.e. the final labeled incidence matrix.

Suppose an image frame I_t-1ID-2 pedestrian on image frame I_tUpper vanishing words, image frame I_tDoes not include the image frame I_t-1ID-2 pedestrian above, then in the matrix

Is set to "1" on row 2 of the last 1 column, otherwise is 0.

Suppose an image frame I_tAbove ID-4 pedestrian is a newly appearing object, i.e., in image frame I_t-1Above does not appear in image frame I_tID-4 pedestrian above, then in the matrix

The column 4 of the last 1 row of the row is set to "1", otherwise it is 0.

Based on the principle, a complete augmented label incidence matrix can be obtained. Then, the matrix is processed

The last row of the correlation matrix is removed to obtain a forward labeling incidence matrix

And represents a forward match from the previous frame to the current frame. Handle matrix

The last column of (2) is removed to obtain the reverse labeling incidence matrix

And represents a reverse match from the current frame to the previous frame.

The labeled incidence matrix, the forward labeled incidence matrix and the reverse labeled incidence matrix obtained by labeling can be used for calculating loss values in the training process.

Step S303, calculating the overall loss through the overall loss function of the data association network model, and iteratively training the data association network model according to the overall loss.

Wherein the overall loss function is

，

Are weight coefficients.

Representing a slave image frame I_t-1To image frame I_tThe smaller it is desirable to be in the training process, the better,

。

representing a slave image frame I_tTo image frame I_t-1The reverse match penalty, the smaller it is desirable to be during training the better,

；

for two-way consistency matching losses, i.e. representing letting positive losses

And reverse loss

As much as possible consistent with the results of the prior art,

；

in the form of a forward prediction matrix, the prediction matrix,

the incidence matrix is labeled for the forward direction,

in order to label the correlation matrix in the reverse direction,

in order to be a reverse prediction matrix,

in order to target the forward prediction matrix,

is the target reverse prediction matrix.

In order to achieve end-to-end model training and model testing, four sub-loss functions are designed, and then the overall loss function of the network is obtained in a linear weighting mode.

The training steps of the data association model are similar to the above steps S101 to S108. For example, after detecting pedestrian frames and human body key points of each video in a training data set by adopting a pedestrian detector and an attitude estimation network, inputting an image pair and a target detection frame of each image frame into a feature extraction network for each image frame in the video, and then obtaining a feature matrix of each image frame based on the attention mechanism network; then, based on the feature matrix of each image frame, a similarity matrix is obtained by using a similarity estimation network; and finally, acquiring a data association matrix based on the similarity matrix, calculating the overall loss value of the network according to the network output and the overall loss function, iteratively updating network parameters according to the overall loss value, and obtaining a trained data association model when the overall loss value tends to be stable.

Therefore, the embodiment of the application integrates the feature learning, the similarity measurement and the target identity distribution into a unified data association network model, and designs an overall loss function for the data association network model to realize end-to-end model training and model testing, thereby further improving the subsequent attitude tracking accuracy.

It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.

Corresponding to the end-to-end multi-row human pose tracking method described in the above embodiments, fig. 5 shows a structural block diagram of an end-to-end multi-row human pose tracking apparatus provided in the embodiments of the present application, and for convenience of explanation, only the parts related to the embodiments of the present application are shown.

Referring to fig. 4, the apparatus includes:

the pedestrian detection and posture estimation module 41 is configured to acquire a video to be detected, perform pedestrian detection and posture estimation on the video to be detected, and acquire target detection frames of each image frame in the video to be detected, where the image frame includes a plurality of target detection frames;

the multilayer convolution feature extraction module 42 is configured to, for each image frame to be processed, input the image pair and the target detection frame of each image frame in the image pair to the feature extraction network, and obtain a hierarchical convolution feature of each target detection frame output by the feature extraction network, where the hierarchical convolution feature includes multilayer convolution features, and the image pair includes the image frame to be processed and a previous frame of the image frame to be processed;

a feature extraction module 43 based on channel attention, configured to input each channel feature of each layer of convolution features of each hierarchical convolution feature to a network based on a channel attention mechanism, respectively, and obtain a final feature based on channel attention description of the channel feature output by the network based on the channel attention mechanism, where each layer of convolution features includes a plurality of channel features;

the similarity matrix generation module 44 is configured to process a feature matrix of each image frame in the image pair to obtain a feature flux, input the feature flux to the similarity estimation network, and obtain a similarity matrix output by the similarity estimation network, where the feature matrix includes a final feature based on channel attention description;

an assignment module 45, configured to assign values to elements of the augmented rows in the forward similarity matrix according to the peak-to-side lobe ratio of each row in the similarity matrix to obtain a forward similarity matrix, and assign values to elements of the augmented rows in the reverse similarity matrix according to the peak-to-side lobe ratio of each row in the similarity matrix to obtain a reverse similarity matrix;

a prediction matrix generation module 46, configured to generate a forward prediction matrix according to the forward similarity matrix, and generate a backward prediction matrix according to the backward similarity matrix;

a data correlation matrix generation module 47, configured to generate a data correlation matrix between the image frame to be processed and a previous frame of the image frame to be processed according to the forward prediction matrix and the reverse prediction matrix;

and the identity distribution module 48 is configured to perform identity distribution on each target detection frame of the image frame to be processed based on each target detection frame of a previous frame of the image frame to be processed according to the data association matrix, so as to obtain a tracking result of the image frame to be processed.

In some possible implementations, the assignment module is specifically configured to:

by the formula

is the peak to side lobe ratio of the ith row of the similarity matrix,

is the standard deviation of the element in row i;

by the formula

the peak to side lobe ratio of the jth column of the similarity matrix,

is the standard deviation of the jth column element;

In some possible implementations, the prediction matrix generation module is specifically configured to: performing point-by-point operation on the rows of the forward similarity matrix by using a softmax function to generate a forward prediction matrix; and performing point-by-point operation on the columns of the reverse similarity matrix by using a softmax function to generate a reverse prediction matrix.

In some possible implementations, the data association matrix generation module is specifically configured to: removing the last column of the forward prediction matrix to obtain a target forward prediction matrix;

by the formula

wherein,

in order to be a matrix of the data associations,

in order to target the forward prediction matrix,

is the target reverse prediction matrix.

In some possible implementations, the channel attention-based feature extraction module is specifically configured to:

convolution feature for each layer

Each channel feature of

Characterization of the channel

Based on the final characteristics of the channel attention description

；

Wherein,

，

，

；

indicating the c-th channel characteristic

A value at position (i, j);

and

respectively representing a sigmod function and a ReLU layer function;

a set of weights representing a first fully-connected layer,

；

In some possible implementations, the system further includes a training module configured to:

acquiring a training data set;

wherein the overall loss function is

；

Is a weight coefficient;

；

；

in order to match the loss for the two-way consistency,

；

in the form of a forward prediction matrix, the prediction matrix,

the incidence matrix is labeled for the forward direction,

in order to label the correlation matrix in the reverse direction,

in order to be a reverse prediction matrix,

in order to target the forward prediction matrix,

moments are predicted backwards for the target.

It should be noted that, for the information interaction, execution process, and other contents between the above-mentioned devices/units, the specific functions and technical effects thereof are based on the same concept as those of the method embodiment in the embodiment of the present application, which may be referred to in the method embodiment section specifically, and are not described herein again.

Fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application. As shown in fig. 5, the electronic apparatus 5 of this embodiment includes: at least one processor 50 (only one shown in fig. 5), a memory 51, and a computer program 52 stored in the memory 51 and executable on the at least one processor 50, the processor 50 implementing the steps in any of the various object tracking method embodiments described above when executing the computer program 52.

The electronic device 5 may be a desktop computer, a notebook, a palm computer, a cloud server, or other computing devices. The electronic device may include, but is not limited to, a processor 50, a memory 51. Those skilled in the art will appreciate that fig. 5 is merely an example of the electronic device 5, and does not constitute a limitation of the electronic device 5, and may include more or less components than those shown, or combine some of the components, or different components, such as an input-output device, a network access device, etc.

The Processor 50 may be a Central Processing Unit (CPU), and the Processor 50 may be other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 51 may in some embodiments be an internal storage unit of the electronic device 5, such as a hard disk or a memory of the electronic device 5. The memory 51 may also be an external storage device of the electronic device 5 in other embodiments, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the electronic device 5. Further, the memory 51 may also include both an internal storage unit and an external storage device of the electronic device 5. The memory 51 is used for storing an operating system, an application program, a BootLoader (BootLoader), data, and other programs, such as program codes of the computer program. The memory 51 may also be used to temporarily store data that has been output or is to be output.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working processes of the units and modules in the system may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

An embodiment of the present application further provides an electronic device, including: at least one processor, a memory, and a computer program stored in the memory and executable on the at least one processor, the processor implementing the steps of any of the various method embodiments described above when executing the computer program.

The embodiments of the present application further provide a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the computer program implements the steps in the above-mentioned method embodiments.

The embodiments of the present application provide a computer program product, which when running on an electronic device, enables the electronic device to implement the steps in the above method embodiments when executed.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, all or part of the processes in the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium and can implement the steps of the embodiments of the methods described above when the computer program is executed by a processor. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer readable medium may include at least: any entity or device capable of carrying computer program code to a photographing apparatus/terminal apparatus, a recording medium, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), an electrical carrier signal, a telecommunications signal, and a software distribution medium. Such as a usb-disk, a removable hard disk, a magnetic or optical disk, etc. In certain jurisdictions, computer-readable media may not be an electrical carrier signal or a telecommunications signal in accordance with legislative and patent practice.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus, electronic device and method may be implemented in other ways. For example, the above-described apparatus/electronic device embodiments are merely illustrative, and for example, the division of the modules or units is only one logical division, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.

Claims

1. An end-to-end multi-person pose tracking method, comprising:

acquiring a video to be detected, and performing pedestrian detection and attitude estimation on the video to be detected to obtain target detection frames of image frames in the video to be detected, wherein the image frames comprise a plurality of the target detection frames;

inputting an image pair and the target detection frame of each image frame in the image pair into a feature extraction network aiming at each image frame to be processed, and obtaining the hierarchical convolution feature of each target detection frame output by the feature extraction network, wherein the hierarchical convolution feature comprises multiple layers of convolution features, and the image pair comprises the image frame to be processed and the previous frame of the image frame to be processed;

inputting each channel feature of each layer of convolution features of each hierarchical convolution feature into a channel attention mechanism-based network respectively, and obtaining a channel attention description-based final feature of the channel feature output by the channel attention mechanism-based network, wherein each layer of convolution features comprises a plurality of channel features;

processing a feature matrix of each image frame in the image pair to obtain a feature flux, inputting the feature flux to a similarity estimation network, and obtaining a similarity matrix output by the similarity estimation network, wherein the feature matrix comprises the final features based on the channel attention description;

assigning values to elements of the augmented rows in the forward similarity matrix according to the peak-to-side lobe ratio of each row in the similarity matrix to obtain a forward similarity matrix, and assigning values to elements of the augmented rows in the reverse similarity matrix according to the peak-to-side lobe ratio of each row in the similarity matrix to obtain a reverse similarity matrix;

2. The method of claim 1, wherein assigning values to elements of the augmented rows in the forward similarity matrix based on a peak-to-side lobe ratio of each row in the similarity matrix to obtain a forward similarity matrix and assigning values to elements of the augmented rows in the reverse similarity matrix based on a peak-to-side lobe ratio of each column in the similarity matrix to obtain a reverse similarity matrix, comprises:

by the formula

Calculating a peak-to-side lobe ratio for each row element in the similarity matrix, wherein,

is the peak-to-side lobe ratio of the ith row of the similarity matrix,

is the standard deviation of the element in row i;

when the peak sidelobe ratio of the ith row is smaller than the first preset threshold, setting the ith row in the augmented column as 1; when the peak-to-side lobe ratio of the ith row is greater than or equal to the first preset threshold, setting the ith row in the augmented row to be 0;

by the formula

Calculating a peak-to-side lobe ratio for each column element in the similarity matrix, wherein,

the peak-to-side lobe ratio of the jth column of the similarity matrix,

is the standard deviation of the jth column element;

when the peak sidelobe ratio of the jth column is smaller than the second preset threshold, setting the jth column in the augmented row as 1; when the peak-to-side lobe ratio of the jth column is greater than or equal to the second preset threshold, setting the jth column in the augmented columns to be 0;

wherein S is the similarity matrix, S_iIs a row of the similarity matrix S, S_jIs the column of the similarity matrix.

3. The method of claim 1, wherein generating a forward prediction matrix based on the forward similarity matrix and generating a reverse prediction matrix based on the reverse similarity matrix comprises:

performing point-by-point operation on the rows of the forward similarity matrix by using a softmax function to generate the forward prediction matrix;

performing point-by-point operation on the columns of the reverse similarity matrix by using the softmax function to generate the reverse prediction matrix.

4. The method of claim 1, wherein generating a data correlation matrix between the image frame to be processed and a frame preceding the image frame to be processed based on the forward prediction matrix and the reverse prediction matrix comprises:

by the formula

wherein,

for the purpose of the data-correlation matrix,

for the target forward prediction matrix, the forward prediction matrix is,

and the target reverse prediction matrix is obtained.

5. The method of claim 1, wherein separately inputting the respective channel features of each layer of convolution features of each of the hierarchical convolution features to a channel attention mechanism-based network, obtaining a channel attention description-based final feature of the channel features output by the channel attention mechanism-based network, comprises:

convolution feature for each layer

Each channel feature of

Characterizing said channel

Input to the channel attention mechanism-based networkObtaining the channel characteristics of the channel attention mechanism-based network output

Based on the final characteristics of the channel attention description

；

Wherein,

，

，

；

indicating the c-th channel characteristic

A value at position (i, j);

and

respectively representing a sigmod function and a ReLU layer function;

a set of weights representing a first fully-connected layer,

；

The channel attention mechanism-based network comprises a global pooling layer, the first fully-connected layer, the ReLU layer function, the second fully-connected layer and the sigmod function which are connected in sequence.

6. The method according to any one of claims 1 to 5, further comprising, before acquiring the video to be detected:

acquiring a training data set;

training a data association network model by adopting a training data set, wherein the data association network model comprises the feature extraction network, the channel attention mechanism-based network and a bidirectional perception network, and the bidirectional perception network comprises the similarity estimation network;

calculating the integral loss through an integral loss function of the data correlation network model, and iteratively training the data correlation network model according to the integral loss;

wherein the global loss function is

；

Is a weight coefficient;

；

；

in order to match the loss for the two-way consistency,

；

in the form of a forward prediction matrix, the prediction matrix,

the incidence matrix is labeled for the forward direction,

in order to label the correlation matrix in the reverse direction,

in order to be a reverse prediction matrix,

in order to target the forward prediction matrix,

a target reverse prediction matrix;

wherein, the forward labeling incidence matrix is the final labeling incidence matrix G_AThe last row of the correlation matrix is removed to obtain a matrix, and the reverse labeling incidence matrix is the final labeling incidence matrix G_ALast column of (1) is removedThe matrix obtained later, the final labeled incidence matrix G_AAnd the matrix is obtained by expanding the data association matrix by one column and one row.

7. An end-to-end multi-row human pose tracking apparatus, comprising:

the pedestrian detection and posture estimation module is used for acquiring a video to be detected, and performing pedestrian detection and posture estimation on the video to be detected to obtain target detection frames of all image frames in the video to be detected, wherein the image frames comprise a plurality of target detection frames;

the multilayer convolution feature extraction module is used for inputting an image pair and the target detection frame of each image frame in the image pair into a feature extraction network aiming at each image frame to be processed, and obtaining the hierarchical convolution feature of each target detection frame output by the feature extraction network, wherein the hierarchical convolution feature comprises multilayer convolution features, and the image pair comprises the image frame to be processed and the previous frame of the image frame to be processed;

a feature extraction module based on channel attention, configured to input, to a network based on a channel attention mechanism, respective channel features of each layer of convolution features of each of the hierarchical convolution features, respectively, and obtain a final feature, output by the network based on the channel attention mechanism, of channel feature description based on channel attention, where each layer of the convolution features includes a plurality of the channel features;

a similarity matrix generation module, configured to process a feature matrix of each image frame in the image pair to obtain a feature flux, input the feature flux to a similarity estimation network, and obtain a similarity matrix output by the similarity estimation network, where the feature matrix includes the final feature based on the channel attention description;

a data correlation matrix generation module, configured to generate a data correlation matrix between the image frame to be processed and a previous frame of the image frame to be processed according to the forward prediction matrix and the reverse prediction matrix;

and the identity distribution module is used for carrying out identity distribution on each target detection frame of the image frame to be processed based on each target detection frame of the previous frame of the image frame to be processed according to the data correlation matrix so as to obtain a tracking result of the image frame to be processed.

8. The apparatus of claim 7, wherein the assignment module is specifically configured to:

by the formula

is the peak-to-side lobe ratio of the ith row of the similarity matrix,

is the standard deviation of the element in row i;

by the formula

the peak-to-side lobe ratio of the jth column of the similarity matrix,

is the standard deviation of the jth column element;

9. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the method of any of claims 1 to 6 when executing the computer program.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 6.