CN113762231B - End-to-end multi-pedestrian posture tracking method and device and electronic equipment - Google Patents

End-to-end multi-pedestrian posture tracking method and device and electronic equipment Download PDF

Info

Publication number
CN113762231B
CN113762231B CN202111328115.XA CN202111328115A CN113762231B CN 113762231 B CN113762231 B CN 113762231B CN 202111328115 A CN202111328115 A CN 202111328115A CN 113762231 B CN113762231 B CN 113762231B
Authority
CN
China
Prior art keywords
matrix
similarity
image frame
similarity matrix
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111328115.XA
Other languages
Chinese (zh)
Other versions
CN113762231A (en
Inventor
阮威健
何耀彬
史周安
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Smart City Research Institute Of China Electronics Technology Group Corp
Original Assignee
Smart City Research Institute Of China Electronics Technology Group Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Smart City Research Institute Of China Electronics Technology Group Corp filed Critical Smart City Research Institute Of China Electronics Technology Group Corp
Priority to CN202111328115.XA priority Critical patent/CN113762231B/en
Publication of CN113762231A publication Critical patent/CN113762231A/en
Application granted granted Critical
Publication of CN113762231B publication Critical patent/CN113762231B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

The application discloses an end-to-end multi-pedestrian attitude tracking method, device, electronic equipment and medium, which are used for improving the attitude tracking accuracy. The method comprises the following steps: acquiring a target detection frame of each frame of the video through pedestrian detection and attitude estimation; the following procedure is performed for each frame: utilizing a feature extraction network to obtain the hierarchical convolution features of each target detection frame in the current frame and the previous frame; based on the hierarchical convolution characteristics, acquiring final characteristics by using a network based on a channel attention mechanism, then acquiring a similarity matrix based on a characteristic matrix including the final characteristics and a similarity estimation network, and then assigning values to the augmented rows and the augmented columns to obtain a forward similarity matrix and a reverse similarity matrix; and generating a data association matrix between the current frame and the previous frame based on the forward similarity matrix and the reverse similarity matrix, and finally performing identity distribution on each target detection frame of the current image frame according to the data association matrix to obtain a tracking result of the current frame.

Description

End-to-end multi-pedestrian posture tracking method and device and electronic equipment
Technical Field
The present application relates to the field of computer vision technologies, and in particular, to a method and an apparatus for tracking an end-to-end multi-pedestrian posture, an electronic device, and a computer-readable storage medium.
Background
Multi-target pose tracking aims at estimating poses of multiple persons in a continuous video, and associating identities of the persons in different video frames in a ready manner, and has wide application in multiple visual fields such as behavior recognition, multimedia analysis, event detection and the like.
At present, a posture tracking method generally detects all pedestrian frames in a video image, estimates human body key points in the pedestrian frames, and finally realizes identity association of the pedestrian frames between different video frames by constructing data association.
However, the existing data association mode is usually realized by three independent steps of feature learning, similarity estimation and identity allocation, and the similarity estimation and identity allocation are usually realized by adopting a manually designed model, so that the generalization capability of the model is low, the effective data association cannot be realized in the gesture tracking process, and the gesture tracking accuracy is low.
Disclosure of Invention
The embodiment of the application provides an end-to-end multi-pedestrian gesture tracking method, an end-to-end multi-pedestrian gesture tracking device, electronic equipment and a computer readable storage medium, and can solve the problem of low accuracy rate of existing gesture tracking.
In a first aspect, an embodiment of the present application provides an end-to-end-based multi-person pose tracking method, including:
acquiring a video to be detected, and performing pedestrian detection and attitude estimation on the video to be detected to obtain target detection frames of each image frame in the video to be detected, wherein the image frame comprises a plurality of target detection frames;
inputting the image pair and the target detection frame of each image frame in the image pair into a feature extraction network aiming at each image frame to be processed, and obtaining the hierarchical convolution feature of each target detection frame output by the feature extraction network, wherein the hierarchical convolution feature comprises multilayer convolution features, and the image pair comprises the image frame to be processed and the previous frame of the image frame to be processed;
respectively inputting each channel feature of each layer of convolution features of each layer of hierarchy convolution features into a network based on a channel attention mechanism, and obtaining a final feature of channel feature based on channel attention description of the channel feature output by the network based on the channel attention mechanism, wherein each layer of convolution features comprises a plurality of channel features;
processing the feature matrix of each image frame in the image pair to obtain feature flux, inputting the feature flux into a similarity estimation network to obtain a similarity matrix output by the similarity estimation network, wherein the feature matrix comprises final features based on channel attention description;
assigning values to elements of the augmented rows in the forward similarity matrix according to the peak side lobe ratio of each row in the similarity matrix to obtain a forward similarity matrix, and assigning values to elements of the augmented rows in the reverse similarity matrix according to the peak side lobe ratio of each row in the similarity matrix to obtain a reverse similarity matrix;
generating a forward prediction matrix according to the forward similarity matrix, and generating a reverse prediction matrix according to the reverse similarity matrix;
generating a data association matrix between the image frame to be processed and a previous frame of the image frame to be processed according to the forward prediction matrix and the backward prediction matrix;
and according to the data incidence matrix, carrying out identity distribution on each target detection frame of the image frame to be processed based on each target detection frame of the previous frame of the image frame to be processed so as to obtain a tracking result of the image frame to be processed.
As can be seen from the above, in the embodiment of the present application, through the feature extraction network and the network based on the channel attention mechanism, the features of each target detection frame are extracted, the similarity estimation network is adopted to generate the similarity matrix when the similarity is measured, and the similarity matrix is further augmented to obtain the forward similarity matrix and the reverse similarity matrix, which can solve the problems of "old target disappearing" and "new target coming" in the tracking process, and provide a bidirectional sensing data association mode for the target detection frame between the current frame and the previous frame, thereby realizing effective data association, and improving the accuracy of posture tracking.
In some possible implementation manners of the first aspect, assigning values to elements of the augmented rows in the forward similarity matrix according to a peak-to-side lobe ratio of each row in the similarity matrix to obtain the forward similarity matrix, and assigning values to elements of the augmented rows in the reverse similarity matrix according to a peak-to-side lobe ratio of each column in the similarity matrix to obtain the reverse similarity matrix, includes:
by the formula
Figure 800339DEST_PATH_IMAGE001
And calculating the peak-to-side lobe ratio of each row element in the similarity matrix, wherein,
Figure 222224DEST_PATH_IMAGE002
is the peak to side lobe ratio of the ith row of the similarity matrix,
Figure 192454DEST_PATH_IMAGE003
is the standard deviation of the element in row i;
judging whether the peak value sidelobe ratio of the ith row is smaller than a first preset threshold value or not;
when the peak value sidelobe ratio of the ith row is smaller than a first preset threshold value, setting the ith row in the augmented column as 1; when the peak value sidelobe ratio of the ith row is larger than or equal to a first preset threshold value, setting the ith row in the augmented column as 0;
by the formula
Figure 16185DEST_PATH_IMAGE004
And calculating the peak-to-side lobe ratio of each column element in the similarity matrix, wherein,
Figure 46458DEST_PATH_IMAGE005
the peak to side lobe ratio of the jth column of the similarity matrix,
Figure 486798DEST_PATH_IMAGE006
is the standard deviation of the jth column element;
judging whether the peak sidelobe ratio of the jth column is smaller than a second preset threshold value or not;
when the peak sidelobe ratio of the jth column is smaller than a second preset threshold, setting the jth column in the augmented row as 1; and when the peak-to-side lobe ratio of the jth column is greater than or equal to a second preset threshold value, setting the jth column in the augmented columns to be 0.
In some possible implementations of the first aspect, generating a forward prediction matrix according to the forward similarity matrix, and generating a reverse prediction matrix according to the reverse similarity matrix includes:
performing point-by-point operation on the rows of the forward similarity matrix by using a softmax function to generate a forward prediction matrix;
and performing point-by-point operation on the columns of the reverse similarity matrix by using a softmax function to generate a reverse prediction matrix.
In some possible implementations of the first aspect, generating a data association matrix between the image frame to be processed and a frame preceding the image frame to be processed according to the forward prediction matrix and the reverse prediction matrix includes:
removing the last column of the forward prediction matrix to obtain a target forward prediction matrix;
removing the last row of the reverse prediction matrix to obtain a target reverse prediction matrix;
by the formula
Figure 932822DEST_PATH_IMAGE007
Generating a data correlation matrix between the image frame to be processed and a previous frame of the image frame to be processed;
wherein,
Figure 594748DEST_PATH_IMAGE008
in order to be a matrix of the data associations,
Figure 343392DEST_PATH_IMAGE009
in order to target the forward prediction matrix,
Figure 723558DEST_PATH_IMAGE010
is the target reverse prediction matrix.
In some possible implementations of the first aspect, separately inputting each channel feature of each layer of convolution features of each hierarchical convolution feature to the channel attention mechanism-based network, and obtaining a final feature based on channel attention description of the channel feature output by the channel attention mechanism-based network includes:
convolution feature for each layer
Figure 786323DEST_PATH_IMAGE011
Each channel feature of
Figure 302755DEST_PATH_IMAGE012
Characterization of the channel
Figure 222301DEST_PATH_IMAGE013
Inputting the channel characteristics into the network based on the channel attention mechanism, and obtaining the channel characteristics of the network output based on the channel attention mechanism
Figure 902812DEST_PATH_IMAGE014
Based on the final characteristics of the channel attention description
Figure 690639DEST_PATH_IMAGE015
Wherein,
Figure 327157DEST_PATH_IMAGE016
Figure 417604DEST_PATH_IMAGE017
Figure 506783DEST_PATH_IMAGE018
Figure 98301DEST_PATH_IMAGE019
indicating the c-th channel characteristic
Figure 136795DEST_PATH_IMAGE020
A value at position (i, j);
Figure 850673DEST_PATH_IMAGE021
and
Figure 912301DEST_PATH_IMAGE022
respectively representing a sigmod function and a ReLU layer function;
Figure 369828DEST_PATH_IMAGE023
a set of weights representing a first fully-connected layer,
Figure 590724DEST_PATH_IMAGE024
a set of weights representing a second fully connected layer; the resolution of the c-th channel feature is
Figure 491815DEST_PATH_IMAGE025
The network based on the channel attention mechanism comprises a global pooling layer, a first full-connection layer, a ReLU layer function, a second full-connection layer and a sigmod function which are connected in sequence.
In comparison, each feature channel is considered as the same in the prior art, so that the difference between the feature channels cannot be effectively utilized, and the generalization capability of the features in a real scene is reduced.
In this implementation, the network based on the attention mechanism is used to capture the differences between the feature channels, increasing the discriminative power and generalization capability of the features.
In some possible implementations of the first aspect, before acquiring the video to be detected, the method further includes:
acquiring a training data set;
training a data association network model by adopting a training data set, wherein the data association network model comprises a feature extraction network, a network based on a channel attention mechanism and a bidirectional perception network, and the bidirectional perception network comprises a similarity estimation network;
calculating the overall loss through an overall loss function of the data correlation network model, and iteratively training the data correlation network model according to the overall loss;
wherein the overall loss function is
Figure 555586DEST_PATH_IMAGE026
Figure 567536DEST_PATH_IMAGE027
Is a weight coefficient;
Figure 298732DEST_PATH_IMAGE028
representing a slave image frame It-1To image frame ItThe loss of the positive-going matching of (c),
Figure 370724DEST_PATH_IMAGE029
Figure 187370DEST_PATH_IMAGE030
representing a slave image frame ItTo image frame It-1The loss of the reverse matching of (a),
Figure 471852DEST_PATH_IMAGE031
Figure 57554DEST_PATH_IMAGE032
in order to match the loss for the two-way consistency,
Figure 221819DEST_PATH_IMAGE033
Figure 733617DEST_PATH_IMAGE034
in the form of a forward prediction matrix, the prediction matrix,
Figure 133374DEST_PATH_IMAGE035
the incidence matrix is labeled for the forward direction,
Figure 793157DEST_PATH_IMAGE036
in order to label the correlation matrix in the reverse direction,
Figure 3689DEST_PATH_IMAGE037
in order to be a reverse prediction matrix,
Figure 794928DEST_PATH_IMAGE009
in order to target the forward prediction matrix,
Figure 499841DEST_PATH_IMAGE038
is the target reverse prediction matrix.
In the implementation manner, the feature learning, the similarity measurement and the target identity distribution are integrated into a unified data association network model, and an overall loss function is designed for the data association network model to realize end-to-end model training and model testing, so that the subsequent attitude tracking accuracy is further improved.
In a second aspect, an embodiment of the present application provides an end-to-end multi-person pose tracking apparatus, including:
the pedestrian detection and attitude estimation module is used for acquiring a video to be detected, performing pedestrian detection and attitude estimation on the video to be detected and acquiring target detection frames of all image frames in the video to be detected, wherein the image frames comprise a plurality of target detection frames;
the multilayer convolution feature extraction module is used for inputting the image pair and the target detection frame of each image frame in the image pair into the feature extraction network aiming at each image frame to be processed, and obtaining the hierarchical convolution feature of each target detection frame output by the feature extraction network, wherein the hierarchical convolution feature comprises multilayer convolution features, and the image pair comprises the image frame to be processed and the previous frame of the image frame to be processed;
the feature extraction module based on the channel attention is used for respectively inputting each channel feature of each layer of convolution features of each layer of hierarchy convolution features into the network based on the channel attention mechanism, and obtaining final features of the channel features output by the network based on the channel attention mechanism and described based on the channel attention, wherein each layer of convolution features comprises a plurality of channel features;
the similarity matrix generation module is used for processing the feature matrix of each image frame in the image pair to obtain feature flux, inputting the feature flux into the similarity estimation network to obtain a similarity matrix output by the similarity estimation network, wherein the feature matrix comprises final features based on channel attention description;
the assignment module is used for assigning values to elements of the augmented rows in the forward similarity matrix according to the peak-to-side lobe ratio of each row in the similarity matrix to obtain the forward similarity matrix, and assigning values to elements of the augmented rows in the reverse similarity matrix according to the peak-to-side lobe ratio of each row in the similarity matrix to obtain the reverse similarity matrix;
the prediction matrix generation module is used for generating a forward prediction matrix according to the forward similarity matrix and generating a reverse prediction matrix according to the reverse similarity matrix;
the data correlation matrix generation module is used for generating a data correlation matrix between the image frame to be processed and the previous frame of the image frame to be processed according to the forward prediction matrix and the reverse prediction matrix;
and the identity distribution module is used for carrying out identity distribution on each target detection frame of the image frame to be processed based on each target detection frame of the previous frame of the image frame to be processed according to the data incidence matrix so as to obtain a tracking result of the image frame to be processed.
In some possible implementations of the second aspect, the assignment module is specifically configured to:
by the formula
Figure 607605DEST_PATH_IMAGE001
And calculating the peak-to-side lobe ratio of each row element in the similarity matrix, wherein,
Figure 910411DEST_PATH_IMAGE002
is the peak to side lobe ratio of the ith row of the similarity matrix,
Figure 861049DEST_PATH_IMAGE003
is the standard deviation of the element in row i;
judging whether the peak value sidelobe ratio of the ith row is smaller than a first preset threshold value or not;
when the peak value sidelobe ratio of the ith row is smaller than a first preset threshold value, setting the ith row in the augmented column as 1; when the peak value sidelobe ratio of the ith row is larger than or equal to a first preset threshold value, setting the ith row in the augmented column as 0;
by the formula
Figure 291025DEST_PATH_IMAGE004
And calculating the peak-to-side lobe ratio of each column element in the similarity matrix, wherein,
Figure 705826DEST_PATH_IMAGE005
the peak to side lobe ratio of the jth column of the similarity matrix,
Figure 54898DEST_PATH_IMAGE006
is the standard deviation of the jth column element;
judging whether the peak sidelobe ratio of the jth column is smaller than a second preset threshold value or not;
when the peak sidelobe ratio of the jth column is smaller than a second preset threshold, setting the jth column in the augmented row as 1; and when the peak-to-side lobe ratio of the jth column is greater than or equal to a second preset threshold value, setting the jth column in the augmented columns to be 0.
In a third aspect, an embodiment of the present application provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the computer program, the method according to any one of the first aspect is implemented.
In a fourth aspect, the present application provides a computer-readable storage medium, in which a computer program is stored, and the computer program is executed by a processor to implement the method according to any one of the above first aspects.
In a fifth aspect, embodiments of the present application provide a computer program product, which, when run on an electronic device, causes the electronic device to perform the method of any one of the above first aspects.
It is understood that the beneficial effects of the second aspect to the fifth aspect can be referred to the related description of the first aspect, and are not described herein again.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.
FIG. 1 is a schematic flow chart of an end-to-end multi-person pose tracking method according to an embodiment of the present disclosure;
FIG. 2 is a schematic diagram of a network based on a channel attention mechanism according to an embodiment of the present disclosure;
FIG. 3 is a flowchart illustrating a data association model training process according to an embodiment of the present disclosure;
FIG. 4 is a block diagram of an end-to-end multi-row human pose tracking apparatus provided by an embodiment of the present application;
fig. 5 is a block diagram schematically illustrating a structure of an electronic device according to an embodiment of the present application.
Detailed Description
In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.
It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It should also be understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.
As used in this specification and the appended claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to" determining "or" in response to detecting ". Similarly, the phrase "if it is determined" or "if a [ described condition or event ] is detected" may be interpreted contextually to mean "upon determining" or "in response to determining" or "upon detecting [ described condition or event ]" or "in response to detecting [ described condition or event ]".
Furthermore, in the description of the present application and the appended claims, the terms "first," "second," "third," and the like are used for distinguishing between descriptions and not necessarily for describing or implying relative importance.
Reference throughout this specification to "one embodiment" or "some embodiments," or the like, means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the present application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," or the like, in various places throughout this specification are not necessarily all referring to the same embodiment, but rather "one or more but not all embodiments" unless specifically stated otherwise. The terms "comprising," "including," "having," and variations thereof mean "including, but not limited to," unless expressly specified otherwise.
The end-to-end multi-person gesture tracking method provided by the embodiment of the application can be applied to electronic equipment such as monitoring equipment and the like, for example, a video analysis all-in-one machine. The embodiment of the present application does not set any limit to the specific type of the electronic device. Exemplarily, in a monitoring scene, the pedestrian posture tracking in the monitoring scene is realized by the multi-pedestrian posture tracking method of the base end to the end in the embodiment of the application.
Referring to fig. 1, a flow chart of an end-to-end multi-person pose tracking method provided by an embodiment of the present application is schematically shown, where the method may include the following steps:
s101, acquiring a video to be detected, and performing pedestrian detection and attitude estimation on the video to be detected to obtain target detection frames of each image frame in the video to be detected, wherein the image frame comprises a plurality of target detection frames.
Illustratively, pedestrian frames on all image frames in the video to be detected are detected by a pedestrian detector, and then human key points in all the pedestrian frames are estimated by using an attitude estimation network to obtain the target detection frame. Namely, the target detection frame comprises a pedestrian frame and human key points.
It is understood that the video to be detected includes a plurality of image frames, each image frame may include a plurality of pedestrians, and each image frame includes at least one object detection frame.
The pedestrian detector can be obtained based on the existing COCO data set and the training of the target detection network fast-RCNN. And the attitude estimation network can be obtained based on the attitude estimation subset in the existing COCO data set, the attitude tracking data set posetrack and the attitude estimation network HRnet training.
Step S102, aiming at each image frame to be processed, inputting the image pair and the target detection frame of each image frame in the image pair into a feature extraction network, and obtaining the hierarchical convolution feature of each target detection frame output by the feature extraction network, wherein the hierarchical convolution feature comprises multilayer convolution features, and the image pair comprises the image frame to be processed and the previous frame of the image frame to be processed.
After the target detection frame of each image frame is detected, in order to realize data association between frames and realize pose tracking, a data association matrix of two adjacent frames needs to be constructed.
The data association process can be divided into three processes of feature learning, similarity measurement and identity allocation. In the embodiment of the present application, the feature learning process may include step S102 and step S103.
In specific application, each image frame is processed in an iteration mode to obtain a data association matrix of each image frame and the previous image frame, and then identity distribution is conducted on each target detection frame of the image frame according to the data association matrix of each image frame. And circulating the steps, so that the attitude tracking result of the video to be detected can be obtained.
For each image frame, a previous frame of the image frame is first selected to constitute an image pair, i.e. the image pair comprises an image frame ItAnd the previous image frame It-1
Then, the image pair and the target detection frame on each image frame in the image pair are input into a feature extraction network, and the hierarchical convolution feature of each target detection frame output by the feature extraction network is obtained. Wherein each image frame includes a plurality of target detection boxes, the output of the feature extraction network may include: image frame ItThe hierarchical convolution characteristic of each target detection frame, and an image frame It-1And (4) the hierarchical convolution characteristics of each target detection box. Each target detection box corresponds to a hierarchical convolution feature. The hierarchical convolution features include features extracted from multiple convolutional layers, and thus include multiple layers of convolution features.
In the embodiment of the application, a parameter-shared dual-stream convolutional network can be used as the feature extraction network, and the dual-stream convolutional network comprises two branches, wherein one branch is used for extracting the image frame ItAnother branch for extracting the image frame It-1The characteristics of (1).
Illustratively, a deep convolutional network is constructed based on VGG16 as a feature extraction network, the first 34 layers of the deep convolutional network adopt the original VGG16, and a plurality of groups of 'convolutional layer-BN layer-ReLU layer' are spliced behind the original VGG 16. Some of the architectural parameters of the deep convolutional network may be as shown in table 1.
Among them, the BN layer and the ReLU layer are not presented in table 1 below for simplicity.
TABLE 1
Ind In-sz In-ch 0-sz 0-ch Str Kern Pad R-sz R-ch
15 225x225 256 225x225 256 - - - 1x1 60
22 113x113 512 113x113 512 - - - 1x1 80
34 56x56 1024 56x56 1024 - - - 1x1 100
35 56x56 1024 56x56 256 1 1x1 0 1x1 -
38 56x56 512 28x28 512 2 3x3 1 1x1 80
41 28x28 512 28x28 128 1 1x1 0 1x1 -
44 14x14 128 14x14 256 2 3x3 1 1x1 60
47 14x14 256 14x14 128 1 1x1 0 1x1 -
50 14x14 128 12x12 256 2 3x3 1 1x1 50
53 12x12 256 12x12 128 1 1x1 0 1x1 -
56 14x14 128 10x10 256 2 3x3 1 1x1 40
59 10x10 256 10x10 128 1 1x1 0 1x1 -
62 10x10 128 5x5 256 2 3x3 1 1x1 30
65 5x5 256 5x5 128 1 1x1 0 1x1 -
68 5x5 128 5x5 256 2 3x3 0 1x1 20
By way of example and not limitation, convolution characteristics of the 15 th layer, 22 th layer, 34 th layer, 38 th layer, 44 th layer, 50 th layer, 56 th layer, 62 th layer and 68 th layer of the deep convolutional network corresponding to the table 1 can be selected as basic characteristic representations of the target detection box respectively. Namely, the hierarchical convolution feature of each target detection box includes: layer 15 extracted convolution features, layer 22 extracted convolution features, layer 34 extracted convolution features, layer 38 extracted convolution features, layer 44 extracted convolution features, layer 50 extracted convolution features, layer 56 extracted convolution features, layer 62 extracted convolution features, and layer 68 extracted convolution features.
Step S103, respectively inputting each channel feature of each layer of convolution features of each layer of hierarchy convolution features into the network based on the channel attention mechanism, and obtaining the final feature of the channel feature output by the network based on the channel attention mechanism and described based on the channel attention, wherein each layer of convolution features comprises a plurality of channel features.
It will be appreciated that each layer of the convolutional network includes a plurality of channels, and thus each layer of the convolutional features includes a plurality of channel features. Namely, for any target detection frame, the corresponding hierarchical convolution feature comprises multiple layers of convolution features, and each layer of convolution feature comprises multiple channel features.
In a specific application, the channel feature may be used as an input of the network based on the channel attention mechanism, and an output of the network based on the channel attention mechanism is a final feature of the channel feature described based on the channel attention mechanism.
In some embodiments, referring to the schematic architecture diagram of the channel attention mechanism-based network shown in fig. 2, the channel attention mechanism-based network includes a global pooling layer (global pooling), a first fully connected layer (FC), a ReLU layer function, a second fully connected layer (FC) and a sigmod function connected in this way, and the input of the network is a resolution of
Figure 774724DEST_PATH_IMAGE025
The output is a resolution of
Figure 54395DEST_PATH_IMAGE025
A feature map based on the channel attention description.
Based on the network based on the channel attention mechanism shown in FIG. 2, for each layer of convolution characteristics
Figure 340014DEST_PATH_IMAGE039
Each channel feature of
Figure 984622DEST_PATH_IMAGE040
Characterization of the channel
Figure 191744DEST_PATH_IMAGE041
Inputting the channel characteristics into the network based on the channel attention mechanism, and obtaining the channel characteristics of the network output based on the channel attention mechanism
Figure 743948DEST_PATH_IMAGE042
Based on the final characteristics of the channel attention description
Figure 618494DEST_PATH_IMAGE043
In the network based on the channel attention mechanism shown in fig. 2, the channel attention descriptor is constructed in a global average pooling manner in the embodiment of the present application:
Figure 965162DEST_PATH_IMAGE044
Figure 190738DEST_PATH_IMAGE045
indicating the c-th channel characteristic
Figure 94103DEST_PATH_IMAGE046
The value at position (i, j).
In addition, in order to further capture the inter-feature-channel variability, in the embodiment of the present application, after the global pooling layer, a valve mechanism based on the sigmod function is constructed by using two fully-connected layers and one ReLU layer, so as to generate a final channel attention descriptor:
Figure 72423DEST_PATH_IMAGE047
Figure 137462DEST_PATH_IMAGE048
and
Figure 506127DEST_PATH_IMAGE022
respectively representing a sigmod function and a ReLU layer function;
Figure 868975DEST_PATH_IMAGE049
a set of weights representing the first fully-connected layer for down-sampling the feature map with a certain reduction ratio,
Figure 526570DEST_PATH_IMAGE050
a set of weights representing a second fully-connected layer for implementing upsampling; the resolution of the c-th channel feature is
Figure 949461DEST_PATH_IMAGE025
Then, for the c-th channel, the final features of its channel attention-based description are:
Figure 71001DEST_PATH_IMAGE051
it is worth pointing out that, in the prior art, each feature channel is considered as the same, so that the difference between the feature channels cannot be effectively utilized, and the generalization capability of the features in a real scene is reduced. In the embodiment of the application, the network based on the attention mechanism is used for capturing the difference between the characteristic channels, so that the discrimination and generalization capability of the characteristics are increased.
And step S104, processing the feature matrix of each image frame in the image pair to obtain feature flux, inputting the feature flux into the similarity estimation network to obtain a similarity matrix output by the similarity estimation network, wherein the feature matrix comprises final features based on channel attention description.
In a specific application, for the image frame ItSequentially inputting each channel feature of each layer of convolution features of each target detection frame on the image frame into the network based on the channel attention mechanism in the manner of the step S103 to obtain the final feature of each target detection frame based on the channel attention description, and then using each channel-based featureThe final features of the attention description constitute the image frame ItFeature matrix M oft
Similarly, the image frame I is composed based on the final features of the channel attention descriptiont-1Feature matrix M oft-1
Obtaining an image frame ItAnd image frame It-1After the eigen-matrix, the eigen-matrix may be stitched to a tensor along the eigen-channel
Figure 519431DEST_PATH_IMAGE052
In (1), i.e. MtAnd Mt-1All the rows are spliced along the depth of the tensor in an NxN arrangement mode to obtain the characteristic flux
Figure 472343DEST_PATH_IMAGE053
. And N is the maximum number of pedestrians on any image frame in the video to be detected.
Then, the characteristic flux is input into a similarity estimation network, and a similarity matrix S output by the similarity estimation network is obtained. Illustratively, the similarity estimation network includes 5 convolutional layers of 1 × 1, and the number of characteristic channels of these several convolutional layers is 512, 256, 128, 64, and 1, respectively. Thus, the similarity estimation network may implement a gradual dimensionality reduction along the depth to generate a similarity moment of size NxN
Figure 3819DEST_PATH_IMAGE054
It is worth pointing out that, for similarity estimation, the embodiment of the present application is different from the existing manual calculation method, and the model generalization capability can be improved.
And step S105, assigning values to elements of the augmented rows in the forward similarity matrix according to the peak-to-side lobe ratio of each row in the similarity matrix to obtain the forward similarity matrix, and assigning values to elements of the augmented rows in the reverse similarity matrix according to the peak-to-side lobe ratio of each row in the similarity matrix to obtain the reverse similarity matrix.
It should be noted that the similarity estimation network can estimate a similarity matrix S for N pedestrian pairs of two image frames. If the similarity is not further analyzed, the situations of 'old target' and 'new target appear' cannot be considered.
In the embodiment of the application, in order to solve the typical situations of 'old target disappearing' and 'new target coming' in the target tracking process, the similarity matrix is further augmented to obtain a forward correlation similarity matrix and a reverse correlation similarity matrix.
In order to construct the forward similarity matrix, after the similarity matrix is obtained, a column is added to the similarity matrix in a self-adaptive manner to obtain the forward similarity matrix to be assigned
Figure 160125DEST_PATH_IMAGE055
. Forward similarity matrix to be assigned
Figure 192672DEST_PATH_IMAGE055
Meaning that the elements on a newly added column (called an augmented column) have not been assigned values.
Obtaining a forward similarity matrix to be assigned
Figure 750823DEST_PATH_IMAGE055
Then, according to the peak value side lobe ratio of each row in the similarity matrix, assigning values to the elements of the augmented column in the forward similarity matrix to obtain the assigned forward similarity matrix
Figure 187621DEST_PATH_IMAGE055
In some embodiments, by formula
Figure 142807DEST_PATH_IMAGE056
And calculating the peak-to-side lobe ratio of each row element in the similarity matrix, wherein,
Figure 136302DEST_PATH_IMAGE057
is the peak to side lobe ratio of the ith row of the similarity matrix,
Figure 735911DEST_PATH_IMAGE003
is the standard deviation of the element in row i; then, judge again
Figure 733823DEST_PATH_IMAGE058
Whether the peak-to-side lobe ratio of the row is less than a first preset threshold.
Finally, when the peak value sidelobe ratio of the ith row is smaller than a first preset threshold value, setting the ith row in the augmented column as 1; and when the peak-to-side lobe ratio of the ith row is greater than or equal to a first preset threshold value, setting the ith row in the augmented column to be 0.
Similarly, in order to construct the reverse similarity matrix, after the similarity matrix is obtained, a row is added to the similarity matrix in a self-adaptive manner to obtain the reverse similarity matrix to be assigned
Figure 864721DEST_PATH_IMAGE059
. Forward similarity matrix to be assigned
Figure 176754DEST_PATH_IMAGE059
Meaning that the elements on the newly added row (called the augmented row) have not yet been assigned values.
Obtaining a forward similarity matrix to be assigned
Figure 975076DEST_PATH_IMAGE059
Then, according to the peak value side lobe ratio of each row in the similarity matrix, assigning values to the elements of the augmented column in the forward similarity matrix to obtain the assigned forward similarity matrix
Figure 612731DEST_PATH_IMAGE059
In some embodiments, by formula
Figure 699767DEST_PATH_IMAGE060
And calculating the peak-to-side lobe ratio of each column element in the similarity matrix, wherein,
Figure 815490DEST_PATH_IMAGE061
peak side lobe of j column of similarity matrixThe ratio of the amount of the acid to the amount of the water,
Figure 468320DEST_PATH_IMAGE062
is the standard deviation of the jth column element; judging whether the peak sidelobe ratio of the jth column is smaller than a second preset threshold value or not; when the peak sidelobe ratio of the jth column is smaller than a second preset threshold, setting the jth column in the augmented row as 1; and when the peak-to-side lobe ratio of the jth column is greater than or equal to a second preset threshold value, setting the jth column in the augmented columns to be 0.
And S106, generating a forward prediction matrix according to the forward similarity matrix, and generating a reverse prediction matrix according to the reverse similarity matrix.
In the embodiment of the application, the forward and reverse similarity matrix is obtained by considering the requirement of the design of the loss function
Figure 808034DEST_PATH_IMAGE063
And
Figure 647945DEST_PATH_IMAGE059
then, the forward similarity matrix is aligned by utilizing the softmax function
Figure 832939DEST_PATH_IMAGE063
Performs a point-by-point operation to generate a forward prediction matrix
Figure 340275DEST_PATH_IMAGE064
(ii) a Similarly, the inverse similarity matrix is subjected to the softmax function
Figure 788574DEST_PATH_IMAGE059
Performs a point-by-point operation on the column to generate a backward prediction matrix
Figure 646940DEST_PATH_IMAGE065
Forward prediction matrix
Figure 370045DEST_PATH_IMAGE066
Capable of encoding objects from image frame It-1To image frame ItIncluding in the image frame ItThe upper object leaves the screen condition, i.e. "old object disappears". Similarly, the inverse prediction matrix
Figure 466308DEST_PATH_IMAGE067
Capable of encoding objects from image frame ItTo image frame It-1Including in the image frame It-1The last new target is present, i.e. "new target is in".
And step S107, generating a data association matrix between the image frame to be processed and the previous frame of the image frame to be processed according to the forward prediction matrix and the backward prediction matrix.
In some embodiments, the forward prediction matrix is obtained
Figure 288771DEST_PATH_IMAGE066
And a backward prediction matrix
Figure 352542DEST_PATH_IMAGE067
The forward prediction matrix may then be removed
Figure 895649DEST_PATH_IMAGE068
To obtain a target forward prediction matrix
Figure 626845DEST_PATH_IMAGE069
(ii) a Removing backward prediction matrix
Figure 167679DEST_PATH_IMAGE067
To obtain a target reverse prediction matrix
Figure 453167DEST_PATH_IMAGE070
Then, by the formula
Figure 65545DEST_PATH_IMAGE071
Generating a pending image frame ItAnd a frame I preceding the image frame to be processedt-1Data correlation matrix between
Figure 57772DEST_PATH_IMAGE072
And S108, according to the data incidence matrix, carrying out identity distribution on each target detection frame of the image frame to be processed based on each target detection frame of the previous frame of the image frame to be processed so as to obtain a tracking result of the image frame to be processed.
In a particular application, in obtaining an image frame ItAnd image frame It-1After the data correlation matrix between the pedestrians and the pedestrian, the similarity between all the pedestrians on the previous frame and all the pedestrians on the current frame can be obtained, and then identity distribution can be performed based on the data correlation matrix.
For example, the previous frame includes 10 pedestrian frames, and the 10 pedestrian frames each have a corresponding identity ID. Because the data correlation matrix can represent the similarity between all pedestrians on the previous frame and all pedestrians of the current frame, regarding any one pedestrian frame D on the previous frame, regarding the pedestrian frame E which has the highest similarity with the pedestrian frame D and is to be assigned with the ID on the current frame as the same pedestrian, that is, the pedestrian frame E on the current frame is assigned with the same ID as the pedestrian D on the previous frame.
And (4) executing the steps S102 to S108 for each image frame in the video to be detected, so that the motion tracks of all pedestrians and postures in the whole video to be detected can be obtained, and a posture tracking result can be obtained.
As can be seen from the above, in the embodiment of the present application, through the feature extraction network and the network based on the channel attention mechanism, the features of each target detection frame are extracted, the similarity estimation network is adopted to generate the similarity matrix when the similarity is measured, and the similarity matrix is further augmented to obtain the forward similarity matrix and the reverse similarity matrix, which can solve the problems of "old target disappearing" and "new target coming" in the tracking process, and provide a bidirectional sensing data association mode for the target detection frame between the current frame and the previous frame, thereby realizing effective data association, and improving the accuracy of posture tracking.
Based on any of the above embodiments, referring to a flow diagram of a data association model training process provided in the embodiment of the present application shown in fig. 3, before acquiring a video to be detected, the method further includes the following steps:
and S301, acquiring a training data set.
And S302, training the data association network model by adopting a training data set.
The data association network model in the embodiment of the application is specifically a network based on channel attention bidirectional awareness, that is, the data association network model includes a feature extraction network, a network based on a channel attention mechanism, and a bidirectional awareness network, and the bidirectional awareness network includes a similarity estimation network.
In other words, in the embodiment of the application, feature learning, similarity measurement and target identity allocation are integrated into a unified data association network model, and compared with the existing implementation that three independent steps of feature learning, similarity estimation and identity allocation are adopted, the method improves the coupling between modules in the testing process of the network, and implements end-to-end network learning and testing.
In the training process, a labeled incidence matrix in the training process needs to be constructed.
First, for easy understanding, the maximum number N of object detection frames in an image frame is set to 6. In the training process, the data association matrix G is a code for encoding two adjacent frame image pairs (I)t,It-1) And (4) a binary matrix of the matching relations of all pedestrians.
Specifically, assume that in image frame It-1Firstly, the image frame I is firstly processedt-1All the pedestrians in the sequence are randomly ordered to obtain ID-1, ID-2, ID-3, … and ID-N. Similarly, for image frame ItAll the pedestrians in the row are also sequentially randomly ordered.
Suppose for image It-1Pedestrian with ID-3 when moving to image frame ItWhen corresponding to I on the image frametID-2 pedestrian. Then, the value of row 3 and column 2 of the data correlation matrix G is set to "1", otherwise it is 0. In order to indicate the situations of 'old object leaves the picture' and 'new object enters the picture' in the current frame, the embodiment of the application expands a row and a column to the matrix G to obtain
Figure 18774DEST_PATH_IMAGE073
I.e. the final labeled incidence matrix.
Suppose an image frame It-1ID-2 pedestrian on image frame ItUpper vanishing words, image frame ItDoes not include the image frame It-1ID-2 pedestrian above, then in the matrix
Figure 858468DEST_PATH_IMAGE074
Is set to "1" on row 2 of the last 1 column, otherwise is 0.
Suppose an image frame ItAbove ID-4 pedestrian is a newly appearing object, i.e., in image frame It-1Above does not appear in image frame ItID-4 pedestrian above, then in the matrix
Figure 195909DEST_PATH_IMAGE074
The column 4 of the last 1 row of the row is set to "1", otherwise it is 0.
Based on the principle, a complete augmented label incidence matrix can be obtained. Then, the matrix is processed
Figure 42642DEST_PATH_IMAGE074
The last row of the correlation matrix is removed to obtain a forward labeling incidence matrix
Figure 253174DEST_PATH_IMAGE075
And represents a forward match from the previous frame to the current frame. Handle matrix
Figure 513254DEST_PATH_IMAGE074
The last column of (2) is removed to obtain the reverse labeling incidence matrix
Figure 592069DEST_PATH_IMAGE076
And represents a reverse match from the current frame to the previous frame.
The labeled incidence matrix, the forward labeled incidence matrix and the reverse labeled incidence matrix obtained by labeling can be used for calculating loss values in the training process.
Step S303, calculating the overall loss through the overall loss function of the data association network model, and iteratively training the data association network model according to the overall loss.
Wherein the overall loss function is
Figure 637516DEST_PATH_IMAGE077
Figure 2639DEST_PATH_IMAGE078
Are weight coefficients.
Figure 766326DEST_PATH_IMAGE079
Representing a slave image frame It-1To image frame ItThe smaller it is desirable to be in the training process, the better,
Figure 383253DEST_PATH_IMAGE080
Figure 532474DEST_PATH_IMAGE081
representing a slave image frame ItTo image frame It-1The reverse match penalty, the smaller it is desirable to be during training the better,
Figure 819230DEST_PATH_IMAGE082
Figure 726006DEST_PATH_IMAGE083
for two-way consistency matching losses, i.e. representing letting positive losses
Figure 208940DEST_PATH_IMAGE084
And reverse loss
Figure 25718DEST_PATH_IMAGE081
As much as possible consistent with the results of the prior art,
Figure 935905DEST_PATH_IMAGE085
Figure 329977DEST_PATH_IMAGE086
in the form of a forward prediction matrix, the prediction matrix,
Figure 164072DEST_PATH_IMAGE087
the incidence matrix is labeled for the forward direction,
Figure 287886DEST_PATH_IMAGE088
in order to label the correlation matrix in the reverse direction,
Figure 306657DEST_PATH_IMAGE089
in order to be a reverse prediction matrix,
Figure 1075DEST_PATH_IMAGE090
in order to target the forward prediction matrix,
Figure 888129DEST_PATH_IMAGE091
is the target reverse prediction matrix.
In order to achieve end-to-end model training and model testing, four sub-loss functions are designed, and then the overall loss function of the network is obtained in a linear weighting mode.
The training steps of the data association model are similar to the above steps S101 to S108. For example, after detecting pedestrian frames and human body key points of each video in a training data set by adopting a pedestrian detector and an attitude estimation network, inputting an image pair and a target detection frame of each image frame into a feature extraction network for each image frame in the video, and then obtaining a feature matrix of each image frame based on the attention mechanism network; then, based on the feature matrix of each image frame, a similarity matrix is obtained by using a similarity estimation network; and finally, acquiring a data association matrix based on the similarity matrix, calculating the overall loss value of the network according to the network output and the overall loss function, iteratively updating network parameters according to the overall loss value, and obtaining a trained data association model when the overall loss value tends to be stable.
Therefore, the embodiment of the application integrates the feature learning, the similarity measurement and the target identity distribution into a unified data association network model, and designs an overall loss function for the data association network model to realize end-to-end model training and model testing, thereby further improving the subsequent attitude tracking accuracy.
It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.
Corresponding to the end-to-end multi-row human pose tracking method described in the above embodiments, fig. 5 shows a structural block diagram of an end-to-end multi-row human pose tracking apparatus provided in the embodiments of the present application, and for convenience of explanation, only the parts related to the embodiments of the present application are shown.
Referring to fig. 4, the apparatus includes:
the pedestrian detection and posture estimation module 41 is configured to acquire a video to be detected, perform pedestrian detection and posture estimation on the video to be detected, and acquire target detection frames of each image frame in the video to be detected, where the image frame includes a plurality of target detection frames;
the multilayer convolution feature extraction module 42 is configured to, for each image frame to be processed, input the image pair and the target detection frame of each image frame in the image pair to the feature extraction network, and obtain a hierarchical convolution feature of each target detection frame output by the feature extraction network, where the hierarchical convolution feature includes multilayer convolution features, and the image pair includes the image frame to be processed and a previous frame of the image frame to be processed;
a feature extraction module 43 based on channel attention, configured to input each channel feature of each layer of convolution features of each hierarchical convolution feature to a network based on a channel attention mechanism, respectively, and obtain a final feature based on channel attention description of the channel feature output by the network based on the channel attention mechanism, where each layer of convolution features includes a plurality of channel features;
the similarity matrix generation module 44 is configured to process a feature matrix of each image frame in the image pair to obtain a feature flux, input the feature flux to the similarity estimation network, and obtain a similarity matrix output by the similarity estimation network, where the feature matrix includes a final feature based on channel attention description;
an assignment module 45, configured to assign values to elements of the augmented rows in the forward similarity matrix according to the peak-to-side lobe ratio of each row in the similarity matrix to obtain a forward similarity matrix, and assign values to elements of the augmented rows in the reverse similarity matrix according to the peak-to-side lobe ratio of each row in the similarity matrix to obtain a reverse similarity matrix;
a prediction matrix generation module 46, configured to generate a forward prediction matrix according to the forward similarity matrix, and generate a backward prediction matrix according to the backward similarity matrix;
a data correlation matrix generation module 47, configured to generate a data correlation matrix between the image frame to be processed and a previous frame of the image frame to be processed according to the forward prediction matrix and the reverse prediction matrix;
and the identity distribution module 48 is configured to perform identity distribution on each target detection frame of the image frame to be processed based on each target detection frame of a previous frame of the image frame to be processed according to the data association matrix, so as to obtain a tracking result of the image frame to be processed.
In some possible implementations, the assignment module is specifically configured to:
by the formula
Figure 617181DEST_PATH_IMAGE092
And calculating the peak-to-side lobe ratio of each row element in the similarity matrix, wherein,
Figure 541275DEST_PATH_IMAGE093
is the peak to side lobe ratio of the ith row of the similarity matrix,
Figure 300152DEST_PATH_IMAGE094
is the standard deviation of the element in row i;
judging whether the peak value sidelobe ratio of the ith row is smaller than a first preset threshold value or not;
when the peak value sidelobe ratio of the ith row is smaller than a first preset threshold value, setting the ith row in the augmented column as 1; when the peak value sidelobe ratio of the ith row is larger than or equal to a first preset threshold value, setting the ith row in the augmented column as 0;
by the formula
Figure 413733DEST_PATH_IMAGE095
And calculating the peak-to-side lobe ratio of each column element in the similarity matrix, wherein,
Figure 184243DEST_PATH_IMAGE096
the peak to side lobe ratio of the jth column of the similarity matrix,
Figure 607134DEST_PATH_IMAGE097
is the standard deviation of the jth column element;
judging whether the peak sidelobe ratio of the jth column is smaller than a second preset threshold value or not;
when the peak sidelobe ratio of the jth column is smaller than a second preset threshold, setting the jth column in the augmented row as 1; and when the peak-to-side lobe ratio of the jth column is greater than or equal to a second preset threshold value, setting the jth column in the augmented columns to be 0.
In some possible implementations, the prediction matrix generation module is specifically configured to: performing point-by-point operation on the rows of the forward similarity matrix by using a softmax function to generate a forward prediction matrix; and performing point-by-point operation on the columns of the reverse similarity matrix by using a softmax function to generate a reverse prediction matrix.
In some possible implementations, the data association matrix generation module is specifically configured to: removing the last column of the forward prediction matrix to obtain a target forward prediction matrix;
removing the last row of the reverse prediction matrix to obtain a target reverse prediction matrix;
by the formula
Figure 72882DEST_PATH_IMAGE098
Generating a data correlation matrix between the image frame to be processed and a previous frame of the image frame to be processed;
wherein,
Figure 177104DEST_PATH_IMAGE099
in order to be a matrix of the data associations,
Figure 395595DEST_PATH_IMAGE100
in order to target the forward prediction matrix,
Figure 536858DEST_PATH_IMAGE101
is the target reverse prediction matrix.
In some possible implementations, the channel attention-based feature extraction module is specifically configured to:
convolution feature for each layer
Figure 942432DEST_PATH_IMAGE102
Each channel feature of
Figure 584765DEST_PATH_IMAGE103
Characterization of the channel
Figure 142917DEST_PATH_IMAGE104
Inputting the channel characteristics into the network based on the channel attention mechanism, and obtaining the channel characteristics of the network output based on the channel attention mechanism
Figure 969927DEST_PATH_IMAGE105
Based on the final characteristics of the channel attention description
Figure 613530DEST_PATH_IMAGE106
Wherein,
Figure 59554DEST_PATH_IMAGE107
Figure 987059DEST_PATH_IMAGE108
Figure 735703DEST_PATH_IMAGE109
Figure 522394DEST_PATH_IMAGE110
indicating the c-th channel characteristic
Figure 834426DEST_PATH_IMAGE104
A value at position (i, j);
Figure 429487DEST_PATH_IMAGE111
and
Figure 535983DEST_PATH_IMAGE112
respectively representing a sigmod function and a ReLU layer function;
Figure 75549DEST_PATH_IMAGE113
a set of weights representing a first fully-connected layer,
Figure 688144DEST_PATH_IMAGE114
a set of weights representing a second fully connected layer; the resolution of the c-th channel feature is
Figure 386979DEST_PATH_IMAGE115
The network based on the channel attention mechanism comprises a global pooling layer, a first full-connection layer, a ReLU layer function, a second full-connection layer and a sigmod function which are connected in sequence.
In some possible implementations, the system further includes a training module configured to:
acquiring a training data set;
training a data association network model by adopting a training data set, wherein the data association network model comprises a feature extraction network, a network based on a channel attention mechanism and a bidirectional perception network, and the bidirectional perception network comprises a similarity estimation network;
calculating the overall loss through an overall loss function of the data correlation network model, and iteratively training the data correlation network model according to the overall loss;
wherein the overall loss function is
Figure 415109DEST_PATH_IMAGE116
Figure 441971DEST_PATH_IMAGE117
Is a weight coefficient;
Figure 892544DEST_PATH_IMAGE084
representing a slave image frame It-1To image frame ItThe loss of the positive-going matching of (c),
Figure 399880DEST_PATH_IMAGE118
Figure 785862DEST_PATH_IMAGE119
representing a slave image frame ItTo image frame It-1The loss of the reverse matching of (a),
Figure 362336DEST_PATH_IMAGE120
Figure 367333DEST_PATH_IMAGE083
in order to match the loss for the two-way consistency,
Figure 650546DEST_PATH_IMAGE121
Figure 269747DEST_PATH_IMAGE086
in the form of a forward prediction matrix, the prediction matrix,
Figure 146567DEST_PATH_IMAGE087
the incidence matrix is labeled for the forward direction,
Figure 142205DEST_PATH_IMAGE088
in order to label the correlation matrix in the reverse direction,
Figure 545504DEST_PATH_IMAGE089
in order to be a reverse prediction matrix,
Figure 883076DEST_PATH_IMAGE122
in order to target the forward prediction matrix,
Figure 434143DEST_PATH_IMAGE091
moments are predicted backwards for the target.
It should be noted that, for the information interaction, execution process, and other contents between the above-mentioned devices/units, the specific functions and technical effects thereof are based on the same concept as those of the method embodiment in the embodiment of the present application, which may be referred to in the method embodiment section specifically, and are not described herein again.
Fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application. As shown in fig. 5, the electronic apparatus 5 of this embodiment includes: at least one processor 50 (only one shown in fig. 5), a memory 51, and a computer program 52 stored in the memory 51 and executable on the at least one processor 50, the processor 50 implementing the steps in any of the various object tracking method embodiments described above when executing the computer program 52.
The electronic device 5 may be a desktop computer, a notebook, a palm computer, a cloud server, or other computing devices. The electronic device may include, but is not limited to, a processor 50, a memory 51. Those skilled in the art will appreciate that fig. 5 is merely an example of the electronic device 5, and does not constitute a limitation of the electronic device 5, and may include more or less components than those shown, or combine some of the components, or different components, such as an input-output device, a network access device, etc.
The Processor 50 may be a Central Processing Unit (CPU), and the Processor 50 may be other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The memory 51 may in some embodiments be an internal storage unit of the electronic device 5, such as a hard disk or a memory of the electronic device 5. The memory 51 may also be an external storage device of the electronic device 5 in other embodiments, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the electronic device 5. Further, the memory 51 may also include both an internal storage unit and an external storage device of the electronic device 5. The memory 51 is used for storing an operating system, an application program, a BootLoader (BootLoader), data, and other programs, such as program codes of the computer program. The memory 51 may also be used to temporarily store data that has been output or is to be output.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working processes of the units and modules in the system may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
An embodiment of the present application further provides an electronic device, including: at least one processor, a memory, and a computer program stored in the memory and executable on the at least one processor, the processor implementing the steps of any of the various method embodiments described above when executing the computer program.
The embodiments of the present application further provide a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the computer program implements the steps in the above-mentioned method embodiments.
The embodiments of the present application provide a computer program product, which when running on an electronic device, enables the electronic device to implement the steps in the above method embodiments when executed.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, all or part of the processes in the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium and can implement the steps of the embodiments of the methods described above when the computer program is executed by a processor. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer readable medium may include at least: any entity or device capable of carrying computer program code to a photographing apparatus/terminal apparatus, a recording medium, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), an electrical carrier signal, a telecommunications signal, and a software distribution medium. Such as a usb-disk, a removable hard disk, a magnetic or optical disk, etc. In certain jurisdictions, computer-readable media may not be an electrical carrier signal or a telecommunications signal in accordance with legislative and patent practice.
In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus, electronic device and method may be implemented in other ways. For example, the above-described apparatus/electronic device embodiments are merely illustrative, and for example, the division of the modules or units is only one logical division, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
The above-mentioned embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.

Claims (10)

1. An end-to-end multi-person pose tracking method, comprising:
acquiring a video to be detected, and performing pedestrian detection and attitude estimation on the video to be detected to obtain target detection frames of image frames in the video to be detected, wherein the image frames comprise a plurality of the target detection frames;
inputting an image pair and the target detection frame of each image frame in the image pair into a feature extraction network aiming at each image frame to be processed, and obtaining the hierarchical convolution feature of each target detection frame output by the feature extraction network, wherein the hierarchical convolution feature comprises multiple layers of convolution features, and the image pair comprises the image frame to be processed and the previous frame of the image frame to be processed;
inputting each channel feature of each layer of convolution features of each hierarchical convolution feature into a channel attention mechanism-based network respectively, and obtaining a channel attention description-based final feature of the channel feature output by the channel attention mechanism-based network, wherein each layer of convolution features comprises a plurality of channel features;
processing a feature matrix of each image frame in the image pair to obtain a feature flux, inputting the feature flux to a similarity estimation network, and obtaining a similarity matrix output by the similarity estimation network, wherein the feature matrix comprises the final features based on the channel attention description;
assigning values to elements of the augmented rows in the forward similarity matrix according to the peak-to-side lobe ratio of each row in the similarity matrix to obtain a forward similarity matrix, and assigning values to elements of the augmented rows in the reverse similarity matrix according to the peak-to-side lobe ratio of each row in the similarity matrix to obtain a reverse similarity matrix;
generating a forward prediction matrix according to the forward similarity matrix, and generating a reverse prediction matrix according to the reverse similarity matrix;
generating a data association matrix between the image frame to be processed and a previous frame of the image frame to be processed according to the forward prediction matrix and the backward prediction matrix;
and according to the data incidence matrix, carrying out identity distribution on each target detection frame of the image frame to be processed based on each target detection frame of the previous frame of the image frame to be processed so as to obtain a tracking result of the image frame to be processed.
2. The method of claim 1, wherein assigning values to elements of the augmented rows in the forward similarity matrix based on a peak-to-side lobe ratio of each row in the similarity matrix to obtain a forward similarity matrix and assigning values to elements of the augmented rows in the reverse similarity matrix based on a peak-to-side lobe ratio of each column in the similarity matrix to obtain a reverse similarity matrix, comprises:
by the formula
Figure 727966DEST_PATH_IMAGE001
Calculating a peak-to-side lobe ratio for each row element in the similarity matrix, wherein,
Figure 653196DEST_PATH_IMAGE002
is the peak-to-side lobe ratio of the ith row of the similarity matrix,
Figure 752871DEST_PATH_IMAGE003
is the standard deviation of the element in row i;
judging whether the peak value sidelobe ratio of the ith row is smaller than a first preset threshold value or not;
when the peak sidelobe ratio of the ith row is smaller than the first preset threshold, setting the ith row in the augmented column as 1; when the peak-to-side lobe ratio of the ith row is greater than or equal to the first preset threshold, setting the ith row in the augmented row to be 0;
by the formula
Figure 814367DEST_PATH_IMAGE004
Calculating a peak-to-side lobe ratio for each column element in the similarity matrix, wherein,
Figure 974084DEST_PATH_IMAGE005
the peak-to-side lobe ratio of the jth column of the similarity matrix,
Figure 386611DEST_PATH_IMAGE006
is the standard deviation of the jth column element;
judging whether the peak sidelobe ratio of the jth column is smaller than a second preset threshold value or not;
when the peak sidelobe ratio of the jth column is smaller than the second preset threshold, setting the jth column in the augmented row as 1; when the peak-to-side lobe ratio of the jth column is greater than or equal to the second preset threshold, setting the jth column in the augmented columns to be 0;
wherein S is the similarity matrix, SiIs a row of the similarity matrix S, SjIs the column of the similarity matrix.
3. The method of claim 1, wherein generating a forward prediction matrix based on the forward similarity matrix and generating a reverse prediction matrix based on the reverse similarity matrix comprises:
performing point-by-point operation on the rows of the forward similarity matrix by using a softmax function to generate the forward prediction matrix;
performing point-by-point operation on the columns of the reverse similarity matrix by using the softmax function to generate the reverse prediction matrix.
4. The method of claim 1, wherein generating a data correlation matrix between the image frame to be processed and a frame preceding the image frame to be processed based on the forward prediction matrix and the reverse prediction matrix comprises:
removing the last column of the forward prediction matrix to obtain a target forward prediction matrix;
removing the last row of the reverse prediction matrix to obtain a target reverse prediction matrix;
by the formula
Figure 289976DEST_PATH_IMAGE007
Generating a data correlation matrix between the image frame to be processed and a previous frame of the image frame to be processed;
wherein,
Figure 471559DEST_PATH_IMAGE008
for the purpose of the data-correlation matrix,
Figure 536598DEST_PATH_IMAGE009
for the target forward prediction matrix, the forward prediction matrix is,
Figure 436421DEST_PATH_IMAGE010
and the target reverse prediction matrix is obtained.
5. The method of claim 1, wherein separately inputting the respective channel features of each layer of convolution features of each of the hierarchical convolution features to a channel attention mechanism-based network, obtaining a channel attention description-based final feature of the channel features output by the channel attention mechanism-based network, comprises:
convolution feature for each layer
Figure 877898DEST_PATH_IMAGE011
Each channel feature of
Figure 179566DEST_PATH_IMAGE012
Characterizing said channel
Figure 415506DEST_PATH_IMAGE013
Input to the channel attention mechanism-based networkObtaining the channel characteristics of the channel attention mechanism-based network output
Figure 802625DEST_PATH_IMAGE014
Based on the final characteristics of the channel attention description
Figure 172427DEST_PATH_IMAGE015
Wherein,
Figure 938389DEST_PATH_IMAGE016
Figure 735443DEST_PATH_IMAGE017
Figure 219646DEST_PATH_IMAGE018
Figure 393138DEST_PATH_IMAGE019
indicating the c-th channel characteristic
Figure 279185DEST_PATH_IMAGE020
A value at position (i, j);
Figure 247141DEST_PATH_IMAGE021
and
Figure 218640DEST_PATH_IMAGE022
respectively representing a sigmod function and a ReLU layer function;
Figure 930244DEST_PATH_IMAGE023
a set of weights representing a first fully-connected layer,
Figure 670798DEST_PATH_IMAGE024
a set of weights representing a second fully connected layer; the resolution of the c-th channel feature is
Figure 809655DEST_PATH_IMAGE025
The channel attention mechanism-based network comprises a global pooling layer, the first fully-connected layer, the ReLU layer function, the second fully-connected layer and the sigmod function which are connected in sequence.
6. The method according to any one of claims 1 to 5, further comprising, before acquiring the video to be detected:
acquiring a training data set;
training a data association network model by adopting a training data set, wherein the data association network model comprises the feature extraction network, the channel attention mechanism-based network and a bidirectional perception network, and the bidirectional perception network comprises the similarity estimation network;
calculating the integral loss through an integral loss function of the data correlation network model, and iteratively training the data correlation network model according to the integral loss;
wherein the global loss function is
Figure 268449DEST_PATH_IMAGE026
Figure 518165DEST_PATH_IMAGE027
Is a weight coefficient;
Figure 378805DEST_PATH_IMAGE028
representing a slave image frame It-1To image frame ItThe loss of the positive-going matching of (c),
Figure 688563DEST_PATH_IMAGE029
Figure 634654DEST_PATH_IMAGE030
representing a slave image frame ItTo image frame It-1The loss of the reverse matching of (a),
Figure 688061DEST_PATH_IMAGE031
Figure 403207DEST_PATH_IMAGE032
in order to match the loss for the two-way consistency,
Figure 883867DEST_PATH_IMAGE033
Figure 176308DEST_PATH_IMAGE034
in the form of a forward prediction matrix, the prediction matrix,
Figure 908772DEST_PATH_IMAGE035
the incidence matrix is labeled for the forward direction,
Figure 603058DEST_PATH_IMAGE036
in order to label the correlation matrix in the reverse direction,
Figure 129985DEST_PATH_IMAGE037
in order to be a reverse prediction matrix,
Figure 175302DEST_PATH_IMAGE009
in order to target the forward prediction matrix,
Figure 445877DEST_PATH_IMAGE038
a target reverse prediction matrix;
wherein, the forward labeling incidence matrix is the final labeling incidence matrix GAThe last row of the correlation matrix is removed to obtain a matrix, and the reverse labeling incidence matrix is the final labeling incidence matrix GALast column of (1) is removedThe matrix obtained later, the final labeled incidence matrix GAAnd the matrix is obtained by expanding the data association matrix by one column and one row.
7. An end-to-end multi-row human pose tracking apparatus, comprising:
the pedestrian detection and posture estimation module is used for acquiring a video to be detected, and performing pedestrian detection and posture estimation on the video to be detected to obtain target detection frames of all image frames in the video to be detected, wherein the image frames comprise a plurality of target detection frames;
the multilayer convolution feature extraction module is used for inputting an image pair and the target detection frame of each image frame in the image pair into a feature extraction network aiming at each image frame to be processed, and obtaining the hierarchical convolution feature of each target detection frame output by the feature extraction network, wherein the hierarchical convolution feature comprises multilayer convolution features, and the image pair comprises the image frame to be processed and the previous frame of the image frame to be processed;
a feature extraction module based on channel attention, configured to input, to a network based on a channel attention mechanism, respective channel features of each layer of convolution features of each of the hierarchical convolution features, respectively, and obtain a final feature, output by the network based on the channel attention mechanism, of channel feature description based on channel attention, where each layer of the convolution features includes a plurality of the channel features;
a similarity matrix generation module, configured to process a feature matrix of each image frame in the image pair to obtain a feature flux, input the feature flux to a similarity estimation network, and obtain a similarity matrix output by the similarity estimation network, where the feature matrix includes the final feature based on the channel attention description;
the assignment module is used for assigning values to elements of the augmented rows in the forward similarity matrix according to the peak-to-side lobe ratio of each row in the similarity matrix to obtain the forward similarity matrix, and assigning values to elements of the augmented rows in the reverse similarity matrix according to the peak-to-side lobe ratio of each row in the similarity matrix to obtain the reverse similarity matrix;
the prediction matrix generation module is used for generating a forward prediction matrix according to the forward similarity matrix and generating a reverse prediction matrix according to the reverse similarity matrix;
a data correlation matrix generation module, configured to generate a data correlation matrix between the image frame to be processed and a previous frame of the image frame to be processed according to the forward prediction matrix and the reverse prediction matrix;
and the identity distribution module is used for carrying out identity distribution on each target detection frame of the image frame to be processed based on each target detection frame of the previous frame of the image frame to be processed according to the data correlation matrix so as to obtain a tracking result of the image frame to be processed.
8. The apparatus of claim 7, wherein the assignment module is specifically configured to:
by the formula
Figure 994670DEST_PATH_IMAGE001
Calculating a peak-to-side lobe ratio for each row element in the similarity matrix, wherein,
Figure 934641DEST_PATH_IMAGE002
is the peak-to-side lobe ratio of the ith row of the similarity matrix,
Figure 201674DEST_PATH_IMAGE003
is the standard deviation of the element in row i;
judging whether the peak value sidelobe ratio of the ith row is smaller than a first preset threshold value or not;
when the peak sidelobe ratio of the ith row is smaller than the first preset threshold, setting the ith row in the augmented column as 1; when the peak-to-side lobe ratio of the ith row is greater than or equal to the first preset threshold, setting the ith row in the augmented row to be 0;
by the formula
Figure 275940DEST_PATH_IMAGE004
Calculating a peak-to-side lobe ratio for each column element in the similarity matrix, wherein,
Figure 679240DEST_PATH_IMAGE005
the peak-to-side lobe ratio of the jth column of the similarity matrix,
Figure 813549DEST_PATH_IMAGE006
is the standard deviation of the jth column element;
judging whether the peak sidelobe ratio of the jth column is smaller than a second preset threshold value or not;
when the peak sidelobe ratio of the jth column is smaller than the second preset threshold, setting the jth column in the augmented row as 1; when the peak-to-side lobe ratio of the jth column is greater than or equal to the second preset threshold, setting the jth column in the augmented columns to be 0;
wherein S is the similarity matrix, SiIs a row of the similarity matrix S, SjIs the column of the similarity matrix.
9. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the method of any of claims 1 to 6 when executing the computer program.
10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 6.
CN202111328115.XA 2021-11-10 2021-11-10 End-to-end multi-pedestrian posture tracking method and device and electronic equipment Active CN113762231B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111328115.XA CN113762231B (en) 2021-11-10 2021-11-10 End-to-end multi-pedestrian posture tracking method and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111328115.XA CN113762231B (en) 2021-11-10 2021-11-10 End-to-end multi-pedestrian posture tracking method and device and electronic equipment

Publications (2)

Publication Number Publication Date
CN113762231A CN113762231A (en) 2021-12-07
CN113762231B true CN113762231B (en) 2022-03-22

Family

ID=78784822

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111328115.XA Active CN113762231B (en) 2021-11-10 2021-11-10 End-to-end multi-pedestrian posture tracking method and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN113762231B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117315521A (en) * 2022-06-22 2023-12-29 脸萌有限公司 Method, apparatus, device and medium for processing video based on contrast learning

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109978921A (en) * 2019-04-01 2019-07-05 南京信息工程大学 A kind of real-time video target tracking algorithm based on multilayer attention mechanism
CN111160203A (en) * 2019-12-23 2020-05-15 中电科新型智慧城市研究院有限公司 Loitering and lingering behavior analysis method based on head and shoulder model and IOU tracking
CN111881840A (en) * 2020-07-30 2020-11-03 北京交通大学 Multi-target tracking method based on graph network

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109214238B (en) * 2017-06-30 2022-06-28 阿波罗智能技术(北京)有限公司 Multi-target tracking method, device, equipment and storage medium
CN108573496B (en) * 2018-03-29 2020-08-11 淮阴工学院 Multi-target tracking method based on LSTM network and deep reinforcement learning
CN109919981B (en) * 2019-03-11 2022-08-02 南京邮电大学 Multi-feature fusion multi-target tracking method based on Kalman filtering assistance
US20210319420A1 (en) * 2020-04-12 2021-10-14 Shenzhen Malong Technologies Co., Ltd. Retail system and methods with visual object tracking
CN112001225B (en) * 2020-07-06 2023-06-23 西安电子科技大学 Online multi-target tracking method, system and application
CN111882581B (en) * 2020-07-21 2022-10-28 青岛科技大学 Multi-target tracking method for depth feature association
CN113034543B (en) * 2021-03-18 2022-05-03 德清阿尔法创新研究院 3D-ReID multi-target tracking method based on local attention mechanism

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109978921A (en) * 2019-04-01 2019-07-05 南京信息工程大学 A kind of real-time video target tracking algorithm based on multilayer attention mechanism
CN111160203A (en) * 2019-12-23 2020-05-15 中电科新型智慧城市研究院有限公司 Loitering and lingering behavior analysis method based on head and shoulder model and IOU tracking
CN111881840A (en) * 2020-07-30 2020-11-03 北京交通大学 Multi-target tracking method based on graph network

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Automatic Selection of Keyframes from Angiogram Videos;T. Syeda-Mahmood et.al.;《2010 20th International Conference on Pattern Recognition》;20101007;第4008-4011页 *

Also Published As

Publication number Publication date
CN113762231A (en) 2021-12-07

Similar Documents

Publication Publication Date Title
Jia et al. Segment, magnify and reiterate: Detecting camouflaged objects the hard way
CN110414432B (en) Training method of object recognition model, object recognition method and corresponding device
CN112597941B (en) Face recognition method and device and electronic equipment
CN109086811B (en) Multi-label image classification method and device and electronic equipment
Vankadari et al. Unsupervised monocular depth estimation for night-time images using adversarial domain feature adaptation
CN112861575A (en) Pedestrian structuring method, device, equipment and storage medium
CN110555405B (en) Target tracking method and device, storage medium and electronic equipment
WO2023010758A1 (en) Action detection method and apparatus, and terminal device and storage medium
CN115240121B (en) Joint modeling method and device for enhancing local features of pedestrians
CN111104925B (en) Image processing method, image processing apparatus, storage medium, and electronic device
CN112749666B (en) Training and action recognition method of action recognition model and related device
CN112784750B (en) Fast video object segmentation method and device based on pixel and region feature matching
Luo et al. Traffic analytics with low-frame-rate videos
US20230334893A1 (en) Method for optimizing human body posture recognition model, device and computer-readable storage medium
CN111209774A (en) Target behavior recognition and display method, device, equipment and readable medium
WO2021169642A1 (en) Video-based eyeball turning determination method and system
Wang et al. Skip-connection convolutional neural network for still image crowd counting
CN115375917B (en) Target edge feature extraction method, device, terminal and storage medium
Malav et al. DHSGAN: An end to end dehazing network for fog and smoke
CN114170558B (en) Method, system, apparatus, medium, and article for video processing
CN113762231B (en) End-to-end multi-pedestrian posture tracking method and device and electronic equipment
Li et al. Robust foreground segmentation based on two effective background models
CN116523959A (en) Moving object detection method and system based on artificial intelligence
EP4332910A1 (en) Behavior detection method, electronic device, and computer readable storage medium
CN115810152A (en) Remote sensing image change detection method and device based on graph convolution and computer equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant