CN113674313A

CN113674313A - Pedestrian tracking method and device, storage medium and electronic equipment

Info

Publication number: CN113674313A
Application number: CN202110756866.5A
Authority: CN
Inventors: 王建雄; 刘竞爽; 鲍一平
Original assignee: Beijing Kuangshi Technology Co Ltd; Beijing Megvii Technology Co Ltd
Current assignee: Beijing Kuangshi Technology Co Ltd; Beijing Megvii Technology Co Ltd
Priority date: 2021-07-05
Filing date: 2021-07-05
Publication date: 2021-11-19

Abstract

The embodiment of the application provides a pedestrian tracking method, a pedestrian tracking device, a storage medium and electronic equipment, wherein the pedestrian tracking method comprises the following steps: extracting at least one face data and at least one body data of each video frame image in at least two video frame images in a video; matching each face characteristic vector and each human body characteristic vector of each video frame image to determine a face-to-person binding result of each video frame image, wherein the face-to-person binding result comprises a face frame, a human body frame corresponding to the face frame and human body characteristic vectors corresponding to the face characteristic vectors and the face frame frames; and determining the same pedestrian in the two video frame images at the interval of n frames based on the human body feature vector in the face-to-person binding result of the two video frame images at the interval of n frames. The method and the device can improve the face binding accuracy in a single video frame image.

Description

Pedestrian tracking method and device, storage medium and electronic equipment

Technical Field

The present application relates to the field of computer application technologies, and in particular, to a pedestrian tracking method, apparatus, storage medium, and electronic device.

Background

Currently, existing pedestrian tracking methods include: and performing frame extraction on the video to obtain a plurality of video frame images. Subsequently, the face detection and the human body detection can be performed on each video frame image in the plurality of video frame images to obtain a face frame and a human body frame corresponding to each video frame image. Subsequently, the face frame and the body frame in each video frame image may be bound by a post-processing method (e.g., the face frame and the body frame in each video frame image are bound based on an Intersection Over Unit (IOU) value of the face frame and the body frame) to obtain a face-body binding pair. And finally, on the adjacent video frame images, tracking the same pedestrian by using a tracking algorithm so as to realize human face binding tracking.

In the process of implementing the invention, the inventor finds that the following problems exist in the prior art: the human face frame and the human body frame in each video frame image are bound through a post-processing method in the existing pedestrian tracking method, so that the problem of low human face binding accuracy is caused. For example, for a scene with a dense crowd, two pedestrians are closer to each other in the video frame image, so that the face frame and the body frame corresponding to the two pedestrians are also closer to each other, and it is difficult to accurately determine which pedestrian the target face frame belongs to based on the IOU value at this time.

Disclosure of Invention

An object of the embodiments of the present application is to provide a pedestrian tracking method, apparatus, storage medium, and electronic device, so as to solve the problem in the prior art that the face binding accuracy is not high.

In a first aspect, an embodiment of the present application provides a pedestrian tracking method, including: extracting at least one face data and at least one body data of each video frame image in at least two video frame images in a video, wherein each face data in at least one face data comprises a face frame and a face feature vector corresponding to a face framed by the face frame, and each body data in at least one body data comprises a body frame and a body feature vector corresponding to a body framed by the body frame; matching each face characteristic vector and each human body characteristic vector of each video frame image to determine a face-to-person binding result of each video frame image, wherein the face-to-person binding result comprises a face frame, a human body frame corresponding to the face frame and human body characteristic vectors corresponding to the face characteristic vectors and the face frame frames; and determining the same pedestrian in the two video frame images at the interval of n frames based on the human body feature vector in the face-to-person binding result of the two video frame images at the interval of n frames, wherein n is an integer greater than or equal to 0.

Therefore, according to the embodiment of the application, each face feature vector and each human body feature vector of each video frame image are matched to determine the face-to-person binding result of each video frame image, so that the face binding accuracy in a single video frame image can be improved. In addition, the embodiment of the application also determines the same pedestrian in the two video frame images at the interval of n frames based on the human body characteristic vector in the face-to-person binding result of the two video frame images at the interval of n frames, so that the accuracy and the speed of human body tracking can be obviously improved.

In one possible embodiment, matching each facial feature vector and each human body feature vector of each video frame image to determine a facial binding result in each video frame image includes: calculating the distance between the current face feature vector and the current human body feature vector; under the condition that the distance between the current face feature vector and the current human body feature vector is smaller than or equal to a first preset distance, determining that a current face frame corresponding to the current face feature vector and a current human body frame corresponding to the current human body feature vector are a face frame and a human body frame of the same pedestrian; and acquiring a face-to-person binding result according to the current face frame, the current body frame, the current face feature vector and the current body feature vector.

Therefore, the embodiment of the application can accurately acquire the face-to-person binding result through the distance between the face feature vector and the human body feature vector.

In one possible embodiment, after matching each facial feature vector and each human body feature vector of each video frame image to determine a facial-human binding result in each video frame image, the pedestrian tracking method further includes: calculating the intersection ratio IOU value of each face frame and each body frame in each video frame image; determining a human face and human body binding pair in each video frame image according to the IOU value, wherein the human face and human body binding pair comprises a human face frame and a human body frame corresponding to the human face frame; verifying the face-to-person binding result of each video frame image by using the face-to-body binding pair in each video frame image so as to search a face-to-person binding result which is in error matching from the face-to-person binding result of each video frame image; and deleting the face-to-person binding result which is in error matching from the face-to-person binding result of each video frame image.

Therefore, the embodiment of the application can also verify the face-to-person binding result through the face-to-body binding pair determined by the IOU value, so that the face-to-person binding result which is far away in distance but is wrongly bound through the feature vector is reduced, and the accuracy of face binding can be further improved.

In one possible embodiment, the two video frame images spaced by n frames include the ith video frame image and the (i + n) th video frame image, i being an integer greater than or equal to 1; the method for determining the same pedestrian in the two video frame images at the interval of n frames based on the human body feature vectors in the face-to-person binding results of the two video frame images at the interval of n frames comprises the following steps: calculating the distance between the human body characteristic vector in the face-to-person binding result of the ith video frame image and the human body characteristic vector in the face-to-person binding result of the (i + n) th video frame image; and under the condition that the distance between the human body feature vector in the face-to-person binding result of the ith video frame image and the human body feature vector in the face-to-person binding result of the (i + n) th video frame image is less than or equal to a second preset distance, determining that the human body frame in the face-to-person binding result of the ith video frame image and the human body frame in the (i + n) th video frame image are the human body frame of the same pedestrian.

Therefore, the embodiment of the application accurately determines whether the image is the same pedestrian according to the distance between the human characteristic vector in the face-to-person binding result of the ith video frame image and the human characteristic vector in the face-to-person binding result of the (i + n) th video frame image.

In one possible embodiment, extracting at least one face data and at least one body data of each of at least two video frame images in a video comprises: and extracting at least one piece of face data and at least one piece of human body data of each video frame image through the trained detection model.

In one possible embodiment, the training process of the detection model includes: calculating the distance between each face feature vector and each human body feature vector of each training image in at least one training image for training the detection model to be trained by utilizing a first loss function to obtain a first loss function value; the first loss function is used for shortening the distance between the human face characteristic vector and the human body characteristic vector of the same pedestrian in each training image and shortening the distance between the human face characteristic vector and the human body characteristic vector of different pedestrians in each training image; calculating the distance between body characteristic vectors of the same pedestrian in two training images spaced by n frames by using a second loss function to obtain a second loss function value; the second loss function is used for shortening the distance between the body characteristic vectors of the same pedestrian in the two training images spaced by n frames and shortening the distance between the body characteristic vectors of different pedestrians in the two training images spaced by n frames; and adjusting parameters of the detection model to be trained according to the first loss function value and the second loss function value so as to obtain the trained detection model.

Therefore, by means of the training process of the detection model, the embodiment of the application enables the face special effect vector and the human body special effect vector output by the trained detection model to be capable of identifying whether the face special effect vector and the human body special effect vector are the same pedestrian.

In one possible embodiment, the two video frame images spaced by n frames are two video frame images adjacent to each other.

Therefore, the embodiment of the application can realize pedestrian tracking through two adjacent video frame images in front and back.

In a second aspect, an embodiment of the present application provides a pedestrian tracking apparatus, including: the extraction module is used for extracting at least one face data and at least one body data of each video frame image in at least two video frame images in the video, wherein each face data in at least one face data comprises a face frame and a face feature vector corresponding to a face framed by the face frame, and each body data in at least one body data comprises a body frame and a body feature vector corresponding to a body framed by the body frame; the matching module is used for matching each face characteristic vector and each human body characteristic vector of each video frame image to determine a face-to-human binding result of each video frame image, wherein the face-to-human binding result comprises a face frame, a human body frame corresponding to the face frame and human body characteristic vectors corresponding to the face characteristic vectors and the face frame frames; the first determining module is used for determining the same pedestrian in the two video frame images at the interval of n frames based on the human body characteristic vector in the face-person binding result of the two video frame images at the interval of n frames, wherein n is an integer greater than or equal to 0.

In a possible embodiment, the matching module is specifically configured to: calculating the distance between the current face feature vector and the current human body feature vector; under the condition that the distance between the current face feature vector and the current human body feature vector is smaller than or equal to a first preset distance, determining that a current face frame corresponding to the current face feature vector and a current human body frame corresponding to the current human body feature vector are a face frame and a human body frame of the same pedestrian; and acquiring a face-to-person binding result according to the current face frame, the current body frame, the current face feature vector and the current body feature vector.

In one possible embodiment, the pedestrian tracking apparatus further includes: the first calculation module is used for calculating the intersection ratio IOU value of each face frame and each human body frame in each video frame image after matching each face feature vector and each human body feature vector of each video frame image to determine the face-to-human binding result in each video frame image; the second determining module is used for determining a human face and human body binding pair in each video frame image according to the IOU value, wherein the human face and human body binding pair comprises a human face frame and a human body frame corresponding to the human face frame; the searching module is used for verifying the face-to-human binding result of each video frame image by using the face-to-human binding pair in each video frame image so as to search the face-to-human binding result which is in error matching from the face-to-human binding result of each video frame image; and the deleting module is used for deleting the face-to-person binding result which is in error matching from the face-to-person binding result of each video frame image.

In one possible embodiment, the two video frame images spaced by n frames include the ith video frame image and the (i + n) th video frame image, i being an integer greater than or equal to 1; the first determining module is specifically configured to: calculating the distance between the human body characteristic vector in the face-to-person binding result of the ith video frame image and the human body characteristic vector in the face-to-person binding result of the (i + n) th video frame image; and under the condition that the distance between the human body feature vector in the face-to-person binding result of the ith video frame image and the human body feature vector in the face-to-person binding result of the (i + n) th video frame image is less than or equal to a second preset distance, determining that the human body frame in the face-to-person binding result of the ith video frame image and the human body frame in the (i + n) th video frame image are the human body frame of the same pedestrian.

In a possible embodiment, the extraction module is specifically configured to: and extracting at least one piece of face data and at least one piece of human body data of each video frame image through the trained detection model.

In one possible embodiment, a pedestrian tracking apparatus includes: the second calculation module is used for calculating the distance between each face characteristic vector and each human body characteristic vector of each training image in at least one training image used for training the detection model to be trained according to the first loss function so as to obtain a first loss function value; the first loss function is used for shortening the distance between the human face characteristic vector and the human body characteristic vector of the same pedestrian in each training image and shortening the distance between the human face characteristic vector and the human body characteristic vector of different pedestrians in each training image; the third calculation module is used for constructing a second loss function according to the distance between the body characteristic vectors of the same pedestrian in the two training images at intervals of n frames; the second loss function is used for shortening the distance between the body characteristic vectors of the same pedestrian in the two training images spaced by n frames and shortening the distance between the body characteristic vectors of different pedestrians in the two training images spaced by n frames; and the adjusting module is used for adjusting the parameters of the detection model to be trained according to the first loss function and the second loss function so as to obtain the trained detection model.

In a third aspect, an embodiment of the present application provides a storage medium, where a computer program is stored on the storage medium, and when the computer program is executed by a processor, the computer program performs the method according to the first aspect or any optional implementation manner of the first aspect.

In a fourth aspect, an embodiment of the present application provides an electronic device, including: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating via the bus when the electronic device is running, the machine-readable instructions when executed by the processor performing the method of the first aspect or any of the alternative implementations of the first aspect.

In a fifth aspect, the present application provides a computer program product which, when run on a computer, causes the computer to perform the method of the first aspect or any possible implementation manner of the first aspect.

In order to make the aforementioned and other objects, features and advantages of the present invention comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and that those skilled in the art can also obtain other related drawings based on the drawings without inventive efforts.

FIG. 1 illustrates a flow chart of a pedestrian tracking method of the prior art;

FIG. 2 is a flow chart illustrating a pedestrian tracking method provided by an embodiment of the present application;

FIG. 3 is a detailed flow chart of a pedestrian tracking method provided by an embodiment of the present application;

fig. 4 is a block diagram illustrating a structure of a pedestrian tracking apparatus according to an embodiment of the present application;

fig. 5 shows a block diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures. Meanwhile, in the description of the present application, the terms "first", "second", and the like are used only for distinguishing the description, and are not to be construed as indicating or implying relative importance.

The face binding tracking technology is a technology for positioning a face and a human body from a video, binding the face and the human body belonging to the same pedestrian into a pair, and tracking the human body in a time sequence. The track of the pedestrian in the video can be obtained through the face binding tracking technology, the face and the human body picture of the pedestrian at any moment can be obtained, and the method has wide application prospects in the fields of automatic driving, security protection, new retail, virtual augmented reality and the like.

Referring to fig. 1, fig. 1 shows a flow chart of a pedestrian tracking method in the prior art. The pedestrian tracking method shown in fig. 1 includes:

and performing frame extraction on the video to be subjected to frame extraction to obtain an image sequence set. The image sequence set comprises 1 st video frame image to Nth video frame image, wherein N is a positive integer. And then, respectively carrying out face detection and human body detection on each video frame image in the image sequence set to obtain a face frame and a human body frame of each video frame image. Subsequently, a post-processing method can be used to bind the face frame and the body frame of each video frame image to obtain a face-body binding pair. Then, on the adjacent video frame images, the tracking of the same pedestrian is completed by using a tracking algorithm (for example, an IoU tracker algorithm or a Re-ID tracker algorithm), so that the face binding tracking can be completed.

However, the existing pedestrian tracking method has at least the following problems:

because the face frame and the body frame in each video frame image are bound by a post-processing method in the existing pedestrian tracking method, the problem of low face binding accuracy is caused;

the tracking of the same pedestrian is completed through an IoU tracker algorithm in the existing pedestrian tracking method, so that the problem of low tracking accuracy is caused. For example, in a scene with dense crowds, tracking by the IoU tracker algorithm may match a pedestrian of a previous video frame image to another pedestrian of a next video frame image;

the existing pedestrian tracking method completes the tracking of the same pedestrian through the Re-ID tracker algorithm, so that the problem of low tracking efficiency is caused. For example, the Re-ID tracker algorithm is used to perform tracking on the same pedestrian by extracting a human body frame in the video frame image, and then inputting the human body frame into an existing feature extraction model to obtain corresponding human body features (e.g., wearing information or age information). However, under the condition that there are many pedestrians in the video frame image, the human body frames in the video frame image can be extracted only by extracting for many times, and each human body frame needs to be subjected to feature extraction by the existing feature extraction model, that is, feature extraction also needs to be extracted for many times, thereby causing the problem of low tracking efficiency.

That is, the existing pedestrian tracking method has at least the following two problems: the face binding accuracy in a single video frame image is not high; the existing pedestrian tracking method is difficult to meet the requirements of high precision and high speed.

Based on this, the embodiment of the present application provides a pedestrian tracking method, by extracting at least one face data and at least one body data of each of at least two video frame images in a video, wherein each face data in the at least one face data includes a face frame and a face feature vector corresponding to a face framed by the face frame, each body data in the at least one body data includes a body frame and a body feature vector corresponding to a body framed by the body frame, then matching each face feature vector and each body feature vector of each video frame image to determine a face-to-person binding result of each video frame image, wherein the face-to-person binding result includes the face frame and the body frame corresponding to the face frame and the body feature vectors corresponding to the face feature vector and the face feature vector, and finally based on the body feature vectors in the face binding results of two video frame images at an interval of n frames, and determining the same pedestrian in two video frame images separated by n frames, wherein n is an integer greater than or equal to 0. Therefore, according to the embodiment of the application, each face characteristic vector and each human body characteristic vector of each video frame image in at least two video frame images are matched to determine the face-to-person binding result of each video frame image, so that the accuracy of face binding in a single video frame image can be improved.

In addition, the embodiment of the application also determines the same pedestrian in the two video frame images at the interval of n frames based on the human body characteristic vector in the face-to-person binding result of the two video frame images at the interval of n frames, so that the accuracy and the speed of human body tracking can be obviously improved.

To facilitate understanding of the embodiments of the present application, some terms referred to in the embodiments of the present application are explained below:

feature Embedding (FE): it is a vector of length M.

It should be understood that the specific value of M may be set according to actual requirements, and the embodiments of the present application are not limited thereto.

For example, a specific designation for M may be 8.

It should also be understood that feature embedding may also be referred to as feature vectors, may also be referred to as feature data, and the like.

Correspondingly, the face feature embedding may also be referred to as face feature vector, and may also be referred to as face feature data, and the like. And, human feature embedding may also be referred to as human feature vectors, may also be referred to as human feature data, and the like.

Furthermore, feature embedding also has the following properties:

for a face frame (and the face frame may have a corresponding face feature vector) and a body frame (and the body frame may have a corresponding body feature vector) belonging to the same pedestrian, the distance between the face feature vector and the body feature vector in the same video frame image is less than or equal to a first preset distance, that is, the distance between the face feature vector and the body feature vector is relatively close;

for the face frames and the body frames belonging to different pedestrians, the distance between the face feature vector and the body feature vector in the same video frame image is larger than a first preset distance, namely the distance between the face feature vector and the body feature vector is relatively far;

for the human body frames and the human body frames belonging to the same pedestrian, the distance between the human body feature vectors in the two adjacent video frame images is less than or equal to a second preset distance, namely the distance between the human body feature vectors is relatively short;

for the human body frames and the human body frames belonging to different pedestrians, the distance between the human body feature vectors in the two adjacent video frame images is larger than or equal to a second preset distance, namely the distance between the human body special effect vectors is relatively far.

It should be noted that, regarding the above four properties of feature embedding, the first two properties of feature embedding are used when performing face binding in a single video frame image, and the last two properties of feature embedding are used when performing human body tracking in an unused video frame image.

It should be understood that the specific distance of the first preset distance and the specific distance of the second preset distance may be set according to actual requirements, and the embodiment of the present application is not limited thereto.

It should also be understood that the distance between different feature vectors (e.g., the distance between a face feature vector and a body feature vector; or the distance between a body feature vector and another body feature vector) may be a euclidean distance, a cosine similarity of vectors, etc.

That is to say, the specific form corresponding to the distance between different feature vectors may be set according to actual requirements, and the embodiment of the present application is not limited thereto.

Referring to fig. 2, fig. 2 shows a flowchart of a pedestrian tracking method according to an embodiment of the present application. The pedestrian tracking method shown in fig. 2 may be performed by a pedestrian tracking apparatus, which may correspond to the pedestrian tracking apparatus shown in fig. 4 below, and the pedestrian tracking apparatus may be various devices capable of performing the method, such as a computer or an electronic device, and the like, and the embodiment of the present application is not limited thereto. The pedestrian tracking method shown in fig. 2 includes:

step S210, at least one face data and at least one body data of each video frame image in at least two video frame images in the video are extracted. Each human body data in the at least one human body data comprises a human body frame and a human body feature vector corresponding to a human body framed by the human body frame.

It should be understood that the facial feature vector is used to represent facial features in the corresponding video frame image.

It should also be understood that the human feature vector is used to represent human features in the corresponding video frame image.

It should also be understood that, although the above description is made by taking the example that the face data includes the face frame and the face feature vector corresponding to the face selected by the face frame as an example, those skilled in the art should understand that the face data may include other data besides the face frame and the face feature vector corresponding to the face selected by the face frame, and the embodiments of the present application are not limited thereto.

Correspondingly, the human body data may include other data besides the human body frame and the human body feature vector corresponding to the human body selected by the human body frame, and the embodiment of the present application is not limited thereto.

It should also be understood that the specific manner of extracting at least one face data and at least one body data of each of at least two video frame images in a video may be set according to actual requirements, and the embodiment of the present application is not limited thereto.

For example, at least one face data and at least one body data in each video frame image in the video are extracted through a trained detection model.

It should also be understood that the training process of the detection model may be set according to actual requirements, and the embodiments of the present application are not limited thereto.

Optionally, calculating a distance between each face feature vector and each human body feature vector of each training image in at least one training image for training the detection model to be trained by using a first loss function to obtain a first loss function value; the first loss function is used for shortening the distance between the face feature vector and the human body feature vector of the same pedestrian in each training image, and shortening the distance between the face feature vector and the human body feature vector of different pedestrians in each training image (or the first function enables the distance between the face feature vector and the human body feature vector of the same pedestrian to be short and the distance between the face feature vector and the human body feature vector of different pedestrians to be long), and then the second loss function is used for calculating the distance between the body feature vectors of the same pedestrian in two training images spaced by n frames to obtain a second loss function value; wherein, the second loss function is used to zoom in the distance between the body feature vectors of the same pedestrian in the two training images spaced by n frames (for example, zoom in the distance between the body feature vector of the first pedestrian in the mth training image and the body feature vector of the first pedestrian in the m + nth training image), and zoom out the distance between the body feature vectors of different pedestrians in the two training images spaced by n frames (or, the second function makes the distance between the body feature vector of the same pedestrian and the body feature vector in the two training images spaced by n frames close, and makes the distance between the body feature vector of different pedestrians and the body feature vector in the two training images spaced by n frames far), and finally, according to the first loss function value and the second loss function value, the parameters of the detection model to be trained are adjusted, to obtain a trained detection model. Wherein n is an integer of 0 or more.

It should be understood that the specific value of n may be set according to actual requirements, and the embodiments of the present application are not limited thereto.

For example, n may be 0, that is, two training images spaced by n frames are two training images adjacent to each other.

In order to facilitate understanding of the embodiment of the present application, two training images adjacent to each other in the front-back direction are described as an example.

Specifically, in order to make the face special effect vector and the body special effect vector output by the trained detection model satisfy the property of feature embedding, a training image set including a face frame, a body frame and corresponding binding information (for example, a first face frame and a first body frame which belong to the same pedestrian) needs to be provided.

In a case where at least one training image in the training image set is input to the detection model to be trained (or the initial detection model), initial face data and initial body data output by the detection model to be trained may be obtained. The initial human body data comprises an initial human body frame and an initial human body feature vector corresponding to the initial human body frame.

Further, for each training image, a distance between each initial face feature vector and each initial human body special effect vector in a single training image may be calculated based on the first loss function (e.g., a distance between the first initial face feature vector and the first initial human body feature vector is calculated by the first loss function), and a sum of distances between each initial face feature vector and each initial human body special effect vector in each training image may be calculated to obtain a first loss function value (or referred to as a first distance sum) corresponding to each training image.

In addition, for different training images, since the same pedestrian may exist in two adjacent training images and different pedestrians may also exist in the two adjacent training images, whether the same pedestrian exists in the two adjacent training images can be determined based on the binding relationship in the training image set.

Further, for different pedestrians in two adjacent training images, a distance between body feature vectors of different pedestrians in the two adjacent training images may be calculated based on the first loss function (e.g., calculating a distance between a body feature vector of a first person in the first training image and a body feature vector of a second person in the second training image), and a sum of distances between body feature vectors of different pedestrians in all two adjacent training images is calculated to obtain a third loss function value (or referred to as a third distance sum) between different pedestrians in the two adjacent training images; for the same pedestrian in the two adjacent training images, the distance between the body feature vectors of the same pedestrian in the two adjacent training images may be calculated based on the second loss function (e.g., the distance between the body feature vector of the first person in the first training image and the body feature vector of the first person in the second training image is calculated), and the sum of the distances between the body feature vectors of the same pedestrian in all the two adjacent training images is calculated to obtain a second loss function value (or referred to as a second distance sum) between the same pedestrian in the two adjacent training images.

All of the first loss function values, all of the second loss function values, and all of the third loss function values may then be added to obtain a total loss function value.

Finally, the total loss function value can be used to adjust the parameters in the detection model to be trained (or the detection model in training) so as to obtain the trained detection model.

It should be noted that, although the above describes an example of obtaining a total loss function value by adding all the first loss function values, all the second loss function values, and all the third loss function values, it should be understood by those skilled in the art that the determination manner of the loss function values may also be set according to actual requirements, and the embodiments of the present application are not limited thereto.

For example, for each training image, the distance between each of the initial face feature vectors in the single training image may be calculated based on a first loss function (e.g., the distance between the first initial face feature vector and the second initial face feature vector is calculated by the first loss function), and the sum of the distances between each of the initial face feature vectors in the single training image may be calculated to obtain a fourth loss function value (alternatively referred to as a fourth sum of distances).

Then, all of the first loss function values, all of the second loss function values, all of the third loss function values, and all of the fourth loss function values may be added to obtain the loss function values.

It should be understood that the specific function corresponding to the first loss function may be set according to actual requirements, as long as the first loss function is ensured to be used for determining the distance of the feature vector of the pedestrian (for example, the distance between the face feature vector and the human body feature vector; for example, the distance between the human body feature vectors in different images), and the embodiment of the present application is not limited thereto.

For example, the first loss function may include a first loss sub-function and a second loss sub-function, and so on. The first loss sub-function is used for calculating the distance between the initial face feature vector and the initial human body feature vector, and the second loss sub-function is used for calculating the distance between one initial face feature vector and the other initial face feature vector.

It should also be understood that the specific function corresponding to the second loss function may also be set according to actual requirements, as long as the second loss function is ensured to be used for determining the distance of the body feature vector of the same pedestrian, and the embodiment of the present application is not limited thereto.

It should be noted that, although the above description is given by taking two training images adjacent to each other in the front and back direction as an example to train the detection model to be trained, it should be understood by those skilled in the art that the detection model to be trained may also be trained by other training images with an interval n (for example, n may be 1, or may also be 3, etc.), and the embodiment of the present application is not limited to this.

In order to facilitate understanding of the embodiments of the present application, the following description will be given by way of specific examples.

Specifically, each video frame image extracted from the same video is input into a trained detection model to obtain at least one face data and at least one body data output by the trained detection model. Each piece of face data in the at least one piece of face data comprises a confidence coefficient of a current face, a current face frame and a face feature vector corresponding to the current face selected by the current face frame; each human body data in the at least one human body data comprises the confidence coefficient of the current human body, the current human body frame and the human body feature vector corresponding to the current human body selected by the current human body frame.

For example, in the case where at least one face data is represented by Faces (i.e., a face data set is represented by Faces), each element in the face data set is f_iF of the_iComprises the following steps:

f_i＝(s_i，b_i，fe_i)

wherein f is_iRepresenting the ith personal face data, s, in a set of face data_iRepresenting the confidence of the ith face, b_iFace frame, fe, representing the ith face_iAnd representing the face feature vector corresponding to the ith face.

For another example, in the case where at least one human body data is represented by Persons (i.e., a human body data set is represented by Persons), each element in the human body data set is p_jP of the group_jComprises the following steps:

p_j＝(w_j，r_j，g_j)

wherein p is_jRepresenting the jth individual's volume data, w, in a set of body data_jRepresents the confidence of the jth human body, r_jBody frame, g, representing the jth human body_iAnd representing the human body feature vector corresponding to the jth human body.

Step S220, matching each human face characteristic vector and each human body characteristic vector of each video frame image to determine the human face binding result of each video frame image. The face-to-person binding result comprises a face frame, a human body frame corresponding to the face frame, a face feature vector and a human body feature vector corresponding to the face feature vector.

Specifically, in the process of binding the face in a single video frame image based on the feature vector, distances between all face feature vectors and all human body feature vectors of each video frame image can be calculated, and under the condition that the distance is smaller than a first preset distance, the face feature vector and the human body feature vector corresponding to the distance can be determined to be the face feature vector and the human body feature vector belonging to the same pedestrian, and further, a face frame corresponding to the face feature vector belonging to the same pedestrian and a human body frame corresponding to the human body feature vector of the same pedestrian can be determined to correspond to the same pedestrian.

Then, the face frame, the face feature vector corresponding to the face frame, the body frame corresponding to the face frame, and the body feature vector corresponding to the body frame may be combined into a face-to-person binding result.

That is to say, the distance between the current face feature vector and the current human body feature vector is calculated, then, under the condition that the distance between the current face feature vector and the current human body feature vector is smaller than or equal to a first preset distance, the current face frame corresponding to the current face feature vector and the current human body frame corresponding to the current human body feature vector are determined to be the face frame and the human body frame of the same pedestrian, and finally, the face-to-human binding result is obtained according to the current face frame, the current human body frame, the current face feature vector and the current human body feature vector.

For example, in the case of representing a face-to-person binding result by an FP (i.e. representing a face binding set by an FP), the FP is specifically as follows:

FP＝{z_k|z_k＝(f_i，p_j)}

where FP represents a face binding set, z_kRepresents a k-th individual face-to-individual binding result, and the k-th individual face-to-individual binding result is composed of the i-th individual face data and the j-th individual volume data, i.e., when the i-th individual face data and the j-th individual volume data are presentThe volume data is data corresponding to the same person.

It should be noted that, in an actual situation, since some pedestrians may only have a human body without a human face (for example, the human face is blocked; or only have a background of a person in the image), the face-to-person binding result at this time has a case that only the human body frame does not have the human face frame. In the subsequent tracking process, because the human body feature vector corresponding to the human body frame is used for tracking, the face-to-person binding result of the human body frame can also be used for tracking.

Therefore, compared with the existing scheme of binding the face frame and the body frame in each video frame image through a post-processing method, the method based on the face feature vector and the body feature vector can utilize the face feature and the body feature, so that whether the face frame and the body frame are the same person can be judged more accurately.

In addition, considering that the face-to-person binding result determined based on the face feature vector and the human body feature vector may also have errors, an IOU value of an intersection ratio of each face frame and each human body frame in each video frame image may be calculated, then a face-to-human body binding pair in each video frame image is determined according to the IOU value, wherein the face-to-human body binding pair includes the face frame and the human body frame corresponding to the face frame, then the face-to-human body binding pair in each video frame image is utilized to verify the face-to-human binding result of each video frame image, so as to search a face-to-human binding result which is in error matching from the face-to-human binding result of each video frame image, and finally the face-to-human binding result which is in error matching is deleted from the face-to-human binding result of each video frame image.

That is to say, the embodiment of the application can also verify the face-to-person binding result through the face-to-body binding pair determined by the IOU value, so as to reduce the face-to-person binding result which is far away from the face-to-person binding result but is wrongly bound through the feature vector, thereby further improving the accuracy of face binding.

In step S230, the same pedestrian in the two video frame images at the interval n frames is determined based on the human body feature vector in the face-to-person binding result of the two video frame images at the interval n frames. Wherein n is an integer of 0 or more.

It should be understood that the specific value of n may be set according to actual requirements, and the embodiment of the present application is not limited thereto.

For example, if the value of n is 0, two video frame images separated by n frames are two adjacent video frame images.

Specifically, under the condition that the value of n is 0, each face-to-face binding result contains a human body feature vector, the distance between the human body feature vector in the face-to-face binding result of the previous video frame image and the human body feature vector in the face-to-face binding result of the next video frame image can be calculated, and then under the condition that the distance between the human body feature vector in the face-to-face binding result of the previous video frame image and the distance between the human body feature vectors in the face-to-face binding result of the next video frame image is smaller than or equal to a second preset distance, the human body frame in the face-to-face binding result of the previous video frame image and the human body frame in the face-to-face binding result of the next video frame image are determined to be the human body frame of the same pedestrian.

In addition, the tracking is realized based on human body features instead of position relations, so that the accuracy of the embodiment of the application is better than that of an IoU tracker algorithm. And, since the human body feature vector can be obtained in step S210, it is not necessary to rely on an additional feature extraction model to implement feature extraction, thereby being able to accelerate the tracking speed.

For example, for the ith video frame image and the (i + 1) th video frame image, the distances between all the human feature vectors in the ith video frame image and all the human feature vectors in the (i + 1) th video frame image are calculated. If the distance is less than or equal to a second preset distance, the pedestrian is considered to be a matched pedestrian (for example, a pedestrian exists in the ith video frame image and the (i + 1) th video frame image); if the distance is greater than the second preset distance, the pedestrian is not matched (for example, a certain pedestrian exists in the ith video frame image, but disappears in the (i + 1) th video frame image, that is, the situation that the corresponding pedestrian leaves the picture or is shielded by other objects in the picture); if there is a pedestrian that does not exist in the ith video frame image and appears in the (i + 1) th video frame image, the pedestrian enters the picture or the shielding degree is weakened. Subsequently, the steps can be repeated to realize the tracking of other adjacent two video frame images, so that the tracking result of all pedestrians in the video can be obtained.

Subsequently, the human body frame and the human face frame of each pedestrian in the video can be obtained, so that the human face frame and the human body frame of each pedestrian in the video at any moment can be obtained, and the binding tracking of the human face is realized.

It should be noted that, although the two video frame images adjacent to each other before and after are taken as an example for description, it should be understood by those skilled in the art that the same pedestrian in the two video frame images spaced by the other n frames may also be determined based on the human feature vector in the face-to-person binding result of the two video frame images spaced by the other n frames (for example, n may be 2, may also be 3, and the like), and the embodiment of the present application is not limited thereto.

Therefore, the embodiment of the application can perform face detection and human body detection on a single video frame image through a trained detection model to obtain face data and human body data. And then, carrying out face binding through the face characteristic vector in the face data and the human body characteristic vector in the human body data to obtain a face-human binding result. And then, the human body characteristic vectors in the face-to-person binding results of two video frame images at intervals of n frames can be used for completing human body matching of the same pedestrian, so that human body tracking is realized. And finally, finishing face binding tracking according to the face-to-person binding result and the human body tracking result.

Referring to fig. 3, fig. 3 is a specific flowchart illustrating a pedestrian tracking method according to an embodiment of the present application. The pedestrian tracking method shown in fig. 3 includes:

and performing frame extraction on the video to be subjected to frame extraction to obtain an image sequence set. The image sequence set comprises 1 st video frame image to Nth video frame image, wherein N is a positive integer. And then, extracting at least one piece of face data and human body data in each video frame image in the image sequence set. Subsequently, each face feature vector and each body feature vector of each video frame image can be matched to determine a face-to-person binding result of each video frame image.

And then, determining the same pedestrian in the two adjacent video frame images based on the human body feature vector in the face-to-person binding result of the two adjacent video frame images so as to realize human body tracking.

It should be understood that the above-mentioned pedestrian tracking method is only exemplary, and various modifications, adaptations or variations thereof may be made by those skilled in the art according to the above-mentioned method and are within the scope of the present application.

Referring to fig. 4, fig. 4 shows a structural block diagram of a pedestrian tracking device 400 provided in an embodiment of the present application, it should be understood that the pedestrian tracking device 400 corresponds to the above method embodiment and can perform the steps related to the above method embodiment, and the specific functions of the pedestrian tracking device 400 can be referred to the above description, and a detailed description is appropriately omitted here to avoid repetition. The pedestrian tracking apparatus 400 includes at least one software function that can be stored in memory in the form of software or firmware (firmware) or is fixed in an Operating System (OS) of the pedestrian tracking apparatus 400. Specifically, the pedestrian tracking apparatus 400 includes:

an extracting module 410, configured to extract at least one face data and at least one body data of each of at least two video frame images in a video, where each face data in the at least one face data includes a face frame and a face feature vector corresponding to a face framed by the face frame, and each body data in the at least one body data includes a body frame and a body feature vector corresponding to a body framed by the body frame;

a matching module 420, configured to match each face feature vector and each human body feature vector of each video frame image to determine a face-to-person binding result of each video frame image, where the face-to-person binding result includes a face frame and a human body frame corresponding to the face frame, and the face feature vector and the human body feature vector corresponding to the face feature vector;

the first determining module 430 is configured to determine, based on human body feature vectors in face-to-person binding results of two video frame images of interval n frames, the same pedestrian in the two video frame images of interval n frames, where n is an integer greater than or equal to 0.

In a possible embodiment, the matching module 420 is specifically configured to: calculating the distance between the current face feature vector and the current human body feature vector; under the condition that the distance between the current face feature vector and the current human body feature vector is smaller than or equal to a first preset distance, determining that a current face frame corresponding to the current face feature vector and a current human body frame corresponding to the current human body feature vector are a face frame and a human body frame of the same pedestrian; and acquiring a face-to-person binding result according to the current face frame, the current body frame, the current face feature vector and the current body feature vector.

In one possible embodiment, the pedestrian tracking apparatus further includes: a first calculating module (not shown) for calculating an intersection ratio IOU value of each face frame and each body frame in each video frame image after matching each face feature vector and each body feature vector of each video frame image to determine a face-to-person binding result in each video frame image; a second determining module (not shown) for determining a human face-human body binding pair in each video frame image according to the IOU value, wherein the human face-human body binding pair includes a human face frame and a human body frame corresponding to the human face frame; a searching module (not shown) for verifying the face-to-person binding result of each video frame image by using the face-to-body binding pair in each video frame image, so as to search a face-to-person binding result which is in error matching from the face-to-person binding result of each video frame image; and a deleting module (not shown) for deleting the face-to-person binding result which is in error matching from the face-to-person binding result of each video frame image.

In one possible embodiment, the two video frame images spaced by n frames include the ith video frame image and the (i + n) th video frame image, i being an integer greater than or equal to 1; the first determining module 430 is specifically configured to: calculating the distance between the human body characteristic vector in the face-to-person binding result of the ith video frame image and the human body characteristic vector in the face-to-person binding result of the (i + n) th video frame image; and under the condition that the distance between the human body feature vector in the face-to-person binding result of the ith video frame image and the human body feature vector in the face-to-person binding result of the (i + n) th video frame image is less than or equal to a second preset distance, determining that the human body frame in the face-to-person binding result of the ith video frame image and the human body frame in the (i + n) th video frame image are the human body frame of the same pedestrian.

In a possible embodiment, the extracting module 410 is specifically configured to: and extracting at least one piece of face data and at least one piece of human body data of each video frame image through the trained detection model.

In one possible embodiment, a pedestrian tracking apparatus includes: a second calculation module (not shown) for calculating a distance between each face feature vector and each human body feature vector of each training image in at least one training image for training the detection model to be trained according to the first loss function to obtain a first loss function value; the first loss function is used for shortening the distance between the human face characteristic vector and the human body characteristic vector of the same pedestrian in each training image and shortening the distance between the human face characteristic vector and the human body characteristic vector of different pedestrians in each training image; a third calculation module (not shown) for constructing a second loss function according to the distance between the body feature vectors of the same pedestrian in the two training images spaced by n frames; the second loss function is used for zooming in the distance between the body feature vectors of the same pedestrian in the two training images spaced by n frames and zooming out the distance between the body feature vectors of different pedestrians in the two training images spaced by n frames; and an adjusting module (not shown) for adjusting parameters of the detection model to be trained according to the first loss function and the second loss function to obtain a trained detection model.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working process of the apparatus described above may refer to the corresponding process in the foregoing method, and will not be described in too much detail herein.

Referring to fig. 5, fig. 5 is a block diagram illustrating an electronic device 500 according to an embodiment of the present disclosure. Electronic device 500 may include a processor 510, a communication interface 520, a memory 530, and at least one communication bus 540. Wherein the communication bus 540 is used for realizing direct connection communication of these components. The communication interface 520 in the embodiment of the present application is used for communicating signaling or data with other devices. Processor 510 may be an integrated circuit chip having signal processing capabilities. The Processor 510 may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; but may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor 510 may be any conventional processor or the like.

The Memory 530 may be, but is not limited to, a Random Access Memory (RAM), a Read Only Memory (ROM), a Programmable Read-Only Memory (PROM), an Erasable Read-Only Memory (EPROM), an electrically Erasable Read-Only Memory (EEPROM), and the like. The memory 530 stores computer readable instructions, which when executed by the processor 510, the electronic device 500 may perform the steps of the above-described method embodiments.

The electronic device 500 may further include a memory controller, an input-output unit, an audio unit, and a display unit.

The memory 530, the memory controller, the processor 510, the peripheral interface, the input/output unit, the audio unit, and the display unit are electrically connected to each other directly or indirectly to realize data transmission or interaction. For example, these elements may be electrically coupled to each other via one or more communication buses 540. The processor 510 is used to execute executable modules stored in the memory 530. Also, the electronic device 500 is configured to perform the following method: extracting at least one face data and at least one body data of each of at least two video frame images in a video, wherein each face data in the at least one face data comprises a face frame and a face feature vector corresponding to a face framed by the face frame, and each body data in the at least one body data comprises a body frame and a body feature vector corresponding to a body framed by the body frame; matching each face feature vector and each human body feature vector of each video frame image to determine a face-to-person binding result of each video frame image, wherein the face-to-person binding result comprises the face frame and a human body frame corresponding to the face frame as well as the face feature vector and a human body feature vector corresponding to the face feature vector; and determining the same pedestrian in the two video frame images at the interval of n frames based on the human body feature vector in the face-to-person binding result of the two video frame images at the interval of n frames, wherein n is an integer greater than or equal to 0.

The input and output unit is used for providing input data for a user to realize the interaction of the user and the server (or the local terminal). The input/output unit may be, but is not limited to, a mouse, a keyboard, and the like.

The audio unit provides an audio interface to the user, which may include one or more microphones, one or more speakers, and audio circuitry.

The display unit provides an interactive interface (e.g. a user interface) between the electronic device and a user or for displaying image data to a user reference. In this embodiment, the display unit may be a liquid crystal display or a touch display. In the case of a touch display, the display can be a capacitive touch screen or a resistive touch screen, which supports single-point and multi-point touch operations. The support of single-point and multi-point touch operations means that the touch display can sense touch operations simultaneously generated from one or more positions on the touch display, and the sensed touch operations are sent to the processor for calculation and processing.

It will be appreciated that the configuration shown in FIG. 5 is merely illustrative and that the electronic device 500 may include more or fewer components than shown in FIG. 5 or may have a different configuration than shown in FIG. 5. The components shown in fig. 5 may be implemented in hardware, software, or a combination thereof.

The present application also provides a storage medium having a computer program stored thereon, which, when executed by a processor, performs the method of the method embodiments.

The present application also provides a computer program product which, when run on a computer, causes the computer to perform the method of the method embodiments.

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process of the system described above may refer to the corresponding process in the foregoing method, and will not be described in too much detail herein.

It should be noted that, in the present specification, the embodiments are all described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other. For the device-like embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method can be implemented in other ways. The apparatus embodiments described above are merely illustrative, and for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In addition, functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.

The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes. It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application. It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A pedestrian tracking method, comprising:

extracting at least one face data and at least one body data of each of at least two video frame images in a video, wherein each face data in the at least one face data comprises a face frame and a face feature vector corresponding to a face framed by the face frame, and each body data in the at least one body data comprises a body frame and a body feature vector corresponding to a body framed by the body frame;

matching each face feature vector and each human body feature vector of each video frame image to determine a face-to-person binding result of each video frame image, wherein the face-to-person binding result comprises the face frame and a human body frame corresponding to the face frame as well as the face feature vector and a human body feature vector corresponding to the face feature vector;

and determining the same pedestrian in the two video frame images at the interval of n frames based on the human body feature vector in the face-to-person binding result of the two video frame images at the interval of n frames, wherein n is an integer greater than or equal to 0.

2. The pedestrian tracking method according to claim 1, wherein the matching each face feature vector and each body feature vector of each video frame image to determine a face-to-person binding result in each video frame image comprises:

calculating the distance between the current face feature vector and the current human body feature vector;

under the condition that the distance between the current face feature vector and the current human body feature vector is smaller than or equal to a first preset distance, determining that a current face frame corresponding to the current face feature vector and a current human body frame corresponding to the current human body feature vector are a face frame and a human body frame of the same pedestrian;

and acquiring the face-to-person binding result according to the current face frame, the current human body frame, the current face feature vector and the current human body feature vector.

3. The pedestrian tracking method according to claim 1 or 2, wherein after the matching of each face feature vector and each body feature vector of each video frame image to determine a face-to-person binding result in each video frame image, the pedestrian tracking method further comprises:

calculating the intersection ratio IOU value of each face frame and each human body frame in each video frame image;

determining a human face and human body binding pair in each video frame image according to the IOU value, wherein the human face and human body binding pair comprises the human face frame and a human body frame corresponding to the human face frame;

verifying the face-to-person binding result of each video frame image by using the face-to-body binding pair in each video frame image so as to search a face-to-person binding result which is in error matching from the face-to-person binding result of each video frame image;

and deleting the face-to-person binding result which is in error matching from the face-to-person binding result of each video frame image.

4. The pedestrian tracking method according to claim 1, wherein the two video frame images spaced by n frames include an i-th video frame image and an i + n-th video frame image, i being an integer equal to or greater than 1;

the determining the same pedestrian in the two video frame images of the interval n frames based on the human body feature vector in the face-to-person binding result of the two video frame images of the interval n frames comprises:

calculating the distance between the human body characteristic vector in the face-to-person binding result of the ith video frame image and the human body characteristic vector in the face-to-person binding result of the (i + n) th video frame image;

and under the condition that the distance between the human characteristic vector in the face-to-person binding result of the ith video frame image and the human characteristic vector in the face-to-person binding result of the (i + n) th video frame image is less than or equal to a second preset distance, determining that the human frame in the face-to-person binding result of the ith video frame image and the human frame in the (i + n) th video frame image are the human frame of the same pedestrian.

5. The pedestrian tracking method according to any one of claims 1 to 4, wherein the extracting at least one face data and at least one body data of each of at least two video frame images in the video comprises:

and extracting at least one piece of face data and at least one piece of human body data of each video frame image through a trained detection model.

6. The pedestrian tracking method according to claim 5, wherein the training process of the detection model includes:

calculating the distance between each human face feature vector of each training image in at least one training image for training a detection model to be trained and each human body feature vector by using a first loss function to obtain a first loss function value; the first loss function is used for shortening the distance between the human face characteristic vector and the human body characteristic vector of the same pedestrian in each training image and shortening the distance between the human face characteristic vector and the human body characteristic vector of different pedestrians in each training image;

calculating the distance between body characteristic vectors of the same pedestrian in two training images spaced by n frames by using a second loss function to obtain a second loss function value; the second loss function is used for shortening the distance between the body feature vectors of the same pedestrian in the two training images spaced by n frames and shortening the distance between the body feature vectors of different pedestrians in the two training images spaced by n frames;

and adjusting parameters of the detection model to be trained according to the first loss function value and the second loss function value so as to obtain the trained detection model.

7. The pedestrian tracking method according to claim 6, wherein the two video frame images spaced by n frames are two video frame images adjacent in front and rear.

8. A pedestrian tracking apparatus, comprising:

the system comprises an extraction module, a selection module and a display module, wherein the extraction module is used for extracting at least one face data and at least one body data of each of at least two video frame images in a video, each face data in the at least one face data comprises a face frame and a face feature vector corresponding to a face selected by the face frame, and each body data in the at least one body data comprises a body frame and a body feature vector corresponding to a body selected by the body frame;

a matching module, configured to match each face feature vector and each human body feature vector of each video frame image to determine a face-to-person binding result of each video frame image, where the face-to-person binding result includes the face frame and a human body frame corresponding to the face frame, and the face feature vector and a human body feature vector corresponding to the face feature vector;

the first determining module is used for determining the same pedestrian in the two video frame images of the interval n frames based on the human body feature vector in the face-person binding result of the two video frame images of the interval n frames, wherein n is an integer greater than or equal to 0.

9. A storage medium, characterized in that the storage medium has stored thereon a computer program which, when executed by a processor, performs the pedestrian tracking method according to any one of claims 1 to 7.

10. An electronic device, characterized in that the electronic device comprises: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating over the bus when the electronic device is operating, the machine-readable instructions when executed by the processor performing the pedestrian tracking method of any one of claims 1 to 7.