CN111223127B

CN111223127B - Human body joint point-based 2D video multi-person tracking method, system, medium and equipment

Info

Publication number: CN111223127B
Application number: CN202010045947.XA
Authority: CN
Inventors: 金雪梅; 彭琪钧; 朱绘霖; 曹伟; 陈佳佳
Original assignee: South China Normal University
Current assignee: South China Normal University
Priority date: 2020-01-16
Filing date: 2020-01-16
Publication date: 2023-04-07
Anticipated expiration: 2040-01-16
Also published as: CN111223127A

Abstract

The invention discloses a human body joint point-based 2D video multi-person tracking method, a system, a medium and equipment, wherein the method comprises the following steps: cutting the acquired video file to obtain a frame sequence set; extracting human body joint point characteristics of all people in each frame of the frame sequence set; performing similarity relation model training by using videos with known character tracks, and calculating similarity relations of the same characters and body characteristics among different characters in the upper and lower frames by using human body joint point characteristics; selecting a similarity relation as a training set, and learning a similarity relation model belonging to the same figure by using a neural network algorithm; during tracking, the identity of the previous n frames of characters is initialized, the similarity relation between the current character with unknown identity and the characters with known identity in the previous frames is calculated frame by frame, the similarity relation is input into a similarity relation model, the probability of whether the current character is the corresponding identity is output, the identity of the unknown character is determined, and track information is obtained. The invention has the advantage of steady tracking capability.

Description

Human body joint point-based 2D video multi-person tracking method, system, medium and equipment

Technical Field

The invention belongs to the field of computer vision and machine learning, and particularly relates to a human body joint point-based 2D video multi-person tracking method, system, medium and device.

Background

At present, the field of multi-person tracking by using 2D video mostly comprises two parts of person detection and data association. As a basis of data association, person detection often extracts feature information of a person, such as color, shape, texture, position, and the like, and then associates the person according to a similarity relationship to realize tracking. In some scenarios, such as player games, the color and texture are not unique, greatly diminishing the ability to correlate using color. In a scene where fusion collisions are likely to occur, a device relying on shape and position tracking often experiences a phenomenon of person ID conversion. However, the existing tracking method can only track the track of people in the video and cannot extract the behavior information of people.

Therefore, it is desirable to provide a device and a method with robust tracking capability, which can extract human behavior information while tracking a 2D video track, and can be applied to a variety of fields requiring behavior recognition.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provide a 2D video multi-person tracking method and system based on human body joint points, which have the advantage of stable tracking capability, can extract the behavior information of people and prepare for further human behavior identification.

The purpose of the invention is realized by the following technical scheme: A2D video multi-person tracking method based on human body joint points comprises the following steps:

cutting the acquired video file to obtain a frame sequence set;

identifying and extracting human body joint point characteristics of all people in each frame of the frame sequence set;

performing similarity relation model training by using videos of known character tracks, and calculating similarity relations of body features of the same characters in upper and lower frames and similarity relations of body features of different characters in the upper and lower frames by using human body joint point features; selecting a similarity relation as a training set, and learning a similarity relation model belonging to the same character identity by using a neural network algorithm;

during tracking, the identity of the previous n frames of characters is initialized, the similarity relation between the current character with unknown identity and the characters with known identity in the previous frames is calculated frame by frame, the similarity relation is input into a similarity relation model, the probability of whether the current character is the corresponding identity is output, the identity of the unknown character is determined, and the joint point track information of the character is obtained.

Preferably, the similarity relationship includes 3 parameters, each being a pearson correlation coefficient P _corr Mean value of distance D between feature points _mean Of special interestStandard deviation of distance between feature points D _std The calculation method is as follows:

let the mth personal character in the ith frame

And the characteristic of the nth person in the jth frame->

Similarity between them

Selecting

And &>

The joint point pixel position coordinates coexisting in the two arrays are obtained to obtain the joint point pixel position characteristic->

And &>

Calculating out

And &>

Has a Pearson correlation coefficient->

Wherein,

represents->

And &>

Covariance in between, <' > based on>

And &>

Respectively represent->

And &>

The standard deviation of (a);

calculating out

And &>

Mean value of the distance in between>

cn is the number of joint points which coexist, and dc represents the absolute value of the pixel coordinate distance of the corresponding joint points;

computing

And &>

Standard deviation of the distance between->

Wherein mu _dc Representing the average value of dc.

Preferably, the similarity relationship model is constructed as follows:

taking out the relation of the known tracking strategy of n continuous frames (namely the person ID in each frame is known) in the set K of the human body joint pointsSet of node coordinates

Taking similarity parameter between two groups of joint point coordinates with object ID R between two adjacent frames as positive training set S _p The label of the sample is set to 1; taking a similarity parameter between an object with an ID of R and an object with an ID of not R between two adjacent frames as a negative training set, and setting a label as 0;

setting the number of layers l of the neural network and the number of neurons n per layer _l And inputting the training set into a set neural network for repeated iterative training to obtain a model omega.

Human joint point based 2D video multi-person tracking system comprises:

the video cutting unit is used for cutting the acquired video file to obtain a frame sequence set;

the joint point feature extraction unit is used for identifying and extracting a human body joint point feature set of all people in each frame of the frame sequence set;

the similarity calculation unit is used for carrying out similarity relation model training by utilizing videos of known character tracks, and calculating the similarity relation of the body characteristics of the same characters in the upper frame and the lower frame and the similarity relation of the body characteristics of different characters in the upper frame and the lower frame by utilizing the characteristics of human body joint points;

the model training unit is used for selecting the similarity relation as a training set and learning a similarity relation model belonging to the same character identity by using a neural network algorithm;

and the tracking unit is used for initializing the identities of the previous n frames of characters during tracking, calculating the similarity relation between the current character with unknown identity and the characters with known identity in the previous frames frame by frame, inputting the similarity relation into the similarity relation model, outputting the probability of judging whether the current character is the corresponding identity, determining the identity of the unknown character, and obtaining the joint point track information of the character.

Compared with the prior art, the invention has the following advantages and beneficial effects:

the invention uses the human body joint point characteristics to calculate the similarity relation between two groups of characteristics, and uses the neural network to learn the relation model between the similarity and the person identity ID, and then inputs the similarity between the known object and the observed object between frames into the model, calculates the tracking probability, and confirms the identity ID of the observed object. The joint point information has the advantages of clear and simple characteristics and low possibility of being influenced by appearance factors, the system uses multi-parameter comparison similarity, the tracking precision is higher, the system is more stable, and even if character fusion and short-time shielding conditions occur, the character ID can be identified by comparing the similarity of adjacent frames, so that the track is tracked.

Drawings

FIG. 1 is a block diagram of the apparatus according to the present embodiment.

FIG. 2 is a flowchart of the method of the present embodiment.

Fig. 3 is a flow chart of feature similarity calculation in the method of the present embodiment.

Fig. 4 is a flowchart of a training set acquisition method in the method of the present embodiment.

FIG. 5 is a flowchart of model training in the method of the present embodiment.

FIG. 6 is a flowchart of the tracking process in the method of the present embodiment.

Detailed Description

The present invention will be described in further detail with reference to examples and drawings, but the present invention is not limited thereto.

Example 1

As shown in fig. 2, the method for tracking multiple persons in 2D video based on human body joint points of the present embodiment includes the following steps:

(1) Collecting a video frame;

(2) Cutting the collected video into frame sequences according to the frame rate to obtain a frame sequence set P = { P = { (P) } ⁽ⁱ⁾ L i =1,2, \8230 |, N }, where N is the total number of frames of the video, p ⁽ⁱ⁾ Representing the ith image in the frame sequence set P.

(3) The coordinates of the human joint points in the continuous video frames can be obtained by using a tool capable of extracting the human joint points in the 2D or 3D video, such as an open source tool openpos for non-commercially identifying the human joint points in the 2D video. Inventive steps identify a set of frame sequencesThe human body joint points of all people in the P are extracted, and the position coordinates of the joint points are extracted to obtain a set of the human body joint points of all people in the video

Wherein M is _i For the total number of people identified in the ith frame image, based on the number of people in the frame image, the number of people in the frame image is selected>

The j-th person's joint point coordinates in the ith frame image.

(4) Computing the mth personal characteristic in the ith frame

And a characteristic of the nth person in the jth frame>

Similarity between them

Function->

Indicating that a measurement is taken>

And &>

Similarity relationship between them. The similarity relation S comprises 3 parameters which are respectively Pearson correlation coefficient P _corr Mean value of distances between feature points D _mine Distance standard deviation D between feature points _std . Fig. 3 is a flowchart of the similarity calculation unit, and the details are as follows:

(4-1) feature

And/or>

Is an array of 1 × 134, comprises 67 human body joint points of 2-dimensional pixel position coordinates of the head, the trunk, the limbs and the hands>

Since the coordinate data of the joint points existing in the array k is uncertain, in order to ensure the accuracy of the function of the similarity, the embodiment selects ^ greater or greater than or equal to>

And &>

The co-existing position coordinates of the joint point pixels in the two arrays, i.e. < >>

Deriving co-existing knuckle point pixel location characteristics>

And &>

(4-2) calculating

And &>

Has a Pearson correlation coefficient->

(4-3) calculation of

And &>

Mean value of the distance between D _mean ，D _mean Indicating that the pixel coordinate position between the mth person in the ith frame and the nth person in the jth frame shares the joint point deviates from the mean value. />

cn is the number of common joints. dc denotes the absolute value of the distance of the pixel coordinate of the corresponding joint point, e.g. <>

(4-4) calculation of

And &>

Standard deviation of the distance between->

Wherein mu _dc Representing the average value of dc.

(5) The video of the known person track is used for training the similarity relation model ω, and the specific structure is shown in fig. 5:

(5-1) taking out the joint point coordinate set of the continuous n frames of known tracking strategies from the set K of the human joint points

(5-2) referring to FIG. 4, the similarity parameter between two sets of joint coordinates with object ID R between two adjacent frames is taken as the positive training set S _p The label of the sample is set to 1; taking a similarity parameter between an object with an ID of R and an object with an ID of not R between two adjacent frames as negative trainingSet, tag is set to 0.

(5-3) setting the number of layers l of the neural network and the number of neurons n per layer _l And inputting the training set into a set neural network for repeated iterative training to obtain the model omega.

(6) The trajectory is tracked using the similarity relation model ω.

As shown in fig. 6, it is assumed that frames 1 to i are a known policy, i.e., the person ID in each frame is known. Taking a person with an ID of R as an example, the specific tracking steps are as follows:

(6-1) the character joint feature with the ID of R in the ith frame and all the character joint features in the (i + 1) th frame calculate the similarity relation S.

(6-2) inputting each similarity relation to the model ω, which outputs the probability P that the person ID is R _pre 。

(6-3) calculating the probability P that all people ID which can be identified in the (i + 1) th frame is R _pre And selecting the one with the maximum probability to judge the probability P _pre Whether the ID is larger than or equal to a preset threshold value G or not is judged, if so, the person with the ID of R in the (i + 2) th frame is continuously detected, and the like; if probability P _pre If the number of the people with the ID R in the (i + 1) th frame is smaller than the preset threshold value G, judging that no people with the ID R in the (i + 1) th frame exists, performing similarity probability with the ID R of the people in the (i + 2) th frame and the i-th frame, and so on, if no people with the ID R still exist in the (i + 10) th frame, judging that the R leaves the video, and terminating the tracking of the R.

Example 2

As shown in fig. 1, the 2D video multi-person tracking system based on human body joints of the present embodiment can be divided into the following modules according to functions:

and the tracking unit is used for initializing the identities of the previous n frames of people during tracking, calculating the similarity relation between the people with unknown current identities and the people with known identities in the previous frames frame by frame, inputting the similarity relation into the similarity relation model, outputting the probability of judging whether the identity is the corresponding identity, determining the identity of the unknown people, and obtaining the joint point track information of the people. In this module, the tracking form may be diversified, and the similarity relationship between the character features of the unknown ID and the character with the known ID in the previous 10 frames, and the probability that the person with the unknown ID is the character with the known ID may be calculated, and the maximum probability may be selected, or the mean value of the maximum probabilities may be selected.

It can be clearly understood by those skilled in the art that, for convenience and simplicity of description, the specific working process of the system, the apparatus, the device, or the unit described above may refer to the corresponding process in the foregoing method embodiment 1, and details are not described herein again.

Those of ordinary skill in the art will appreciate that the various illustrative components and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the components and steps of the various examples have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

In the embodiments provided by the present invention, it should be understood that the disclosed apparatus, device and method can be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only a logical division, and there may be other divisions in actual implementation, or units with the same function may be grouped into one unit, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may also be an electrical, mechanical or other form of connection.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment of the present invention.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit may be implemented in the form of hardware, or may also be implemented in the form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a separate product, may be stored in a storage medium. Based on such understanding, the technical solution of the present invention essentially or partially contributes to the prior art, or all or part of the technical solution can be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a magnetic disk, or an optical disk.

While the invention has been described with reference to specific embodiments, the invention is not limited thereto, and various equivalent modifications and substitutions can be easily made by those skilled in the art within the technical scope of the invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A2D video multi-person tracking method based on human body joint points is characterized by comprising the following steps:

cutting the acquired video file to obtain a frame sequence set;

performing similarity relation model training by using videos of known character tracks, and calculating similarity relations of body features of the same characters in upper and lower frames and similarity relations of body features of different characters in the upper and lower frames by using human body joint point features; selecting a similarity relation as a training set, and learning a similarity relation model belonging to the same character identity by using a neural network algorithm; the similarity relation comprises 3 parameters which are respectively Pearson correlation coefficient P _corr Mean value of distances between feature points D _mean Distance standard deviation D between feature points _std The calculation method is as follows:

let the mth personal characteristic in the ith frame

And a characteristic of the nth person in the jth frame>

Similarity between them

Selecting

And &>

And &>

Computing

And &>

Has a Pearson correlation coefficient->

Wherein,

represents->

And &>

Covariance in between, <' > based on>

And &>

Respectively represent->

And &>

Standard deviation of (d);

computing

And &>

Mean value of the distance in between>

cn is the number of joint points which coexist, and dc represents the absolute value of the pixel coordinate distance of the corresponding joint point;

calculating out

And &>

Standard deviation of distance therebetween +>

Wherein mu _dc Represents the average value of dc;

during tracking, the identities of the previous n frames of characters are initialized, the similarity relation between the current character with unknown identity and the characters with known identity in the previous frames is calculated frame by frame, the similarity relation is input into a similarity relation model, the probability of judging whether the current character is the corresponding identity is output, the identity of the unknown character is determined, and the joint point track information of the character is obtained.

2. The human joint point based 2D video multi-person tracking method of claim 1,

the construction method of the similarity relation model comprises the following steps:

taking out the joint point coordinate set of the known tracking strategy of continuous n frames from the set K of the human joint points

3. Many people of 2D video tracking system based on human joint point, its characterized in that includes:

and the tracking unit is used for initializing the identities of the previous n frames of characters during tracking, calculating the similarity relation between the current character with unknown identity and the characters with known identities in the previous frames frame by frame, inputting the similarity relation into the similarity relation model, outputting the probability of judging whether the current character is the corresponding identity, determining the identity of the unknown character, and obtaining the joint point track information of the character.

4. A storage medium storing a computer program which, when executed by a processor, causes the processor to perform the method for 2D video multi-person tracking based on human joint according to any one of claims 1-2.

5. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the computer program implements the method for 2D video multi-person tracking based on human joint points according to any one of claims 1-2.