CN112800825B - Key point-based association method, system and medium - Google Patents

Key point-based association method, system and medium Download PDF

Info

Publication number
CN112800825B
CN112800825B CN202011451402.5A CN202011451402A CN112800825B CN 112800825 B CN112800825 B CN 112800825B CN 202011451402 A CN202011451402 A CN 202011451402A CN 112800825 B CN112800825 B CN 112800825B
Authority
CN
China
Prior art keywords
pair
head
body frame
human body
association
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011451402.5A
Other languages
Chinese (zh)
Other versions
CN112800825A (en
Inventor
陈长升
齐竟雄
何翔
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yuncong Technology Group Co Ltd
Original Assignee
Yuncong Technology Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yuncong Technology Group Co Ltd filed Critical Yuncong Technology Group Co Ltd
Priority to CN202011451402.5A priority Critical patent/CN112800825B/en
Publication of CN112800825A publication Critical patent/CN112800825A/en
Application granted granted Critical
Publication of CN112800825B publication Critical patent/CN112800825B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/60Analysis of geometric attributes
    • G06T7/62Analysis of geometric attributes of area, perimeter, diameter or volume
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/52Surveillance or monitoring of activities, e.g. for recognising suspicious objects

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Health & Medical Sciences (AREA)
  • Geometry (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to the field of image processing, in particular to a key point-based association method, a key point-based association system and a key point-based association medium. The method aims to solve the technical problem of how to more accurately and quickly correlate different parts detected in a complex scene and ensure that the correctly correlated pairs of the different parts belong to the same target. The method mainly aims at the detection and identification of the pedestrian target, filters out the human head-human body frame pairs which can not be matched completely according to the association precondition, calculates the association loss degree of the human head-human body frame pairs through the key points of the human body frames obtained by calculation when the human body frames are detected, and performs one-to-one association by utilizing the global optimal matching algorithm to obtain the correctly associated pairs to represent that the human head-human body frame pairs belong to the same pedestrian target. By the technical scheme, the method and the device can reduce the calculation amount of the algorithm, improve the pairing accuracy and support accurate and reliable human-head human-body association under diversified dense scenes.

Description

Key point-based association method, system and medium
Technical Field
The present invention relates to the field of image processing, and in particular, to a method, an apparatus, a device, and a medium for associating based on a key point.
Background
Along with the attention of people to public safety problems and the rapid increase of the number and the coverage degree of monitoring cameras, intelligent security video monitoring plays an increasingly important role. In the high-quality and high-performance security video full-structured research neighborhood, in order to keep the simplicity and high efficiency of a detector model, the human-head-human-body association operation of the pedestrian target is usually performed only on target frames, such as a human head frame and a human body frame, detected by each frame of video image without designing a corresponding model structure. Therefore, after the detector returns the detection result, an additional module is needed to perform the operation of associating the first part and the second part of the target, such as the human head and the human body.
In the research neighborhood, taking the detection (identification) of an object such as a pedestrian as an example, it is currently common practice to perform a correlation operation by calculating the size of the intersection area of a human head frame and a human body frame. However, in a scene of dense pedestrians, the method has very poor practicability. For example, for two pedestrian objects that are occluded from each other in tandem, the detector returns a single body frame and two head frames. The correct correlation result cannot be ensured only by calculating the intersection area of the human head frame and the human body frame. Therefore, in the high-quality and high-performance security video full-structured research neighborhood, accurate and reliable solutions for human-head and human-body frame association in dense scenes are urgently needed.
Disclosure of Invention
Aiming at the existing defects, the invention provides a key point-based association method, a system and a medium, so as to solve or partially solve the technical problem of how to realize more accurate and reliable association of different components of a target under a dense scene according to the key points of the components of the target to identify the target.
In a first aspect, the present invention provides a method for association based on key points, including: forming pairs of different detected components of a plurality of targets, and calculating the association loss degree of each pair according to key points of the different components, wherein the association loss degree is used for expressing the degree of association error of two components in each pair; and determining whether the pair belongs to the same target or not according to the association loss degree.
Before the calculating the association loss degree of each pair according to the key points of the different components, the method further comprises: judging whether the pairing meets a preset association precondition; and if so, calculating the association loss degree of each pairing according to the key points of the different components.
Wherein the different components include: a first portion and a second portion; forming pairs of different detected components of the multiple targets, specifically including: forming a pairwise pairing of each first portion with each second portion; the association precondition is as follows: the ratio of the range size of the first part contained by the second part in the horizontal direction to the range size of the first part and the ratio of the range size of the first part contained by the second part in the vertical direction to the range size of the first part in each pair are both greater than or equal to a preset first threshold; or the ratio of the distance between the vertex position of the first part and the same vertex position of the second part in each pair to the width or height of the first part is larger than a preset second threshold value or smaller than a preset second threshold value; or the ratio of the area of the intersection of the first part and the second part in each pair to the area of the first part is greater than or equal to a preset third threshold value.
Further comprising: when the different components are detected through an estimation model based on deep learning, simultaneously obtaining key points of the different components; calculating the association loss degree of the pair according to the key points of the different components, specifically comprising: calculating the central point of a connecting line based on the key points according to the key points of the second part in each pair, and determining whether the central point of the connecting line is in the first part of the corresponding pair; if so, calculating the distance between the center point of the connecting line and the center point of the first part, and taking the distance as the association loss degree of the pair; or, if yes, calculating the relative distance or angle between the key point of the second part and the central point of the first part as the association loss degree of the pair; or, if yes, inputting the key point of the second part and the central point of the first part into a deep learning model, and taking the calculation result as the association loss degree of the pair.
Wherein the pair does not belong to the same target if the pair does not satisfy a preset association precondition or if the second center point of the pair is not in the first part of the pair.
Determining whether the pair belongs to the same target according to the association loss degree, specifically comprising: performing global optimal matching on all pairs according to the association loss degree of each pair to obtain a matching pair combination with the minimum association loss degree sum; determining that each matching pair in the combination of matching pairs is a target.
Wherein the target is a pedestrian; the different components include: a human head frame and a human body frame from a frame of picture; the pairing is a human head-human body frame pair formed by any human head frame and any human body frame in pairs; wherein the first threshold is 75%; wherein the vertex position is the top left vertex of each of the human head frame and the human body frame; wherein a ratio between an area of the intersection and an area of the first portion is: the ratio of the area of the intersection of the human head frame and the human body frame to the area of the human head frame.
Wherein the key points include: the head top key point and the neck key point of the human body frame; the key points are obtained when a human body frame is detected through a posture estimation model based on deep learning; calculating the association loss degree of the pairing specifically comprises: when the central point of the connecting line of the vertex key point and the neck key point of each head-body frame pair body frame is positioned in the head frame in the head-body frames: calculating the distance between the central point of the connecting line and the central point of the human head frame as the correlation loss degree of the human head-human body frame pair; or calculating the relative distance or angle between the key point of the head top or the key point of the neck of each human head-human body frame pair and the center point of the human head frame in the human head-human body frame pair as the correlation loss degree of the human head-human body frame pair; or inputting the key point of the head top and/or the key point of the neck of each head-human body frame pair and the central point of the head frame of the head-human body frame pair into a deep learning model, and calculating and outputting the correlation loss degree of the head-human body frame pair.
In a second aspect, the present invention provides a key point-based association system, including: the calculation module is used for calculating the association loss degree of each pair formed by different components according to the detected key points of the different components of the targets; and the matching module is used for determining whether the pair belongs to the same target or not according to the association loss degree.
Wherein, still include: the judging module is used for judging whether each pair formed by the different components meets a preset association precondition; if so, the pairing is input to a computing module.
Wherein the different components include: a first portion and a second portion; each pair formed by the different components specifically includes: forming pairwise with each second portion of the detected plurality of targets; the association precondition is as follows: the ratio of the range size of the first part contained by the second part in the horizontal direction to the range size of the first part and the ratio of the range size of the first part contained by the second part in the vertical direction to the range size of the first part in each pair are both greater than or equal to a preset first threshold; or the ratio of the distance between the vertex position of the first part and the same vertex position of the second part in each pair to the width or height of the first part is larger than a preset second threshold value or smaller than a preset second threshold value; or the ratio of the area of the intersection of the first part and the second part in each pair to the area of the first part is greater than or equal to a preset third threshold value.
Wherein, still include: when the different components are detected through an estimation model based on deep learning, simultaneously obtaining key points of the different components; the calculation module specifically further performs the following operations: calculating the central point of a connecting line based on the key points according to the key points of the second part in each pair, and determining whether the central point of the connecting line is in the first part of the corresponding pair; if so, calculating the distance between the center point of the connecting line and the center point of the first part, and taking the distance as the association loss degree of the pair; or, if yes, calculating the relative distance or angle between the key point of the second part and the central point of the first part as the association loss degree of the pair; or, if yes, inputting the key point of the second part and the central point of the first part into a deep learning model, and taking the calculation result as the association loss degree of the pair.
The judging module specifically executes the following operations: if the pair does not satisfy a preset association precondition, or if the second center point of the pair is not in the first part of the pair, the pair does not belong to the same target.
The matching module specifically executes the following operations: performing global optimal matching on all pairs according to the association loss degree of each pair to obtain a matching pair combination with the minimum association loss degree sum; determining that each matching pair in the combination of matching pairs is a target.
Wherein the target is a pedestrian; the different components include: a human head frame and a human body frame from a frame of picture; the pairing is a human head-human body frame pair formed by any human head frame and any human body frame in pairs; wherein the first threshold is 75%; wherein the vertex position is the top left vertex of each of the human head frame and the human body frame; wherein a ratio between an area of the intersection and an area of the first portion is: the ratio of the area of the intersection of the human head frame and the human body frame to the area of the human head frame.
Wherein the key points include: the head top key point and the neck key point of the human body frame; the key points are obtained when a human body frame is detected through a posture estimation model based on deep learning; the calculation module specifically executes the following operations: when the central point of the connecting line of the key point of the head top and the key point of the neck of each head-human body frame pair is positioned in the head frame of the head-human body frames: calculating the distance between the central point of the connecting line and the central point of the human head frame as the correlation loss degree of the human head-human body frame pair; or calculating the relative distance or angle between the key point of the head top or the key point of the neck of each human head-human body frame pair and the center point of the human head frame in the human head-human body frame pair as the correlation loss degree of the human head-human body frame pair; or inputting the key point of the head top and/or the key point of the neck of each head-human body frame pair and the central point of the head frame of the head-human body frame pair into a deep learning model, and calculating and outputting the correlation loss degree of the head-human body frame pair.
In a third aspect, a video full-structured system is provided, which includes: a pre-trained detector model, a subsequent video full-structured unit, and any of the keypoint-based correlation systems of the second aspect; the detector model detects a first part and a second part from a frame of picture, provides the first part and the second part for the association system based on the key point for association, and outputs the correctly associated pairs to the subsequent video full-structured unit for structured processing.
In a fourth aspect, a security system is provided, comprising: a network hard disk video recorder and/or a network camera, and the video full-structured system of the third aspect; the network hard disk video recorder and/or the network camera provide the detected pictures for the video full-structured system
In a fifth aspect, there is provided a processing apparatus comprising a processor and a memory, the memory being adapted to store a plurality of program codes, the program codes being adapted to be loaded and run by the processor to perform the steps of the keypoint-based correlation method of any of the preceding first aspects.
In a sixth aspect, a computer readable storage medium is provided, which stores a plurality of program codes, wherein the program codes are adapted to be loaded and executed by the processor to perform the keypoint-based correlation method according to any of the previous first aspects.
One or more technical schemes of the invention at least have one or more of the following beneficial effects:
by employing a correlation precondition such as a certain case of a predetermined target component (a specific example: a case where a human head frame is contained by a human body frame at least 75% in both horizontal and vertical directions) to filter out a large number of pairs of components apparently not belonging to the same target, such as human head-human body frame pairs not belonging to the same pedestrian, the amount of arithmetic calculation can be reduced. Further, since the various components of the associated target come from an estimation model based on deep learning, such as: the human body key point information of the pedestrian target comes from a deep learning-based attitude estimation model, and the model has the characteristics of data driving and iterative optimization, so that the correlation performance can be improved along with the improvement of the model performance, such as: in the pedestrian target, the performance (accuracy degree) of human-head-human correlation can be improved along with the improvement of the attitude estimation model, and the accurate and reliable correlation of the components of the target can be supported under various dense scenes (under the scenes of the number of shelters, such as the number of people, the shading degree and the like), for example, the human-head-human correlation belonging to one pedestrian target. The phenomenon of missing a detected target is effectively controlled and improved, and the accuracy rate is more than 99.2% in actual test.
Furthermore, the technical scheme of the invention realizes more accurate and reliable human-head-human-body association under a dense scene based on the human-body key point information (head vertex and neck point) through key points based on the components of the target, especially in the detection of the pedestrian target.
Drawings
Embodiments of the invention are described below with reference to the accompanying drawings, in which:
FIG. 1 is a flow diagram for one embodiment of a keypoint-based association method, in accordance with the present invention;
FIG. 2 is a schematic diagram of human keypoint information obtained by a deep learning-based pose estimation model, such as a pedestrian target, in accordance with an embodiment of the present invention;
FIG. 3 is a flow chart of calculating the degree of correlation loss of a head-body frame pair when the correlation preconditions are satisfied, as exemplified by a pedestrian target, in accordance with an embodiment of the present invention;
FIG. 4 is a block diagram of the main structure of a video full-structured system applied according to an embodiment of the present invention;
FIG. 5 is a block diagram of the main structure of a security system applied according to an embodiment of the present invention;
FIG. 6 is a block diagram of one embodiment of a keypoint-based correlation system in accordance with the present invention;
fig. 7 is an example of the way the 2 nd optional condition of the associated preconditions are set in the solution according to the invention;
fig. 8 is an example of the 2 nd alternative of calculating the degree of correlation loss in the solution according to the invention.
Detailed Description
Some embodiments of the invention are described below with reference to the accompanying drawings. It should be understood by those skilled in the art that these embodiments are only for explaining the technical principle of the present invention, and are not intended to limit the scope of the present invention.
In the description of the present invention, a "module" or "processor" may include hardware, software, or a combination of both. A module may comprise hardware circuitry, various suitable sensors, communication ports, memory, may comprise software components such as program code, or may be a combination of software and hardware. The processor may be a central processing unit, microprocessor, image processor, digital signal processor, or any other suitable processor. The processor has data and/or signal processing functionality. The processor may be implemented in software, hardware, or a combination thereof. Non-transitory computer readable storage media include any suitable medium that can store program code, such as magnetic disks, hard disks, optical disks, flash memory, read-only memory, random-access memory, and the like. The term "a and/or B" denotes all possible combinations of a and B, such as a alone, B alone or a and B. The term "at least one A or B" or "at least one of A and B" means similar to "A and/or B" and may include only A, only B, or both A and B. The singular forms "a", "an" and "the" may include the plural forms as well.
Some terms in the present invention are explained below:
human body key point information: through the human body posture estimation deep learning model/the posture estimation model based on deep learning, when detecting single or multiple human bodies on a picture, human body part points can be calculated, such as 14 human body parts points in total as shown in an example of fig. 2. The method of the present invention mainly utilizes the 13 (vertex) and 14 (neck) key points.
Correlation loss degree: and the floating point numerical value is used for measuring the degree of the correlation error of the human head-human body frame pair. For example, one calculation method in the present invention is to measure the distance between the vertex-neck center point and the head frame center point.
The Hungarian algorithm: and performing one-to-one association according to the association loss degree conditions of all pairs (such as the human head-human body frame pairs of the pedestrian targets), so that the sum of the association loss degrees of the associated human head-human body frame pairs is minimum, and thus determining that the pairs associated one-to-one are correctly associated and outputting. For example: human head box A, B, C; the human body frames a, b and c are initially paired, namely, any human head frame and any human body frame form pairing: aa. Ab, Ac, Ba, Bb, Bc, Ca, Cb, Cc, Aa, Bb, Bc, Ca, Cb, Cc, satisfying the conditions, and calculating the association loss degrees of the five pairs, for example: 5. 3, 15, 18, 9, 4, inputting to perform global optimization matching processing, that is, performing one-to-one association processing by hungarian algorithm, wherein a first part can only be paired with a second part, that is, a head box is only associated with a body box, and the sum of the association loss degrees of all pairs after association is minimum, for example, according to the association loss degrees, in the case of one-to-one association such as Aa, Bb, Cc, the sum of the association loss degrees is minimum, that is, several pairs of Aa, Bb, Cc are correctly associated.
Target detection is used in various actual security snapshot scenes to obtain detection frames of all components of the target. Taking the case that the detected target is a pedestrian target as an example, all pedestrians in each monitored picture are detected, and the detection result is usually a plurality of human head frames and human body frames. The method specifically comprises the following steps: the great majority of pictures of entry and exit ports, railway stations, airport halls and the like are multi-head and multi-body, and the scenes are often key areas needing attention during detection and monitoring. Thus, it is necessary to determine which of the human head frames and the human body frames belong to the human head and the human body of the same pedestrian target, that is, it is necessary to pair the human head frames and the human body frames in pairs to determine whether the paired human head and human body are indeed the same pedestrian. And how to determine whether a human head belongs to the human body, namely whether the human head and the human body can be really related, namely whether the human head and the human body are the same pedestrian target, is a key point of target identification of the image. In the existing related art, matching is generally determined by an intersection ratio between a human head frame and a human body frame (for example, fig. 4 is applied to an association algorithm based on an intersection area size/intersection ratio in a video full-structured system), but this method does not consider semantic information of an image itself, and can only deal with simple scenes, for example, a scene with only one person in the image, and obviously, such identification is relatively easy, but it is difficult to do with complex scenes with many people, many occlusions, or people on the back. The improved association method can be effectively applied to scenes with multiple targets, multiple shelters and complex backgrounds.
The association method based on the key points is applied to the association of different components of the identification target more accurately and reliably in the dense scene, mainly takes the human-head-human-body association of the target pedestrian as an example, and explains how to realize the more accurate and reliable human-head-human-body association in the dense scene through the human-body key point information (such as the head top point and the neck point). Firstly, roughly pairing all human head frames and human body frames detected in each frame of picture/image, namely human head-human body frame pairs; the human head box can be obtained by detection of a lightweight human head detection network/target detection algorithm/regression algorithm and the like; the human body frame can be obtained through detection of a posture estimation model (such as OpenPose, DeepCut, Mask RCNN and the like) based on deep learning. Secondly, inputting all the human head-human body frame pairs for filtering, specifically, adopting a correlation precondition, for example, the human head frame in the human head-human body frame pair is contained by the human body frame at least 75% in the horizontal and vertical directions at the same time, the human head-human body frames which do not meet the precondition are largely filtered, the pairs obviously do not belong to the same target pedestrian, and the filtered pairs are set to have the maximum correlation loss degree, namely, the human head frame and the human body frame are completely not correlated, so that the calculation amount of an algorithm is reduced, and the recognition speed is accelerated; and the head-body frame meeting the association precondition is probably correct in association with the human body frame and probably belongs to the same target pedestrian. Thirdly, calculating the correlation loss degree of the human head-human body frame pair which is possibly correlated correctly, wherein the calculation needs to be carried out by adopting key point information (head vertex and neck point) obtained when the human body frame is detected by a posture estimation model based on deep learning; during calculation, the specific steps are as follows: calculating a central point between key points of a human body frame, namely a head vertex and a neck point connecting line (straight line), if the central point is not in the head frame of the head-human body frame pair, setting the association loss degree of the head-human body frame pair to be infinite, and if the central point is in the head frame, calculating the distance from the central point to the central point of the head frame, wherein the distance is used as the association loss degree of the head-human body frame pair. And fourthly, inputting the association loss degrees of all the human head-human body frame pairs to carry out global optimal matching, and carrying out one-to-one association matching through the Hungarian algorithm to obtain a final result of whether the human head-human body frame pairs are correctly associated or not, wherein the correct association is the pair belonging to the same target pedestrian. The whole algorithm process of the first to fourth detection, pairing, association precondition judgment, association loss degree calculation and final association matching of the one-to-one Hungarian algorithm applied to the video full-structured system for replacing the association algorithm based on the intersection area size/intersection ratio is shown in FIG. 4. Therefore, the pedestrian target is taken as an example based on the key point information of different components of the target, and accurate and reliable human-head-human-body association under diversified dense scenes is supported based on the human body key point; in addition, the human body key point information comes from a deep learning posture estimation-based model which has the characteristics of data driving and iterative optimization. The performance of the human head-human body correlation algorithm can be improved along with the improvement of the posture estimation model.
The implementation of the invention is described below with reference to the main flow chart of an embodiment of the method according to the invention shown in fig. 1. The human-head-human-body association algorithm is used for detecting and analyzing pedestrians and pedestrian streams, such as people counting, density analysis, pedestrian tracking and re-identification in a monitoring area and the like.
Step S110, forming a pair of the detected different components, and judging whether the pair meets a preset association precondition.
In one embodiment, it is generally desirable to obtain an image to be detected and identified when detecting and identifying an object. For example: and acquiring an image to obtain a monitoring area image.
Specifically, one or more images/pictures of a frame/image/picture acquired from the monitored area are acquired using the image acquisition unit. One or more targets, such as target pedestrian/pedestrian targets, target vehicles, target animals, etc., may be included in each frame of image/picture. Taking the target as a pedestrian as an example, the number of the pedestrian targets and the number of the images are only used for describing the actual capturing scene of the embodiment, and the number of the pedestrian targets and the number of the images are not limited.
The multiple pictures may be continuous pictures collected from the surveillance video or pictures with intervals. The plurality of pictures may be periodically extracted data from the target video.
The monitoring video can be monitoring video from a network camera in a security monitoring system or monitoring video from a network hard disk video recorder.
In one embodiment, each picture is examined.
Specifically, different components (e.g., the first portion, the second portion) of all objects included in each monitored area picture can be detected by a detection algorithm model (e.g., the detector models shown in fig. 4 and 5), and taking a pedestrian as an example, the different components may be a head frame (the first portion) and a body frame (the second portion) of all pedestrians. In one embodiment, different components of the object may be detected in different ways, and the first part and the second part may be detected by different respective detection models/machine learning models, etc. Taking the target pedestrian as an example: the first part such as the human head box can be obtained by detection of a lightweight human head detection network/target detection algorithm/regression algorithm and the like; a second part, such as a human frame, can be obtained by deep learning based model detection and simultaneously obtain keypoint information, for example: and obtaining the human body frame and the corresponding key points thereof through the posture estimation model detection based on deep learning. The pose estimation model includes, but is not limited to, OpenPose, DeepCut, Mask RCNN, etc. The target detection is performed by taking pedestrian detection as an example, and in order to track pedestrians, a region of each pedestrian in each picture needs to be extracted, so that information of the corresponding pedestrian is extracted on the basis of the extracted region, and a subsequent tracking process is further performed. The region of each pedestrian in the picture is the human body frame of each pedestrian in the picture, as shown in fig. 2. And the frame of the region of the head of each pedestrian in the picture is the frame of the head of each pedestrian in the picture. The human head frame and the human body frame can be extracted based on a pre-trained Detector model, and a known extraction method, such as a YOLO algorithm, a multi-class Single-rod Detector (SSD) algorithm, a fast Convolutional-Neural-network (fast R-CNN) algorithm based on a Region, and the like, can be used, and then, the key point information and the like are further calculated for each human body frame according to the posture estimation model.
The human head frame and/or the human body frame can be extracted according to preset extraction requirements, such as the size of the extraction frame, the pixel and/or definition requirements of the extracted picture, and the like.
Wherein, for each pedestrian in each picture, only one head frame and one body frame are extracted.
As another example, each human body frame may further contain position information of a pedestrian, and each human head frame contains position information of a human head of the pedestrian. The position information comprises coordinate information (x, y) and width and height information w and h of the human head frame/human body frame, wherein x is an abscissa of a top left corner vertex of the human head frame/human body frame, y is an ordinate of the top left corner vertex of the human head frame/human body frame, w is the width of the human head frame/human body frame, and h is the height of the human head frame/human body frame.
Further, a plurality of human body key points in each picture are obtained through a human body posture estimation deep learning Model/a posture estimation Model based on deep learning, which can be an Active Shape Model (ASM) or an Active Appearance Model (AAM), and the like, and 14 human body loci including a head, a neck, left and right shoulders, left and right elbows, and the like are obtained through calculation, specifically as shown in fig. 2.
In one example of the present invention, the plurality of key points includes at least a vertex key point 13 of a human body and a neck key point 14 of the human body.
Further, the human body frame may include coordinate information of the plurality of human body key points, and the like.
In one embodiment, the detected different components are paired, and the first component and the second component are paired pairwise. Taking the pedestrian target as an example, the detected human head frame and the human body frame form a pair, and may be a pair of any human head frame and any human body frame, so as to form a human head-human body frame pair. For example: the head box ABC and body box ABC can be paired by Aa, Bb, Cc, Ab, Ac, Ba, Bc, Ca and Cb.
In one embodiment, in determining whether the pairing satisfies a preset association precondition, the association precondition may be any one of the following conditions:
1. the ratio of the range size of the first portion contained by the second portion in the horizontal direction to the range size of the first portion, and the ratio of the range size of the first portion contained by the second portion in the vertical direction to the range size of the first portion are both greater than or equal to a preset first threshold. Take the example that the target is a pedestrian: the human head-human body frame is centered by the human head-human body frame, wherein at least 75% of the human head frame is contained in the human head-human body frame centered in the horizontal direction and the vertical direction, namely more than 75% of the range of the human head frame falls into the human body frame from the horizontal direction, and more than 75% of the range of the human head frame also falls into the human body frame from the vertical direction. The first threshold may be 75%.
2. The ratio of the distance between the vertex position of the first part and the same vertex position of the second part to the width or height of the first part is larger than a preset second threshold value, or the ratio to the height of the second part is smaller than a preset second threshold value. The second threshold value is set according to the manner in which the actual calculation is performed.
For example, when the target is a pedestrian, the target is determined according to the coordinate information (x, y) of the head frame/body frame and the width and height information w, h and the like of the head frame/body frame, such as the vertex coordinate (x) at the upper left corner of the head frame in the head-body frame pairh,yh) Width whHigh h, hhCoordinates (x) of vertex at upper left corner of human body framep,yp) Width wpHigh h, hp(ii) a Distance in vertical direction (y)h–yp)>-0.25*hh(when the human head frame exceeds the human body frame, the exceeding distance is less than one fourth of the height of the human head frame), (y)h–yp)/hhThe ratio of (a) to (b), i.e. the ratio of the distance between the vertices to the height of the first portion, is greater than a preset second threshold value of 0.25 (the second threshold value, "-" indicates the direction from which the direction is exceeded); or, distance in horizontal direction, right-hand excess (x)h-xp)>-0.25*whOr left side exceeds wp-(xh-xp)>0.25*wh,(xh-xp)/whOr (w)p-(xh-xp))/whThe ratio of the distance between the vertexes to the width of the first portion is greater than a preset second threshold value of 0.25; or, in the vertical direction, (y)h–yp)<0.35*hp(when the human head frame does not exceed the human body frame, the human head frame must not be positioned too below the human body frame, i.e. it is not proper, for example, 0.35 of the height of the human body frame can be taken as the second threshold), (y)h–yp)/hpIs less than 0.35, the ratio of the distance between the vertices to the height of the second portion is less than a predetermined second threshold. As shown in fig. 7.
3. The ratio of the intersection area of the first part and the area of the second part to the area of the first part is greater than or equal to a preset third threshold value. This condition is similar to the existing cross-over correlation algorithm, which is used to filter out the situation of the first and second parts that most of them can not be matched at all. For example, if the target is a pedestrian, the ratio of the intersection area of the head frame area and the body frame area of the head-body frame pair to the head frame area is greater than a predetermined area threshold.
Under diversified complex scenes, the 2 nd and 3 rd conditions show that the relationship between the normal human head frame and the human body frame is not as good as the 1 st condition, and the 1 st condition is optimal.
In one embodiment, if it is determined that the association precondition is not satisfied, the association loss degree of the pair is set to infinity, that is, the pair has a possibility of completely lacking association and cannot be associated at all. For example, in the case of targeting a pedestrian, the association loss degree of the human head-human body frame pair that does not satisfy the association precondition may be set to infinity, and the association loss degree may be transmitted to the subsequent global optimal matching process. And the association loss degree is used for measuring the degree of the association error of the human head-human body frame pair. On the contrary, it is determined to be satisfied, that is, if yes, the process proceeds to step S120.
And step S120, if so, calculating the association loss degree of the pair according to the key points of different components.
In one embodiment, if it is determined that the association precondition is satisfied, calculating the association loss degree of the pair is started. The calculation utilizes keypoint information obtained when different components are detected via an estimation model based on deep learning, i.e. keypoints of different components. Specifically, the key point information of the corresponding component is obtained by calculating when the different components are detected or calculating after the different components are detected. In the example of a line artifact target, the key points include: vertex and neck keypoints of the human frame. The key point information can be obtained by detecting the human body frame and calculating at the same time through a posture estimation model based on deep learning, or can be obtained by calculating the human body frame through the posture estimation model after the human body frame is detected.
In one embodiment, the calculation method specifically includes: calculating a second center point from keypoints of a second portion of the pair, and determining whether the second center point is in the first portion of the pair; if so, calculating the distance between the second central point and the first central point of the first part as the association loss degree of the pair; or, if yes, calculating the relative position/distance or angle between the key point of the second part and the central point of the first part as the association loss degree of the pair; or, if yes, inputting the key points of the second part and the central point of the first part into a deep learning model, and calculating and outputting the association loss degree of the pair. Additionally, if not, i.e., the calculated second center point of the pair is not in the first portion of the pair, then the association penalty for the pair is set to infinity.
In one embodiment, in the case of a pedestrian as a target, calculating a central point (second central point) of a connecting line of a vertex key point and a neck key point of a head-body frame of a human body pair, and judging whether the central point falls into the head frame of the head-body frame;
1. if so, calculating the distance between the center point and the center point of the human head frame as the correlation loss degree of the human head-human body frame pair; alternatively, the first and second electrodes may be,
2. if so, calculating the relative distance or angle between the key point of the head top or the key point of the neck of the human body frame in the human head-human body frame pair and the center point of the human head frame in the human head-human body frame pair as the correlation loss degree of the human head-human body frame pair, as shown in fig. 8: calculating the relative position/distance or angle between the head vertex and the neck vertex of the human body frame and the central point of the human head frame; the angle range is set at 0-180 DEG, and the angle BOA is obviously larger than the angle BKA. Taking the angle as a negative number as the associated loss degree (the larger the angle is, the smaller the loss degree is, while the distance of the point is taken as the loss degree in the main scheme, and the smaller the distance is, the smaller the loss is); calculating a distance OA + distance OB; calculating a distance KA + a distance KB; the angle is taken to be negative as the associated loss (and the greater the loss, the smaller the loss).
3. If so, inputting the key point of the head and/or neck of the human head-human body frame pair and the central point of the head frame of the human head-human body frame pair into a deep learning model, and calculating and outputting the correlation loss degree of the head-human body frame pair.
Furthermore, normalization operation can be carried out on the association loss degree, a threshold value between 0 and 1 is set, whether normalization of the association loss degree has no influence on an association result, and the threshold value under a complex scene is difficult to set; in addition, the 2 nd calculation mode is more complex to calculate the relative positions or angles of the vertex, the neck point and the human head frame, and is difficult to adjust to obtain proper parameters, and the accuracy is inferior to the 1 st mode; the 3 rd calculation mode using the deep learning model needs a large amount of labeled data, and the cost is high. I.e., relatively speaking, the 1 st calculation is optimal and may or may not be normalized.
Referring to fig. 3, an example of detection and identification aiming at pedestrians is described, which illustrates how to calculate the association loss degree of the human head-human body frame pair when the pairing satisfies the association precondition by using an optimal calculation method:
step S301, calculating coordinate information of the center point of the human head frame.
And calculating the coordinate information of the center point of the head frame through the coordinate information of the key point of the head top and the key point of the neck in the head frame.
Step S302, calculating coordinate information of the center point of the head of the human body frame.
And calculating the coordinate information of the head central point of the human body frame according to the coordinate information of the head top key point and the neck key point in the human body frame.
Step S303, judging whether the head center point of the human body frame is positioned in the human head frame.
The association loss degree calculation unit determines whether the head center point of the human body frame is located in the human head frame, if the head center point of the human body frame is not located in the human head frame, the step goes to step S304, otherwise, the step goes to step S305.
Step S304, if the head center point of the human body frame is not in the human head frame, setting the association loss degree of the human head-human body frame pair to be infinite so as to provide the association loss degree to a global optimal matching algorithm for subsequent processing.
Step S305, calculating the distance from the head center point of the human body frame to the head center point of the human body frame as the correlation loss degree of the human head-human body frame pair, so as to send the correlation loss degree to a global optimal matching algorithm for subsequent processing.
In an alternative embodiment, the human head frame may be input into the depth regression network, and the position regression processing of the central point of the human head frame is performed to obtain the position of the central point of the human head frame. And similarly, the human body frame can be input into the depth regression network to carry out position regression processing on the head central point of the human body frame, so that the position of the head central point of the human body frame is obtained. The deep regression network is obtained by deep learning according to an existing feature training set (such as a human body frame and a human face frame).
And step S130, determining whether the pair belongs to the same target according to the association loss degree.
In one embodiment, global optimal matching is performed on the association loss degrees of all pairs, and the pair in a one-to-one association state with the smallest sum of the association loss degrees is obtained and used as the pair with correct association; and outputting the pairing with correct association to represent that the pairing belongs to the same target, thereby identifying the target.
The global optimal matching adopts Hungarian algorithm or other global optimal matching algorithms, such as a deep learning model and the like. The calculation complexity of the deep learning model is high, the overall performance of the video full-structured system is influenced, and the method is simple and efficient in operation when applied to a real scene, so that the Hungarian algorithm is better matched in a complex scene.
In the example of the pedestrian, the global optimal matching unit obtains the correlation result by using the Hungarian algorithm. The Hungarian algorithm performs one-to-one association according to the association loss degree conditions of all the human head-human body frame pairs, namely A is associated with a, B is not associated with a any more, all the human head-human body frame pairs are input to perform global optimal matching, specifically, the nine pairs are subjected to one-to-one association by the Hungarian algorithm, Ab, Ac and Ba with the largest association loss degree (infinity), Aa, Bb, Bc, Ca, Cb and Cc, and the corresponding association loss degrees are 5, 3, 15, 18, 9 and 4 in sequence (for convenience of explanation, a floating point form is not adopted). Such as: aa. The Ba, Bc, Ca, Ab, Ac and Cb can not be paired any more by one-to-one correlation of Bb and Cc, and the total correlation loss degree of the head-human body frame pairs after the correlation is calculated to be 5+3+ 4-12; such as: aa. Bc and Cb are associated in a one-to-one manner, the Ba, Bb, Ab, Ac, Ca and Cc can not be paired, and the association loss degree sum of the head-human body frame pairs after association is calculated to be 5+15+ 9-29; by analogy, Ab, Ac, and Ba in any one-to-one well-correlated pair have infinite sum. Therefore, after the one-to-one correlation is completed, the correlation result with the minimum correlation loss sum is Aa, Bb and Cc, and therefore, the correlation of the several pairs is correct, namely, the correlation result can be output to indicate that the pairs belong to the same pedestrian, namely, the pedestrian is identified.
The implementation and application process of the present invention will be further described below when the detected object is a pedestrian, with reference to the foregoing method example, the video fully-structured system example of fig. 4, and the security system example applied in fig. 5.
Fig. 4 shows an association method improved by the invention, which replaces an association algorithm based on the size of the intersection area of the human head and the human body frame in the existing video full-structured system. In the whole system, the established video monitoring system comprises a network camera, a network hard disk video recorder and the like. Each picture needing to be detected is extracted from the videos, the pictures are input into a detector model (namely equipment applying a detection algorithm model), a plurality of head frames and a plurality of body frames in the pictures are detected, and the pictures are output to an association algorithm module, wherein the association algorithm module replaces the original association algorithm based on the size of the intersection area (the size of the intersection of the area of the head frame and the area of the body frame) with the processing process of the association method, and then the finally obtained one-to-one associated paired output, namely the head-body frame pair belonging to the same target, is output to a subsequent video full-structured module to process the identified target pedestrian.
FIG. 5 shows an example of a specific process of the improved association method of the present invention: the human head frame and the human body frame output from the detector model are obtained by the correlation algorithm module part, for example, the detection results of the human head frame and the human body frame, including key points, coordinate information and the like in each self-detection are obtained; then, the human head frame and the human body frame are primarily paired (human head-human body frame pairing module), and any human head frame can be paired with any human body frame; inputting the pairs to a correlation precondition judgment module, determining whether the pairs meet the conditions, setting the correlation loss degree of the human head-human body frame pair to be infinite if the pairs do not meet the conditions, entering a correlation loss degree calculation module based on human body key points if the pairs meet the conditions, and calculating the correlation loss degree of the pairs through the human body key points; after the association loss degree is calculated, all the pairs output by the module with infinite association loss degree and the corresponding association loss degree, and the pairs output by the association loss degree calculation module based on the human body key points and the corresponding association loss degree are sent to the global optimal matching module, one-to-one association matching is carried out by using methods such as Hungarian algorithm and the like, the pair with the minimum association loss degree sum is determined, the pair with the one-to-one association at the moment is taken as the pair output with correct association to represent that all the human head-human body frames belong to the same pedestrian, and the pedestrian is identified.
Further, the present invention also provides an embodiment of a correlation system based on key points, which is shown in the structural block diagram of fig. 6 and fig. 1 to 5.
The determining module 610 forms a pair with the detected different components, and determines whether the pair meets a preset association precondition.
In one embodiment, it is generally desirable to obtain an image to be detected and identified when detecting and identifying an object. For example: and acquiring an image to obtain a monitoring area image.
Specifically, one or more images/pictures of a frame/image/picture acquired from the monitored area are acquired using the image acquisition unit. One or more targets, such as target pedestrian/pedestrian targets, target vehicles, target animals, etc., may be included in each frame of image/picture. Taking the target as a pedestrian as an example, the number of the pedestrian targets and the number of the images are only used for describing the actual capturing scene of the embodiment, and the number of the pedestrian targets and the number of the images are not limited.
The multiple pictures may be continuous pictures collected from the surveillance video or pictures with intervals. The plurality of pictures may be periodically extracted data from the target video.
The monitoring video can be monitoring video from a network camera in a security monitoring system or monitoring video from a network hard disk video recorder.
In one embodiment, each picture is examined.
Specifically, different components (e.g., the first portion, the second portion) of all objects included in each monitored area picture can be detected by a detection algorithm model (e.g., the detector models shown in fig. 4 and 5), and taking a pedestrian as an example, the different components may be a head frame (the first portion) and a body frame (the second portion) of all pedestrians. In one embodiment, different components of the object may be detected in different ways, and the first part and the second part may be detected by different respective detection models/machine learning models, etc. Taking the target pedestrian as an example: the first part such as the human head box can be obtained by detection of a lightweight human head detection network/target detection algorithm/regression algorithm and the like; a second part, such as a human frame, can be obtained by deep learning based model detection and simultaneously obtain keypoint information, for example: and obtaining the human body frame and the corresponding key points thereof through the posture estimation model detection based on deep learning. The pose estimation model includes, but is not limited to, OpenPose, DeepCut, Mask RCNN, etc. The target detection is performed by taking pedestrian detection as an example, and in order to track pedestrians, a region of each pedestrian in each picture needs to be extracted, so that information of the corresponding pedestrian is extracted on the basis of the extracted region, and a subsequent tracking process is further performed. The region of each pedestrian in the picture is the human body frame of each pedestrian in the picture, as shown in fig. 2. And the frame of the region of the head of each pedestrian in the picture is the frame of the head of each pedestrian in the picture. The human head frame and the human body frame can be extracted based on a pre-trained Detector model, and a known extraction method, such as a YOLO algorithm, a multi-class Single-rod Detector (SSD) algorithm, a fast Convolutional-Neural-network (fast R-CNN) algorithm based on a Region, and the like, can be used, and then, the key point information and the like are further calculated for each human body frame according to the posture estimation model.
The human head frame and/or the human body frame can be extracted according to preset extraction requirements, such as the size of the extraction frame, the pixel and/or definition requirements of the extracted picture, and the like.
Wherein, for each pedestrian in each picture, only one head frame and one body frame are extracted.
As another example, each human body frame may further contain position information of a pedestrian, and each human head frame contains position information of a human head of the pedestrian. The position information comprises coordinate information (x, y) and width and height information w and h of the human head frame/human body frame, wherein x is an abscissa of a top left corner vertex of the human head frame/human body frame, y is an ordinate of the top left corner vertex of the human head frame/human body frame, w is the width of the human head frame/human body frame, and h is the height of the human head frame/human body frame.
Further, a plurality of human body key points in each picture are obtained through a human body posture estimation deep learning Model/a posture estimation Model based on deep learning, which can be an Active Shape Model (ASM) or an Active Appearance Model (AAM), and the like, and 14 human body loci including a head, a neck, left and right shoulders, left and right elbows, and the like are obtained through calculation, specifically as shown in fig. 2.
In one example of the present invention, the plurality of key points includes at least a vertex key point 13 of a human body and a neck key point 14 of the human body.
Further, the human body frame may include coordinate information of the plurality of human body key points, and the like.
In one embodiment, the detected different components are paired, and the first component and the second component are paired pairwise. Taking the pedestrian target as an example, the detected human head frame and the human body frame form a pair, and may be a pair of any human head frame and any human body frame, so as to form a human head-human body frame pair. For example: the head box ABC and body box ABC can be paired by Aa, Bb, Cc, Ab, Ac, Ba, Bc, Ca and Cb.
In one embodiment, in determining whether the pairing satisfies a preset association precondition, the association precondition may be any one of the following conditions:
1. the ratio of the range size of the first portion contained by the second portion in the horizontal direction to the range size of the first portion, and the ratio of the range size of the first portion contained by the second portion in the vertical direction to the range size of the first portion are both greater than or equal to a preset first threshold. Take the example that the target is a pedestrian: the human head-human body frame is centered by the human head-human body frame, wherein at least 75% of the human head frame is contained in the human head-human body frame centered in the horizontal direction and the vertical direction, namely more than 75% of the range of the human head frame falls into the human body frame from the horizontal direction, and more than 75% of the range of the human head frame also falls into the human body frame from the vertical direction. The first threshold may be 75%.
2. The ratio of the distance between the vertex position of the first part and the same vertex position of the second part to the width or height of the first part is larger than a preset second threshold value, or the ratio to the height of the second part is smaller than a preset second threshold value. The second threshold value is set according to the manner in which the actual calculation is performed.
For example, when the target is a pedestrian, the target is determined according to the coordinate information (x, y) of the head frame/body frame and the width and height information w, h and the like of the head frame/body frame, such as the vertex coordinate (x) at the upper left corner of the head frame in the head-body frame pairh,yh) Width whHigh h, hhCoordinates (x) of vertex at upper left corner of human body framep,yp) Width wpHigh h, hp(ii) a Distance in vertical direction (y)h–yp)>-0.25*hh(when the human head frame exceeds the human body frame, it exceedsThe distance is less than one fourth of the height of the human head frame), (y)h–yp)/hhThe ratio of (a) to (b), i.e. the ratio of the distance between the vertices to the height of the first portion, is greater than a preset second threshold value of 0.25 (the second threshold value, "-" indicates the direction from which the direction is exceeded); or, distance in horizontal direction, right-hand excess (x)h-xp)>-0.25*whOr left side exceeds wp-(xh-xp)>0.25*wh,(xh-xp)/whOr (w)p-(xh-xp))/whThe ratio of the distance between the vertexes to the width of the first portion is greater than a preset second threshold value of 0.25; or, in the vertical direction, (y)h–yp)<0.35*hp(when the human head frame does not exceed the human body frame, the human head frame must not be positioned too below the human body frame, i.e. it is not proper, for example, 0.35 of the height of the human body frame can be taken as the second threshold), (y)h–yp)/hpIs less than 0.35, the ratio of the distance between the vertices to the height of the second portion is less than a predetermined second threshold. As shown in fig. 7.
3. The ratio of the intersection area of the first part and the area of the second part to the area of the first part is greater than or equal to a preset third threshold value. This condition is similar to the existing cross-over correlation algorithm, which is used to filter out the situation of the first and second parts that most of them can not be matched at all. For example, if the target is a pedestrian, the ratio of the intersection area of the head frame area and the body frame area of the head-body frame pair to the head frame area is greater than a predetermined area threshold.
Under diversified complex scenes, the 2 nd and 3 rd conditions show that the relationship between the normal human head frame and the human body frame is not as good as the 1 st condition, and the 1 st condition is optimal.
In one embodiment, if it is determined that the association precondition is not satisfied, the association loss degree of the pair is set to infinity, that is, the pair has a possibility of completely lacking association and cannot be associated at all. For example, in the case of targeting a pedestrian, the association loss degree of the human head-human body frame pair that does not satisfy the association precondition may be set to infinity, and the association loss degree may be transmitted to the subsequent global optimal matching process. And the association loss degree is used for measuring the degree of the association error of the human head-human body frame pair. Conversely, the determination is satisfied, i.e., if so, processing proceeds to the calculation module 620.
And if so, the calculating module 620 calculates the association loss degree of the pair according to the key points of the different components.
In one embodiment, if it is determined that the association precondition is satisfied, calculating the association loss degree of the pair is started. The calculation utilizes keypoint information obtained when different components are detected via an estimation model based on deep learning, i.e. keypoints of different components. Specifically, the key point information of the corresponding component is obtained by calculating when the different components are detected or calculating after the different components are detected. In the example of a line artifact target, the key points include: vertex and neck keypoints of the human frame. The key point information can be obtained by detecting the human body frame and calculating at the same time through a posture estimation model based on deep learning, or can be obtained by calculating the human body frame through the posture estimation model after the human body frame is detected.
In one embodiment, the calculation method specifically includes: calculating a second center point from keypoints of a second portion of the pair, and determining whether the second center point is in the first portion of the pair; if so, calculating the distance between the second central point and the first central point of the first part as the association loss degree of the pair; or, if yes, calculating the relative distance/position or angle between the key point of the second part and the central point of the first part as the association loss degree of the pair; or, if yes, inputting the key points of the second part and the central point of the first part into a deep learning model, and calculating and outputting the association loss degree of the pair. Additionally, if not, i.e., the calculated second center point of the pair is not in the first portion of the pair, then the association penalty for the pair is set to infinity.
In one embodiment, in the case of a pedestrian as a target, calculating a central point (second central point) of a connecting line of a vertex key point and a neck key point of a head-body frame of a human body pair, and judging whether the central point falls into the head frame of the head-body frame; 1. if so, calculating the distance between the center point and the center point of the human head frame as the correlation loss degree of the human head-human body frame pair; or, 2, if yes, calculating the relative position or angle between the key point of the vertex or the key point of the neck of the human body frame in the human head-human body frame pair and the human head frame in the human head-human body frame pair as the correlation loss degree of the human head-human body frame pair, as shown in fig. 8: calculating the relative position/distance or angle between the head vertex and the neck vertex of the human body frame and the central point of the human head frame; the angle range is set at 0-180 DEG, and the angle BOA is obviously larger than the angle BKA. Taking the angle as a negative number as the associated loss degree (the larger the angle is, the smaller the loss degree is, while the distance of the point is taken as the loss degree in the main scheme, and the smaller the distance is, the smaller the loss is); calculating a distance OA + distance OB; calculating a distance KA + a distance KB; taking the negative number of the angle as the associated loss degree (the larger the sum, the smaller the loss degree); 3. if so, inputting the key point of the head and/or neck of the human head-human body frame pair and the central point of the head frame of the human head-human body frame pair into a deep learning model, and calculating and outputting the correlation loss degree of the head-human body frame pair.
Furthermore, normalization operation can be carried out on the association loss degree, a threshold value between 0 and 1 is set, whether normalization of the association loss degree has no influence on an association result, and the threshold value under a complex scene is difficult to set; in addition, the 2 nd calculation mode is more complex to calculate the relative positions or angles of the vertex, the neck point and the human head frame, and is difficult to adjust to obtain proper parameters, and the accuracy is inferior to the 1 st mode; the 3 rd calculation mode using the deep learning model needs a large amount of labeled data, and the cost is high. That is, the 1 st calculation mode is optimal, normalization can be used or not, since the human body key point information comes from the posture estimation model based on deep learning, and the model has the characteristics of data driving and iterative optimization, the performance of the human head-human body association method can be improved along with the improvement of the posture estimation model, and accurate and reliable human head-human body association (the number of people to be occluded, the degree of occlusion and the like) in diversified dense scenes can be supported.
The matching module 630 determines whether the pair belongs to the same target according to the association loss degree.
In one embodiment, global optimal matching is performed on the association loss degrees of all pairs, and the pair in a one-to-one association state with the smallest sum of the association loss degrees is obtained and used as the pair with correct association; and outputting the pairing with correct association to represent that the pairing belongs to the same target, thereby identifying the target.
The global optimal matching adopts Hungarian algorithm or other global optimal matching algorithms, such as a deep learning model and the like. The calculation complexity of the deep learning model is high, the overall performance of the video full-structured system is influenced, and the method is simple and efficient in operation when applied to a real scene, so that the Hungarian algorithm is better matched in a complex scene.
In the example of the pedestrian, the global optimal matching unit obtains the correlation result by using the Hungarian algorithm. The Hungarian algorithm performs one-to-one association according to the association loss degree conditions of all the human head-human body frame pairs, namely A is associated with a, B is not associated with a any more, all the human head-human body frame pairs are input to perform global optimal matching, specifically, the nine pairs are subjected to one-to-one association by the Hungarian algorithm, Ab, Ac and Ba with the largest association loss degree (infinity), Aa, Bb, Bc, Ca, Cb and Cc, and the corresponding association loss degrees are 5, 3, 15, 18, 9 and 4 in sequence (for convenience of explanation, a floating point form is not adopted). Such as: aa. The Ba, Bc, Ca, Ab, Ac and Cb can not be paired any more by one-to-one correlation of Bb and Cc, and the total correlation loss degree of the head-human body frame pairs after the correlation is calculated to be 5+3+ 4-12; such as: aa. Bc and Cb are associated in a one-to-one manner, the Ba, Bb, Ab, Ac, Ca and Cc can not be paired, and the association loss degree sum of the head-human body frame pairs after association is calculated to be 5+15+ 9-29; by analogy, Ab, Ac, and Ba in any one-to-one well-correlated pair have infinite sum. Therefore, after the one-to-one correlation is completed, the correlation result with the minimum correlation loss sum is Aa, Bb and Cc, and therefore, the correlation of the several pairs is correct, namely, the correlation result can be output to indicate that the pairs belong to the same pedestrian, namely, the pedestrian is identified.
It will be appreciated by those skilled in the art that the present invention also provides a computer-readable storage medium. In a computer-readable storage medium embodiment according to the present invention, the medium may be configured to store a program for performing the methods of the various method embodiments described above, which program may be loaded and executed by a processor to implement any of the methods described above. All or part of the flow of the method for implementing the above embodiments is implemented by a computer program instructing related hardware, where the computer program can be stored in the computer-readable storage medium, and when the computer program is executed by a processor, the steps of the above embodiments of the method can be implemented. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying said computer program code, media, usb disk, removable hard disk, magnetic diskette, optical disk, computer memory, read-only memory, random access memory, electrical carrier wave signals, telecommunication signals, software distribution media, etc. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice. Furthermore, for convenience of illustration, only the parts related to the embodiments of the present invention are shown, and details of the specific technology are not disclosed, please refer to the method part of the embodiments of the present invention. The storage device may be a storage device apparatus formed by including various electronic devices, and optionally, a non-transitory computer-readable storage medium is stored in the embodiment of the present invention.
It will be appreciated by those skilled in the art that the present invention also provides a processing apparatus/terminal device comprising a memory and a processor, wherein the memory has stored thereon a plurality of program codes adapted to be loaded and run by the processor to perform any of the methods described above.
Similarly, the present invention also provides a video full-structured system comprising: the method comprises the steps that a pre-trained detector model is used for detecting different components of a target, a subsequent video full-structured unit and any key point-based association system which is used for realizing one-to-one association processing of pairs of the detected different components to determine that the pairs formed by the detected different components belong to the same target, and the result output by the key point-based association system is provided for a subsequent video full-structured module.
Similarly, the present invention also provides a security system, comprising: the system comprises a network hard disk video recorder and/or a network camera, a detector module and any one of the key point-based association systems; the detector module acquires and monitors each picture from the network video recorder and/or the network camera, extracts a first part (a human head frame) and a second part (a human body frame) of a target from the picture, inputs the first part and the second part (the human body frame) into a key point-based association system, outputs a one-to-one associated pair, shows that the pair belongs to the same target, pairs all the human head frames and the human body frames to obtain a final human head-human body association result, and outputs the final human head-human body association result.
Similarly, the present invention also provides a security system, comprising: the system comprises a network hard disk video recorder and/or a network camera, a detector module and the video full-structured system.
Further, it should be understood that, since the modules are only configured to illustrate the functional units of the system of the present invention, the corresponding physical devices of the modules may be the processor itself, or a part of software, a part of hardware, or a part of a combination of software and hardware in the processor. Thus, the number of individual modules in the figures is merely illustrative.
Those skilled in the art will appreciate that the various modules in the system may be adaptively split or combined. Such splitting or combining of specific modules does not cause the technical solutions to deviate from the principle of the present invention, and therefore, the technical solutions after splitting or combining will fall within the protection scope of the present invention.
So far, the technical solution of the present invention has been described with reference to one embodiment shown in the drawings, but it is easily understood by those skilled in the art that the scope of the present invention is obviously not limited to these specific embodiments. Equivalent changes or substitutions of related technical features can be made by those skilled in the art without departing from the principle of the invention, and the technical scheme after the changes or substitutions can fall into the protection scope of the invention.

Claims (20)

1. A method for association based on key points is characterized by comprising the following steps:
forming pairs of different detected components of a plurality of targets, and calculating the association loss degree of each pair according to key points of the different components, wherein the association loss degree is used for expressing the degree of association error of two components in each pair;
determining whether the pair belongs to the same target or not according to the association loss degree;
the different components include: a first portion and a second portion;
the pairing of different detected components of the multiple targets specifically includes: forming a pairwise pairing of each first portion with each second portion;
the calculating the association loss degree of each pair according to the key points of the different components comprises: calculating the central point of a connecting line based on the key points according to the key points of the second part in each pairing, and determining whether the central point of the connecting line is in the first part of the corresponding pairing; if so, calculating the distance between the center point of the connecting line and the center point of the first part, and taking the distance as the association loss degree of the pair; or, if yes, calculating the relative distance or angle between the key point of the second part and the central point of the first part as the association loss degree of the pair; or, if yes, inputting the key point of the second part and the central point of the first part into a deep learning model, and taking the calculation result as the association loss degree of the pair.
2. The method of claim 1, further comprising, prior to said calculating a degree of association loss for each of said pairs based on said keypoints of different components:
judging whether the pairing meets a preset association precondition;
and if so, calculating the association loss degree of each pairing according to the key points of the different components.
3. The method of claim 2,
the association precondition is as follows:
the ratio of the range size of the first part contained by the second part in the horizontal direction to the range size of the first part and the ratio of the range size of the first part contained by the second part in the vertical direction to the range size of the first part in each pair are both greater than or equal to a preset first threshold;
alternatively, the first and second electrodes may be,
the ratio of the distance between the vertex position of the first part and the same vertex position of the second part in each pair to the width or height of the first part is larger than a preset second threshold value or smaller than a preset second threshold value;
alternatively, the first and second electrodes may be,
the ratio between the area of the intersection of the first portion and the second portion in each pair and the area of the first portion is greater than or equal to a preset third threshold.
4. The method of claim 2, further comprising:
when the different components are detected through the estimation model based on deep learning, key points of the different components are obtained simultaneously.
5. The method of claim 4,
if the pair does not satisfy a preset association precondition, or if the center point of the connecting line is not in the first part of the pair, the pair does not belong to the same target.
6. The method of claim 1, wherein determining whether the pair belongs to the same target according to the association loss degree comprises:
performing global optimal matching on all pairs according to the association loss degree of each pair to obtain a matching pair combination with the minimum association loss degree sum;
determining that each matching pair in the combination of matching pairs is a target.
7. The method of claim 3,
the target is a pedestrian;
the different components include: a human head frame and a human body frame from a frame of picture;
the pairing is a human head-human body frame pair formed by any human head frame and any human body frame in pairs;
wherein the first threshold is 75%; wherein the vertex position is the top left vertex of each of the human head frame and the human body frame; wherein a ratio between an area of the intersection and an area of the first portion is: the ratio of the area of the intersection of the human head frame and the human body frame to the area of the human head frame.
8. The method of claim 7,
the key points include: the head top key point and the neck key point of the human body frame;
the key points are obtained when a human body frame is detected through a posture estimation model based on deep learning;
calculating the association loss degree of the pairing specifically comprises:
when the central point of the connecting line of the vertex key point and the neck key point of each head-body frame pair body frame is positioned in the head frame in the head-body frames:
calculating the distance between the central point of the connecting line and the central point of the human head frame as the correlation loss degree of the human head-human body frame pair;
alternatively, the first and second electrodes may be,
calculating the relative distance or angle between the key point of the head top or the key point of the neck of each head-human body frame pair and the center point of the head frame of the head-human body frame pair as the correlation loss degree of the head-human body frame pair;
alternatively, the first and second electrodes may be,
inputting the key point of the head top and/or the neck of each head-body frame pair and the central point of the head frame of the head-body frame pair into a deep learning model, and calculating and outputting the correlation loss degree of the head-body frame pair.
9. A keypoint-based correlation system, comprising:
the calculation module is used for calculating the association loss degree of each pair formed by different components according to the detected key points of the different components of the targets;
the matching module is used for determining whether the pair belongs to the same target or not according to the association loss degree;
the different components include: a first portion and a second portion;
each pair formed by the different components specifically includes: forming pairwise with each second portion of the detected plurality of targets;
the calculation module specifically further performs the following operations: calculating the central point of a connecting line based on the key points according to the key points of the second part in each pairing, and determining whether the central point of the connecting line is in the first part of the corresponding pairing; if so, calculating the distance between the center point of the connecting line and the center point of the first part, and taking the distance as the association loss degree of the pair; or, if yes, calculating the relative distance or angle between the key point of the second part and the central point of the first part as the association loss degree of the pair; or, if yes, inputting the key point of the second part and the central point of the first part into a deep learning model, and taking the calculation result as the association loss degree of the pair.
10. The system of claim 9, further comprising:
the judging module is used for judging whether each pair formed by the different components meets a preset association precondition; if so, the pairing is input to a computing module.
11. The system of claim 10, further comprising:
the association precondition is as follows:
the ratio of the range size of the first part contained by the second part in the horizontal direction to the range size of the first part and the ratio of the range size of the first part contained by the second part in the vertical direction to the range size of the first part in each pair are both greater than or equal to a preset first threshold;
alternatively, the first and second electrodes may be,
the ratio of the distance between the vertex position of the first part and the same vertex position of the second part in each pair to the width or height of the first part is larger than a preset second threshold value or smaller than a preset second threshold value;
alternatively, the first and second electrodes may be,
the ratio between the area of the intersection of the first portion and the second portion in each pair and the area of the first portion is greater than or equal to a preset third threshold.
12. The system of claim 10,
further comprising: when the different components are detected through the estimation model based on deep learning, key points of the different components are obtained simultaneously.
13. The system of claim 12, wherein the determining module further performs: if the pairing does not meet the preset association precondition, the pairing does not belong to the same target;
alternatively, the first and second electrodes may be,
the calculation module specifically further performs the following operations: if the center point of the connecting line is not in the first part of the pair, the pair does not belong to the same object.
14. The system of claim 9, wherein the matching module specifically performs the following operations:
performing global optimal matching on all pairs according to the association loss degree of each pair to obtain a matching pair combination with the minimum association loss degree sum;
determining that each matching pair in the combination of matching pairs is a target.
15. The system of claim 11,
the target is a pedestrian;
the different components include: a human head frame and a human body frame from a frame of picture;
the pairing is a human head-human body frame pair formed by any human head frame and any human body frame in pairs;
wherein the first threshold is 75%;
wherein the vertex position is the top left vertex of each of the human head frame and the human body frame;
wherein a ratio between an area of the intersection and an area of the first portion is: the ratio of the area of the intersection of the human head frame and the human body frame to the area of the human head frame.
16. The system of claim 15,
the key points include: the head top key point and the neck key point of the human body frame;
the key points are obtained when a human body frame is detected through a posture estimation model based on deep learning;
the calculation module specifically executes the following operations:
when the central point of the connecting line of the key point of the head top and the key point of the neck of each head-human body frame pair is positioned in the head frame of the head-human body frames:
calculating the distance between the central point of the connecting line and the central point of the human head frame as the correlation loss degree of the human head-human body frame pair; or calculating the relative distance or angle between the key point of the head top or the key point of the neck of each human head-human body frame pair and the center point of the human head frame in the human head-human body frame pair as the correlation loss degree of the human head-human body frame pair; or inputting the key point of the head top and/or the key point of the neck of each head-human body frame pair and the central point of the head frame of the head-human body frame pair into a deep learning model, and calculating and outputting the correlation loss degree of the head-human body frame pair.
17. A video full-structured system, comprising:
a pre-trained detector model, a subsequent video full-structured unit, and a keypoint-based correlation system as claimed in any one of claims 9 to 16;
the detector model detects a first part and a second part from a frame of picture, provides the first part and the second part for the association system based on the key point for association, and outputs the correctly associated pairs to the subsequent video full-structured unit for structured processing.
18. A security system, comprising:
a network hard disk video recorder and/or a network camera, and the video full structured system according to claim 17;
and the network hard disk video recorder and/or the network camera provides the detected pictures for the video full-structured system.
19. A processing apparatus comprising a processor and a memory, the memory adapted to store a plurality of program codes, wherein the program codes are adapted to be loaded and run by the processor to perform the keypoint-based correlation method of any of claims 1 to 8.
20. A computer-readable storage medium storing a plurality of program codes, wherein,
the program code adapted to be loaded and run by a processor to perform the keypoint-based correlation method of any of claims 1 to 8.
CN202011451402.5A 2020-12-10 2020-12-10 Key point-based association method, system and medium Active CN112800825B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011451402.5A CN112800825B (en) 2020-12-10 2020-12-10 Key point-based association method, system and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011451402.5A CN112800825B (en) 2020-12-10 2020-12-10 Key point-based association method, system and medium

Publications (2)

Publication Number Publication Date
CN112800825A CN112800825A (en) 2021-05-14
CN112800825B true CN112800825B (en) 2021-12-03

Family

ID=75806654

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011451402.5A Active CN112800825B (en) 2020-12-10 2020-12-10 Key point-based association method, system and medium

Country Status (1)

Country Link
CN (1) CN112800825B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113192048A (en) * 2021-05-17 2021-07-30 广州市勤思网络科技有限公司 Multi-mode fused people number identification and statistics method
CN113591785A (en) * 2021-08-12 2021-11-02 北京爱笔科技有限公司 Human body part matching method, device, equipment and storage medium
CN114359373B (en) * 2022-01-10 2022-09-09 杭州巨岩欣成科技有限公司 Swimming pool drowning prevention target behavior identification method and device, computer equipment and storage medium
CN114022910B (en) 2022-01-10 2022-04-12 杭州巨岩欣成科技有限公司 Swimming pool drowning prevention supervision method and device, computer equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109740516A (en) * 2018-12-29 2019-05-10 深圳市商汤科技有限公司 A kind of user identification method, device, electronic equipment and storage medium
CN110443190A (en) * 2019-07-31 2019-11-12 腾讯科技(深圳)有限公司 A kind of object identifying method and device
US10482348B1 (en) * 2012-01-22 2019-11-19 Sr2 Group, Llc System and method for tracking coherently structured feature dynamically defined with migratory medium
WO2020041999A1 (en) * 2018-08-29 2020-03-05 Intel Corporation Apparatus and method for feature point tracking using inter-frame prediction
CN111814885A (en) * 2020-07-10 2020-10-23 云从科技集团股份有限公司 Method, system, device and medium for managing image frames

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10482348B1 (en) * 2012-01-22 2019-11-19 Sr2 Group, Llc System and method for tracking coherently structured feature dynamically defined with migratory medium
WO2020041999A1 (en) * 2018-08-29 2020-03-05 Intel Corporation Apparatus and method for feature point tracking using inter-frame prediction
CN109740516A (en) * 2018-12-29 2019-05-10 深圳市商汤科技有限公司 A kind of user identification method, device, electronic equipment and storage medium
CN110443190A (en) * 2019-07-31 2019-11-12 腾讯科技(深圳)有限公司 A kind of object identifying method and device
CN111814885A (en) * 2020-07-10 2020-10-23 云从科技集团股份有限公司 Method, system, device and medium for managing image frames

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
"Learning an Efficient and Robust Graph Matching Procedure for Specific Object Recognition";Jerome Revaud等;《2010 20th International Conference on Pattern Recognition》;20101007;第754-757页 *
"人像属性识别关键技术研究进展及应用探索";康运锋等;《警察技术》;20180331(第2期);第12-16页 *

Also Published As

Publication number Publication date
CN112800825A (en) 2021-05-14

Similar Documents

Publication Publication Date Title
CN112800825B (en) Key point-based association method, system and medium
CN109819208B (en) Intensive population security monitoring management method based on artificial intelligence dynamic monitoring
US10573018B2 (en) Three dimensional scene reconstruction based on contextual analysis
CN103824070B (en) A kind of rapid pedestrian detection method based on computer vision
WO2020233397A1 (en) Method and apparatus for detecting target in video, and computing device and storage medium
CN109255802B (en) Pedestrian tracking method, device, computer equipment and storage medium
CN110956114A (en) Face living body detection method, device, detection system and storage medium
CN113313097B (en) Face recognition method, terminal and computer readable storage medium
CN112200056B (en) Face living body detection method and device, electronic equipment and storage medium
Teutsch et al. Robust detection of moving vehicles in wide area motion imagery
Krinidis et al. A robust and real-time multi-space occupancy extraction system exploiting privacy-preserving sensors
CN112434566A (en) Passenger flow statistical method and device, electronic equipment and storage medium
CN111833380A (en) Multi-view image fusion space target tracking system and method
CN116824641B (en) Gesture classification method, device, equipment and computer storage medium
CN112766046B (en) Target detection method and related device
CN111654668B (en) Monitoring equipment synchronization method and device and computer terminal
KR101542206B1 (en) Method and system for tracking with extraction object using coarse to fine techniques
KR20200060868A (en) multi-view monitoring system using object-oriented auto-tracking function
Zhou et al. Fast road detection and tracking in aerial videos
Unno et al. Vehicle motion tracking using symmetry of vehicle and background subtraction
CN115830513A (en) Method, device and system for determining image scene change and storage medium
CN112802112B (en) Visual positioning method, device, server and storage medium
Yu et al. Accurate motion detection in dynamic scenes based on ego-motion estimation and optical flow segmentation combined method
CN110572618B (en) Illegal photographing behavior monitoring method, device and system
CN114694204A (en) Social distance detection method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant