CN115311690B

CN115311690B - End-to-end pedestrian structural information and dependency relationship detection method thereof

Info

Publication number: CN115311690B
Application number: CN202211229170.8A
Authority: CN
Inventors: 区英杰; 符桂铭; 谭焯康; 董万里
Original assignee: Guangzhou Embedded Machine Tech Co ltd
Current assignee: Guangzhou Embedded Machine Tech Co ltd
Priority date: 2022-10-08
Filing date: 2022-10-08
Publication date: 2022-12-23
Anticipated expiration: 2042-10-08
Also published as: CN115311690A

Abstract

The invention discloses a method for detecting end-to-end pedestrian structural information and the subordination relationship thereof, which comprises the following steps: the method comprises the steps of firstly developing a universal structural labeling tool and carrying out data labeling, synchronously designing a network structure, needing to strengthen the dependency relationship in the training process, distributing labels for a prediction frame, then carrying out reasoning by a model, regressing the coordinates of a pedestrian structural information rectangular frame by utilizing model output information, and finally calculating a loss function loss for updating the model. The invention is suitable for edge equipment; under the condition that the computing resources of the edge device are limited, the effect maximization under effective resources is realized end to end, the subordination relation can be directly output while the pedestrian rectangular frame and the pedestrian structural information rectangular frame are detected, the subsequent logic judgment is avoided, and the accuracy rate is higher; meanwhile, a universal attribute labeling tool is developed, the detection frame is labeled, and the dependency relationship between the frames can be labeled, so that the labeling efficiency is effectively improved.

Description

End-to-end pedestrian structural information and dependency relationship detection method thereof

Technical Field

The invention relates to the field of computer vision, in particular to a method for detecting end-to-end pedestrian structural information and the dependency relationship thereof.

Background

Pedestrian detection is always a research hotspot in the field of intelligent video monitoring. Pedestrian detection may obtain a pedestrian rectangular frame in the image and video frames. In industrial park application, not only the pedestrian position needs to be obtained, but also effective information of the pedestrian needs to be extracted, and the effective information generally comprises structural information such as whether a worker hat is worn, whether a worker clothes is worn, whether a mask is worn and the like.

Currently, there are two general ways to acquire these structured information: one is through the method of detection, the detection method can support the detection function of the particular article attribute, give the rectangular frame of the article, namely utilize the target detection algorithm, while detecting the rectangular frame of the pedestrian, also detect the structural information, give the rectangular frame position of the structural information; the other is a classification method, which can support the identification function of non-entity attributes such as age, gender and the like.

For the first mode, the pedestrian rectangular frame and the pedestrian structured information rectangular frame are independently detected, and the dependency relationship between the pedestrian structured information rectangular frame and the pedestrian rectangular frame needs to be determined after independent detection; at present, the dependency relationship between two frames is determined by calculating the intersection ratio IOU between the two frames; however, if multiple persons overlap, confusion may occur in this way, resulting in inaccurate membership determination.

Specifically, the prior art of the first mode is adopted, for example:

1. the patent names are: a pedestrian detection tracking method and device based on multi-attribute analysis, the disclosure number is: CN114092558A.

The patent redefines a pedestrian detection network, inputs a preprocessed pedestrian image into a preset network structure, enables the network structure to perform feature extraction on the pedestrian image to obtain a feature image, performs detection tracking on each target according to the feature image, and outputs a detection tracking result. The detection tracking result comprises a detected target, a tracked target, target front face information and position information, whether the target rides a bicycle or not and the like. The advantages are that: the method has the advantages that the detection of the target, the tracking of the target, the judgment of whether the target contains the required information, the judgment of the position of the information and the like are finished through a single preset network structure, the multi-attribute analysis is realized, and one model has multiple purposes; meanwhile, whether the pedestrians ride the bicycle or not and whether the pedestrians have the front faces or not in the image can be known, and the pedestrians do not need to be sent to another network for analysis after being captured.

The technical scheme has the following disadvantages:

(1) By adopting a multi-task model, the detection efficiency can be improved, but the model is difficult to train, and the training data labeling cost is high.

(2) The regression frame of the attribute is predicted relative to the center of the person, and the regression frame of the attribute only predicts the offset relative to the center of the person, namely predicts the offset of the upper left corner and the lower right corner of the attribute target relative to the center of the person to obtain the position information of the frame. The center of the person is obtained by model prediction, the center of the person has an error a, and the prediction of the attribute regression frame also has an error b, so that the attribute regression frame obtained indirectly finally has an accumulated error a + b.

(3) The enhancement mode adopted during training is too single, and related trigks such as cutting and the like are not adopted, so that the model recall rate effect is possibly influenced.

(4) The designed model network is customized and improved aiming at the face with small resolution and whether to ride the bicycle or not, and is not necessarily applicable to other attribute scenes.

2. The patent names are: a pedestrian detection method, a system and a terminal device are disclosed as follows: CN110245564A.

The method comprises the steps that a pedestrian in a target image is identified through a multi-object convolution depth network model, and a first identification frame is added to the pedestrian in the target image; the training task of the multi-object convolution depth network model comprises a semantic task and pedestrian detection; identifying a specific object in the target image through the convolutional neural network VGG19, and adding a second identification frame to the specific object in the target image; and judging whether the first identification frame and the second identification frame are overlapped, if so, judging that the pedestrian carries a specific object, and triggering a preset monitoring event.

The technical scheme has the following disadvantages:

(1) The two models are adopted to detect pedestrians and specific articles respectively, so that detection time consumption and operation memory are increased.

(2) And judging whether the first recognition frame and the second recognition frame are overlapped or not by means of a cross-over ratio to judge whether the pedestrian carries a specific object or not. This approach may be confusing if multiple people overlap.

The disadvantages of the two patents are summarized as follows:

(1) The scheme is inefficient.

(2) The membership between the pedestrian rectangular frame and the pedestrian structural information rectangular frame needs to be calculated by an intersection and parallel equation, and the accuracy is influenced under the condition of overlapping of multiple persons.

(3) The universality is not strong, and the attribute categories cannot be expanded.

(4) The regression mode of the model attribute box is indirect regression, and the accuracy can be influenced.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provide an end-to-end pedestrian structural information and dependency relationship detection method, which is suitable for edge equipment; in case the computational resources of the edge device are limited (the TPU of the edge device has limited computational power with respect to the GPU of the server), the present invention maximizes the effect under the effective resources by end-to-end.

The purpose of the invention is realized by the following technical scheme:

an end-to-end pedestrian structural information and dependency relationship detection method comprises the following steps:

s1, improving a Yolox model: respectively increasing S and 4*S channels for an obj _ output branch and a reg _ output branch of a decoupling Head Decoupled Head of the Yolox model; wherein S is the category number of the pedestrian structured information;

s2, before training the improved Yolox model, carrying out data annotation on the image of the training sample through an annotation tool, wherein the annotated information comprises a pedestrian rectangular frame, a pedestrian structural information rectangular frame and dependency relationship information between the pedestrian rectangular frame and the pedestrian structural information rectangular frame;

s3, training the improved Yolox model:

enhancing the image data of the training sample, enhancing the subordination relation between the pedestrian rectangular frame and the pedestrian structural information rectangular frame, and distributing a label for the prediction frame;

inputting the enhanced image, performing inference by using an improved Yolox model, and returning the coordinates of the pedestrian structural information rectangular frame and the coordinates of the pedestrian rectangular frame by using the output information of the improved Yolox model, and simultaneously directly obtaining the membership between the pedestrian rectangular frame and the pedestrian structural information rectangular frame;

calculating loss function loss, updating the improved Yolox model, and finishing training;

and S4, inputting the image to be detected into the improved and trained Yolox model, and directly outputting the pedestrian structural information rectangular frame and the coordinates of the pedestrian rectangular frame, and the subordination relation between the pedestrian rectangular frame and the pedestrian structural information rectangular frame by the model end to end.

The improvement of the Yolox model specifically includes:

firstly, increasing the output of S channels on the obj _ output branch of a Decoupled Head of a yolk model, wherein the output size is H × W (1+S), increasing 4*S channels on the reg _ output branch of a Decoupled Head of a yolk model, and wherein the output size is H × W (4 × S); the output size of the cls _ output branch of a decoupling Head Decoupled Head of the Yolox model is H W1; wherein H is the height of the output characteristic diagram, and W is the width of the output characteristic diagram;

the inference is executed by using the improved Yolox model, specifically: merging an obj _ output branch, a reg _ output branch and a cls _ output branch of a decoupling Head Decoupled Head of the Yolox model to obtain final characteristic information, wherein the size of the final characteristic information is pred _ num _ dim _ s; wherein pred _ num = W × H, for representing the number of prediction frames; dim _ s =1+ S +4+ S for characterizing each prediction box feature vector dimension; each predictor block then contains a feature vector with dimension dim _ s:

[x y w h obj cls attr_1 ... attr_n x_1 y_1 w_1 h_1 ... x_n y_n w_n h_n]

wherein, x is the x coordinate information of the central point of the target frame, y is the y coordinate information of the central point of the target frame, and w is the width information of the target frame; h is height information of the target frame; obj is the score information of the target frame; cls is the score information of the target box category; attr _ n is score information of the structured information n, and [ x _ n y _ n w _ n h _ n ] is frame coordinate information of the structured information n;

according to the feature vector, the judgment process of the pedestrian structured information is as follows: when the size of obj × cls meets a pedestrian rectangular frame score threshold value, the current prediction frame is considered to contain pedestrian information, and [ x y w h ] is the coordinate information of the pedestrian rectangular frame, at the moment, if the score attr _ n of the structured information n meets the structured information probability threshold value, the pedestrian is considered to contain the structured information n, and [ x _ n y _ n w _ n h _ n ] is the coordinate information of the rectangular frame of the pedestrian structured information n; the whole end-to-end pedestrian structural information detection is completed.

Marking the dependency relationship information between the pedestrian rectangular frame and the pedestrian structural information rectangular frame, wherein the dependency relationship information is established by directly connecting a left upper corner connecting line between the pedestrian rectangular frame and the pedestrian structural information rectangular frame through a marking tool; and acquiring the id pairing information of the pedestrian rectangular frame and the pedestrian structural information rectangular frame through the connection line, and storing the pairing information into the dependency relationship label file.

After the image of the training sample is subjected to data annotation, the annotated data format comprises three parts: image data images, rectangular frame label information labels and dependency relationship label information images relevate; each part is divided into a training part and a testing part.

The enhancing of the image data of the training sample specifically includes: storing the membership label information of the pedestrian rectangular frame and the pedestrian structural information rectangular frame into a queue, then performing mosaic and mixup data enhancement on the image data, the label information of the pedestrian rectangular frame and the pedestrian structural information rectangular frame together, finally combing the pedestrian rectangular frame and the pedestrian structural information rectangular frame which still exist after enhancement again, and judging whether the membership exists or increases according to whether the pedestrian rectangular frame and the structural frame exist or increase newly, thereby deleting or increasing the membership in the original queue, and the updated queue data is the membership label information after enhancement.

The specific implementation manner of allocating the label to the prediction frame is as follows: when the labels of the prediction frames of the pedestrian rectangular frame are distributed, the label information of the real rectangular frame is adopted to carry out positive and negative sample area division on the characteristic image on the decoupling head, namely all the prediction frames in the real rectangular frame are used as positive sample candidate frames, and the rest are negative samples; when the labels are distributed to the prediction frame of the pedestrian structured information rectangular frame, the real frame of the pedestrian structured information rectangular frame is not adopted to carry out region division on the characteristic graph on the solution matching head, but the real frame of the pedestrian rectangular frame is still adopted to replace the characteristic graph, so that the label distribution consistency between the pedestrian rectangular frame and the pedestrian structured information rectangular frame is kept, the performance convergence of a training model can be accelerated, and the condition that the pedestrian rectangular frame is detected but the structured information is not detected in the trained model is avoided.

The structured information rectangular frame regression method specifically comprises the following steps:

when the resolution of the input graph is 640 × 640, the three Decoupled heads respectively have feature graphs output in different down-sampling scales, the feature graphs W × H are respectively 20 × 20 (down-sampling 5 times, the down-sampling multiple is 32, and the other same principles are adopted), 40 × 40 and 80 × 80;

for each cell of one feature map, a corresponding anchor frame anchor (the anchor frame is a link established between the feature map and an actual pixel coordinate rectangular frame and has the function of accelerating model convergence during training); when the feature maps W × H are 20 × 20, respectively, the anchor frame size is 32 × 32, and is consistent with the down-sampling magnification; according to the improved Yolox model, rectangular frame coordinate information [ x _ n y _ n w _ n h _ n ] of structural information n under a certain cell (U _ w, U _ h) is given, and rectangular frame coordinates under actual resolution are calculated by combining anchor frame information of a feature map where the rectangular frame coordinate information is located; wherein x _ n is the x offset of the central point of the rectangular frame relative to the current cell, and y _ n is the y offset of the central point of the rectangular frame relative to the current cell;

assuming that the width of the anchor frame is anchor _ w and the height of the anchor frame is anchor _ h, the actual pixel coordinates of the rectangular frame of the structured information are as follows:

wherein, (x _ pixel, y _ pixel) is the middle point of the rectangular frame, w _ pixel is the width of the rectangular frame, and h _ pixel is the height of the rectangular frame;

in the above calculation formula, only [ x _ n y _ n w _ n h _ n ] is obtained by network prediction, and the others are preset information, so that: the pedestrian structural information rectangular frame is directly obtained by the improved Yolox model, and the accumulation of intermediate errors is avoided without the intermediate result of the pedestrian rectangular frame information.

For the loss function loss during model training, on the basis of Yolox, structured information foreground probability loss attr _ obj _ loss and pedestrian structured information rectangular box regression loss attr _ reg _ loss are added, wherein the structured information foreground probability loss attr _ obj _ loss adopts a cross entropy loss function BCEWithLogitsLoss, and the pedestrian structured information rectangular box regression loss attr _ reg _ loss adopts a cross-over-loss function IOULoss; and adding newly added structured information foreground probability loss attr _ obj _ loss and pedestrian structured information rectangular frame regression loss attr _ reg _ loss to the original loss function to obtain a final loss function.

And setting an activation function of the Yolox model as relu, and setting a channel coefficient of a decoupling head as 0.5.

Compared with the prior art, the invention has the following advantages and beneficial effects:

1. the method can directly output the dependency relationship while detecting the pedestrian rectangular frame and the pedestrian structural information rectangular frame, thereby avoiding subsequent logic judgment and having higher accuracy. Meanwhile, a universal attribute labeling tool is developed, the detection frame is labeled, and the dependency relationship between the frames can be labeled, so that the labeling efficiency is effectively improved.

2. The method modifies the target detection network Yolox model, realizes the end-to-end output of the membership between the pedestrian rectangular frame and the pedestrian structural information rectangular frame, avoids the judgment of whether the subsequent frames are overlapped, and effectively improves the accuracy and efficiency of structural information detection through data enhancement and label distribution. Meanwhile, the detection of various structured information can be easily supported through the expansion of the model output channel, more time consumption is not caused, and the universality of the model is ensured. In addition, a new attribute labeling tool is developed, so that the detection frames can be labeled, the dependency relationship among the frames can be labeled, and the labeling efficiency is effectively improved.

Drawings

FIG. 1 is a flow chart of a method for detecting end-to-end pedestrian structural information and its dependency relationship;

FIG. 2 is a schematic diagram of a data enhancement scheme for dependencies;

FIG. 3 is a diagram illustrating the labeled data format.

Detailed Description

The present invention will be described in further detail with reference to examples and drawings, but the embodiments of the present invention are not limited thereto.

Referring to fig. 1 to 3, a method for detecting end-to-end pedestrian structural information and its dependency relationship includes the following steps: the method comprises the steps of firstly developing a universal structural labeling tool and carrying out data labeling, synchronously designing a network structure, needing to enhance the dependency relationship in the training process, distributing labels for a prediction frame, then carrying out reasoning by a model, regressing the coordinates of a pedestrian structural information rectangular frame by utilizing model output information, and finally calculating a loss function loss for updating the model.

The implementation process is as follows:

1. network architecture design

Yolox adopts yolov3_ spp as a reference model, and adopts the design of such trigks as Decoupled Head and Anchor-free to carry out improvement, so as to finally obtain a Yolox-Darknet53 network structure. The Yolox paper indicates that the expression capability of the current Yolov 3-v 5 series detection heads may be deficient, the expression capability of the detection heads is better without a decorled Head, and the adopted decorled Head not only improves the precision, but also accelerates the convergence speed of the network. There is a further level of importance after this decoupling: the network architecture of the Yolox can be integrated with a plurality of algorithm tasks. Such as: YOLOX + Yolact/CondInst/SOLO, which enables an end-side example segmentation. And outputting a YOLOX + 34 layer to realize the detection of 17 key points of the human body at the end side. Here is also the reason why the present invention adopts Yolox, and the integrated detection of the pedestrian structured information can be realized by adding an output channel.

Decorupped Head has a total of three branches: cls _ output — category and score of the main predicted target box, output size W × H × C; obj _ output-mainly judging whether the target frame is a foreground or a background, and outputting the size of W × H × 1; reg _ output-mainly predicts the coordinate information (x, y, W, H) of the target frame, and the output size is W × H × 4. Where W is the width of the output feature map, H is the height of the output feature map, and C is the number of detection categories, e.g., C is 1 when there are only pedestrians in a detection category. And merging the last three outputs to obtain final feature information, wherein the size of the final feature information is pred _ num × dim, pred _ num = W × H represents the number of prediction frames, and dim = C +1+4 represents the feature vector dimension of each prediction frame. Each predictor box then contains a feature vector with dimension dim:

[cls_1 cls_2 ... cls_C obj x y w h]

wherein, x is the coordinate information of the center point x of the target frame; y is the coordinate information of the center point y of the target frame; w is the width information of the target frame; h, height information of the target frame; obj is target box score information; cls _1: target box category 1 score; cls _ C-target Box category C score.

In order to increase the detection of the pedestrian structural information, the following changes are made to the decorumled Head: the obj _ output branch increases S channel outputs, the output size is W H (1+S), the reg _ output branch increases 4*S channels, and the output size is W H (4 + 4S). The number of S structural categories of the pedestrians can be changed according to actual project requirements, so that the algorithm has certain universality, for example, whether a mask is worn or not is detected, whether a worker hat is worn or not is detected, and S is 2 at the moment. And finally, combining the three output to obtain final feature information, wherein the size of the final feature information is pred _ num + dim _ s, pred _ num = W + H represents the number of prediction frames, dim _ s = C +1+ S +4+ S represents the feature vector dimension of each prediction frame. Each predictor block then contains a feature vector with dimension dim _ s:

[x y w h obj cls attr_1 ... attr_n x_1 y_1 w_1 h_1 ... x_n y_n w_n h_n]

wherein, attr _ n is score information of the structured information n, and [ x _ n y _ n w _ n h _ n ] is frame coordinate information of the structured information n.

According to the feature vector, the judgment process of the pedestrian structured information is as follows: when the size of obj × cls meets the score threshold of the rectangular frame of the pedestrian, the current prediction frame is considered to contain the pedestrian information, and [ x y w h ] is coordinate information of the rectangular frame of the pedestrian, at this time, if the score attr _ n of the structured information n meets the probability threshold of the structured information, the pedestrian is considered to contain the structured information n, and [ x _ n y _ n w _ n h _ n ] is the coordinate information of the rectangular frame of the structured information n of the pedestrian. The whole end-to-end pedestrian structural information detection is completed.

2. Network improvements for edge end migration

In order to reduce the inference time after the model edge end transplantation, the following changes are further made to the network: the activation function is modified from silu to relu, and the decoupling head channel coefficient is modified from 1.0 to 0.5.

3. Design of pedestrian structured information rectangular frame regression method

For the improved network, when the input pattern resolution is 640 × 640, the three decorumded heads have feature outputs of different down-sampling scales, and the feature W × H is 20 × 20 (down-sampling 5 times, down-sampling multiple is 32, and the same applies to the rest), 40 × 40, and 80 × 80. For each cell of one of the feature maps, there is a corresponding anchor box anchor (the anchor box is established as a link between the feature map and the rectangular box of the actual pixel coordinates, and has the effect of accelerating model convergence during training). When the feature patterns W × H are 20 × 20, respectively, the anchor frame size is 32 × 32, and corresponds to the down-sampling magnification. According to the network structure design, rectangular frame coordinate information [ x _ n y _ n w _ n h _ n ] of the pedestrian structural information n under a certain cell (U _ w, U _ h) is given, and rectangular frame coordinates under actual resolution can be calculated by combining anchor frame information of a feature map where the rectangular frame coordinate information is located. Wherein x _ n is the x offset of the center point of the rectangular frame relative to the current cell, and y _ n is the y offset of the center point of the rectangular frame relative to the current cell. Assuming that the width of the anchor frame is anchor _ w and the height of the anchor frame is anchor _ h, the actual pixel coordinates of the rectangular frame of the structured information are as follows:

where (x _ pixel, y _ pixel) is the rectangular frame midpoint, w _ pixel is the rectangular frame width, and h _ pixel is the rectangular frame height.

In the above calculation formula, only [ x _ n y _ n w _ n h _ n ] is obtained by network prediction, and the others are preset information, which is equivalent to that the pedestrian structural information rectangular frame is directly obtained by the model, and intermediate results such as pedestrian rectangular frame information and the like are not needed, so that the accumulation of intermediate errors is avoided.

4. Data enhancement scheme for dependencies

For the convolutional neural network, in order to avoid overfitting caused by the fact that training data are too single in the model training process, a data-enhanced regularization mode is adopted, the diversity of a data set is increased, and therefore the generalization capability of the model is improved. The data sample of the present invention is divided into three parts, which are image data, rectangular frame tag information (including a pedestrian rectangular frame and a pedestrian structured information rectangular frame), and membership label information between the pedestrian rectangular frame and the pedestrian structured information rectangular frame. The original Yolox data enhancement adopts mosaic and mixup, and the training graph is spliced in a random scaling, random cutting and random arrangement mode, so that two parts of image data and rectangular frame label information can be amplified.

The method increases the enhancement of the dependency relationship between the pedestrian rectangular frame and the pedestrian structural information rectangular frame on the basis of the original mosaic and mixup data enhancement, namely, splicing, deleting and cutting each image data in the data enhancement process, wherein the accompanying phenomenon is the new addition, cutting, distortion or disappearance of the rectangular frame label information, and the corresponding dependency relationship is also increased or deleted synchronously.

The basic principle of the implementation is that before data enhancement, the membership label information of a rectangular frame is stored in a queue, then mosaic and mixup data enhancement is carried out on image data and the rectangular frame label information, finally, a pedestrian rectangular frame and a pedestrian structural information rectangular frame which still exist after enhancement are combed again, whether membership exists or is increased is judged according to whether the pedestrian rectangular frame and the pedestrian structural frame exist or are increased, so that membership in the original queue is deleted or increased, and updated queue data is the membership label information after enhancement.

5. Label allocation scheme for prediction boxes

In the design of the network structure, we mention that a feature map on a decoupling head has pred _ num prediction blocks output. The label distribution is that in the process of model training, real rectangular frame label information is distributed to each prediction frame, the corresponding relation between the prediction frames and the labeled real rectangular frame label information is established, namely, which prediction frames are used as foreground positive samples and which prediction frames are used as background negative samples are determined, then the loss between the prediction frames and the real rectangular frame label information is calculated by using a loss function, and finally the model weight is updated through back propagation.

In the original Yolox, in order to determine the distribution mode of the prediction frame, the label information of a real rectangular frame is adopted to perform positive and negative sample area division on the feature map on the decoupling head, that is, all the prediction frames in the real rectangular frame are used as positive sample candidate frames, and the rest are negative samples. The prediction frame of the invention is processed in two cases, namely a prediction frame of a pedestrian rectangular frame (hereinafter referred to as P _ Pred _ Rect) and a prediction frame of a pedestrian structured information rectangular frame (hereinafter referred to as S _ Pre _ Rect). When label allocation is carried out on the P _ Pred _ Rect, the original Yolox label allocation scheme is directly adopted. Let the real frame of the rectangular frame of the pedestrian be P _ True _ Rect, and the real frame of the rectangular frame of the pedestrian structured information be S _ True _ Rect.

When the label distribution is carried out on the S _ Pred _ Rect, the following changes are made, the S _ True _ Rect is not adopted to carry out region division on the characteristic diagram on the decoupling head, and the P _ True _ Rect is still adopted to carry out substitution, so that the consistency of the label distribution between the pedestrian rectangular frame and the pedestrian structural information rectangular frame is kept, the performance convergence of the trained model can be accelerated, and the condition that the pedestrian rectangular frame is detected but the structural information is not detected in the trained model is avoided. The method is mainly characterized in that a detection result, namely whether a pedestrian rectangular frame and a pedestrian structural information rectangular frame are in an independent relation or not, is only detected on the basis of the detection of the pedestrian rectangular frame, whether the pedestrian structural information rectangular frame exists or not is judged, so that when an S _ True _ Rect region is adopted for label distribution for S _ Pred _ Rect, some prediction frames which are located in a P _ True _ Rect region but not located in the S _ True _ Rect region are distributed as negative samples, and a model trained by the distribution mode of the method can cause that whether the pedestrian structural information rectangular frame exists or not can not be accurately judged after a part of the prediction frames detect the pedestrian rectangular frame. For example, for a cell with coordinates of (U _ w, U _ h) on the feature map, if the S _ True _ Rect area is used as S _ Pred _ Rect for label assignment, the S _ Pred _ Rect of the cell is assigned as a negative sample, and the cell is located in the P _ True _ Rect area, so that the P _ Pred _ Rect of the cell is assigned as a positive sample, and thus the predicted frame feature vector of the cell is [ x y w h 1 cls 0 x _1y _1w _1h _1] and the pedestrian structured information rectangular frame is predicted to be absent after the pedestrian rectangular frame is detected; if the P _ True _ Rect region is used for label allocation for S _ Pred _ Rect, the S _ Pred _ Rect of the cell is allocated as a positive sample, so that the consistency of label allocation between the pedestrian rectangular frame and the pedestrian structural information rectangular frame is maintained.

6. Training loss function loss scheme design

For loss function loss during model training, structured information foreground probability loss attr _ obj _ loss and pedestrian structured information rectangular box regression loss attr _ reg _ loss are added on the basis of Yolox, wherein cross entropy loss function BCEWithLogitsLoss is adopted for attr _ obj _ loss, and cross-over ratio loss function IOULoss is adopted for attr _ reg _ loss. And adding two new losses and the original loss function to obtain a final loss function.

7. Universal structured marking tool development and data reading mode

In order to improve the labeling of the structured information data, a new structured labeling tool LabelImg-Attr is also developed, and the tool can be directly connected with a left upper corner connecting line between a pedestrian rectangular frame and a pedestrian structured information rectangular frame to establish an attribute relation. The marked data format comprises three parts including image data images, rectangular frame label information labels and dependency relationship label information image relevate. Each part is divided into a training part and a testing part.

The txt information under labels is shown in the following figure, the first column is category, the second to fifth columns are frame information, and the last column is added id information.

2 0.374023 0.503125 0.041797 0.122917 0

1 0.363281 0.400694 0.028906 0.063889 1

0 0.073828 0.877431 0.110937 0.245139 2

0 0.369727 0.536111 0.091797 0.340278 3

2 0.084375 0.907292 0.078125 0.177083 4

The corresponding label information under relevate is as follows, wherein the first number of each line is host id, and the second number is attribute id.

3,0

3,1

2,4

The key points of the technical scheme of the embodiment are as follows:

(1) And improving the obtained end-to-end pedestrian structural detection model on the basis of the Yolox model. The method for increasing the number of channels at the decoupling output head is provided, the structural detection of the pedestrians is realized, the structural type number can be expanded along with the specific structural type number, and the universality of the algorithm is ensured. The current patent has no related general structured information detection method. The improved model can output the dependency relationship between the pedestrian rectangular frame and the structural information rectangular frame end to end. The method avoids the situation that whether frames are overlapped or not is judged by adopting a cross-over comparison scheme in the prior art, and then whether the subordination relationship exists or not is judged.

(2) Structured information rectangular frame regression method. The model directly predicts the deviation of the central point of the regression pedestrian structured information rectangular frame relative to the x direction and the y direction of the current cell of the model output characteristic diagram, and index information of the width and the height of the pedestrian structured information rectangular frame relative to the width and the height of the anchor frame of the current cell of the model output characteristic diagram. And combining other preset information to obtain a rectangular frame under the actual pixel coordinate. Equivalently, the pedestrian structural information rectangular frame is directly obtained by the model, compared with the prior art, intermediate results such as pedestrian rectangular frame information and the like are not needed, and accumulation of intermediate errors is avoided.

(3) And the data enhancement scheme of the membership relation between the pedestrian rectangular frame and the pedestrian structured information rectangular frame. The diversity of the dependency relationship data in the training process is effectively increased, and no patent is provided for a related data enhancement scheme at present.

(4) And (3) a label distribution scheme of the pedestrian structured information rectangular frame. The method and the device have the advantages that the label distribution consistency between the pedestrian rectangular frame and the pedestrian structural information rectangular frame is kept, the performance convergence of the training model can be accelerated, and meanwhile, the condition that the pedestrian rectangular frame is detected but the structural information is not detected in the trained model is avoided. No patent is currently available that proposes a related label assignment scheme.

(5) A generic structured labeling tool. The invention develops a universal structural labeling tool LabelImg-Attr which can be directly connected with a left upper corner connecting line between a pedestrian rectangular frame and a pedestrian structural information rectangular frame to establish an attribute relationship. No patent currently proposes a relevant marking tool.

Compared with the prior art, the technical scheme of the embodiment has the following advantages:

(1) The model has universality. Based on Yolox, the expandable network output head channel is utilized to adapt to different types of structured information.

(2) And (4) outputting end to end. Based on the expandable network output head channel, the model outputs the membership between the pedestrian rectangular frame and the structural information rectangular frame end to end, and subsequent overlapping logic judgment is avoided.

(3) The application of newer target detection techniques. The mosaics data enhancement mode and the SimQTA label distribution are improved for adapting to the training of the structured information data.

(4) The accuracy of the pedestrian structured information rectangular frame is more accurate. And (3) directly predicting and regressing the model to obtain the offset of the central point of the pedestrian structural information rectangular frame in the x direction and the y direction relative to the current cell of the model output characteristic diagram, and removing a middle error.

(5) A universal attribute labeling tool is designed. The marking efficiency of the structured information data is effectively improved.

It will be understood by those skilled in the art that all or part of the steps in the embodiments may be implemented by hardware instructions associated with a computer program, and the program may be stored in a computer readable medium, which may include various media capable of storing program code, such as a flash memory, a removable hard disk, a read-only memory, a random access memory, a magnetic or optical disk, and the like. In one embodiment, the disclosure proposes a computer readable medium, in which a computer program is stored, the computer program being loaded and executed by a processing module to implement a method for detecting end-to-end pedestrian structural information and its dependencies.

The various embodiments or features mentioned herein may be combined with each other as additional alternative embodiments without conflict, within the knowledge and ability level of those skilled in the art, and a limited number of alternative embodiments formed by a limited number of combinations of features not listed above are still within the skill of the disclosed technology, as will be understood or inferred by those skilled in the art from the figures and above.

Moreover, the descriptions of the various embodiments are expanded upon with varying emphasis, and where not already described, may be had by reference to the prior art or other related descriptions herein.

It is emphasized that the above-mentioned embodiments, which are typical and preferred embodiments of the present disclosure, are only used for explaining and explaining the technical solutions of the present disclosure in detail for the convenience of the reader, and do not limit the protection scope or application of the present disclosure. Any modifications, equivalents, improvements and the like which come within the spirit and principle of the disclosure are intended to be covered by the scope of the disclosure.

Claims

1. An end-to-end pedestrian structural information and dependency relationship detection method is characterized by comprising the following steps:

the Yolox model is improved, and specifically comprises the following steps:

firstly, increasing the obj _ output branch of a decoupling Head Decoupled Head of a Yolox model by S channel outputs, wherein the output size is H × W (1+S), increasing the reg _ output branch of the decoupling Head Decoupled Head of the Yolox model by 4*S channels, and wherein the output size is H × W (4 × S); the output size of the cls _ output branch of a decoupling Head Decoupled Head of the Yolox model is H W1; wherein H is the height of the output characteristic diagram, and W is the width of the output characteristic diagram;

marking the dependency relationship information between the pedestrian rectangular frame and the pedestrian structural information rectangular frame, wherein the dependency relationship information is established by directly connecting a left upper corner connecting line between the pedestrian rectangular frame and the pedestrian structural information rectangular frame through a marking tool; acquiring id pairing information of a pedestrian rectangular frame and a pedestrian structural information rectangular frame through a connection line, wherein the pairing information can be stored in a subordinate relation label file;

s3, training the improved Yolox model:

enhancing the image data of the training sample, enhancing the dependency relationship between the pedestrian rectangular frame and the pedestrian structural information rectangular frame, and distributing labels for the prediction frames;

inputting the enhanced image, performing reasoning by using an improved Yolox model, and utilizing output information of the improved Yolox model to return coordinates of the pedestrian structural information rectangular frame and coordinates of the pedestrian rectangular frame, and simultaneously directly obtaining a subordinate relationship between the pedestrian rectangular frame and the pedestrian structural information rectangular frame;

the method for performing inference by using the improved Yolox model specifically includes: merging an obj _ output branch, a reg _ output branch and a cls _ output branch of a decoupling Head Decoupled Head of the Yolox model to obtain final characteristic information, wherein the size of the final characteristic information is pred _ num _ dim _ s; wherein pred _ num = W × H, which is used for representing the number of prediction frames; dim _ s =1+ S +4+ S for characterizing each prediction box feature vector dimension; each predictor box now contains a feature vector of dimension dim _ s:

[x y w h obj cls attr_1 ... attr_n x_1 y_1 w_1 h_1 ... x_n y_n w_n h_n]

wherein x is the x coordinate information of the center point of the target frame, y is the y coordinate information of the center point of the target frame, and w is the width information of the target frame; h is height information of the target frame; obj is the score information of the target frame; cls is the score information of the target box category; attr _ n is score information of the structured information n, and [ x _ n y _ n w _ n h _ n ] is frame coordinate information of the structured information n;

according to the feature vector, the judgment process of the pedestrian structured information is as follows: when the size of obj × cls meets a pedestrian rectangular frame score threshold value, the current prediction frame is considered to contain pedestrian information, and [ x y w h ] is the coordinate information of the pedestrian rectangular frame, at the moment, if the score attr _ n of the structured information n meets the structured information probability threshold value, the pedestrian is considered to contain the structured information n, and [ x _ n y _ n w _ n h _ n ] is the coordinate information of the rectangular frame of the pedestrian structured information n; the whole end-to-end pedestrian structural information detection is completed;

calculating a loss function loss, updating the improved Yolox model, and finishing training;

2. The end-to-end pedestrian structural information and dependency relationship detection method according to claim 1, wherein after the data annotation of the image of the training sample is completed, the annotated data format comprises three parts: image data images, rectangular frame label information labels and subordinate relation label information images relevate; each part is divided into a training part and a testing part.

3. The method for detecting end-to-end pedestrian structural information and the dependency relationship thereof according to claim 1, wherein the enhancing is performed on the image data of the training sample, specifically: storing the membership label information of the pedestrian rectangular frame and the pedestrian structural information rectangular frame into a queue, then performing mosaic and mixup data enhancement on the image data, the label information of the pedestrian rectangular frame and the pedestrian structural information rectangular frame together, finally combing the pedestrian rectangular frame and the pedestrian structural information rectangular frame which still exist after enhancement again, and judging whether the membership exists or increases according to whether the pedestrian rectangular frame and the structural frame exist or increase newly, thereby deleting or increasing the membership in the original queue, and the updated queue data is the membership label information after enhancement.

4. The method for detecting end-to-end pedestrian structural information and the dependency relationship thereof according to claim 1, wherein the labels are allocated to the prediction frames in the following specific implementation manner: when the labels of the prediction frames of the pedestrian rectangular frame are distributed, the label information of the real rectangular frame is adopted to carry out positive and negative sample area division on the characteristic image on the decoupling head, namely all the prediction frames in the real rectangular frame are used as positive sample candidate frames, and the rest are negative samples; when the labels are distributed to the prediction frame of the pedestrian structured information rectangular frame, the real frame of the pedestrian structured information rectangular frame is not adopted to carry out region division on the characteristic graph on the solution matching head, but the real frame of the pedestrian rectangular frame is still adopted to replace the characteristic graph, so that the label distribution consistency between the pedestrian rectangular frame and the pedestrian structured information rectangular frame is kept, the performance convergence of a training model can be accelerated, and the condition that the pedestrian rectangular frame is detected but the structured information is not detected in the trained model is avoided.

5. The end-to-end pedestrian structured information and dependency relationship detection method according to claim 1, wherein the structured information rectangular frame regression method specifically comprises the following steps:

when the resolution of the input graph is 640 × 640, the three decorupled heads respectively have feature graph outputs with different down-sampling scales, and the feature graphs W × H are respectively 20 × 20, 40 × 40 and 80 × 80;

for each cell of one of the feature maps, a corresponding anchor frame anchor is provided; when the feature maps W × H are 20 × 20, respectively, the anchor frame size is 32 × 32, and is consistent with the down-sampling magnification; according to the improved Yolox model, rectangular frame coordinate information [ x _ n y _ n w _ n h _ n ] of structural information n under a certain cell (U _ w, U _ h) is given, and rectangular frame coordinates under actual resolution are calculated by combining anchor frame information of a feature map where the rectangular frame coordinate information is located; wherein x _ n is the x offset of the central point of the rectangular frame relative to the current cell, and y _ n is the y offset of the central point of the rectangular frame relative to the current cell;

in the above calculation formula, only [ x _ n y _ n w _ n h _ n ] is obtained by network prediction, and the others are preset information, so that: the pedestrian structured information rectangular frame is directly obtained by the improved Yolox model, and the accumulation of intermediate errors is avoided without the intermediate result of the pedestrian rectangular frame information.

6. The end-to-end detection method for the pedestrian structured information and the subordinate relationship thereof according to claim 1, characterized in that for the loss function loss during model training, on the basis of Yolox, a structured information foreground probability loss attr _ obj _ loss and a pedestrian structured information rectangular box regression loss attr _ reg _ loss are added, wherein the structured information foreground probability loss attr _ obj _ loss adopts a cross entropy loss function bceath logitss loss, and the pedestrian structured information rectangular box regression loss attr _ reg _ loss adopts a cross-over ratio loss function ioulos; and adding newly added structured information foreground probability loss attr _ obj _ loss and pedestrian structured information rectangular frame regression loss attr _ reg _ loss to the original loss function to obtain a final loss function.

7. The end-to-end pedestrian structural information and dependency detection method according to claim 1, characterized in that an activation function of the Yolox model is set to relu, and a decoupling head channel coefficient is set to 0.5.

8. A computer-readable medium characterized by: the computer readable medium stores a computer program, which is loaded and executed by a processing module to implement the method for detecting end-to-end pedestrian structural information and its dependency relationship according to any one of claims 1 to 7.