CN114067440A

CN114067440A - Pedestrian detection method, device, equipment and medium of cascade neural network model

Info

Publication number: CN114067440A
Application number: CN202210034371.6A
Authority: CN
Inventors: 尹赞朗; 郑伟; 刘国清; 敖争光
Original assignee: Shenzhen Minieye Innovation Technology Co Ltd
Current assignee: Shenzhen Youjia Innovation Technology Co.,Ltd.
Priority date: 2022-01-13
Filing date: 2022-01-13
Publication date: 2022-02-18
Anticipated expiration: 2042-01-13
Also published as: CN114067440B

Abstract

The application relates to a pedestrian detection method, a device, equipment and a medium of a cascade neural network model, wherein the method comprises the steps of obtaining an original frame of a pedestrian target; performing supplementary labeling of additional attributes on the original frame, and generating a patch type small picture taking a pedestrian target as a center; expanding the patch type small picture to a preset height along the head and foot direction of the pedestrian target, and expanding the patch type small picture to a preset width according to a preset first threshold value to obtain a training sample; constructing a lightweight multi-head classification network based on a preset re-parameterizable residual error unit; inputting the training samples into a multi-head classification network to obtain a secondary re-parameterization model; and connecting the secondary re-parameterized model with the primary detector in series, and inputting an original image of the pedestrian target. The problem of current one-level detector can't the meticulous attribute of perception pedestrian target, easily the false retrieval pedestrian target is solved. The pedestrian target false detection method and device have the effects of reducing the false detection rate of the pedestrian target and providing rich behavior information for the downstream vehicle control link.

Description

Pedestrian detection method, device, equipment and medium of cascade neural network model

Technical Field

The application relates to the technical field of deep learning, in particular to a pedestrian detection method, a pedestrian detection device, pedestrian detection equipment and a pedestrian detection medium of a cascade neural network model.

Background

Vehicles and pedestrians are the most common movable objects on the road surface, wherein pedestrians have more severe requirements on the perception control system of the automatic driving vehicle due to high collision risk and serious collision consequences. At present, an end-to-end detection model with huge calculation amount, namely a first-level detector, is generally adopted to classify all pedestrian targets into one type of targets for full-image detection.

Although the method is easy to train, the method has a plurality of problems in the practical application process: in order to ensure the real-time performance of the detection result, the amount of the primary detector is limited to a great extent, so that the perception of important information such as the posture, the moving speed and the direction of a pedestrian target is abandoned; because the pedestrian target disperses everywhere on the road surface and is in different environment that shelters from, the mistake that is easily with trees, lamp pole, the leading of roadside changes facilities etc. is the pedestrian and triggers the brake by mistake. The pedestrian target has the complexity of far exceeding the vehicle target, the safety risk caused by pedestrians with different attribute combinations is far beyond the vehicle target, and the existing primary detector is difficult to support a downstream regulation and control system to make a safety decision.

In view of the above-mentioned related technologies, the inventor believes that the existing first-level detector has the defects that the fine attribute of the pedestrian target cannot be sensed and the pedestrian target is easy to be detected by mistake.

Disclosure of Invention

In order to reduce the false detection rate of the pedestrian target, the application provides a pedestrian detection method, a device, equipment and a medium of a cascade neural network model.

In a first aspect, the application provides a pedestrian detection method of a cascade neural network model, which has the characteristics of reducing the false detection rate of a pedestrian target and providing more behavior information of the pedestrian target.

The application is realized by the following technical scheme:

a pedestrian detection method of a cascade neural network model comprises the following steps:

acquiring an original frame of a pedestrian target;

supplementary labeling of additional attributes is carried out on the original sample frame, and a patch type small picture taking a pedestrian target as a center is generated;

expanding the patch type small picture to a preset height along the head and foot direction of the pedestrian target, and expanding the patch type small picture to a preset width according to a preset first threshold value to obtain a training sample;

constructing a lightweight multi-head classification network based on a preset re-parameterizable residual error unit;

inputting the training samples into the multi-head classification network, and training to obtain a secondary heavy parameterized model;

and connecting the secondary re-parameterized model with a primary detector in series, and inputting an original image containing a pedestrian target for prediction.

By adopting the technical scheme, the original frame of the pedestrian target is obtained, namely a rectangular frame which frames a certain pedestrian target from head to foot and from left to right is obtained, and the rectangular frame is only marked with the attribute of 'people'; supplementary labeling of additional attributes is carried out on the original frame, so that the rectangular frame comprises at least two attribute characteristics of 'people' and the additional attributes, and a patch type small picture taking a pedestrian target as a center is generated, conversion from the original frame to a target sample frame with 'people' and additional attribute labels taking the pedestrian target as the center is realized, and the original frame is reserved so as to facilitate data tracing; expanding the patch type small picture to a preset height along the head and foot direction of the pedestrian target, so that the training sample can be centered artificially, the foot and head position areas of the pedestrian can be exposed, and the head and foot positions of the pedestrian target can be conveniently predicted subsequently; expanding the patch type small picture to a preset width according to a preset first threshold value so as to keep the frame width of the training sample consistent, achieve the aim of preprocessing the training sample and facilitate reducing the subsequent sample training time; setting parameters of the residual error unit based on a preset re-parameterizable residual error unit, constructing a lightweight multi-head classification network, inputting a training sample into the multi-head classification network to obtain a secondary re-parameterization model, constructing a lightweight training framework, inputting the training sample into the multi-head classification network for training, and reducing the calculated amount in the sample training process; further, the pedestrian detection method of the cascade neural network model can make full use of richer image semantic information with additional attributes, guide the model to learn the fine features of the pedestrian target, and reduce the false detection rate of the pedestrian target.

The present application may be further configured in a preferred example to: the residual unit capable of being re-parameterized comprises Conv 3x3, Conv 1x1, a plurality of normalization layers for extracting semantic information and an activation function SiLU, wherein the Conv 3x3, the Conv 1x1 and one of the normalization layers are arranged in parallel, the normalization layers are respectively connected to the output ends of the Conv 3x3 and the Conv 1x1, the output result of the branch where the Conv 3x3 is located, the output result of the branch where the Conv 1x1 is located and the output result of the normalization layers arranged in parallel are sequentially overlapped and then input into the activation function SiLU.

By adopting the technical scheme, by means of the structural design of Conv 3x3, Conv 1x1, a normalization layer and an activation function SiLU of the re-parameterizable residual error unit, under the condition of the same number of input channels and output channels and in combination with the hardware characteristics of the existing GPU computing unit, compared with the traditional residual error unit, the computation amount is reduced to 1/2 of the original residual error unit, the model computation amount is greatly reduced, the operation efficiency is higher, the real-time prediction can be realized in the actual deployment stage of the model, and the delay of the system is reduced.

The present application may be further configured in a preferred example to: inputting the training samples into the multi-head classification network, and training to obtain a secondary re-parameterized model, wherein the step of training further comprises the following steps:

and in the multi-head classification network, training the training samples by adopting a label combination multi-item distribution sampling method.

By adopting the technical scheme, in the multi-head classification network, the training samples are trained by adopting a label combination multi-item distribution sampling method, so that the problem that each branch of the model is insufficiently trained or over-fitted due to the adoption of a simple random sampling training strategy is avoided, and the classification precision of the multi-head classification network obtained by training is higher.

The present application may be further configured in a preferred example to: in the multi-head classification network, the step of training the training samples by adopting a label combination polynomial distribution sampling method comprises the following steps:

arranging and combining the additional attributes of the original sample frame to obtain a combined label;

calculating a weight value of each combined label;

and sampling and training the training samples based on the weight values.

By adopting the technical scheme, the additional attributes of the original frame are arranged and combined, each arrangement and combination mode of the additional attributes forms a combined label, and a set formed by all the combined labels is used as an obtained combined label; and calculating the weight value of each combined label, and sampling and training the training samples based on the weight values so as to more fully train each branch of the model, reduce the overfitting phenomenon in the training process of the model and improve the training precision of the model.

The present application may be further configured in a preferred example to: in the multi-head classification network, the step of training the training samples by using a label combination polynomial distribution sampling method further includes:

presetting a second threshold value, and enabling the number of the combined labels to be lower than the combined labels corresponding to the second threshold value to serve as rare combined labels;

and presetting a third threshold value, and enabling the rare combined label to be sampled according to the third threshold value.

By adopting the technical scheme, the second threshold value is preset to screen the combined labels to obtain the rare combined labels, the rare combined labels are sampled according to the third threshold value, the occurrence frequency of the rare combined labels is kept constant, the sampling times of the rare combined labels can be guaranteed in each batch of training data, the model can learn the characteristics of the rare combined labels at the training opportunity, the overall generalization performance of the model is improved, and the applicability is enhanced.

The present application may be further configured in a preferred example to: before the step of expanding the patch type small picture to a preset height along the head and foot direction of the pedestrian target and expanding the zoomed patch type small picture to a preset width according to a preset first threshold value, the method further comprises the following steps:

and scaling the patch type small picture to a preset size.

By adopting the technical scheme, before the patch type small picture is expanded, the patch type small picture is zoomed to the preset size so as to reduce the size of the patch type small picture and further reduce the calculation amount of the subsequent patch type small picture during model training.

The present application may be further configured in a preferred example to: before the step of obtaining the training sample, the method further comprises the following steps:

and enabling the coordinates of the four vertexes of the patch type small picture to move in a self-adaptive manner in the original image of the pedestrian target.

By adopting the technical scheme, the four vertex coordinates of the patch type small picture are moved in a self-adaptive manner in the original image of the pedestrian target to obtain the training sample so as to adjust the size of the image, and the operation supported by the self-adaptive platform avoids the black edge supplementing operation and is convenient for subsequent image processing.

In a second aspect, the application provides a pedestrian detection device of a cascade neural network model, which has the characteristics of reducing the false detection rate of a pedestrian target and providing more behavior information of the pedestrian target.

The application is realized by the following technical scheme:

a pedestrian detection apparatus of a cascaded neural network model, comprising:

the acquisition module is used for acquiring an original frame of the pedestrian target;

the supplementary module is used for carrying out supplementary labeling of additional attributes on the original sample frame and generating a patch type small picture taking a pedestrian target as a center;

the expansion module is used for expanding the patch type small picture to a preset height along the head and foot direction of the pedestrian target, and expanding the zoomed patch type small picture to a preset width according to a preset first threshold value to obtain a training sample;

the building module is used for building a lightweight multi-head classification network based on a preset re-parameterizable residual error unit;

the training module is used for inputting the training samples into the multi-head classification network and training to obtain a secondary heavy parameterized model;

and the detection module is used for connecting the secondary re-parameterized model with the primary detector in series and inputting an original image containing a pedestrian target for detection.

In a third aspect, the present application provides a computer device, which has the characteristics of reducing the false detection rate of a pedestrian target and providing more behavior information of the pedestrian target.

The application is realized by the following technical scheme:

a computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the pedestrian detection method of a cascaded neural network model as described above when executing the computer program.

In a fourth aspect, the present application provides a computer-readable storage medium, which has the characteristics of reducing the false detection rate of the pedestrian target and providing more behavior information of the pedestrian target.

The application is realized by the following technical scheme:

a computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned pedestrian detection method of a cascaded neural network model.

In a fifth aspect, the present application provides a computer program product having features of reducing false detection rate of a pedestrian target and providing more behavior information of the pedestrian target.

The application is realized by the following technical scheme:

a computer program product comprising a computer program which, when executed by a processor, carries out the steps of a pedestrian detection method of the cascaded neural network model described above.

In summary, the present application includes at least one of the following beneficial technical effects:

1. a pedestrian detection method of a cascade neural network model can make full use of richer image semantic information with additional attributes to guide the model to learn the fine characteristics of a pedestrian target, and the false detection rate of the pedestrian target is reduced;

2. based on the structural design of Conv 3x3, Conv 1x1, a normalization layer and an activation function SiLU of the residual error unit which can be re-parameterized, under the condition of the same number of input channels and output channels, the calculation amount of a model can be greatly reduced;

3. in the multi-head classification network, a label combination multi-item distribution sampling method is adopted to train training samples, so that the classification precision of the multi-head classification network obtained by training is higher;

4. ensuring the sampling times of the rare combination labels in each batch of training data, so that the model has the opportunity to learn the characteristics of the rare combination labels during training, thereby improving the overall generalization performance of the model and enhancing the applicability;

5. before expanding the patch type small picture, the patch type small picture is zoomed to a preset size so as to reduce the size of the patch type small picture and further reduce the calculated amount of the patch type small picture during model training;

6. and the coordinates of the four vertexes of the patch type small picture are enabled to move in a self-adaptive manner in the original image of the pedestrian target so as to adjust the size of the image, and the operation supported by the self-adaptive platform is avoided.

Drawings

Fig. 1 is a flowchart illustrating a pedestrian detection method using a cascaded neural network model according to an embodiment of the present disclosure.

Fig. 2 is a flow chart of training a training sample using a label-combining polynomial distribution sampling method.

Fig. 3 is a schematic diagram of a structure of a residual unit that can be reparameterized.

Fig. 4 is a graph illustrating trend curves of the ReLu function and the sulu function.

FIG. 5 is a training diagram of a label combination multi-term distribution sampling method.

FIG. 6 is a sample box of pedestrian objects with additional attributes output by the two-stage re-parameterized model in series with the one-stage detector.

FIG. 7 is a schematic diagram of a pedestrian detection method using a cascaded neural network model to correct false positive samples.

Fig. 8 is a block diagram of a pedestrian detection apparatus based on a cascaded neural network model according to an embodiment of the present disclosure.

Detailed Description

The present embodiment is only for explaining the present application, and it is not limited to the present application, and those skilled in the art can make modifications of the present embodiment without inventive contribution as needed after reading the present specification, but all of them are protected by patent law within the scope of the claims of the present application.

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

In addition, the term "and/or" herein is only one kind of association relationship describing an associated object, and means that there may be three kinds of relationships, for example, a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship, unless otherwise specified.

The embodiments of the present application will be described in further detail with reference to the drawings attached hereto.

With reference to fig. 1, an embodiment of the present application provides a pedestrian detection method based on a cascaded neural network model, and main steps of the method are described as follows.

S1: acquiring an original frame of a pedestrian target;

s2: performing supplementary labeling of additional attributes on the original frame, and generating a patch type small picture taking a pedestrian target as a center;

s3: expanding the patch type small picture to a preset height along the head and foot direction of the pedestrian target, and expanding the patch type small picture to a preset width according to a preset first threshold value to obtain a training sample;

s4: constructing a lightweight multi-head classification network based on a preset re-parameterizable residual error unit;

s5: inputting the training samples into a multi-head classification network, and training to obtain a secondary heavy parameterized model;

s6: and connecting the secondary re-parameterized model with the primary detector in series, and inputting an original image containing a pedestrian target for prediction.

Further, S3: before the steps of expanding the patch type small picture to a preset height along the head and foot direction of the pedestrian target and expanding the zoomed patch type small picture to a preset width according to a preset first threshold value, the method further comprises the following steps:

s31: and scaling the patch type small picture to a preset size.

Further, S3: before the step of obtaining the training sample, the method also comprises the following steps:

s32: and (4) enabling the coordinates of the four vertexes of the patch type small picture to move in an adaptive mode in the original image of the pedestrian target.

Further, referring to fig. 2, S5: inputting the training samples into a multi-head classification network, and training to obtain a secondary re-parameterized model, wherein the step of training further comprises the following steps:

in the multi-head classification network, a label combination multi-item distribution sampling method is adopted to train training samples.

In the multi-head classification network, the step of training the training samples by adopting a label combination multi-item distribution sampling method comprises the following steps:

s51: arranging and combining the additional attributes of the original frame to obtain a combined label;

s52: calculating a weight value of each combined label;

s53: and sampling and training the training samples based on the weight values.

Further, in the multi-head classification network, the step of training the training samples by using the label combination multi-term distribution sampling method further includes:

s511: presetting a second threshold value, and enabling the number of the combined labels to be lower than that of the combined labels corresponding to the second threshold value to serve as rare combined labels;

s521: and presetting a third threshold value, and sampling the rare combined label according to the third threshold value.

Further, referring to fig. 3, S4: the residual unit capable of being re-parameterized comprises Conv 3x3, Conv 1x1, a plurality of normalization layers for extracting semantic information and an activation function SiLU, wherein the Conv 3x3 and the Conv 1x1 and one of the normalization layers are arranged in parallel, the normalization layers are respectively connected to the output ends of the Conv 3x3 and the Conv 1x1, the output result of the branch where the Conv 3x3 is located, the output result of the branch where the Conv 1x1 is located and the output results of the normalization layers arranged in parallel are sequentially overlapped and then input into the activation function SiLU.

Specifically, the specific flow steps of the above embodiments are described as follows.

S1: and acquiring an original frame of the pedestrian target output by the primary detector.

S2: and performing supplementary labeling on the original frame with additional attributes, calling a patch function based on the supplementary labeled sample frame, expanding a script of the corresponding patch type small picture, and generating the patch type small picture taking the pedestrian target as the center.

The additional attributes comprise the posture of the person, whether the feet are shielded or not, the moving direction or not, whether the person is a child or not and the like. The posture of a person includes sitting, standing or riding. Whether the foot is occluded or not includes occluded or unoccluded. The moving direction includes leftward, rightward, same direction with the vehicle, or opposite direction with the vehicle. Whether a child includes an adult or a child.

S31: and scaling the patch type small picture to a preset size. And in consideration of the small computational power space reserved for the secondary model by the system, the patch type small picture is zoomed to a preset size so as to reduce the size of the patch type small picture and further reduce the computational load of the patch type small picture in the model.

In this embodiment, the predetermined size may be 128 × 64 pixels. Specifically, in a zoomed 128 × 64-sized patch picture, after expanding the original frame height by 20% in the head and foot direction, upward and downward of the pedestrian target, the background area of the picture is larger, and the area occupied by the human body in the picture is relatively small, which easily causes false alarm of detection of the pedestrian target. Particularly, false alarm of pedestrian target detection is easily caused in a picture of a dense pedestrian scene. Meanwhile, by counting the length-width ratios of original frame of a large number of pedestrian target image samples, it is found that the mean value of the length-width ratios of the original frame of the sample floats around 2.0, that is, the height/width is approximately equal to 2.0, and when model training is carried out by using a square image sample adopted by a general deep learning network, such as the size of 128x128, under the scene of dense pedestrians, a plurality of pedestrians can appear in the 128x128 patch type small picture, so that the pedestrians in the foreground and background can be hardly distinguished by the trained model, and the test effect in the system integration test is poor. Therefore, the average value of the aspect ratio of the pedestrian target sample frame obtained on the basis of the floating of about 2.0 is enabled to be 2, the size of the pedestrian target image sample is controlled to be 128x64, the situation that the pedestrian with the foreground and the background is difficult to distinguish by the trained model is further reduced, and the test effect of the system integration test is improved.

S3: expanding the zoomed patch type small picture to a preset height along the head and foot direction of the pedestrian target, and expanding the zoomed patch type small picture to a preset width according to a preset first threshold value.

Starting from a pedestrian sample center, expanding 5-10% of the original height in the pedestrian head and foot direction respectively, namely expanding 5-10% of the original frame height in the pedestrian head and foot direction, upwards and downwards along the pedestrian target by taking the position of the existing original sample frame as a reference, so that the original frame height is expanded to the frame height range of the target sample frame to 1.1-1.2 times of the original frame height finally, and expanding the scaled patch type small picture to the preset width according to a preset first threshold value, so that the training sample can be artificially centered, the foot and head position area of a person can be exposed, and the head and foot position of the target can be conveniently predicted by a pedestrian subsequently.

In this embodiment, the frame height of the original sample frame is expanded by 5% in the head-foot direction, upward direction, and downward direction of the pedestrian target, and the frame height is set as the frame height of the target sample frame.

The first threshold may be an aspect ratio of 2:1, which meets a statistical rule that the height/width is equal to about 2.0, and based on the frame height of the target sample frame, the frame width of the target sample frame is half of the frame height, thereby obtaining the frame width of the target sample frame. In this embodiment, the original frame width is expanded by 2.5% of the original frame width to the left and right, respectively, as the frame width of the target sample frame.

Particularly, in the process of expanding the patch type small picture, some random jitters can be added to simulate the situation that the detection result is not always stable when the primary model is deployed, and the generalization performance of the trained secondary model is enhanced through artificial disturbance data, so that the detection performance of the secondary model in the actual environment is more stable.

S32: the coordinates of the four vertexes of the patch type small picture are enabled to move in a self-adaptive manner in the original image of the pedestrian target, the problem that the boundary of the obtained sample image falls outside the original image and then a black edge needs to be filled is avoided, and subsequent image processing is facilitated.

The self-adaptive moving process mainly considers the relative position relation between the four vertex coordinates of the patch type small picture and the four boundaries of the original image. And if a certain vertex is outside the boundary, the vertex corresponding to the same side of the certain vertex is adaptively moved to the inside of the original image until the contents of the patch type small picture are all from the original image.

This adaptive move process is not necessary for model development, and an alternative operation of filling black edges by a desired size may be employed. But in consideration of hardware requirements and algorithm efficiency in a subsequent model deployment stage, the data are generated by adopting a self-adaptive moving method, so that the method is obviously more convenient and efficient.

By calling the function Padding (), assuming that the size of the required sample frame is 128x64, and the size of the area where the acquired sample frame to be processed falls in the inner part of the required sample frame is 96x60, and the area is colored, 0 needs to be filled in the area where the sample frame to be processed falls in the outer part of the required sample frame, so that the area where the sample frame falls outside the required sample frame after being processed becomes black, and the black edge complementing operation is completed.

S4: and constructing a lightweight multi-head classification network based on a preset re-parameterizable residual error unit.

And the lightweight multi-head classification network can be built based on RepPerson.

Because of the small image size of the pedestrian target training sample, the maximum downsampling multiplying power of the model is limited to a large extent, in order to fully extract the characteristics of the pedestrian target and fuse the characteristics after downsampling at every time, the method can re-parameterize the residual error unit to serve as a basic unit for building the multi-head classification network, so that the model precision is effectively improved, the model optimizing process is improved, and meanwhile, the operation efficiency of the convolutional neural network in the deployment stage is improved.

Referring to fig. 3, the reparameterizable residual unit includes a Conv 3x3, a Conv 1x1, three normalization layers for extracting semantic information and an activation function SiLU, wherein the Conv 3x3 and the Conv 1x1 and one of the normalization layers are arranged in parallel, the remaining two normalization layers are respectively connected to the output ends of the Conv 3x3 and the Conv 1x1, the output result of the branch where the Conv 3x3 is located, the output result of the branch where the Conv 1x1 is located and the output results of the parallel arranged normalization layers are sequentially superimposed, the merging unit branches, the results are input into the activation function SiLU, and finally the results of the activation function SiLU are output.

Compared with the traditional residual error unit, the method has the advantages that a branch circuit which comprises a normalization layer and is arranged in parallel with Conv 1x1 and Conv 3x3 is additionally arranged, so that more semantic information of different levels can be extracted as far as possible.

The normalization layer is introduced into the branches where the original Conv 1x1 and Conv 3x3 are located in batch, so that under the condition of achieving the same performance, the number of GPU (graphics processing unit) training hours required by the model is less, the convergence of the model is accelerated, and the detection speed is accelerated.

Referring to fig. 4, compared with a conventional residual error unit, the present application changes the function ReLu into a smoother activation function SiLU, for negative value data characteristics, the activation function SiLU can still be transferred to a downstream network structure, and the function ReLu directly ignores the negative value data characteristics and cannot be downloaded, so that during model learning, the negative value data characteristics ignored by the function ReLu can cause the model index to vibrate up and down, that is, the learning curve is jagged, and the activation function SiLU retains the negative value data characteristic information, so that the vibration amplitude of the model index is narrowed during training, the learning curve is smoother, that is, the smoothness of model training is improved. Meanwhile, the SilU function is a curve passing through the 0 point instead of a straight line, so that the nonlinearity of the model can be increased.

Furthermore, the lightweight multi-head classification network based on the re-parameterization residual error unit can not only fully utilize richer image semantic information and guide the model to learn the fine characteristics of the pedestrian target in the model training stage so as to obtain the model weight with better generalization capability, so that the parameter performance of each layer structure of the model is better; and higher operation efficiency can be achieved under the hardware characteristic of the existing GPU computing unit.

S5: the training samples are input into a multi-headed classification network.

Because the attributes of the pedestrian target samples are independent and the distribution heights are uneven, the pure random sampling training strategy is easy to cause insufficient training or overfitting of each branch of the model, so the training samples are trained by adopting a label combination multi-item distribution sampling method to obtain a two-stage heavy parameterized model.

Referring to fig. 5, the step of training the training samples by using the label combination polynomial distribution sampling method includes:

s51: and arranging and combining the additional attributes of the original frame to obtain a combined label. Specifically, the additional attributes acquired from each pedestrian target sample are traversed according to a preset sequence, and the additional attributes are arranged and combined, so that all the obtained possible arrangement combinations form a set, and a combined label is obtained.

Furthermore, the number of the obtained combined labels is unreasonable and needs to be removed, that is, the number of the final combined labels is removed with the unreasonable number, so that the classification of the model is more accurate.

For example, the target sample frame of the pedestrian target contains 3 additional attributes, each additional attribute has 2 possible values, then 2x2x2=8 combinations are obtained after the additional attributes are arranged and combined, and the 8 combinations are the obtained combination labels.

And traversing and counting the additional attributes carried by each pedestrian target sample based on the combined label, sequentially adding 1 to the number of corresponding elements in the combined label when the additional attributes of the pedestrian target sample which are the same as the arrangement combination of the additional attributes in the combined label are obtained each time, and finally updating the number of corresponding elements in the combined label to obtain the sample distribution condition of each label combination in the data set.

S511: and presetting a second threshold value, and enabling the number of the combined labels to be lower than the combined labels corresponding to the second threshold value to be used as the rare combined labels. The rare combined labels are fewer in number and have smaller occurrence probability, so that the probability of being learned by the model is smaller. In this embodiment, the second threshold may be 2.

S52: calculating a weight value for each combined label

. The specific calculation formula is as follows:

。

wherein the content of the first and second substances,

the number of labels for each combination;

is the total number of combined labels.

S521: and presetting a third threshold value, and sampling the rare combined label according to the third threshold value. In this embodiment, when the Batch _ Size in the model is set to 128, the third threshold may be 0.1, i.e. 10% of the sample Size is retained.

S53: and sampling and training the training samples based on the weight values and the third threshold value. When the combined label is a non-rare combined label, training according to the corresponding weight value; and when the combined label is the rare combined label, training according to a third threshold value. Through making rare composite label sample according to the third threshold value to increase the frequency that rare composite label appears, let the model have the chance to learn the characteristic that rare composite label has, and then promote the holistic generalization performance of model.

And the input training samples are subjected to the training process of model forward transmission and model backward transmission updating parameters by adopting a label combination multi-item distribution sampling method, and are circularly repeated until the preset training standard is reached, and the training of the secondary heavy parameterized model is completed.

S6: the two-stage re-parameterized model is connected in series with the output end of the first-stage detector, and an original image containing a pedestrian target is input to the input end of the first-stage detector, so that a finally predicted pedestrian target and corresponding subdivided additional attributes thereof are obtained, and more abundant attribute information is provided compared with the result of the first-stage detector, as shown in fig. 6, and meanwhile, after the two-stage re-parameterized model is connected in series with the first-stage detector, the false detection sample of the first-stage detector is corrected, and the false detection rate is improved, as shown in fig. 7.

Of course, this embodiment may also perform corresponding adjustment according to different scenarios in combination with the calculation force conditions of the system, and will not be described herein again.

In the pedestrian detection method of the cascade neural network model, the false detection sample is suppressed by cascading a lightweight class model, the false detection result of the existing first-level detector is corrected, the fine attribute of the sample can be predicted, so that the more fine sample attribute is output, and the false detection rate of a pedestrian target is improved; more pedestrian target information is provided for a downstream planning control system for decision making, and the overall safety is improved; meanwhile, the calculated amount of the model is small, real-time prediction is achieved in the actual deployment stage of the model, the problem of system delay is solved, obvious system delay cannot be generated, and the applicability is stronger.

It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.

Referring to fig. 8, an embodiment of the present application further provides a pedestrian detection apparatus of a cascaded neural network model, where the pedestrian detection apparatus of the cascaded neural network model corresponds to the pedestrian detection method of the cascaded neural network model in the foregoing embodiment one to one. The pedestrian detection device of the cascade neural network model comprises:

the supplementary module is used for carrying out supplementary labeling of additional attributes on the original frame and generating a patch type small picture taking a pedestrian target as a center;

the expansion module is used for expanding the patch type small picture to a preset height along the head and foot direction of the pedestrian target, expanding the zoomed patch type small picture to a preset width according to a preset first threshold value, and obtaining a training sample;

and the detection module is used for connecting the secondary re-parameterized model with the primary detector in series and inputting an original image containing a pedestrian target for prediction.

Wherein, the training module includes:

the combined label unit is used for arranging and combining the additional attributes of the original frame to obtain a combined label;

a weighting unit for calculating a weight value of each combination label;

and the training unit is used for sampling and training the training samples based on the weight values and a preset third threshold value.

Further, the combination tag unit further includes:

the rare combined label subunit is used for presetting a second threshold value, and enabling the combined labels corresponding to the combined labels with the quantity lower than the second threshold value to be used as rare combined labels;

the weight unit further includes:

and the threshold subunit is used for presetting a third threshold so as to sample the rare combined label according to the third threshold.

Further, a pedestrian detection apparatus of the cascade neural network model further includes:

and the scaling module is used for scaling the patch type small picture to a preset size.

and the adjusting module is used for enabling the four vertex coordinates of the patch type small picture to move in a self-adaptive manner in the original image of the pedestrian target.

For specific definition of the pedestrian detection device of the cascaded neural network model, refer to the above definition of the pedestrian detection method of the cascaded neural network model, and details are not repeated here. All or part of each module in the pedestrian detection device of the cascaded neural network model can be realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a server. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a pedestrian detection method of a cascaded neural network model.

In one embodiment, a computer-readable storage medium is provided, comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the following steps when executing the computer program:

s1: acquiring an original frame of a pedestrian target;

In one embodiment, a computer program product is provided, comprising a computer program which, when executed by a processor, implements a pedestrian detection method of a cascaded neural network model.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the system is divided into different functional units or modules to perform all or part of the above-mentioned functions.

Claims

1. A pedestrian detection method of a cascade neural network model is characterized by comprising the following steps:

acquiring an original frame of a pedestrian target;

2. The pedestrian detection method of the cascade neural network model according to claim 1, wherein the reparameterizable residual error unit includes Conv 3x3, Conv 1x1, a plurality of normalization layers for extracting semantic information, and an activation function SiLU, the Conv 3x3 and the Conv 1x1 are arranged in parallel with one of the normalization layers, the normalization layers are respectively connected to the output ends of the Conv 3x3 and the Conv 1x1, the output result of the branch where the Conv 3x3 is located, the output result of the branch where the Conv 1x1 is located, and the output results of the normalization layers arranged in parallel are sequentially superimposed and then input into the activation function SiLU.

3. The method of claim 1, wherein the training samples are input into the multi-head classification network, and the step of training to obtain a two-stage parameterization model further comprises:

4. The method as claimed in claim 3, wherein the step of training the training samples in the multi-head classification network by using a label-combination polynomial distribution sampling method comprises:

calculating a weight value of each combined label;

and sampling and training the training samples based on the weight values.

5. The method as claimed in claim 4, wherein the step of training the training samples in the multi-head classification network by using a label-combination polynomial distribution sampling method further comprises:

6. The pedestrian detection method of the cascade neural network model according to any one of claims 1 to 5, wherein before the step of expanding the patch small picture to a preset height along a head-foot direction of the pedestrian target and expanding the scaled patch small picture to a preset width according to a preset first threshold, the method further comprises the following steps:

and scaling the patch type small picture to a preset size.

7. The method for pedestrian detection based on the cascaded neural network model according to any one of claims 1 to 5, wherein the step of obtaining the training samples further comprises the following steps:

8. A pedestrian detection device of a cascade neural network model, comprising:

9. A computer device comprising a memory, a processor and a computer program stored on the memory, the processor executing the computer program to perform the steps of the method of any one of claims 1 to 7.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.

11. A computer program product, characterized in that it comprises a computer program which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.