CN115147922A

CN115147922A - Monocular pedestrian detection method, system, equipment and medium based on embedded platform

Info

Publication number: CN115147922A
Application number: CN202210643994.3A
Authority: CN
Inventors: 洪刚; 陈豪
Original assignee: Zhejiang Yiti Technology Co ltd
Current assignee: Zhejiang Yiti Technology Co ltd
Priority date: 2022-06-08
Filing date: 2022-06-08
Publication date: 2022-10-04

Abstract

The invention discloses a monocular pedestrian detection method, a monocular pedestrian detection system, monocular pedestrian detection equipment and a monocular pedestrian detection medium based on an embedded platform, and relates to the technical field of image processing, wherein the monocular pedestrian detection method comprises the following steps: s1, an image acquisition unit acquires a current color image by using a monocular camera; s2, inputting image information obtained by an image acquisition unit; s3, preprocessing image information, including cutting the image information into a fixed size without distortion and carrying out normalization correction; s4, extracting features of different scales from the preprocessed image by using a convolutional neural network; s5, fusing the characteristics of different scales in the step S4 to generate a preselection frame; s6, judging according to the fusion characteristics and outputting a detection result; the method effectively solves the problems of low monocular pedestrian detection accuracy and large calculation amount, and has the advantages of small model volume, small calculation amount, low requirement on equipment calculation force, simple overall operation, clear detection result and the like.

Description

Monocular pedestrian detection method, system, equipment and medium based on embedded platform

Technical Field

The invention relates to the technical field of image processing, in particular to a monocular pedestrian detection method, a monocular pedestrian detection system, monocular pedestrian detection equipment and a monocular pedestrian detection medium based on an embedded platform.

Background

With the continuous maturity of chip manufacturing process, embedded equipment develops towards small volume, multiple functions and high computing power, and the functions are expanded to the current smart home, mobile office and the like from the initial simple operation or communication. Particularly, the concept of 'everything interconnection' is proposed and the technology of internet of things is popularized, and the embedded device gradually becomes the control core of the whole system.

When the traditional embedded device carries out information interaction outside, the traditional embedded device usually depends on keys, a display screen and various sensors. In recent years, computer vision, especially deep learning technology, is mature, a visual sensor represented by a camera can provide more convenient and accurate service in interaction between a user and a system, and more embedded devices adopt the camera as an interaction medium.

The cameras in the current market can be roughly divided into a monocular camera, a binocular camera and an RGB-D camera. Monocular is to use a camera to collect visual information to obtain two-dimensional information of a target; the binocular uses two cameras to shoot the same target, and the two-dimensional information of the target and the depth information are obtained by calculating the imaging parallax of the two cameras; the RGB-D camera measures depth information through a physical method of structured light or ToF, and is mainly used for three-dimensional imaging. For a pedestrian detection system taking an embedded platform as a core, the monocular camera has the advantages of low cost, small volume, easy integration, convenient maintenance and the like, and is widely deployed.

In a pedestrian detection system using a monocular camera as visual information acquisition equipment, the pedestrian detection system can be divided into three types, namely a traditional algorithm-based pedestrian detection system, a machine learning-based pedestrian detection system and a deep learning-based pedestrian detection system according to detection means. The traditional pedestrian detection algorithm is only suitable for moving targets, a background modeling algorithm is used for obtaining moving foreground targets, and a classifier is used for judging whether pedestrians are included or not; the method mainly adopts a scheme of artificial features and a classifier based on machine learning, and judges whether the current image contains pedestrians or not by utilizing low-level features such as colors, edges, textures and the like; based on the deep learning technology, pixel information and semantic information are extracted by using a convolutional neural network on a larger data volume, so that a more accurate and reliable detection result is obtained.

At present, a monocular camera pedestrian detection system mostly adopts a deep learning technology to acquire, process and feed back image information. According to the functional requirements of the system, firstly, data containing pedestrian images in different scenes are collected, position information and category information of pedestrians are marked on the data, then a proper network model structure is designed, finally, marked image information is input into a network for training, and optimal or local optimal model weight is obtained through continuous iterative optimization. The system processes the input image based on the trained model at runtime and outputs a confidence score and a category label. A part of the pedestrian detection network model adopts a two-stage method, a Selective Search algorithm is used for preprocessing image data before a data input model, a large number of preselected frames are obtained, and model training is carried out according to the preselected frames. The other scheme is to adopt an end-to-end model and generate a preselected frame in the training process, so as to obtain better real-time performance. In particular, lightweight models are proposed and optimized, and the latter scheme is more feasible in an embedded platform.

At present, a pedestrian detection system designed based on an embedded platform by using a deep learning technology partially adopts a lightweight model to extract features, but model weight and floating point number calculation adopted during system operation still have large memory and time overhead, and instantaneity cannot be guaranteed.

In the image information acquisition solution based on the camera, the binocular camera is easily influenced by the placing position, and the binocular camera is difficult to popularize in the market under the restriction of comprehensive factors such as cost, manufacturing process and reliability. RGB-D cameras based on structured light, toF and the like are sensitive to light change and cannot acquire effective information under strong illumination, and are mainly used in indoor and other specific application scenes at present.

Therefore, there is a need for further improvements in embedded platform-based monocular pedestrian detection methods, systems, devices, and media to address the various deficiencies described above.

Disclosure of Invention

The purpose of the application is: the monocular pedestrian detection method, the monocular pedestrian detection system, the monocular pedestrian detection equipment and the monocular pedestrian detection medium based on the embedded platform solve and overcome the defects of the prior art and application, effectively solve the problems of low monocular pedestrian detection accuracy and large calculated amount, and have the advantages of small model size and calculated amount, low requirement on equipment calculation force, simplicity in overall operation, clear detection result and the like.

The application aims to be completed through the following technical scheme, and the monocular pedestrian detection method based on the embedded platform comprises the following steps:

s1, an image acquisition unit acquires a current color image by using a monocular camera;

s2, inputting image information obtained by an image acquisition unit;

s3, preprocessing image information, including cutting the image information into a fixed size without distortion and carrying out normalization correction;

s4, extracting features of different scales from the preprocessed image by using a convolutional neural network;

s5, fusing the features of different scales in the step S4 to generate a preselection frame;

s6, judging according to the fusion characteristics and outputting a detection result;

wherein S5 specifically is:

s51, selecting a plurality of extracted features to perform fusion of 6 different scales, respectively detecting pedestrians in large and small images, directly using a regression algorithm for 6 feature maps to obtain 6 groups of data, performing fusion participation on the 6 groups of data in loss function calculation according to the pedestrian confidence score and coordinate information predicted by the feature map corresponding to each group of data;

s52: and fusing the feature maps to generate a plurality of preselected frames, wherein the size of the preselected frames is relative to the scaling of the original image, and the scaling formula is as follows:

wherein R is _max 、R _min Respectively is 0.9 and 0.2 which are preset; m is the number of profiles used, and is 6; substituting the m value into a formula to obtain a scaling ratio, and multiplying the scaling ratio by the image size to obtain the size of a preselected frame; generating 4 preselected boxes using a length to width ratio of {1, 2,1/2} for feature maps of sizes 38 x 38, 3 x 3,1 x 1; generating 6 preselected frames for sizes 19 x 19, 10 x 10, 5 x 5 according to a length-to-width ratio of {1, 2,1/3,1 };

s53, filtering redundant candidate frames by using non-maximum suppression, and matching the preselected frame with the real frame if the IoU of the preselected frame and the real frame is more than 0.5; if the remaining real frames of the preselected frame have a plurality of IoUs larger than 0.5, the preselected frame is matched with the real frame with the largest IoU, namely one real frame can be matched with a plurality of preselected frames, and the preselected frame can be matched with only one real frame; if one preselection frame can not meet the two conditions, the preselection frame is classified as a background, namely a negative sample; and sorting the negative samples according to the confidence coefficient loss, and participating in back propagation according to the proportion of the positive and negative samples 1.

Preferably, the step S4 specifically includes:

s41, adjusting the undistorted size of the input image;

s42, processing the image with fixed size by using a plurality of characteristic extraction basic units

S43, obtaining a plurality of characteristic graphs with different receptive fields and different scales.

Preferably, the feature extraction basic unit comprises a depth separable convolution, maximum pooling, BN (batch normalization) layer composition, and uses ReLu6 function as an activation function to obtain the nonlinearity and sparsity of the weight earlier, and simultaneously comprises L2-NORM (regularization algorithm) and linear residual operation.

Preferably, the feature extraction basic units of different step sizes are optimized differently, the channel attention mechanism is added to the feature extraction basic unit of step size 1, and the Zero Padding operation is added to the feature extraction unit of step size 2.

Preferably, the step S6 specifically includes:

s61: comparing the preselection frame generated in the step S5 with a preset threshold, and reserving the preselection frame higher than the threshold, otherwise, considering the preselection frame as background information and discarding the preselection frame;

s62: calculating loss between the reserved pre-selection frame and the image label;

s63: the loss meets the termination condition, the training process is terminated, and the detection result is output; if the loss condition is not met, feeding the loss back to the step S4, and updating and optimizing network parameters;

s64: repeatedly executing the step S63 until the termination condition is met;

s65: step S64, finishing execution, outputting a detection result, confirming whether the task requirement is met by a user according to the detection result, and if so, reserving the model; if not, cleaning the image data or optimizing the network structure until the task requirements are met.

S66: and (6) obtaining a model with the weight type of 32-bit floating point number according to the step S65, further converting and optimizing the model, selecting a proper optimization means in the conversion process by means of a model conversion tool, converting the weight of the 32-bit floating point number into a 16-bit floating point number or an 8-bit integer, and freely selecting models with different weight types according to task requirements.

Preferably, the calculation loss in step S62 is composed of two parts, one part is a confidence loss, i.e., a classification loss; another part is the return loss, i.e. the position loss.

Wherein the classification loss calculation formula is as follows:

wherein i represents the sequence number of the prediction box; j representsThe serial number of the label; p is a category number, and p =0 represents a background;

the score of the jth label representing the ith prediction box match is only 0 and 1.

Representing the probability that the ith prediction box predicts the class p. Pos and Neg represent positive and negative samples; s represents the number of matching candidate boxes.

The position loss is calculated as the Smooth L1 loss between the prediction box L and the tag g, and the formula is as follows:

wherein { x, y, w, h } represents the x, y axis coordinates of the top left corner of the prediction box and the width and height of the prediction box, respectively. Total loss:

when S =1, α is set to 1 by cross validation.

Preferably, the TFLite model transformation tool of tensoflow is used in step S66, and the optimization means is: metaData (MetaData) is added at model conversion according to network inputs, and operators are specified at conversion.

The invention also provides a monocular pedestrian detection system based on the embedded platform, which comprises:

the power supply unit provides power support for the whole system;

a central processing unit which controls other units and receives feedback of each unit;

the image acquisition unit acquires a current color image by using a monocular camera;

the image data processing unit receives the image acquired by the image acquisition unit and an instruction sent by the central processing unit, performs preprocessing, feature extraction, feature fusion and identification operations on the image according to the instruction, acquires whether the current image contains a pedestrian target result, and feeds back the result to the central processing unit;

and the communication control unit is a medium for communication between the central processing unit and the peripheral unit, and transmits the detection result of the image data fed back to the central processing unit to the peripheral unit.

And the peripheral unit visualizes the current detection result in real time in a visual and text mode, and comprises the number of the pedestrian targets, the position in the current picture and confidence information.

The present invention also provides an electronic device, comprising: one or more processors; storage means for storing one or more programs; when the one or more programs are executed by the one or more processors, the one or more processors implement the monocular pedestrian detection method based on the embedded platform provided by the invention.

The present invention also provides a computer-readable storage medium, which stores a computer program, where the computer program can be executed by a computer processor to implement any one of the above-mentioned monocular pedestrian detection methods based on an embedded platform.

Compared with the prior art, the application has the following obvious advantages and effects:

1. built based on an embedded platform, the integrated level and the portability are high, the cost is low, and the power consumption is low.

2. And the monocular camera is adopted to acquire image information, so that the applicable scene types are rich, and the equipment maintenance cost is low.

3. The pedestrian detection is carried out by using the light-weight convolutional neural network after the conversion optimization, the model size and the calculated amount are small, and the requirement on the calculation force of equipment is low.

4. The system has strong expansibility, the units are mutually independent, external interfaces are rich, and the iterative upgrade is convenient; and the temperature measurement and other tasks are completed by matching with other sensors.

5. The whole system is simple to operate, and the detection result is clear.

Drawings

Fig. 1 is an overall flowchart of the monocular pedestrian detection method based on the embedded platform according to the present application.

Fig. 2 is a schematic structural diagram of a feature extraction unit in the present application.

Fig. 3 is a schematic diagram of the overall network architecture framework in the present application.

Fig. 4 is a flow chart of network training in the present application.

Fig. 5 is a flowchart of step S6 in the present application.

Fig. 6 is an overall flowchart of the embedded platform-based monocular pedestrian detection system in the present application.

Fig. 7 is a schematic structural diagram of an electronic device in the present application.

Fig. 8 is a schematic illustration of the application in operation.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some structures related to the present invention are shown in the drawings, not all of them.

Before discussing exemplary embodiments in more detail, it should be noted that some exemplary embodiments are described as processes or methods depicted as flowcharts. Although a flowchart may describe the operations (or steps) as a sequential process, many of the operations (or steps) can be performed in parallel, concurrently or simultaneously. In addition, the order of the operations may be re-arranged. The process may be terminated when its operations are completed, but may have additional steps not included in the figure. The processes may correspond to methods, functions, procedures, subroutines, and the like.

The embedded platform-based monocular pedestrian detection method, system, device and medium provided in the present application are described in detail by the following embodiments and alternatives thereof.

Fig. 1 is a flowchart of a monocular pedestrian detection method based on an embedded platform provided in an embodiment of the present invention. The method can comprise the following steps:

in the embodiment of the application, the image acquisition unit acquires the current color image by using the monocular camera. The camera carries the cloud platform, can be by the user to the camera send the instruction about 120 degrees, the rotation of 45 degrees from top to bottom, obtain the video image of different angles. The camera can perform self-adjustment to a certain degree, and the user can select the video image quality most suitable for the task requirement.

S2, inputting image information obtained by an image acquisition unit;

in the embodiment of the present application, clipping without distortion: the density of pixels of an image captured by a camera is often large, network calculation cost is increased, and subsequent network calculation amount is reduced while an original image structure is ensured by performing undistorted clipping (pixel number reduction) on the image. Normalization correction: the image is normalized, namely the pixels after clipping are divided by 255, and the image before and after normalization has no change.

S4, extracting features of different scales from the preprocessed image by using a convolutional neural network; step S4 specifically includes:

s41, adjusting the undistorted size of the input image;

In the embodiment of the present application, as shown in fig. 2, a schematic structural diagram of the feature extraction unit is shown, and step S4 will be described in detail below: the first feature extraction basic unit receives an RGB image with the dimension of 300 × 3 and outputs a feature map of n channels with the size reduced by half, wherein n is related to the convolution number used by the feature extraction basic unit. And the subsequent feature extraction basic unit receives the feature graph output by the previous feature extraction basic unit. Describing a detailed processing process by using a first feature extraction basic unit, for a 300 × 3 image (after undistorted adjustment), firstly, performing sliding window scanning on the image by using 16 depth separable convolutions, wherein the convolution step size is 2, the kernel size is 3 × 3, obtaining a feature map with the size of 150 × 16, then, passing through a maximum pooling layer (the image size is also reduced, the receptive field size is increased), and finally, passing through a BN (the feature map difference during each training iteration is reduced, the training speed is increased), and outputting the feature map to a next feature extraction basic unit. And the residual error structure adopted by each feature extraction unit. The basic units forming the feature extraction network consist of convolution operations with different step sizes and deep separable convolution operations, and different optimization is carried out on the feature extraction basic units with different step sizes. The channel attention mechanism is added to the feature extraction basic unit with step size 1, and the Zeropadding operation is added to the feature extraction unit with step size 2. The method selects feature maps with 6 sizes in total from 38 × 38, 19 × 19, 10 × 10, 5 × 5, 3 × 3 and 1 × 1, wherein the large-size feature maps use shallow information to predict pedestrians with smaller imaging sizes, and the small-size feature maps use deep information to predict pedestrians with larger imaging sizes. By adopting the multi-scale detection mode, the pedestrians can be imaged in both sizes, and the detection effect is improved. The application adopts a lightweight convolutional neural network MobilenetV2, and the network adopts a deep separable convolution and a ReLu6 function, so that the consumption of memory and bandwidth can be reduced in an embedded system.

The feature extraction basic unit comprises a depth separable convolution, maximum pooling and BN (batch standardization) layer composition, uses a ReLu6 function as an activation function to obtain nonlinearity and sparsity of weight earlier, and simultaneously comprises an L2-NORM (regularization algorithm) and a linear residual operation.

In the embodiment of the application, as shown in fig. 3, a schematic diagram of a network overall structure framework is shown, and S5, features of different scales in the step S4 are fused to generate a preselection frame; the step S5 specifically comprises the following steps:

s52: and fusing the feature maps to generate a plurality of preselected frames, wherein the size of the preselected frames is carried out relative to the scaling of the original image, and the scaling formula is as follows:

wherein R is _max 、R _min Respectively is 0.9 and 0.2 which are preset; m is the number of feature maps used, and is 6; substituting the m value into a formula to obtain a scaling ratio, and multiplying the scaling ratio by the image size to obtain the size of a preselected frame; generating 4 preselected boxes using a length to width ratio of {1, 2,1/2} for feature maps of sizes 38 x 38, 3 x 3,1 x 1; generating 6 preselected frames for sizes 19 x 19, 10 x 10, 5 x 5 according to the length-width ratio of {1, 2,1/3,1 };

s53, filtering redundant candidate frames by using non-maximum suppression, wherein if the IoU (cross-over ratio) of the preselected frame and the real frame is more than 0.5, the preselected frame is matched with the real frame; if the remaining real frames of the preselected frame have a plurality of IoUs larger than 0.5, the preselected frame is matched with the real frame with the largest IoU, namely one real frame can be matched with a plurality of preselected frames, and the preselected frame can be matched with only one real frame; if one preselection frame can not meet the two conditions, the preselection frame is classified as a background, namely a negative sample; and sorting the negative samples according to the confidence loss, and participating in back propagation according to the proportion of the positive and negative samples 1.

In the embodiment of the application, the network is composed of a feature extraction part and a detection identification part. In the characteristic extraction process, firstly, undistorted size adjustment is carried out on an input image, then, a plurality of characteristic extraction basic units are used for processing images with fixed sizes, and a plurality of characteristic graphs with different receptive fields and different scales are obtained. The detection identification part selects a plurality of extracted features to perform fusion of different scales, and finally adjusts and fuses the features into two parts of contents after a series of fusion, wherein the fusion is to stack feature maps in channel dimensions, for example, several cubes are combined into a larger cube, and the two parts of contents are as follows: one part includes position information, namely the coordinates of the detection frame, the other part includes category and score information, and resipe (resizing) operation is performed on the two parts of contents respectively, the two parts are fused into a larger cube, the calculation amount is large, and the size needs to be adjusted again. And splicing to be the final information output. Only one detection frame is required for each pedestrian detection target, and therefore, unnecessary candidate frames are filtered out using NMS (Non-Maximum Suppression). For each real frame, a preselected frame with the largest cross-over ratio (IoU) with the real frame is selected as a matching frame of the current real frame, namely, a evidence sample. In this way, each real frame is ensured to have a corresponding pre-selection frame; the fused features contain more pixel information and rich semantic information, and the detection precision is improved.

After the two parts of contents are completed, the unit outputs the position information, the category information and the confidence degree information of the pedestrian target and feeds back the position information, the category information and the confidence degree information to the detection result of the central processing unit, and the central processing unit controls the peripheral unit to output the detection result according to the feedback result so as to provide a basis for the next decision of a user.

In the embodiment of the application, as shown in fig. 4 and 5, a network training flow chart is shown, and S6, the judgment is performed according to the fusion characteristics, and a detection result is output; the step S6 specifically includes:

the calculation loss in step S62 is composed of two parts, one part is a confidence loss, i.e., a classification loss; another part is the return loss, i.e. the position loss.

Wherein the classification loss calculation formula is as follows:

wherein i represents the sequence number of the prediction box; j represents the serial number of the label; p is a category number, and p =0 represents a background;

wherein { x, y, w, h } represents the x, y axis coordinates of the top left corner of the prediction box and the width and height of the prediction box, respectively. The total loss is:

when S =1, α is set to 1 by cross validation.

In the embodiment of the application, the loss function of the deep convolutional neural network influences the time required by model training, the performance of the model and the like. The system uses classification loss of positive and negative samples and positive sample position loss with penalty items as final network loss according to task requirements and network structures. Firstly, a classification loss function and a position loss function are defined, and then the classification loss and the position loss of all positive labels are calculated. Then, the number of positive and negative samples of each image in a batch is counted, and if the number of negative samples of all the images in the batch is 0, N (N > 1) candidate boxes are selected as the negative samples by default, wherein N is a natural number. Judging the prediction result if the selection basis is the basis, and marking the candidate frame as a sample difficult to classify if the probability that the candidate frame does not contain the target object and does not belong to the background is high; all probabilities not belonging to the background are summed again, with larger values representing greater difficulty in classifying the sample. In the process, even the candidate frame not containing the target object still remains, and the N candidate frames which are most difficult to classify are selected as negative samples. The overall loss is equal to the classification loss of the positive sample plus the position loss of the positive sample with the introduced penalty term and the classification loss of the negative sample, and then normalization is carried out to obtain the final loss. And updating and optimizing network parameters by taking the loss as a back propagation basis.

S63: the loss meets the termination condition, the training process is terminated, and the detection result is output; if the loss condition is not met, feeding back the loss size to the step S4, and updating and optimizing network parameters;

s64: repeatedly executing the step S63 until the termination condition is met;

And in the training process of the network, the image and the labeled label are sent into the model together, and the network parameters are continuously optimized in an iterative manner according to the loss reduction condition in the training process until the training termination condition is met. Before training, a certain amount of images containing single or multiple pedestrian targets need to be acquired, and the pedestrian targets should have certain differences, including positions of pedestrians in the images, presented postures, clothing colors, whether shielding exists or not, and the like. And the images are labeled as accurately as possible, and the labeled content includes but is not limited to a pedestrian target position frame, the attribute of the belonged category, the difficulty degree of detection and the like. In the embodiment of the application, a TFLite model conversion tool of Tensorflow is used, and the optimization means is as follows: metaData (MetaData) is added at model conversion time according to network input, and operators are specified at conversion time. In order to reduce training time and overfitting, an Early _ stopping function is introduced, and when the absolute value of loss reduction is verified to be less than 0.001, network training is stopped.

Fig. 6 is a flowchart of an embedded platform based monocular pedestrian detection system provided in an embodiment of the present invention. The method comprises the following steps:

a central processing unit which controls other units and receives feedback of each unit; the central processing unit is the core of the overall system, which controls the other units and receives feedback from each unit. When other units are abnormal in the operation process, the central processing unit sends a warning signal to a user, and the operation of the whole system can be terminated if necessary. Meanwhile, the computing power of the central processing unit determines the real-time performance and the performance of the system operation, so that the central processing unit needs to select proper equipment according to task requirements.

The image acquisition unit acquires a current color image by using a monocular camera; the image acquisition unit acquires a current color image by using a monocular camera. The camera carries the cloud platform, can be by the user to the camera send the instruction about 120 degrees, the rotation of 45 degrees from top to bottom, obtain the video image of different angles. The camera can perform self-adjustment to a certain degree, and the user can select the video image quality most suitable for the task requirement.

The image data processing unit receives the image acquired by the image acquisition unit and an instruction sent by the central processing unit, performs preprocessing, feature extraction, feature fusion and identification operations on the image according to the instruction, acquires whether the current image contains a pedestrian target result, and feeds back the result to the central processing unit; the embedded device is limited by the computing power, and the real-time performance cannot be guaranteed. Based on the above, the data processing module adopts a lightweight convolutional neural network, so that the parameter quantity and the calculation cost are greatly reduced, and the data processing module is provided with a linear bottleneck residual error structure, so that the possibility of gradient disappearance in the feature extraction process is reduced, and the network is allowed to extract richer semantic information. In the detection stage, multi-scale information and global average pooling are used as a judgment basis, pixel information and semantic information are fully utilized, a large target and a small target are considered, and the detection precision is improved.

And the communication control unit is a medium for communication between the central processing unit and the peripheral unit, and transmits the detection result of the image data fed back to the central processing unit to the peripheral unit. The hardware of the communication unit can select wired communication modes such as twisted-pair lines, USB and the like, and can use wireless transmission modes such as serial communication, bluetooth, local area network and the like.

And the peripheral unit visualizes the current detection result in real time in a visual and text mode, and comprises the number of the pedestrian targets, the position in the current picture and confidence information. The peripheral unit mainly uses a display screen as main equipment and can also be combined with other types of sensors to act.

When the system is in operation, the power supply unit supplies power to the central processing unit, and other units are supplied with power by the central processing unit or independently supplied with power. Firstly, loading a trained neural network model into an image data processing unit for subsequent use. When the image acquisition unit receives the instruction of the central processing unit, the camera is opened to acquire image information, the image information is transmitted into the image data processing unit to be processed, and a processing result is fed back to the central processing unit. And after receiving the processing result, the central processing unit outputs the result to the user through the peripheral unit.

The present invention further provides an electronic device, as shown in fig. 7, which is a schematic structural diagram of an electronic device in the present application, and includes one or more processors 101 and a storage device 102; the processor 101 in the electronic device may be one or more, and fig. 7 illustrates one processor 101 as an example; storage 102 is used to store one or more programs; the one or more programs are executed by the one or more processors 101, so that the one or more processors 101 implement the embedded platform-based monocular pedestrian detection method according to any one of the embodiments of the present invention.

The electronic device may further include: an input device 103 and an output device 104. The processor 101, the storage device 102, the input device 103, and the output device 104 in the electronic apparatus may be connected by a bus 105 or other means, and fig. 7 illustrates an example in which the processor, the storage device 102, the input device 103, and the output device are connected by the bus 105.

The storage device 102 in the electronic device is used as a computer-readable storage medium for storing one or more programs, which may be software programs, computer-executable programs, and modules, such as program instructions/modules corresponding to the embedded platform-based monocular pedestrian detection method provided in the embodiment of the present invention. The processor 101 executes various functional applications and data processing of the electronic device by running software programs, instructions and modules stored in the storage device 102, that is, the embedded platform-based monocular pedestrian detection method in the above method embodiment is realized.

The storage device 102 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of the electronic device, and the like. In addition, the storage device 102 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, the storage 102 may further include memory located remotely from the processor 101, which may be connected to the device over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input device 103 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the electronic apparatus. The output device 104 may include a display device such as a display screen.

And when the one or more programs included in the electronic device are executed by the one or more processors 101, the programs perform the following operations:

s2, inputting image information obtained by an image acquisition unit;

of course, it can be understood by those skilled in the art that when one or more programs included in the electronic device are executed by the one or more processors 101, the programs may also perform related operations in the method for detecting a monocular pedestrian based on an embedded platform provided in any embodiment of the present invention.

It should be further noted that the present invention also provides a computer-readable storage medium, where a computer program is stored, where the computer program can be executed by a computer processor, and implements the above-mentioned embodiment of the monocular pedestrian detection method based on an embedded platform. The computer program may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

A computer readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).

The principle of the invention is as follows: firstly, optimizing the feature extraction network to a certain extent according to the task type and the related theory. For deep convolutional neural networks, as the depth of the network increases, a part of detail information is lost by continuous down-sampling. Therefore, the convolution step size is modified for some of the feature extraction units, and the feature extraction units output feature maps with fixed size, and only the channels of the feature maps are increased or decreased. The purpose of doing so is firstly to make up the problems of resolution reduction and detail information loss caused by continuous down-sampling, and secondly to obtain richer semantic information. The fused features contain more pixel information and rich semantic information, and detection precision is improved.

Secondly, the basic units forming the feature extraction network are composed of convolution operations with different step sizes and depth separable convolution operations, and different optimization is carried out on the feature extraction basic units with different step sizes. A channel attention mechanism is added to the feature extraction basic unit with step size 1, and a Zeropadding operation is added to the feature extraction unit with step size 2.

Deep convolutional neural networks are prone to gradient vanishing and overfitting problems. The feature extraction network adopts a linear bottleneck residual error structure to make up the problem of gradient disappearance to a certain extent, and a certain linear relation also exists among a plurality of features which need to be fused. Therefore, qualitative and quantitative analysis is carried out on the selected characteristic parameters, and the L2-NORM regularization option is added to the characteristic extraction unit with larger parameter distribution difference, so that the distribution state of data of the characteristic extraction unit is improved, the generalization capability of the model is enhanced, and the condition that the detection performance of the model is reduced due to overfitting is reduced.

The loss function of the deep convolutional neural network influences the time required by model training, the model performance and the like. The system uses classification loss of positive and negative samples and positive sample position loss with penalty items as final network loss according to task requirements and a network structure. Firstly, a classification loss function and a position loss function are defined, and then the classification loss and the position loss of all positive labels are calculated. Then, the number of positive and negative samples of each image in the batch is counted, and if the number of negative samples of all the images in the batch is 0, N (N > 1) candidate boxes are selected as the negative samples by default. Judging the prediction result if the selection basis is the basis, and marking the candidate frame as a sample difficult to classify if the probability that the candidate frame does not contain the target object and does not belong to the background is high; all probabilities not belonging to the background are summed again, with larger values representing greater difficulty in classifying the sample. In the process, even the candidate frames without the target object still remain, and N candidate frames which are difficult to classify are selected as negative samples. The overall loss is equal to the classification loss of the positive sample plus the position loss of the positive sample with the introduced penalty term and the classification loss of the negative sample, and then normalization is carried out to obtain the final loss. And updating and optimizing network parameters by taking the loss as a back propagation basis.

In the model training process, in order to reduce the training time and overfitting, an Early _ stopping function is introduced, and when the absolute value of loss reduction is verified to be less than 0.001, network training is stopped.

And aiming at the condition that the computational power of the embedded platform is limited, the network model weight is optimized. Through a model conversion tool, the weight of a 32-bit floating point number in the model is converted into the weight of a 16-bit floating point number and the weight of an 8-bit integer number respectively, so that the volume of the model is greatly reduced under the condition of little precision loss, and the real-time performance is further improved. Meanwhile, a user can freely select a network model according to the calculation force of the embedded equipment, the balance precision and the real-time requirement and give consideration to the precision and speed measurement indexes.

Each unit of the system is independent, so that the user can conveniently and repeatedly upgrade according to task requirements, and meanwhile, a communication interface is reserved in the system and can be jointly operated with other related follow-up tasks. The method effectively solves the problems of low accuracy and large calculation amount of monocular pedestrian detection, and has the advantages of small model volume and calculation amount, low requirement on equipment calculation force, simple integral operation, clear detection result and the like.

The following is a specific description of the application of the present embodiment in an actual scene: a marker for football players is proposed for football match of sports match. The tool consists of a camera and a flat panel device. The camera is responsible for gathering football match video recording, and central processing unit and image data processing unit are integrated in the dull and stereotyped equipment. The system operates as shown in figure 8. The left side of the football match is a section of the football match, an attack process of a front edge athlete is simulated, own assisting personnel and other defending personnel are arranged around the attack athlete, and the camera acquires and stores the process and uploads the process to the server through a cloud technology. The tablet device integrated image data processing unit may download the specified video footage from the server and process it, with the processing results shown on the right side of fig. 8. All players (except goalkeepers) in the picture are correspondingly labeled, the bold solid boxes represent selected players, and the bold solid and dashed lines represent teammates and opponents of the selected players. Meanwhile, the athletes can be freely selected according to needs, including own athletes and opponent athletes. The marking tool has the advantages of low requirement on equipment computing power, simple overall operation, clear detection result, small model volume and small computation amount.

Since any modifications, equivalents, improvements, etc. made within the spirit and principles of the application may readily occur to those skilled in the art, it is intended to be included within the scope of the claims of this application.

Claims

1. A monocular pedestrian detection method based on an embedded platform is characterized in that: the method comprises the following steps:

s2, inputting image information obtained by an image acquisition unit;

wherein S5 specifically is:

s51, selecting a plurality of extracted features to perform fusion of 6 different scales, respectively detecting pedestrians in large and small images, directly using a regression algorithm to obtain 6 groups of data by 6 feature maps, performing fusion participation on 6 groups of data in loss function calculation according to the predicted pedestrian confidence score and coordinate information of the feature map corresponding to each group of data;

wherein R is _max 、R _min Respectively is preset 0.9 and 0.2; m is the number of feature maps used, and is 6; substituting the m value into a formula to obtain a scaling ratio, and multiplying the scaling ratio by the image size to obtain the size of a preselected frame; generating 4 preselected boxes using a length to width ratio of {1, 2,1/2} for feature maps of sizes 38 x 38, 3 x 3,1 x 1; generating 6 preselected frames for sizes 19 x 19, 10 x 10, 5 x 5 according to a length-to-width ratio of {1, 2,1/3,1 };

s53, filtering redundant candidate frames by using non-maximum suppression, wherein if the IoU of the preselected frame and the IoU of the real frame are more than 0.5, the preselected frame is matched with the real frame; if the remaining real frames of the preselected frame have a plurality of IoUs larger than 0.5, the preselected frame is matched with the real frame with the largest IoU, namely one real frame can be matched with a plurality of preselected frames, and the preselected frame can be matched with only one real frame; if one pre-selection frame cannot meet the two conditions, the pre-selection frame is classified as a background, namely a negative sample; and sorting the negative samples according to the confidence coefficient loss, and participating in back propagation according to the proportion of the positive and negative samples 1.

2. The monocular pedestrian detection method based on an embedded platform according to claim 1, wherein: the step S4 specifically includes:

s41, adjusting the undistorted size of the input image;

s42, processing the image with fixed size by using a plurality of characteristic extraction basic units;

3. The monocular pedestrian detection method based on an embedded platform according to claim 2, wherein: the feature extraction basic unit comprises depth separable convolution, maximum pooling and batch standardization layer composition, a ReLu6 function is used as an activation function, nonlinearity and sparsity of the weight are obtained earlier, and meanwhile a regularization algorithm and linear residual operation are included.

4. The monocular pedestrian detection method based on an embedded platform according to claim 1, wherein: and carrying out different optimization on the feature extraction basic units with different step lengths, adding a channel attention mechanism to the feature extraction basic unit with the step length of 1, and adding zero padding operation to the feature extraction unit with the step length of 2.

5. The method for detecting the monocular pedestrian according to claim 1, characterized in that: the step S6 specifically includes:

s64: repeatedly executing the step S63 until the termination condition is met;

s65: step S64, finishing execution, outputting a detection result, confirming whether the task requirement is met by a user according to the detection result, and if so, reserving the model; if not, cleaning the image data or optimizing the network structure until the task requirements are met;

6. The monocular pedestrian detection method based on an embedded platform according to claim 5, wherein: the calculation loss in step S62 is composed of two parts, one part is a confidence loss, i.e., a classification loss; another part is the return loss, i.e. position loss;

wherein the classification loss calculation formula is as follows:

Representing the probability that the ith prediction box predicts the class p. Pos and Neg represent positive and negative samples; s represents the number of matched candidate frames;

wherein { x, y, w, h } represents the x, y axis coordinates of the upper left corner of the prediction box and the width and height of the prediction box, respectively. The total loss is:

when S =1, α is set to 1 by cross validation.

7. The monocular pedestrian detection method based on an embedded platform according to claim 5, wherein: the TFLite model conversion tool of the tensrflow is used in the step S66, and the optimization means is as follows: metadata is added at model conversion according to network input, and operators are specified at conversion.

8. A monocular pedestrian detection system based on embedded platform, characterized in that includes:

the power supply unit provides power support for the whole system;

the communication control unit is a medium for communication between the central processing unit and the peripheral unit and transmits a detection result of feeding back the image data to the central processing unit to the peripheral unit;

and the peripheral unit visualizes the current detection result in real time by adopting a visual and text mode, and comprises the number of the pedestrian targets, the positions in the current picture and confidence information.

9. An electronic device, comprising:

one or more processors;

storage means for storing one or more programs;

when executed by the one or more processors, cause the one or more processors to implement the embedded platform based monocular pedestrian detection method of any one of claims 1 to 7.

10. A computer-readable storage medium, storing a computer program, wherein the computer program is executable by a computer processor to execute computer-readable instructions for implementing the method according to any one of claims 1 to 7.