CN109506628A

CN109506628A - Object distance measuring method under a kind of truck environment based on deep learning

Info

Publication number: CN109506628A
Application number: CN201811447469.4A
Authority: CN
Inventors: 肖冬; 单丰; 王宝华; 刘燨文; 李雪娆; 孙效玉
Original assignee: Northeastern University China
Current assignee: Northeastern University China
Priority date: 2018-11-29
Filing date: 2018-11-29
Publication date: 2019-03-22

Abstract

Object distance measuring method under the present invention relates to a kind of truck environment based on deep learning includes the following steps: S1, obtains object image data；S2, the object image data got is pre-processed；S3, it will be inputted in SSD model by pretreated object image data, and obtain the location of pixels coordinate of object in the picture by the processing of the SSD model；S4, basis have identified the location of pixels coordinate of object in the picture, are calculated by the method for image ranging the distance of object range image acquisition device, obtain the distance of object range image acquisition device；Wherein, the SSD model is through the improved model of overcompression, and the object image data is the object image data obtained under the conditions of vertical view.Distance measuring method provided by the invention needs the object identified to have many advantages, such as that recognition speed is fast, recognition efficiency is high in mining area truck, fast using monocular distance measuring method ranging speed.

Description

Target object ranging method in truck environment based on deep learning

Technical Field

The invention belongs to the technical field of computer vision and truck anti-collision, and particularly relates to a target object ranging method in a truck environment based on deep learning.

Background

With the progress and development of science and technology, a great deal of manual labor is gradually completed by a computer. Machine vision can complete various tasks better and faster than human beings, and the one hand is that due to the nature of human beings, long-time work is easy to generate fatigue, and high detection accuracy cannot be guaranteed. On the other hand, it is difficult to improve the speed, accuracy, and the like due to the physiological limit of human eyes. Therefore, modern industry and production urgently need a new machine technology to appear to replace human vision. Meanwhile, as computer technology, electromechanical control technology, intelligent detection technology and digital image processing technology are continuously developed and perfected, people begin to combine the intelligent abstract capability of human vision with the high speed, high precision and high reliability of a processor, and a new subject-to-machine vision is gradually formed. Machine vision is currently used in many fields such as vision inspection, object recognition, automated quality inspection, process control, parametric measurement, and automated assembly. Machine vision involves multiple disciplines such as image processing, machine learning, pattern recognition, and the like. The machine vision uses image acquisition equipment such as a camera and the like to input acquired images into a specific algorithm for calculation, so that the machine simulates human eyes and brains to identify objects, and certain parameters of the objects can be measured through the machine vision.

In a strip mine production system, a number of trucks operate in a network of electric shovels, unloading points and bidirectional roads connecting them, effecting the transport of ore, rock between the electric shovels and the unloading points. This is a dynamic circulation system. The development of strip mines tends to large-scale equipment, the discontinuous process of large-scale collection devices and large-scale trucks is more common, the dead zone of the truck is large due to the characteristics of height, width and size, the braking distance of the truck is long, and the accident probability of the truck is high. Truck accidents cause significant loss of life and property and mental injuries to people. In order to overcome the vision blind area caused by the large size and the high driving position of the truck, the machine vision is used for constructing the front and rear vision environments of the truck in real time. This method, however, does not alert the driver to the presence of other objects such as pedestrians, auxiliary vehicles, etc. Therefore, in order to reduce the occurrence of truck accidents and ensure safety, target detection and distance measurement near a large ore-carrying truck become important.

The above drawbacks are expected to be overcome by those skilled in the art.

Disclosure of Invention

Technical problem to be solved

In order to solve the above problems in the prior art, the present invention provides a method for measuring a distance to a target object in a truck environment based on deep learning, which can calculate the distance from the target object to the truck and realize functions such as collision prevention of the truck.

(II) technical scheme

In order to achieve the purpose, the invention adopts the main technical scheme that the method comprises the following steps:

a target object ranging method in a truck environment based on deep learning comprises the following steps:

s1, acquiring image data of the target object;

s2, preprocessing the acquired image data of the target object;

s3, inputting the preprocessed image data of the object into an SSD model, and obtaining the pixel position coordinates of the object in the image through the processing of the SSD model;

s4, calculating the distance between the target object and the image acquisition device by an image ranging method according to the pixel position coordinates of the identified target object in the image to obtain the distance between the target object and the image acquisition device;

the SSD model is a compressed and improved model, and the target image data is acquired under an overlook condition.

Preferably, the SSD model comprises eleven blocks:

wherein, the first Block is two convolution layers of 3 × 3 and a maximum pooling layer of 2 × 2 with a step size of 2; the second Block is two 3 x 3 convolutional layers and one 2 x 2 max pooling layer with step size of 2; the third Block is three 3 × 3 convolutional layers and one 2 × 2 max pooling layer with step size of 2; the fourth Block is three 3 × 3 convolutional layers and one 2 × 2 max pooling layer with step size of 2; the fifth Block is three 3 x 3 convolutional layers and one 3 x 3 max pooling layer with step size 1; the sixth Block is a3 × 3 convolutional layer; the seventh Block is a1 × 1 convolutional layer; the eighth to eleventh blocks are each a1 × 1 convolutional layer and a3 × 3 convolutional layer, and information is transferred between the blocks through the convolutional layer and the pooling layer.

Preferably, the SSD model compression method sets the output number of each convolutional layer to be one quarter of the output number of each convolutional layer of the initial SSD model.

Preferably, before the step S3, the method further includes training the SSD model after the compression improvement by using the target image acquired under the top view condition.

Preferably, training the SSD model after compression modification comprises the steps of:

a1, inputting the preprocessed target object image into the compressed SSD model to obtain a first characteristic diagram;

a2, calculating a first default box of the first feature map through the first feature map;

the scale calculation formula of the first default box is as follows:

wherein m is the number of feature maps; s_minThe scale of a first default box for the bottommost feature map; s_maxIs the scale of the first default box of the top-most feature map, s_minSet to 0.2, s_maxSet to 0.9;

finally according toAndcalculating the width and the height of each first default frame, wherein a is the proportion value of the first default frame;

a3, judging whether to encode the preset reference frame according to the judgment condition;

the judgment conditions are as follows: comparing the value of the first threshold with the size of the similarity of the jaccard, wherein the similarity of the jaccard is calculated by the first default frame code and the reference frame, the similarity of the jaccard is the overlapping degree of two sets A, B, namely the ratio of A and B intersection sets is calculated, wherein A is the first default frame, and B is the reference frame;

if the jaccard similarity is greater than the first threshold, encoding the reference frame,

the reference frame after encoding includes: a position offset (g ═ cx, cy, w, h)), a target score (p ∈ [0, 1]), and a tag (x ∈ {0, 1}),

the code offset is calculated as:

wherein (cx, cy) denotes the center of the reference frame after encoding, and (w, h) denotes the width and height of the reference frame, and subscript indexes g and d denote the reference frame and the first default frame, respectively;

a4, performing convolution operation of a first inactivity function on the first feature map to obtain four position offsets of a first default frame, wherein the four position offsets of the first default frame are used for positioning prediction of a target object;

performing convolution operation of a second non-activation function on the first feature map to obtain three category confidence coefficients, wherein the category confidence coefficients are used for category prediction of the target object;

and processing the confidence degrees of all the categories by adopting a softmax function to obtain the probability of the prediction category of the target object.

Preferably, the loss function of the compressed SSD model is:

wherein, α, β are weights, N is the number of matched anchor box, if N is 0, the formula of the term is 0;

the position offset penalty function is defined as:

wherein 1 is the position of a target mark in the image, and g is the offset of a target box after coding; the target score loss function is a multi-class softmax loss:

wherein,

the target prior loss function is calculated by binary cross entropy:

preferably, the step S3 further includes the following sub-steps:

s301, inputting the preprocessed target object image data into the SSD model during verification, and obtaining a second characteristic diagram in the SSD model;

s302, calculating a second default frame of the second feature map;

s303, decoding the coded reference frame; fusing the second default frame and the reference frame to obtain a third default frame;

the fused calculation formula is as follows:

x_c＝loc[0]×w_ref×sacling[0]+x_ref

y_c＝loc[1]×h_ref×sacling[1]+y_ref

w＝w_ref×e^{(loc[2]×saoling[2])}

h＝h_ref×e^{(loc[3]×sacling[3]})

wherein x is_c、y_cThe coordinates of the center point of the third default frame are w and hWidth and height of the frame, loc is four position offsets, x, of the second default frame obtained by convolution_ref、y_ref、w_ref、h_refScaling is a default parameter for four position offsets of the prediction frame of the preprocessed image information calculated according to the proportion value;

s304, screening the third default frame, marking the position offset coordinate of the screened third default frame on the preprocessed target object image, and then outputting the target object image marked with coordinate information;

s305, calculating and obtaining the pixel position of the closest point of the target object to the image acquisition device according to the coordinate information of the identified target object obtained by the trained compressed SSD model.

Preferably, the image pre-processing comprises at least an image normalization process.

Preferably, the step S4 further includes: image ranging is carried out between the target object and the image acquisition device according to a monocular ranging method and the pixel position of the point, closest to the image acquisition device, of the target object, and the distance is calculated by adopting the monocular ranging method under the overlooking condition, wherein the distance calculation formula is as follows:

wherein H is the camera height, O₃M is the distance between the world coordinate point corresponding to the image coordinate center and the camera on the y axis of the world coordinate system, O₁(centre, vcenter) is the image of the lens center point, P₁(u，0)、Q₁(u, v) are image coordinates of pixel points to be measured, Q is a certain point to be measured on a world coordinate system, the projection of the point to be measured on the y axis of the world coordinate system is a point P, the focal length of a camera is f, and the length x of an actual pixel_pixWidth y of real pixel_pix(ii) a The length of the actual pixel, the width of the actual pixel and the focal length are obtained by calibrating the camera.

Preferably, the step S4 further includes obtaining an included angle β between a straight line where the camera and the point P are located and a horizontal plane according to a distance calculation formula, and deriving coordinates Q (X, Y) of the target object in the real world;

and carrying out annotation display on the acquired position coordinates of the target object in the image and the distance data of the target object from the camera in the target object image.

(III) advantageous effects

The invention has the beneficial effects that: the invention provides a target object ranging method in a truck environment based on deep learning, which aims at mine trucks in an open mine, wherein each image acquisition device on each truck acquires image information of a corresponding area, an image processing center connected with all the image acquisition devices on each truck processes the image information in real time, the pixel position of a target to be identified in an acquired image is determined, and the distance between the target and the image acquisition devices is calculated by an image ranging method according to the pixel position of the identified target in the image. The multi-target distance measurement method in the truck-mounted environment adopts the compressed SSD model to process image information in real time, obtains the type and the image position of the target, and measures the distance of the target by using the monocular distance measurement method under the overlooking condition. The model has the advantages of high speed and high recognition efficiency for recognizing the target to be recognized by the truck in the mining area, and the system can quickly respond under dangerous conditions by using the monocular distance measuring method to measure the distance.

Drawings

FIG. 1 is a schematic flow chart illustrating a method for measuring distance of a target object in a truck environment based on deep learning according to an embodiment of the present invention;

FIG. 2 is a schematic flow chart illustrating a method for measuring distance of a target object in a truck environment based on deep learning according to an embodiment of the present invention;

FIG. 3 is a schematic flowchart of a method for measuring distance of a target object in a truck environment based on deep learning according to an embodiment of the present invention;

FIG. 4 is a schematic structural diagram of an SSD model in a method for measuring distance of an object in a truck environment based on deep learning according to an embodiment of the present invention;

FIG. 5 is a geometric diagram illustrating a distance measurement calculation formula in a deep learning-based method for measuring distance of an object in a truck environment according to an embodiment of the present invention;

fig. 6 is a schematic diagram illustrating a display result in a method for measuring distance of a target object in a truck environment based on deep learning according to an embodiment of the present invention.

Detailed Description

For the purpose of better explaining the present invention and to facilitate understanding, the present invention will be described in detail by way of specific embodiments with reference to the accompanying drawings. The following detailed description of embodiments of the invention refers to the accompanying drawings.

Example 1

As shown in fig. 1: the embodiment discloses a target object ranging method in a truck environment based on deep learning, which comprises the following steps:

and S1, acquiring the image data of the target object.

It should be noted that the target object image data in this step is target object image acquisition data acquired by an image acquisition device located on a mine truck, where the target object may be other mine trucks, people, and other objects; the target object image data is image data of other ore-carrying trucks; of course, in this embodiment, there may be a plurality of image capturing devices on the ore-carrying truck.

And S2, preprocessing the acquired image data of the target object.

The pretreatment in this step mainly comprises: randomly capturing a part of the target object image, performing distortion deformation, randomly turning the image left and right, randomly distorting colors (such as brightness, saturation, chroma, contrast and the like), performing image normalization, and scaling the image to a specification of 300 × 300.

And S3, inputting the preprocessed image data of the object into the SSD model, and obtaining the pixel position coordinates of the object in the image through the processing of the SSD model.

And S4, calculating the distance between the target object and the image acquisition device by an image ranging method according to the pixel position coordinates of the identified target object in the image, and obtaining the distance between the target object and the image acquisition device.

The SSD model is a compressed and improved model, and the target object image data is acquired by the image acquisition device under an overlook condition.

What should be explained about the SSD model described in this embodiment is: the SSD model includes eleven blocks.

The improvement of the SSD model in this implementation is: the SSD model compression method is that the output number of each convolution layer is set to be one fourth of the output number of each convolution layer of the initial SSD model.

It should be noted that: in this embodiment, before the step S3, the method further includes training the SSD model after the compression improvement by using the image data of the object acquired under the overhead condition.

As shown in fig. 2: in this embodiment, training the SSD model after the compression improvement includes the following steps:

and A1, inputting the preprocessed target object image into the compressed SSD model to obtain a first characteristic diagram.

A2, calculating a first default box of the first feature map through the first feature map.

Wherein the scale calculation formula of the first default frame is as follows:

wherein m is the number of feature maps; s_minThe scale of a first default box for the bottommost feature map; s_maxThe scale of the first default box for the top-most feature map.

Finally according toAndand calculating the width and the height of each first default frame, wherein a is the first default frame proportion value.

A3, whether to encode a predetermined reference frame is determined according to the determination condition.

In detail, the determination conditions are: comparing the value of the first threshold with the size of the similarity of the jaccard, wherein the similarity of the jaccard is calculated by the first default frame code and the reference frame, the similarity of the jaccard is the overlapping degree of two sets A, B, namely the ratio of A and B intersection sets is calculated, wherein A is the first default frame, and B is the reference frame;

if the similarity of the jaccard is larger than the first threshold value, encoding the reference frame, wherein the encoded reference frame comprises: the position offset (g ═ cx, cy, w, h)), the target score (p ∈ [0, 1]) and the label (x ∈ {0, 1}), and the calculation formula of the coding offset is as follows:

where (cx, cy) denotes the center of the reference frame after encoding, and (w, h) denotes the width and height of the reference frame, and subscript indices g and d denote the reference frame and the first default frame, respectively.

And A4, performing convolution operation of a first inactivity function on the first feature map to obtain four position offsets of a first default frame, wherein the four position offsets of the first default frame are used for positioning prediction of the target object.

And performing convolution operation of a second non-activation function on the first feature map to obtain three class confidence coefficients, wherein the class confidence coefficients are used for class prediction of the target object.

The target position is calculated by the target positioning prediction, and the target category prediction refers to calculation of the type of the target.

It should be noted that: in this embodiment, the loss function of the compressed SSD model is:

wherein, α and β are weights, N is the number of matched anchor boxes, and if N is 0, the term formula is 0.

The position offset penalty function is defined as:

wherein l is the position of a target mark in the image, and g is the offset of a target box after coding; the target score loss function is a multi-class softmax loss:

wherein,

the target prior loss function is calculated by binary cross entropy:

as shown in fig. 3: in this embodiment, the step S3 further includes the following sub-steps:

s301, inputting the preprocessed target object image data into the SSD model, and obtaining a second characteristic diagram in the SSD model;

s302, calculating a second default frame of the second feature map;

the fused calculation formula is as follows:

x_c＝loc[0]×w_ref×sacling[0]+x_ref

y_c＝loc[1]×h_ref×sacling[1]+y_ref

w＝w_ref×e^{(loc[2]×sacling[2])}

h＝h_ref×e^{(loc[3]×sacling[3]})

wherein x is_c、y_cThe coordinate of the center point of the third default frame, w and h are the width and height of the third default frame, loc is four position offsets of the second default frame obtained by convolution, and x_ref、y_ref、w_ref、h_refAnd scaling is a default parameter for four position offsets of the prediction frame of the preprocessed image information calculated according to the proportion value.

S304, screening the third default frame, marking the position offset coordinate of the screened third default frame on the preprocessed target object image, and then outputting the target object image marked with coordinate information.

It should be noted that: the first feature map is 6 feature maps obtained by pictures in training through Block4, Block7, Block8, Block9, Block10 and Block11, and the first feature map is the total.

The second feature map is a general name of 6 feature maps obtained during detection after training.

Each feature map in the first feature map is calculated to obtain a plurality of default boxes, which are collectively called as first default boxes, and the first feature map and the default boxes are used for finding the type and the position of the training target.

Each feature map in the second feature map is also calculated to obtain a plurality of default frames, which are collectively called as second default frames, and the second default frames are fused with the reference frames to be called as third default frames, wherein the third default frames are used for predicting the position of the target, and the second feature map is used for predicting the category.

Step 4 described in this embodiment further includes: the image ranging is carried out between the target object and the image acquisition device according to the monocular ranging method and the pixel position of the point, closest to the image acquisition device, of the target object, and the position of the ore-carrying truck for mounting the camera is higher, so that the monocular ranging method under the overlooking condition is adopted for distance calculation, and the distance calculation formula is as follows:

wherein H is the camera height, O₃M is the distance between the world coordinate point corresponding to the image coordinate center and the camera on the y axis of the world coordinate system, O₁(centre, vcenter) is the image of the lens center point, P₁(u，0)、Q₁(u, v) is the image coordinate of the pixel point to be measured, Q is a certain point to be measured on the world coordinate system, and the world coordinate of the point to be measuredThe projection on the y axis is P point, the focal length of the camera is f, and the length x of the actual pixel_pixWidth y of real pixel_pix(ii) a The length of the actual pixel, the width of the actual pixel and the focal length are obtained by calibrating the camera.

In this embodiment, the step S4 further includes obtaining an included angle β between a straight line where the camera and the point P are located and a horizontal plane according to a distance calculation formula, and then deriving and obtaining coordinates Q (X, Y) of the target object in the real world.

Example 2

The embodiment discloses a target object ranging method in a truck environment based on deep learning, which specifically comprises the following steps:

(1) for a truck transporting a mine in an open pit, each image acquisition device on the truck acquires image information of a corresponding area.

(2) And the image processing center connected with all the image acquisition devices on the truck processes the image information in real time and determines the pixel position of the target to be identified in the acquired image.

(3) And calculating the distance between the target and the image acquisition device by an image ranging method according to the pixel position of the identified target in the image.

Wherein, the step (2) specifically comprises the following steps:

and (2.1) preprocessing the image acquired by each image acquisition device to obtain a preprocessed image.

And (2.2) training the compressed SSD model by adopting the image acquired under the overlooking condition, and processing the preprocessed image in real time by using the trained compressed SSD model.

And (2.3) obtaining the pixel position of the identified target in the image after the SSD model processing, and calculating the distance between the target and the image acquisition device by a monocular distance measurement method.

FIG. 4 is a schematic diagram of an SSD model according to an embodiment of the present invention; where the VGG16 itself comprises a 5-block structure.

Specifically, the SSD model includes eleven blocks (blocks):

the first Block is two 3 × 3 convolutional layers and one 2 × 2 max pooling layer with step size of 2; the second Block is two 3 x 3 convolutional layers and one 2 x 2 max pooling layer with step size of 2; the third Block is three 3 × 3 convolutional layers and one 2 × 2 max pooling layer with step size of 2; the fourth Block is three 3 × 3 convolutional layers and one 2 × 2 max pooling layer with step size of 2; the fifth Block is three 3 x 3 convolutional layers and one 3 x 3 max pooling layer with step size 1; the sixth Block is a3 × 3 convolutional layer; the seventh Block is a1 × 1 convolutional layer; the eighth to eleventh blocks are each a1 × 1 convolutional layer and a3 × 3 convolutional layer, and information is transferred between the blocks through the convolutional layer and the pooling layer.

The SSD model compression method is that the output number of each convolution layer is set to be one fourth of the output number of each convolution layer of the initial SSD model.

Since the types of target detection are not many, extracting features using a small number of convolution kernels does not have a particularly large influence on the recognition result.

The structure of the improved SSD model is as follows:

block 1: inputting a gray image with the size of 300 × 300, convolving an original image twice by 16 convolution kernels with the size of 3 × 3, wherein the step size is 1, the filling mode is 'SAME' (zero padding makes the size of the input and output images the SAME), the activation function is a linear rectification function (ReLU), pooling is performed by a pooling kernel with the size of 2 × 2, and 16 150 × 150 feature maps are output.

Block 2: performing convolution twice on the output of Block1 by adopting 32 convolution kernels of 3 × 3, performing pooling by adopting pooling kernels with the step length of 2 and the size of 2 × 2, and outputting 32 characteristic graphs of 75 × 75;

block 3: the Block2 output is convolved three times with 64 convolution kernels of 3 × 3, pooled with pooling kernels of 2 × 2 step size, and output as 64 feature maps of 38 × 38.

Block 4: the Block3 output is convolved three times with 128 convolution kernels of 3 × 3, pooled with pooling kernels of 2 × 2 step size, and output as 128 feature maps of 19 × 19.

Block 5: the Block4 output is convolved three times with 128 convolution kernels of 3 x 3 size, and pooled with pooling kernels of 3 x 3 size with step size 1.

Block 6: the Block5 output was convolved once with 256 convolution kernels of 3 × 3 and a rate (spreading factor) of 6, and 256 19 × 19 feature maps, which are random deactivation layers in training the network and have a deactivation rate of 0.5, were output.

Block 7: the Block6 output is convolved once with 256 convolution kernels of 1 × 1, and 256 feature maps of 19 × 19 are output, which are also random inactivation layers when the network is trained, and the inactivation rate is 0.5.

Block 8: the outputs of the Block7 are subjected to convolution once by adopting 64 convolution checks of 1 × 1, and the outputs of the Block7 are subjected to convolution once by adopting 128 convolution checks of 3 × 3, wherein the filling mode is 'VALID' (new pixels are not added on the basis of the original inputs), and the outputs are 128 feature maps of 10 × 10.

Block 9: 32 convolution checks of 1 × 1 are adopted to perform convolution once on the Block8 output, 64 convolution checks of 3 × 3 are adopted to perform convolution once on the Block8 output, the filling mode is 'VALID', and the output is 64 feature graphs of 5 × 5.

Block 10: 32 convolution checks of 1 × 1 are adopted to perform convolution once on the Block9 output, 64 convolution checks of 3 × 3 are adopted to perform convolution once on the Block9 output, the filling mode is 'VALID', and the output is 64 feature maps of 3 × 3.

Block 11: 32 convolution checks of 1 × 1 are adopted to perform convolution once on the Block10 output, 64 convolution checks of 3 × 3 are adopted to perform convolution once on the Block10 output, the filling mode is 'VALID', and the output is 64 feature graphs of 1 × 1.

The compressed SSD model adopted in this embodiment adopts a multi-layer feature fusion method to perform feature fusion on feature maps output by Block4, Block7, Block8, Block9, Block10, and Block 11.

Here, it should be noted that: the embodiment also provides a method for training the compressed SSD model, and the method for training the compressed SSD model comprises the following steps:

(1) inputting the preprocessed training images into a compressed SSD model to obtain first feature maps, wherein each first feature map comprises a certain number of default bounding boxes, namely default boxes, the default boxes have certain proportion and aspect ratio, and two different types of filters (namely positions and scores) are applied to each first feature map to predict the position offset and the target score of the default boxes.

The training images described herein are a large amount of pre-acquired image data of the target object in the environment of the truck.

(2) Calculating a first default box of each first feature map;

first, the first feature map is divided into a grid of blocks having a size of 1 × 1, and the center of the first default frame is determined among the blocks.

The scale calculation formula of the first default box is as follows:

wherein m is the number of feature maps; s_minThe scale of a first default box for the bottommost feature map; s_maxThe scale of a first default box for the topmost feature map;

the first default box takes a small scale (e.g., s)_min0.1) making it better able to detect small targets.

(3) determining whether to encode a predetermined reference frame according to the determination condition;

after the first default frame is obtained, a pre-defined reference frame needs to be encoded, and in this way, the reference frame can be converted into a form that can be input to the compressed SSD model for training. The judgment conditions are as follows: comparing the value of the first threshold with the magnitude of the jaccard similarity, wherein the jaccard similarity is calculated by the first default box code and the reference box, and is the overlapping degree of the two sets A, B (wherein A is the first default box and B is the reference box), namely calculating the ratio of the intersection of A and B;

where (cx, cy) represents the center of the target box after encoding, and (w, h) represents the width and height of the box. Subscript indices g and d denote the reference box and the first default box, respectively.

(4) And performing convolution operation of a first inactive function on the first feature map to obtain four position offsets of a first default frame, wherein the four position offsets of the first default frame are used for positioning prediction of the target.

And performing convolution operation of a second non-activation function on the first feature map to obtain three category confidence coefficients, wherein the category confidence coefficients are used for category prediction of the target.

And processing the category confidence coefficient by adopting a softmax function to obtain the probability of the prediction category of the first default frame.

The loss function of the compressed SSD model is as follows:

wherein, α and β are weights, N is the number of matched anchors, and if N is 0, the term formula is 0.

The position offset penalty function is defined as:

wherein l is the position of the target mark in the image, and g is the offset of the target box after encoding. The target score loss function is a multi-class softmax loss:

wherein,

the target prior loss function is calculated by binary cross entropy:

the parameters to be determined during the training process are: moving average update parameters, learning rate, etc., batch size, etc. The trained network parameters need to be saved for later use.

In the embodiment, the deep learning-based multi-target detection ranging method in the truck environment can realize the measurement of the distance between a large truck and a target to be measured. Because the images under the view angle of the truck are fewer, and the similarity between the images and the PASCALVOC2012 data set is higher, the data of the target images of the personnel and the vehicle in the PASCALVOC2012 data set, the data acquired by monitoring camera shooting under the overlooking condition and the image data acquired at the mine site are adopted to train the model. The training images of this example were collected in the saddle thousand mining industry.

After the compressed SSD model is trained, the target recognition detection program can be run formally, which specifically includes the following steps:

(1) and loading the trained compressed SSD model, and processing the preprocessed image information by the trained SSD model to obtain a first characteristic diagram.

Wherein the pretreatment mainly comprises: randomly intercepting a part of an image, performing distortion deformation, randomly turning the image left and right, randomly distorting colors (such as brightness, saturation, chroma, contrast and the like), performing image normalization, and scaling the image to a specification of 300 × 300.

It should be noted that the preprocessed image is processed by a lightweight SSD model to obtain 8 Block-output second feature maps, in this embodiment, six Block-output second feature maps are adopted: 19 x 19 pixel Block4 and Block7, 10 x 10 pixel Block8, 5 x 5 pixel Block9, 3 x 3 pixel Block10 and 1 x 1 pixel Block11, wherein the second feature maps of the six blocks have different dimensions, and the generation of predictions of different dimensions from the second feature maps of different dimensions can ensure that the network can identify objects of different sizes.

(2) A second default box for the second feature map is computed.

For each second feature map of each Blocok, identifying the center of a second default box with the respective pixels, the center of the second default box being set to:

where l is the size of the feature map, k second default boxes (k is 6) are generated according to different sizes and aspect ratios. Each second default box size calculation formula is:

wherein m is the number of second profiles, where m is 3; here s_minIs 0.10; s_maxIs 0.5. Each second default box aspect ratio is calculated from the ratio value:

wherein a is a default frame scale value.

Taking the second feature map of Block7 as an example, in the second feature map of Block7, a is {1, 2, 0.5, 3, 1.0/3 }. For the default box with the proportion of 1, additionally adding a default box width and height as follows:

finally, the default number of blocks 7 is 6.

Performing convolution operation of 3 × 3 × (k × 4) inactive functions on the second feature maps of the three blocks respectively to obtain four offset positions of a second default frame for target positioning prediction, wherein 3 × 3 is the size of a convolution kernel, k is the number of the second default frame of each pixel point on each second feature map, and '4' is the four offset positions of the second default frame: the horizontal and vertical coordinates of the starting point, the width and the height.

And then, respectively carrying out convolution operation on the second feature maps of the blocks by using a3 × 3 × (k × n) non-activation function to obtain three confidence coefficients for classifying and predicting the target, wherein 3 × 3 is the size of a convolution kernel, k is the default frame number of each pixel point on each feature map, n' is the class number to which each default frame may belong, and n refers to the type and the background of the target to be recognized.

Taking the second signature of Block7 as an example, the second signature size of Block7 is 19 × 19, k is 6, and therefore the final output is (19 × 19) × 6 × (3+ 4). And (3) processing the confidence outputs of the three categories by a softmax function to obtain the probability (the value range is 0-1) of the prediction category of each second default frame, and then fusing the second default frame and the reference frame to obtain a third default frame.

The fused calculation formula is:

x_c＝loc[0]×w_ref×sacling[0]+x_ref

y_c＝loc[1]×h_ref×sacling[1]+y_ref

w＝W_ref×e^{(loc[2]×sacling[2])}

h＝h_ref×e^{(loc[３]×sacling[3])}

wherein x is_c、y_cIs the coordinate of the center point of the third default frame, w and h are the width and height of the third default frame, loc is the four position offsets of the second default frame obtained by convolution, and x_ref、y_ref、w_ref、h_refScaling is a default parameter for four position offsets of the prediction frame of the preprocessed image information calculated according to the ratio, and the size is [0.1, 0.1, 0.2 ]]；

And screening the third default frame, marking the position offset of the screened third default frame on the preprocessed image information, and then outputting the marked image information.

And screening the third default frames, and for each third default frame, if the probability of predicting that the third default frame belongs to a certain class is greater than a second threshold, wherein the value of the second threshold is 0.5, reserving the third default frame, and storing the class to which the third default frame belongs and the prediction score. Clipping a third default frame which meets the requirements: i.e. calculating the intersection between the third default frame and the given reference frame; then, the prediction category scores of the third default frames are sorted in descending order, and the n third default frames with the highest score are obtained, wherein n is 400. And (3) adopting a non-maximum value inhibition method to screen the remaining third default frame again: and (3) calculating jaccard values of any two third default frames, reserving the two third default frames when the two third default frames are subjected to different types of prediction, otherwise only reserving the third default frame with a high prediction score, and finely adjusting the third default frame meeting the requirements according to the given reference frame. And marking the third default frame on the image information according to the four position offsets of the third default frame, and marking the predicted category of the third default frame.

In the embodiment, after the program loads the compressed SSD model to obtain the position coordinates of the target to be identified, the monocular distance measurement method is adopted to perform image distance measurement on the target.

Specifically, since the position of the mine truck for mounting the camera is high, the distance is calculated by adopting a monocular distance measurement method under the overlooking condition, a schematic diagram of which is shown in fig. 5, and a distance calculation formula is as follows:

And (3) calculating an included angle β between a straight line where the camera and the point P are located and a horizontal plane according to a distance calculation formula, and then deducing and calculating coordinates Q (X, Y) of the target object in the real world.

It should be noted that the focal length of the camera is f, and the length x of the actual pixel_pixThe width of the actual pixel is obtained by calibrating the camera. The M point coordinates are obtained by shooting a calibration plate through a camera, searching the position of the center of an image by using Photoshop software, and measuring the distance from the position to a camera under the actual condition.

Since the distortion of the image formed by the wide-angle camera affects the accuracy of the distance measurement, the acquired image is also subjected to distortion correction.

After the positions of the targets and the cameras are measured through monocular distance measurement, the type information and the distance of the targets are marked on the image to be displayed in real time. The effect graph is shown in fig. 6.

The running speed of the model for target detection and distance measurement can reach about 40 frames, and the requirement of real-time performance is met. The following table shows the results of ranging the same target using lidar and image ranging.

The technical principles of the present invention have been described above in connection with specific embodiments, which are intended to explain the principles of the present invention and should not be construed as limiting the scope of the present invention in any way. Based on the explanations herein, those skilled in the art will be able to conceive of other embodiments of the present invention without inventive efforts, which shall fall within the scope of the present invention.

Claims

1. A target object ranging method in a truck environment based on deep learning is characterized by comprising the following steps:

s1, acquiring image data of the target object;

s2, preprocessing the acquired image data of the target object;

2. The method of claim 1,

the SSD model includes eleven blocks:

3. The method of claim 2,

4. The method of claim 1, wherein step S3 is preceded by training the improved SSD model with the target image obtained under the top view condition.

5. The method of claim 4, wherein training the compression-refined SSD model comprises the steps of:

the scale calculation formula of the first default box is as follows:

wherein m is the number of feature maps; s_minThe scale of a first default box for the bottommost feature map; s_maxIs the scale of the first default box of the top-most feature map, s_manSet to 0.2, s_maxSet to 0.9;

the code offset is calculated as:

6. The method of claim 5,

the loss function of the compressed SSD model is as follows:

wherein, α, β are weights, N is the number of matched anchors, if N is 0, the formula of the term is 0;

the position offset penalty function is defined as:

wherein l is the position of a target mark in the image, and g is the offset of a target box after coding;

the target score loss function is a multi-class softmax loss:

wherein,

the target prior loss function is calculated by binary cross entropy:

7. the method according to claim 6, wherein the step S3 further comprises the sub-steps of:

s302, calculating a second default frame of the second feature map;

the fused calculation formula is as follows:

x_c＝loc[0]×w_ref×sacling[0]+x_ref

y_c＝loc[1]×h_ref×sacling[1]+y_ref

w＝w_ref×e^{(loc[2]×sacling[2])}

h＝h_ref×e^{(loc[3]×sacling[3])}

wherein x is_c、y_cThe coordinate of the center point of the third default frame, w and h are the width and height of the third default frame, loc is four position offsets of the second default frame obtained by convolution, and x_ref、y_ref、w_ref、h_refScaling is a default parameter for four position offsets of the prediction frame of the preprocessed image information calculated according to the proportion value;

8. The method of claim 7,

the image preprocessing at least comprises image normalization processing.

9. The method of claim 8,

the step S4 further includes: image ranging is carried out between the target object and the image acquisition device according to a monocular ranging method and the pixel position of the point, closest to the image acquisition device, of the target object, and the distance is calculated by adopting the monocular ranging method under the overlooking condition, wherein the distance calculation formula is as follows:

wherein H is the camera height, O₃M is the distance between the world coordinate point corresponding to the image coordinate center and the camera on the y axis of the world coordinate system, O₁(centre, vcenter) is the image of the lens center point, P₁(u，0)、Q₁(u, v) are image coordinates of pixel points to be measured, Q is a certain point to be measured on a world coordinate system, the projection of the point to be measured on the y axis of the world coordinate system is a point P, the focal length of a camera is f, and the length x of an actual pixel_pixThe width of the actual pixel; the length of the actual pixel, the width of the actual pixel and the focal length are obtained by calibrating the camera.

10. The method of claim 9,

the step S4 further comprises the step of obtaining an included angle β between a straight line where the camera and the point P are located and a horizontal plane according to a distance calculation formula, and deriving and obtaining coordinates Q (X, Y) of the target object in the real world;