CN111353976A

CN111353976A - Sand grain target detection method based on convolutional neural network

Info

Publication number: CN111353976A
Application number: CN202010114804.XA
Authority: CN
Inventors: 王聪; 顾庆; 蒋智威; 郝慧珍; 董小龙; 胡修棉
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2020-02-25
Filing date: 2020-02-25
Publication date: 2020-06-30
Anticipated expiration: 2040-02-25
Also published as: CN111353976B

Abstract

The invention discloses a sand target detection method based on a convolutional neural network, which comprises the following steps: 1) designing a convolution network structure, and stacking a convolution module and a residual error module; adding a double-end input structure and a multi-scale detection structure; 2) preprocessing is carried out based on the sand grain images and the labels; constructing a training data set based on the marked sand grain images; 3) training a convolution network based on a training data set, wherein the training process comprises the steps of defining an objective function and optimizing the training process; 4) and predicting the target position of the sand grain image by applying the trained convolutional network. The invention fully utilizes the characteristics of the single polarization image and the orthogonal polarization image, applies the convolution neural network technology and improves the detection precision and the detection efficiency; the method has the advantages of high network training speed, capability of quickly finishing the sand target detection, suitability for automatic detection of mass sand images, and good expansibility, robustness and practicability.

Description

Sand grain target detection method based on convolutional neural network

Technical Field

The invention belongs to the field of image detection and identification, and particularly relates to a sand target detection method based on a convolutional neural network.

Background

In the geological field, the classification and statistics task of sand grains is always an important part of sand grain research, and sand grain target detection based on sand grain images is one of the most basic links. The traditional sand classification statistical task firstly needs to manually mark out a sand target in an image, but due to the particularity of the sand image, the marking method is low in efficiency and poor in repeatability when facing a large amount of image data. Specifically, the contrast between the target in the sand grain image and the background is weak, and the accuracy of annotation may be low due to manual annotation of a large amount of image data; the marking task of the sand target has certain speciality, and a large amount of manpower is not easy to be started for marking.

Object detection is a technique for detecting a specific object in an image using deep learning. The automation of the detection process can be realized by training the deep network through a certain amount of labeled data. Therefore, in the sand target detection task, a certain amount of sand image data is used for training the network, and automatic detection with high accuracy can be realized. The detection method utilizing the target detection technology can obviously reduce the workload of manual labeling and simultaneously ensure the detection precision.

The current mainstream target detection method based on the deep learning technology comprises the following steps: the method divides a target detection task into two subtasks, namely a feature frame extraction task and a feature frame classification task, and has high detection precision but large detection time consumption; the single-stage detector based on one-time scanning is different from the double-stage detector, the target detection task is regarded as a whole by the method, the detection precision of the method is lower than that of the double-stage detector, and the detection time consumption is higher. The target detection method can effectively detect the target in the image containing the natural object, but if the method is directly applied to a sand grain detection task, the method cannot achieve better detection precision.

Disclosure of Invention

In view of the above-mentioned shortcomings of the prior art, the present invention aims to provide a method for detecting a sand target based on a Convolutional Neural Network (CNN), which uses a full convolutional network with a residual structure, trains the convolutional network by using an annotated data set, automatically extracts features in a single-polarization image and an orthogonal-polarization image based on the trained convolutional network, and finally predicts the sand target in the sand image by using the features.

In order to achieve the purpose, the technical scheme adopted by the invention is as follows:

the invention discloses a sand grain target detection method based on a convolutional neural network, which comprises the following steps of:

1) designing a convolution network structure, and adding a double-end input structure and a multi-scale detection structure;

2) preprocessing is carried out based on the sand grain images and the labels;

3) training a convolutional network;

4) and predicting the target position of the sand grain image by applying the trained convolutional network.

Further, the convolution network structure in the step 1) is formed by stacking a convolution module and a residual error module.

Further, the convolution module in the step 1) is formed by serially connecting a convolution layer, a Batch Normalization (BN) layer and an activation layer.

The convolution strategy of the convolution layer is divided into two types of dimension reduction strategy and dimension non-reduction strategy according to different use conditions, the convolution kernel of the dimension non-reduction strategy has two sizes which are 1 × 1 and 3 × 3 respectively, and the convolution kernel of the dimension reduction strategy has a size of 3 × 3.

For a non-dimensionality reduction strategy with a convolution kernel size of 1 × 1, the input is C_in× W × H, with a convolution kernel of C_out×C_in× 1 × 1, and the output O dimension of the convolutional layer is C_out× W × H at [ c, x, y ]]，1≤c≤C_outX is more than or equal to 1 and less than or equal to W, and y is more than or equal to 1 and less than or equal to H, the value of the position is as follows:

for the non-dimensionality reduction strategy with a convolution kernel of 3 × 3, the input is C_in× (W +2) × (H +2), the output O has dimensions [ c, X, y ]]，1≤c≤C_outX is more than or equal to 1 and less than or equal to W, and y is more than or equal to 1 and less than or equal to H, the value of the position is as follows:

for a dimensionality reduction strategy with a convolution kernel of 3 × 3, the input is C_in× tensor X of W × H, where W and H are even numbers and the convolution kernel is C_out×C_in× 1 × 1, and the dimension of the convolutional layer output O is C_out× W × H at [ c, x, y ]]，1≤c≤C_outX is more than or equal to 1 and less than or equal to W, and y is more than or equal to 1 and less than or equal to H, the value of the position is as follows:

the batch normalization layer is used for keeping the same distribution of input and output under the condition of small batch input so as to prevent the condition of slow training convergence caused by output distribution deviation when the number of network layers is increased; for each hidden layer, in small batch training, if the batch size is m, x is input for activation^(k)K is more than or equal to 1 and less than or equal to m, and output after batch standardization operation

Comprises the following steps:

wherein E [. cndot. ] represents a mathematical expectation; var [. cndot. ] represents variance;

and in order to increase the network expression capacity, two adjusting parameters gamma and β are added for carrying out inverse transformation on the transformed activation:

the active layer adds non-linearity to the network using the Leaky ReLU function, whose output has a small gradient to the negative input, defined as follows:

wherein LReLU (·) represents a Leaky ReLU function; x is input; y is the output; k is a negative gradient.

Further, the residual error module in the step 1) is formed by cascading a zero padding layer, a convolution unit and a residual error structure module.

The Zero Padding (Zero Padding) layer is used to augment the input to fit the underlying input: will have a size of C_in× W × H input expansion to C_in× (W +1) × (H +1), a convolution unit uses a convolution kernel of a dimensionality reduction strategy and is defined by the formula (3), a residual error structure module is formed by connecting two convolution kernel sizes of non-dimensionality reduction strategy convolution units of 1 × 1 and 3 × 3 respectively through residual errors and is defined by the formulas (1) and (2), and the residual error connection uses a short-circuit mechanism to relieve the gradient disappearance problem caused by increasing the depth in the neural network, so that the neural network becomes easier to optimize.

Further, the double-end input structure of the convolution network structure in the step 1) is used for inputting the single-polarization image and the orthogonal-polarization imaging image into the network simultaneously, so that the network can learn the characteristics of the orthogonal-polarization image and the single-polarization image simultaneously during training; each end of the double-ended input structure is identical in structure but does not share parameters.

The network needs to combine the two-end inputs after inputting the structure and generate three branches for accessing the detection networks with three dimensions, namely large, medium and small.

Further, the multi-scale detection structure in the step 1) is used for detecting sand targets in the image at different scales; the multi-scale detection network uses three scales of large, medium and small; contains 5 modules: large-scale detection structure, large-scale medium-scale branch structure, medium-scale detection structure, medium-scale small-scale detection structure, and small-scale detection structure.

The number of tensor channels output by the three dimensions of the network is B × 5, wherein B represents the number of target frames in an original image mapped by each vector in the tensor, 5 represents the confidence coefficient of the target frames and the offset of the x-axis direction, the y-axis direction, the length and the width of the corresponding preset anchor frame, and for accurately obtaining the information of the target frames, the position and the confidence coefficient of the output tensor are transformed, the position offset of a certain target frame is respectively delta x, delta y, delta w and delta h, wherein delta x and delta y both represent the offset of the central position of the target frame, and the predicted confidence coefficient is c_oThe transformed prediction result is calculated as follows:

x＝[sigmoid(Δx)+g_x]·s；

y＝|sigmoid(Δy)+g_y|·s；

w＝p_w·e^Δw·s；

h＝p_h·e^Δh·s；

c＝sigmoid(c_o)；

wherein sigmoid (·) represents a sigmoid function; g_xAnd g_yRespectively representing the positions of grids with centers; s is a scale scaling coefficient, the small scale is 8, the medium scale is 16, and the large scale is 32; p is a radical of_wAnd p_hRespectively, the length and width of the anchor frame.

Further, the step 2) specifically includes: the training data set constructed based on the labeled sand grain images comprises two parts: an image part and a label part; the image part is a group of single polarization and orthogonal polarization images; the label part is the position of the target sand grain in each image.

Further, the preprocessing of the sand grain image and the annotation in the step 2) specifically comprises:

preprocessing a sand image:

21) downsampling or upsampling the image to make the longer side of the image match the input size of the convolution network;

22) the shorter edge is extended to the convolution input size, with an extended pixel value of 125;

23) normalizing the image;

preprocessing the marked data:

24) converting the position information of the annotation data according to the scaling of the image data;

25) comparing the intersection ratio of each piece of position information and the anchor frame, mapping the position information larger than a certain threshold value to a tensor which has the same size as the output tensor, and storing other position information independently;

26) and obtaining the tensors of the three scales, and keeping the tensors consistent with the output tensors of the convolutional network.

Further, the step 3) specifically includes: training a convolutional network based on a training data set, comprising: defining an objective function and optimizing a training process.

Further, the loss function of the convolutional network in the step 3) is lost by a bounding box

And confidence loss

The two parts are as follows:

wherein, the loss of each part simultaneously considers the loss of different prediction frames under different scales, and the formula is as follows:

wherein S is_lL is more than or equal to 1 and less than or equal to 3 and is the output size under the first scale;

bounding box loss the G-IoU function between two bounding boxes A, B was calculated using the G-IoU loss, where a is the prediction box and B is the label (true) box, and the formula is as follows:

where C is the smallest convexity of A and B, IoU (A, B) represents the intersection ratio of A and B, defined as:

based on the above, the bounding box loss is:

wherein, C_l,i,jRepresenting the true confidence of the grid in which it is located;

indicating whether the predicted bounding box is valid, and IoU ≧ 0.3 associated with the true bounding box is considered valid; w, H denote the width and height of the input image, respectively; o represents a real bounding box; w and h respectively represent the width and the height of the real bounding box;

a bounding box representing the prediction;

confidence loss using cross entropy loss with sigmoid and logist, the formula is as follows:

wherein, C_l,i,jIs the prediction confidence; CEL (-) is a cross-entropy loss function with sigmoid and logist:

CEL(X,Y)＝-[logist(Y)·log(σ(X))+(1-logist(Y))·log(1-σ(X))](14)

wherein the logist function is:

further, the optimizer used in the training phase of the convolutional network in the step 3) is an Adam optimizer; the learning rate scheduler adopts Linear cosine scheduling, and the learning rate lr (i) of the training stage i is:

wherein, W is the number of training stages of the linear stage; lr of₀Setting the basic learning rate as 1 e-6; lr of_maxThe maximum learning rate is set to 1 e-3.

Further, the processing procedure of predicting the target position of the sand grain image in the step 4) is as follows: preprocessing the sand grain image by applying a trained convolutional network, inputting two orthogonally polarized and single polarized sand grain images into the convolutional network to obtain prediction tensors of three scales, setting a confidence threshold value of 0.3 for all prediction frames, and deleting the prediction frames with confidence coefficients smaller than the threshold value; and finally, removing redundant prediction frames by using non-maximum value inhibition, outputting the residual prediction frames as prediction results, and marking the prediction results on the original sand grain image.

The invention has the beneficial effects that:

the method can prevent the influence of small difference of different types of sand characteristics on the detection precision, fully utilizes the characteristics of the single-polarization image and the orthogonal-polarization image, simplifies the network structure and improves the detection precision and the detection efficiency; the method has the advantages of high network training speed, capability of quickly finishing the sand target detection, suitability for automatic detection of mass sand images, and good expansibility, robustness and practicability.

Drawings

FIG. 1 is an overall block diagram of the process of the present invention;

FIG. 2 is a block diagram of a sand image pre-processing process;

FIG. 3a is a schematic diagram of cross-polarization input;

FIG. 3b is a schematic diagram of a single polarization input;

FIG. 3c is a graph showing the results of cross polarization prediction;

FIG. 3d is a graph showing the single polarization prediction result;

fig. 4 is a structural diagram of a convolutional network designed by the present invention.

Detailed Description

In order to facilitate understanding of those skilled in the art, the present invention will be further described with reference to the following examples and drawings, which are not intended to limit the present invention.

The invention discloses a sand grain target detection method based on a convolutional neural network, which utilizes a trained CNN to automatically extract the characteristics of a single polarization image and an orthogonal polarization image corresponding to sand grains, and predicts the sand grain target in the image according to the characteristics. Different from the traditional target detection method, the method abandons the classification part in the traditional target detection frame and focuses on the prediction of the target position; the method inputs the orthogonal polarization image and the single polarization image simultaneously so as to fully utilize the characteristics of the single polarization image and the orthogonal polarization image.

Referring to fig. 1, the present invention adopts the following steps:

1) designing a convolution network structure;

2) preprocessing is carried out based on the sand grain images and the labels; a training data set is constructed based on labeled sand images, and comprises two parts: an image part and a label part; the image part is a group of single polarization and orthogonal polarization images; the marking part is the position of the target sand in each image;

3) training a convolution network based on a training data set, wherein the training process comprises the steps of defining an objective function and optimizing the training process;

The network structure in step 1) is shown in fig. 4, and the network is a full convolution network using a residual error structure, is formed by stacking convolution units and residual error modules, and includes a double-end input structure and a multi-scale detection structure.

As shown in [1] of fig. 4, the convolution module in the network is formed by connecting a convolution (conv) layer, a Batch Normalization (BN) layer, and a leakage relu activation layer in series.

The convolution strategy of the convolution layer is divided into two types of dimension reduction strategy and dimension non-reduction strategy according to different use conditions, wherein the convolution kernel of the dimension non-reduction strategy has two sizes which are 1 × 1 and 3 × 3 respectively, and the convolution kernel size of the dimension reduction strategy is 3 × 3.

the batch normalization layer is used for keeping the same distribution of input and output under the condition of small batch input so as to prevent the condition of slow training convergence caused by output distribution deviation when the number of network layers is increased; for each hidden layer in the network, the input distribution which is gradually mapped to the nonlinear function and then drawn close to the extreme saturation region of the value-taking interval is converted into the standard normal distribution which is approximate to the standard that the mean value is 0 and the variance is 1, so that the input value of the nonlinear transformation function is positioned in a region which is sensitive to input, and the problem of gradient disappearance is avoided. For a hidden layer in the network, inFor small training batches, if the batch size is m, x is used for activating input^(k)K is more than or equal to 1 and less than or equal to m, and output after batch standardization operation

Comprises the following steps:

and converting the activation input into a linear region of nonlinear transformation to enhance the mobility of back propagation information and accelerate the convergence speed of training, but also leading the network expression capacity to be reduced, and in order to prevent the network expression capacity from being reduced, adding two adjusting parameters gamma and β for carrying out inverse transformation on the transformed activation:

the activation layer uses a Leaky ReLU function to add nonlinearity to the network; compared to the ReLU function, the output has a small gradient for negative inputs, defined as follows:

As shown in FIG. 4 [3 ]]As shown, a residual error unit of a network in the network is formed by cascading a Zero Padding (ZP) layer, a convolution unit and a plurality of residual error structure modules. The Zero Padding layer is used to augment the input to fit the underlying input: will have a size of C_in× W × H input expansion to C_in× (W +1) × (H +1) convolution unit uses the dimensionality reduction strategy convolution kernel defined by the formula (3) and a residual error structure module is formed by connecting two non-dimensionality reduction strategy convolution units respectively using 1 × 1 and 3 × 3 convolution kernels through residual errors and respectively comprises the formula (3)The formulas (1) and (2) are defined as [2 ] in FIG. 4]As shown. Residual concatenation uses a short-circuit mechanism to mitigate the gradient vanishing problem with increasing depth in neural networks. A direct correlation channel is established between input and output in an identity mapping mode, so that a network can intensively learn residual errors between the input and the output, and the neural network becomes easier to optimize.

The double-end input structure in the network is used for inputting the single-polarization image and the orthogonal-polarization imaging image into the network simultaneously, so that the network can learn the characteristics of the orthogonal-polarization image and the single-polarization image simultaneously during training; in the testing stage, the network can more accurately predict the sand positions in the image through the orthogonal polarization image and the single polarization image. The structure of each end of the double-end input structure is the same, but does not share parameters, and the structure of a possible input sub-network is shown as [4] in fig. 4 and table 1:

TABLE 1

The network needs to combine the double-end inputs after inputting the structure and generate three branches for accessing the detection networks with large, medium and small scales respectively; one possible connection and multi-scale generation structure is shown in [5] in FIG. 4 and Table 2:

TABLE 2

A multi-scale detection structure in the network is used for detecting sand targets in the image at different scales. Particularly, for small target objects in the image, the effect of using a smaller scale for detection is better; similarly, for a large target object in the image, the effect of using large-scale detection is better. Generally, the more detection scales are used in the multi-scale structure, the higher the detection accuracy of the model is, but the longer the detection time may be. In order to balance the detection time and the detection precision of the network, the network provided by the embodiment uses a large, medium and small three-scale detection structure. The multi-scale detection network uses the FPN structure, and the structure of a possible three-scale detection network is shown as [6] to [10] in fig. 4. Specifically, the system comprises 5 modules which are respectively: large-scale detection structure, large-scale medium-scale branch structure, medium-scale detection structure, medium-scale small-scale detection structure, and small-scale detection structure.

The large-scale detection structure is shown as [6] in FIG. 4 and in the following Table 3:

TABLE 3

The large-scale medium-scale branch structure is shown in the following table 4 and [7] in fig. 4:

TABLE 4

The mesoscale detection structure is shown in [8] in FIG. 4 and in Table 5 below:

TABLE 5

The meso-scale versus micro-scale detection structure is shown in fig. 4 [9] and table 6 below:

TABLE 6

The small scale detection structure is shown as [10] in fig. 4 and in table 7 below:

TABLE 7

The number of tensor channels of the three-scale output of the network is all B × 5, wherein B represents the number of target frames of the RoI prediction of the original image mapped by each vector in the tensor, and 5 represents the confidence of the predicted target frames and the offset of the X-axis direction, the Y-axis direction, the length direction and the width direction of the corresponding anchor frame.

Assuming that the size of the input image is 608 × 608 and B is 3, the output of CNN is 3 tensors, each being a small-scale tensor T with a size of 72 × 72 × 15_s26 × 26 × 15 medium scale tensor T_mAnd a large scale tensor T of size 13 × 13 × 15_l。

Then, transforming the position and confidence of the output tensor; specifically, assume that the position offset of a certain target frame is Δ x, Δ y, Δ w, Δ h, respectively, where Δ x and Δ y both represent the offset of the center position of the target frame, and the confidence of the prediction is c_oThe transformed prediction result is calculated as follows:

x＝[sigmoid(Δx)+g_x]·s；

y＝[sigmoid(Δy)+g_y]·s；

w＝p_w·e^Δw·s；

h＝p_h·e^Δh·s；

c＝sigmoid(c_o)；

After decoding the output tensor of the CNN, the data in the tensor directly represents the target frame information of the network prediction.

The training phase first requires preprocessing of the data set data. The data set data contains two parts: an image part and an annotation part. The image part is a group of single polarization and orthogonal polarization images; the annotation data is a target location in the image. The purpose of the data preprocessing is to match the image data to the input of the CNN and the annotation data to the output of the CNN.

The preprocessing process for the image data in the above step 2) is shown in fig. 2. Firstly, down-sampling or up-sampling an image to enable the longer side of the image to be matched with the input size of the CNN; secondly, the shorter edge is expanded to the input size of CNN, and the general expanded pixel value is 125; finally, the image is normalized. Compared with the traditional scheme of directly sampling two edges of the image to the input size of the CNN, the method can keep the aspect ratio of the target in the image and prevent the network from learning distorted target information.

The pre-processing process of the marked data comprises the following steps: firstly, converting the position information of the marking data according to the scaling of the image data; secondly, comparing the intersection ratio of each piece of position information and the anchor frame, mapping the position information larger than a certain threshold value to a tensor which has the same size as the output tensor, and storing other position information independently; and finally, obtaining three scales of tensors consistent with the CNN output tensors in size.

Loss function of the convolutional network in the step 3) is lost by the bounding box

And confidence loss

The two parts are as follows:

based on the above, the bounding box loss is:

a bounding box representing the prediction;

CEL(X,Y)＝-[logist(Y)·log(σ(X))+(1-logist(Y))·log(1-σ(X))](14)

wherein the logist function is:

the optimizer used in the training phase of the convolutional network in the step 3) is an Adam optimizer; the learning rate scheduler adopts Linear cosine scheduling, and the learning rate lr (i) of the training stage i is:

wherein, W is the number of training stages of the linear stage; lr of₀Setting the basic learning rate as 1 e-6; lr of_max1e-3 for the maximum learning rate.

The specific process of the detection process in the step 4) is as follows: firstly, preprocessing a sand grain image; then inputting two orthogonal polarized light and single polarized light images to be detected into a network to obtain a three-scale prediction tensor, setting a confidence threshold for all prediction frames, deleting the prediction frames with the confidence lower than the threshold, and generally setting the threshold as 0.3; and finally, removing redundant prediction frames by using non-maximum value suppression, outputting the residual prediction frames as prediction results, and marking the prediction results on the original image.

As shown in fig. 3a to fig. 3d, through experiments, the sand target detection method provided by the present invention has a fast network training speed, can quickly complete sand target detection, is suitable for automatic detection of a large amount of sand images, and has good expansibility, robustness and practicability. Specifically, the detection rate of the sand target in the test set reaches 97.40%, and the detection rate of AP reaches 90.98%. The requirement of a sand target detection task is completely met.

In summary, the invention is based on the task of detecting the sand target, and has the following characteristics:

(1) in order to prevent the influence of small characteristic difference of different types of sand grains on the detection frame precision, a multi-classification part of a traditional target detection frame is abandoned, and the detection precision is improved while the frame structure is simplified;

(2) in order to fully utilize the image characteristics in the data set, a double-end input network structure which takes a single-polarization image and an orthogonal-polarization image as input is designed, and compared with a single-input framework, the framework has higher detection precision due to more available image characteristics in the training and testing stages.

While the invention has been described in terms of its preferred embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention.

Claims

1. A sand grain target detection method based on a convolutional neural network is characterized by comprising the following steps:

2) preprocessing is carried out based on the sand grain images and the labels;

3) training a convolutional network;

2. The sand grain target detection method based on the convolutional neural network as claimed in claim 1, wherein the convolutional network structure in step 1) is formed by stacking a convolutional module and a residual module.

3. The sand grain target detection method based on the convolutional neural network as claimed in claim 2, wherein the convolutional module in step 1) is formed by serially connecting a convolutional layer, a batch normalization layer and an activation layer.

4. The sand grain target detection method based on the convolutional neural network as claimed in claim 2, wherein the residual module in step 1) is formed by cascading a zero padding layer, a convolution unit and a residual structure module.

5. The convolutional neural network-based sand target detection method as claimed in claim 1, wherein the double-ended input structure of the convolutional network structure in step 1) is used to simultaneously input the single-polarized and orthogonal-polarized imaging images into the network, so that the network simultaneously learns the features of the orthogonal-polarized image and the single-polarized image during training; each end of the double-end input structure has the same structure, and parameters are not shared.

6. The convolutional neural network-based sand target detection method as claimed in claim 1, wherein the multi-scale detection structure in step 1) is used for detecting sand targets in images at different scales; the multi-scale detection network uses three scales of large, medium and small; contains 5 modules: a large-scale detection structure, a large-scale medium-scale branch structure, a medium-scale detection structure, a medium-scale small-scale detection structure and a small-scale detection structure.

7. The sand grain target detection method based on the convolutional neural network as claimed in claim 1, wherein the step 2) specifically comprises: constructing a trained data set based on the marked sand grain images, wherein the trained data set comprises two parts: an image part and a label part; the image part is a group of single polarization and orthogonal polarization images; the label part is the position of the target sand grain in each image.

8. The convolutional neural network-based sand target detection method according to claim 7, wherein the preprocessing of the sand image and the annotation in step 2) specifically comprises:

preprocessing a sand image:

23) normalizing the image;

preprocessing the marked data:

9. The sand grain target detection method based on the convolutional neural network as claimed in claim 1, wherein the step 3) specifically comprises: training a convolutional network based on a training data set, comprising: defining an objective function and optimizing a training process.

10. The convolutional neural network-based sand target detection method as claimed in claim 1, wherein the processing procedure for predicting the sand image target position in step 4) is: preprocessing the sand grain image by applying a trained convolutional network, inputting two orthogonally polarized and single polarized sand grain images into the convolutional network to obtain prediction tensors of three scales, setting a confidence threshold value of 0.3 for all prediction frames, and deleting the prediction frames with confidence coefficients smaller than the threshold value; and finally, removing redundant prediction frames by using non-maximum value inhibition, outputting the residual prediction frames as prediction results, and marking the prediction results on the original sand grain image.