CN110610210A

CN110610210A - Multi-target detection method

Info

Publication number: CN110610210A
Application number: CN201910881579.XA
Authority: CN
Inventors: 吕乔; 叶茂; 窦强; 李鑫鹏
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2019-09-18
Filing date: 2019-09-18
Publication date: 2019-12-24
Anticipated expiration: 2039-09-18
Also published as: CN110610210B

Abstract

The invention discloses a multi-target detection method, which comprises the following steps: s1, extracting a basic feature graph and a context feature graph; s2, capturing a fuzzy activation region in the real-time image, and taking the coordinate information of the fuzzy activation region as first positioning information; s3, setting the cycle number n to 1; s4, taking the coordinate pair of the nth batch of positioning information as the center, and acquiring a local feature matrix of a fixed area near the center position on the basic feature map; s5, inputting the focusing characteristic and the context characteristic into a double-layer cyclic convolution transmitting module; s6, making n equal to n +1, and returning to step S4 until reaching the preset cycle number; outputting all positioning information; s7, inputting all positioning information into an area suggestion network; s8, loop through steps S1 to S7, sum all errors. The invention can output positioning information through the pre-defined double-layer cyclic convolution transmitting module, thereby obtaining the approximate position of the target object in the image and greatly reducing the calculation amount of each characteristic point.

Description

Multi-target detection method

Technical Field

The invention belongs to the technical field of target detection, and particularly relates to a method for detecting an image target in the field of computers.

Background

Nowadays, high-speed parallel computing architectures represented by NVIDIA series are gradually developed, and products thereof are gradually developed into more civilized parallel computing platforms from DirectX to current computing devices such as GTX1080TI and the like. Under the popular trend, various fields needing abundant computing resources are developed rapidly, wherein the image processing technology leads the pioneer and promotes the progress of numerous fields such as intelligent technology, monitoring technology, security technology and the like. In addition, several related hardware technologies in the field of real-time image perception are also being developed, such as infrared cameras, monocular and binocular cameras in peripheral devices, and these perception hardware are gradually developed to a structure more conforming to the human visual system, so as to facilitate the processing of images on software algorithms. Under the dual support of the image perception module and the embedded computing system, how to use a more innovative and ergonomic image analysis technology on the mobile intelligent machine problem becomes a leading-edge subject with great challenges across software, hardware and subjects.

In recent years, due to the rapid development of hardware systems, many higher-performance real-time image analysis and processing methods are emerging, wherein the important problem to be solved is the real-time detection of multiple targets in an image. At present, many mature multi-target detection methods have appeared in the industry, and in the field of traditional machine learning, target detection is generally divided into three steps of violence extraction candidate regions, manual design extraction features, and classification by using a rapid Adaboost algorithm or an SVM algorithm with strong generalization capability.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provide a multi-target detection method which can output positioning information through a predefined double-layer cyclic convolution transmitting module so as to obtain the approximate position of a target object in an image, greatly reduce the calculation amount of each characteristic point, avoid anchoring and calculating each position in a fast-RCNN method and enable the detection to be more consistent with the detection speed under the real-time condition.

The purpose of the invention is realized by the following technical scheme: a multi-target detection method comprises the following steps:

s1, acquiring a real-time image from the camera, extracting a basic feature map from the real-time image, and inputting the real-time image into a context integration network to obtain a context feature map;

s2, inputting the real-time image into a convolution activation network, and capturing a fuzzy activation area in the real-time image; creating a corresponding positioning cache pool for each real-time image, storing coordinate information of all the activation areas of the real-time image, and taking the coordinate information of the fuzzy activation areas as a first batch of positioning information;

s3, setting the cycle number n to 1;

s4, taking the coordinate pair of the nth batch of positioning information as a center, acquiring a local feature matrix of a fixed area near the center position on the basic feature map, and filtering and pooling the local feature matrix to obtain a focusing feature;

s5, inputting the focusing characteristic and the context characteristic into a double-layer cyclic convolution transmitting module, and outputting two predicted values: the first predicted value is the confidence coefficient that the feature belongs to the corresponding category; the second predicted value is the (n + 1) th batch of positioning information input into the double-layer cyclic convolution emission module, and the batch of positioning information is input into a positioning cache pool, the positioning cache pool is maintained globally, and the positioning information output by the current cyclic body is injected into the positioning cache pool after each cyclic body is finished;

s6, making n equal to n +1, and returning to step S4 until reaching the preset cycle number; finally, all positioning information is output, and two error values are output: the first error value is the error between the prediction category near the positioning and the label category, and the second error value is the error between the positioning information and the coordinates of the label target frame;

s7, inputting all positioning information in the positioning cache pool into the area suggestion network, wherein, a fixed number of first suggestion candidate box sets and two error values are output firstly: the first error value is the error between the coordinates of the first suggestion candidate frame and the real coordinates of the target frame, and the second error value is the error between the prediction category of the first suggestion candidate frame and the real category of the target frame;

inputting the first suggestion candidate box into a suggestion-target module, screening and refining the first suggestion candidate box set, and outputting second suggestion candidate boxes, corresponding category labels of each second suggestion candidate box and offsets of coordinates of each second suggestion candidate box and corresponding label coordinates;

inputting the second suggestion candidate box into an ROI posing module in a fast RCNN method, and outputting final interest features with consistent sizes through pooling operation; inputting the final interest features into an RCNN module in a fast RCNN method to respectively obtain the prediction categories and the prediction frame coordinates in the candidate frames corresponding to the final interest features, and generating two error values: the first error is the error between the prediction category and the label category, and the second error is the error between the coordinates of the prediction frame and the coordinates of the label;

and S8, looping steps S1 to S7, summing all errors, performing a back propagation algorithm, and iteratively updating each weight parameter in the network.

Further, the context synthesis network is formed by overlapping basic convolution operation units, wherein the expression of a single convolution operation unit formula is as follows:

wherein, l represents the first layer of the convolution layer; j represents the jth feature map of the current convolutional layer;a jth feature diagram representing the l-1 th roll base;an mth convolution kernel representing a jth feature map of the ith layer volume base layer; m_jRepresenting all convolution kernel sets corresponding to the jth feature map; symbol denotes convolution operation;an offset vector parameter representing a jth characteristic diagram of the ith layer volume base layer; f (-) represents the activation function.

Further, the step S2 includes the following sub-steps:

s21, inputting the original image into a superposed basic convolution operation unit, wherein two basic convolution operation units and one basic pooling unit are used as a convolution block unit, five convolution block units with the same structure are used for cascading, and after the cascading, a characteristic map of the original image is output;

s22, inputting the feature map into the GAP layer, and outputting a one-dimensional vector, wherein elements in the one-dimensional vector are the feature matrix average value of each channel in the feature map; calculating the weighted sum of all values in the one-dimensional vector, and solving an activation function layer of the class probability;

s23, carrying out weighted summation on the output characteristics of the last layer of convolution lumps, and solving an activation map based on the category, wherein the formula is as follows:

wherein f is_k(x, y) represents the activation value of the last layer of convolution blob output features at its coordinates (x, y) for the kth cell in the feature vector;representing the weight corresponding to each unit k for each class c, namely the importance of the unit k for the class c;

s24, the activation map based on the category obtained in the above steps is scaled to the same size as the original image, the correlation between the activation region and the category is compared, and the coordinate point with the highest local correlation is calculated:

c_i＝max(g(x₀，y₀)，g(x₁，y₁)...g(x_N，y_N))

wherein g (·)) The pixel value representing the position, (x)_i，y_i) As a coordinate point within the local activation region, c_iRepresenting the correlation in the ith local activation region;

and outputting the obtained coordinate point with the highest local correlation as a coordinate point set of the first batch of positioning information.

Further, the double-layer cyclic convolution transmitting module takes a basic feature map, a category mark value, a target frame mark value, a first batch of positioning information and context features as input, continuously explores the optimal positioning under the optimization of a back propagation algorithm through a double-layer cyclic transmission network, and finally outputs all positioning information with a fixed quantity; the method comprises the following specific steps:

s51, using each image as processing unit, taking t batch positioning information L_t＝((x₀，y₀)，(x₁，y₁)...(x_m，y_m) Extracting high-dimensional vectors in a corresponding fixed range (2 x 2) on the basic characteristic diagram, and processing the high-dimensional vectors into a fixed-dimension positioning characteristic tensor P through vector operation_t；

S52, inputting the localization feature tensor into a convolution layer, a regularization layer and an excitation function layer, and outputting an activated localization feature tensor, wherein the formula is as follows:

P_t_active＝RELU(BN(Conv2d(P_t)))

wherein, when x > 0, relu (x) x; when x is less than or equal to 0, relu (x) is 0; BN (-) is derived from a deep learning network layer whose main function is to prevent the network from overfitting, which is called Batch-normalization network layer; conv2d (. cndot.) is derived from a deep learning network layer whose main function is to extract image features using convolution operations;

positioning information L_tInputting the information into a convolution layer, a regularization layer and an excitation function layer, and outputting an activation positioning information tensor, wherein the formula is as follows:

L_t_active＝RELU(BN(Conv2d(L_t)))

carrying out tensor multiplication on the two tensors to obtain a focusing characteristic tensor; the formula is as follows:

s53, in the circulation operation, one circulation unit is measured at one moment, and the specific implementation method is as follows:

s531, if the first moment in the loop operation is, initializing a hidden state of a first layer convolution LSTM structure by using a zero vector; otherwise, inputting the focusing feature tensor and the hidden state at the previous moment into a first layer convolution LSTM structure encoder e (-) and outputting a new hidden state of the encoder, wherein the formula is as follows:

wherein the content of the first and second substances,represents the new hidden state of the output encoder e at time t;represents the new cell state of the output encoder e at time t, which is derived from a step defined in the existing method LSTM network structure, for storing hidden information valid in long-term memory; g_tA focusing feature tensor representing time t;

hidden state of convolutional LSTM structure encoderInputting a cascaded convolution network ec (-) and a linear classifier, and outputting the classification probability of the focusing region:

wherein V represents an outputThe classification probability of (2); w₁Denotes a first weight parameter, W₂Represents a second weight parameter; b₁Denotes a first offset parameter, b₂Representing a second bias parameter; prob_iIs the probability of a certain class, V_iIs the output of the ith unit at the front stage of the classifier, and C represents the total number of categories;

the output definition of the above first layer ends;

s532, in the second layer of convolution LSTM structure decoder d (-) if the cycle is the initial time, the decoder takes the context feature graph as an initialization value; if not, the decoder willAnd the hidden state of the previous moment of the layer is taken as input, and the new hidden state of the decoder is output, wherein the formula is as follows:

hidden state of convolutional LSTM structure decoderInputting a linear regressor el (·), outputting two-dimensional coordinates of the attention orientation position at the next moment, and storing the coordinates into a positioning cache pool, wherein the formula is as follows:

the definition of the output of the above second layer is finished;

s533, calculating a local classification error of the image by using a cross entropy method in the current time cycle and combining the hidden information of the past time and the current information by using a double-layer network; in the time cycle, the double-layer network combines the hidden information of the past time and the current information in combination, and the positioning error of the next time is calculated by using a mean square error method, wherein the formula is as follows:

wherein gt represents the label category, y_iThe coordinates of the annotation are represented by,representing the probability of predicting the output representation of the current labeled coordinate to a certain value; in the process of calculating the loss function, the loss function sums the loss of each image at each time and takes the average value as the final loss.

And S534, circulating the steps S531 to S533, and taking all the positioning information obtained by the final loss and all the moments as the output of the double-layer circular convolution transmitting module.

Further, the regional suggestion network takes the basic feature map, the mark value of the target frame and the positioning information of the positioning cache pool as input, improves the RPN method according to the positioning information, which is abbreviated as LRPN, and then outputs the coordinates of a fixed number of suggestion candidate frames and the intra-frame prediction result, and outputs two loss functions;

the specific implementation mode is as follows:

s71, inputting the basic feature graph into the convolution network and the activation network, and outputting an activation feature graph;

s72, introducing anchor frame rules, and setting A anchor frames for each space position on the activation graph; performing convolution operation with the step length of 1 and the convolution kernel of 1 multiplied by 1 on the activation characteristic diagram, and outputting a fraction tensor with the channel number of 2 multiplied by A, wherein the channel number in the tensor represents the class prediction probability fraction in A fixed-size anchor boxes corresponding to each space position on the LRPN activation characteristic diagram; performing convolution operation with the step length of 1 and the convolution kernel of 1 multiplied by 1 on the LRPN activation characteristic diagram, and outputting a coordinate offset tensor with the channel number of 4 multiplied by A, wherein the channel number in the tensor represents the predicted coordinate offset of A fixed-size anchor frames corresponding to each space position on the LRPN activation characteristic diagram and is used for solving the optimal predicted coordinate;

s73, inputting the fraction tensor, the coordinate offset tensor and the positioning information into an LRPN suggestion module, and specifically comprising the following steps:

s731, screening out a corresponding effective anchoring frame in the tensor according to the positioning information, and cutting out the effective anchoring frame beyond the image boundary;

s732, sorting the anchor frames and the corresponding fraction tensors based on fractions, and taking the first N, wherein N is used as a hyper-parameter;

s733, screening the fraction tensors by using a non-maximum inhibition method, and taking the first M fraction tensors which are sorted according to sizes from the rest fraction tensors as a first suggested candidate frame set;

s734, setting the labeling of all the anchor boxes as-1 if the following conditions are not met:

a. an anchor frame corresponding to the positioning information;

b. anchor frames that do not exceed the image boundaries;

modifying each anchor frame based on the predicted coordinate offset, comparing the modified anchor frame coordinates with the labels of the target frames in the image, selecting F anchor frames with the largest overlapping rate and larger than a positive threshold value, and setting the labels of the anchor frames to be 1; b anchor frames with the largest overlapping rate and smaller than a negative threshold are selected, and the labels of the anchor frames are set to be 0; wherein, the F value is set according to the super parameter in the fast RCNN method, and the B value is set according to the super parameter in the fast RCNN method;

s735, removing the anchor frame with the label of-1, and solving the loss function of the rest anchor frames to obtain the anchor frame category prediction and the loss lrpn of the category label_clsAnchor frame coordinate prediction and loss of coordinate tags lrpn_bboxAnd outputting the first suggestion candidate box;

s736, screening and refining the first suggestion candidate box set, i.e., inputting the first suggestion candidate box set into a suggestion-target module, specifically operating as follows:

s7361, traversing the frame coordinates of all the labels for any frame coordinate in the first suggested candidate frame set, selecting the frame coordinate with the largest overlapping rate as a corresponding label frame, if the overlapping rate of the label frame and the candidate frame is greater than a threshold value, considering the candidate frame as a foreground, and if the overlapping rate of the label frame and the candidate frame is less than the threshold value, considering the candidate frame as a background;

s7362, setting a fixed number of foreground and background for each training period, sampling from the candidate frames to meet the fixed number requirement, and taking the sampled candidate frame set as a second candidate frame set;

s7363, calculating the offset between the coordinates of the second candidate frame and the coordinates of the corresponding label frame, and using the offset and the second candidate frame set as the output of the module.

The invention has the beneficial effects that: the method comprises the steps of setting a plurality of cascaded modules, extracting an activation position based on identification features in an image through a convolution activation network to serve as an initial value input into a double-layer cyclic convolution emission module, then extracting features used for training in a convolution depth network and a context comprehensive network, obtaining a positioning information set based on visual attention through the double-layer cyclic convolution emission module, obtaining a second suggestion candidate frame based on the positioning information through a region suggestion network, and finally predicting the category and the coordinate of the features in the suggestion candidate frame through an ROI posing module and an RCNN module. The algorithm has the advantages that the positioning information can be output through the pre-defined double-layer circular convolution transmitting module, so that the approximate position of the target object in the image is obtained, the calculation amount of each characteristic point is greatly reduced, the anchoring and calculation of each position in the fast-RCNN method are avoided, and the detection can be more consistent with the detection speed under the real-time condition.

Drawings

FIG. 1 is a flow chart of a multi-target detection method of the present invention.

Detailed Description

When a visual device such as a monocular camera on a mobile machine acquires real-time images from the periphery of the machine, the embedded computing system needs to perform target detection processing on the images in time so as to judge the target position and the target size in the current environment and further take corresponding action measures. Based on the requirement, an accurate and rapid multi-target detection method is crucial. In this process, the mainstream method requires processing of all regions of the image, and each time a processing region has a possibility of being duplicated with other processing regions. In a hierarchical method structure of deep learning, the weight coefficient in a feature expression function is increased correspondingly due to the huge number of regional suggestions, so that the invention designs a scheme, improves the regional processing efficiency and reduces the load of a computing system by combining a human visual focusing mechanism.

The method comprises the steps of setting a plurality of cascaded modules, extracting an activation position based on identification features in an image through a convolution activation network to serve as an initial value input into a double-layer cyclic convolution emission module, then extracting features used for training in a convolution depth network and a context comprehensive network, obtaining a positioning information set based on visual attention through the double-layer cyclic convolution emission module, obtaining a second suggestion candidate frame based on the positioning information through a region suggestion network, and finally predicting the category and the coordinate of the features in the suggestion candidate frame through an ROI posing module and an RCNN module. The algorithm has the advantages that the positioning information can be output through the pre-defined double-layer circular convolution transmitting module, so that the approximate position of the target object in the image is obtained, the calculation amount of each characteristic point is greatly reduced, the anchoring and calculation of each position in the fast-RCNN method are avoided, and the detection can be more consistent with the detection speed under the real-time condition. The technical scheme of the invention is further explained by combining the attached drawings.

As shown in fig. 1, a multi-target detection method includes the following steps:

s1, acquiring a real-time image from the camera, extracting a basic feature map from the real-time image through a relevant technology such as ResNet series, and simultaneously inputting the real-time image into a context integration network to obtain a context feature map as the initial input of a subsequent module;

when hardware such as a camera acquires an original image with larger spatial resolution, a context synthesis network acts on the original image, the context synthesis network is formed by overlapping basic convolution operation units, wherein the expression of a single convolution operation unit formula is as follows:

In the context comprehensive network, the first 9 convolution operation units of the related algorithm VGG16 network are selected as an architecture, the equivalent value of the input channel number, the output channel number, the convolution kernel size, the convolution operation step length and the filling parameter of each convolution operation unit is fixed, the original image with the channel number of 3 is input into the first convolution operation unit, and finally the context feature map with the channel number of 128 is output.

In the scheme, the purpose of using the context integration network is to use the context integration network as an initialization basis of the double-layer cyclic convolution transmitting module, so that the double-layer cyclic convolution transmitting module can obtain the extracted fuzzy characteristics in advance, acquire the global information of the image and accelerate the process of accurately positioning the fuzzy position of the target.

based on an existing unsupervised algorithm scheme CAM, the convolution activation network mainly realizes a process of generating an activation map based on a category unsupervised by a GAP algorithm and outputting target fuzzy positioning information. The method specifically comprises the following substeps:

s23, based on function output for activating the category, the important area in the original image is marked and visualized in a mode of mapping the weight output by the GAP layer to the output characteristic of the last layer of convolution block mass, the specific method is to carry out weighted summation on the output characteristic of the last layer of convolution block mass, and solve the activation map based on the category, and the formula is as follows:

wherein f is_k(x, y) represents the activation value of the last layer of convolution blob output features at its coordinates (x, y) for the kth cell in the feature vector;representing the weight corresponding to each unit k for each class c, namely the importance of the unit k for the class c; after passing through the GAP layer, for each cell, the activation values at all coordinate positions are solved and summed.

S24, in the convolution depth activation network, the activation region is subjected to weight mapping so as to highlight the importance of the activation region in the original image. The activation map based on the category obtained in the above steps is scaled to the same size as the original image, the correlation between the activation region and the category is compared, and the coordinate point with the highest local correlation is calculated:

c_i＝max(g(x₀，y₀)，g(x₁，y₁)...g(x_N，y_N))

where g (-) represents the pixel value of the location, (x)_i，y_i) As a coordinate point within the local activation region, c_iRepresenting the correlation in the ith local activation region;

S3, setting the cycle number n to 1;

s4, this step is the inlet of the circulation body, the input of which comes from the output generated in the previous cycle: the nth batch of positioning information; if the loop body enters for the first time, using the first batch of positioning information in the step S2 as a center, otherwise, taking the coordinate pair of the nth batch of positioning information as the center, acquiring a local feature matrix of a fixed area near the center position on the basic feature map, and filtering and pooling the local feature matrix to obtain a focusing feature;

s5, inputting the focusing characteristic and the context characteristic into a double-layer cyclic convolution transmitting module, and outputting two predicted values: the first predicted value is the confidence coefficient that the feature belongs to the corresponding category; the second predicted value is the (n + 1) th batch of positioning information input into the double-layer cyclic convolution emission module, and the batch of positioning information is input into a positioning cache pool, the positioning cache pool is maintained globally, and the positioning information output by the current cyclic body is injected into the positioning cache pool after each cyclic body is finished; so far, the output of the circulation body outlet is as follows: (n + 1) th batch of positioning information, the output being the input for the next entry into the cycle body inlet;

the double-layer cyclic convolution transmitting module takes a basic feature map, a category mark value, a target frame mark value, a first batch of positioning information and context features as input, continuously explores the optimal positioning under the optimization of a back propagation algorithm through a double-layer cyclic transmitting network, and finally outputs all positioning information with fixed quantity; the method comprises the following specific steps:

s51, taking each image as the processing unitt batch of positioning information L_t＝((x₀，y₀)，(x₁，y₁)…(x_m，y_m) Extracting high-dimensional vectors in a corresponding fixed range (2 x 2) on the basic characteristic diagram, and processing the high-dimensional vectors into a fixed-dimension positioning characteristic tensor P through vector operation_t；

P_t_active＝RELU(BN(Conv2d(P_t)))

L_t_active＝RELU(BN(Conv2d(L_t)))

wherein V represents the classification probability of the output; w₁Denotes a first weight parameter, W₂Represents a second weight parameter; b₁Denotes a first offset parameter, b₂Representing a second bias parameter; the formula has the main function of solving the characteristic vector obtained after classification operation of the current focus area;

Prob_iis the probability of a certain class, V_iIs the output of the ith unit at the front stage of the classifier, and C represents the total number of categories; the formula mainly uses the feature vector of the previous formula for mapping to obtain a classification probability value aiming at a certain class;

the output definition of the above first layer ends;

s532, in the second layer convolution LSTM structure decoder d (-) if the cycle is the initial time, decodingThe context feature map is used as an initialization value; if not, the decoder willAnd the hidden state of the previous moment of the layer is taken as input, and the new hidden state of the decoder is output, wherein the formula is as follows:

the definition of the output of the above second layer is finished;

wherein gt represents the label category, y_iThe coordinates of the annotation are represented by,representing the predicted output as representing the current annotated coordinate as a valueRate; in the process of calculating the loss function, the loss function sums the loss of each image at each time and takes the average value as the final loss.

the regional suggestion network takes a basic feature map, a target frame mark value and positioning information of a positioning cache pool as input, improves an RPN method according to the positioning information, is abbreviated as LRPN, and then outputs coordinates of a fixed number of suggestion candidate frames and in-frame prediction results and outputs two loss functions;

the specific implementation mode is as follows:

a. an anchor frame corresponding to the positioning information;

b. anchor frames that do not exceed the image boundaries;

modifying each anchor frame based on the predicted coordinate offset, comparing the modified anchor frame coordinates with the labels of the target frames in the image, selecting F anchor frames with the largest overlapping rate and larger than a positive threshold value, and setting the labels of the anchor frames to be 1; b anchor frames with the largest overlapping rate and smaller than a negative threshold are selected, and the labels of the anchor frames are set to be 0; wherein, the F value is set according to the hyper-parameter in the fast RCNN method, the number ratio of the selected anchor frames larger than the positive threshold value to the anchor frames smaller than the negative threshold value is 1:2, the total number of the anchor frames is 300, and then the F value is set as 100; setting the value B to be 200 according to the super parameter setting in the fast RCNN method in the same way;

s735, removing the anchor frame with the label of-1, and solving the loss function of the rest anchor frames to obtain the anchor frame categoryLoss lrpn of prediction and class label_clsAnchor frame coordinate prediction and loss of coordinate tags lrpn_bboxAnd outputting the first suggestion candidate box;

s7363, calculating the offset of the coordinates of the second candidate frame and the coordinates of the corresponding label frame, and taking the offset and the second candidate frame set as the output of the module;

Details of other module implementations

1) Inputting the second candidate box set into a pooling module, and outputting final interest features with uniform sizes by using a ROI posing method in fast RCNN;

2) inputting the final interest characteristics into a cascade convolution network by using an RCNN module method in fast RCNN, and outputting the coordinates of a second candidate frameMarking a predicted value, inputting the final interest feature into a cascaded convolution network, outputting a category predicted value of the final interest feature, and calculating the loss rcnn of the coordinate predicted value and the label value_bboxCalculating the loss rcnn of the above category prediction value and label value_cls。

3) The total loss is calculated as follows:

according to the total loss formula, the invention uses an end-to-end method, and adjusts the weight matrix according to the total loss L in parallel by using a supervised training SGD algorithm, wherein the weight matrix comprises the weight matrix of other supervised modules except the convolution depth activation module.

4) If in the testing stage, the coordinate prediction value in 2) is output as the detection result of the frame coordinate, and the category prediction value in 2) is output as the detection result of the frame category.

The scheme can be used as an independent complete technical scheme for realizing a computer product form, a medium for storing program codes is used as basic hardware for realizing the scheme, real-time camera equipment is usually used as equipment for receiving high-resolution images in external equipment, GTX1080Ti is used as image computing equipment, and terminal platforms such as personal computers and flat panels are used as output equipment of prediction results.

It will be appreciated by those of ordinary skill in the art that the embodiments described herein are intended to assist the reader in understanding the principles of the invention and are to be construed as being without limitation to such specifically recited embodiments and examples. Those skilled in the art can make various other specific changes and combinations based on the teachings of the present invention without departing from the spirit of the invention, and these changes and combinations are within the scope of the invention.

Claims

1. A multi-target detection method is characterized by comprising the following steps:

s3, setting the cycle number n to 1;

s5, inputting the focusing characteristic and the context characteristic into a double-layer cyclic convolution transmitting module, and outputting two predicted values: the first predicted value is the confidence coefficient that the feature belongs to the corresponding category; the second predicted value is the (i + 1) th batch of positioning information input into the double-layer cyclic convolution emission module, and the batch of positioning information is input into a positioning cache pool, the positioning cache pool is maintained globally, and the positioning information output by the current cyclic body is injected into the positioning cache pool after each cyclic body is finished;

2. The multi-target detection method of claim 1, wherein the context synthesis network is formed by superposition of basic convolution operation units, wherein the expression of a single convolution operation unit formula is as follows:

3. The multi-target detection method according to claim 1, wherein the step S2 includes the following sub-steps:

c_i＝max(g(x₀，y₀)，g(x₁，y₁)...g(x_N，y_N))

where g (-) represents the pixel value of the location, (x)_i，y_i) As a coordinate point within the local activation region, c_iRepresents the ithCorrelation in local activation regions;

4. The multi-target detection method according to claim 1, wherein the double-layer cyclic convolution emission module takes a basic feature map, a category mark value, a target frame mark value, a first batch of positioning information and a context feature as input, continuously explores optimal positioning under the optimization of a back propagation algorithm through a double-layer cyclic emission network, and finally outputs all positioning information with a fixed quantity; the method comprises the following specific steps:

P_t_active＝RELU(BN(Conv2d(P_t)))

wherein, when x > 0, relu (x) x; when x is less than or equal to 0, relu (x) is 0;

L_t_active＝RELU(BN(Conv2d(L_t)))

wherein the content of the first and second substances,represents the new hidden state of the output encoder e at time t;represents a new cell state of the output encoder e at time t, which introduces a step defined in the LSTM network structure of its own method for storing the hidden information valid in the long-term memory; g_tA focusing feature tensor representing time t;

wherein V represents the classification probability of the output; w₁Denotes a first weight parameter, W₂Represents a second weight parameter; b₁Denotes a first offset parameter, b₂Indicating a second offset parameter；Prob_iIs the probability of a certain class, V_iIs the output of the ith unit at the front stage of the classifier, and C represents the total number of categories;

the output definition of the above first layer ends;

the definition of the output of the above second layer is finished;

5. The multi-target detection method according to claim 1, wherein the regional suggestion network takes a basic feature map, a target frame mark value and positioning information of a positioning cache pool as input, improves an RPN method according to the positioning information, which is abbreviated as LRPN, and then outputs coordinates of a fixed number of suggestion candidate frames and intra-frame prediction results, and outputs two loss functions;

the specific implementation mode is as follows:

a. an anchor frame corresponding to the positioning information;

b. anchor frames that do not exceed the image boundaries;

modifying each anchor frame based on the predicted coordinate offset, comparing the modified anchor frame coordinates with the labels of the target frames in the image, selecting F anchor frames with the largest overlapping rate and larger than a positive threshold value, and setting the labels of the anchor frames to be 1; b anchor frames with the largest overlapping rate and smaller than a negative threshold are selected, and the labels of the anchor frames are set to be 0; wherein, the F value is set according to the super parameter in the FasterRCNN method, and the B value is set according to the super parameter in the FasterRCNN method;