CN110610210A - Multi-target detection method - Google Patents

Multi-target detection method Download PDF

Info

Publication number
CN110610210A
CN110610210A CN201910881579.XA CN201910881579A CN110610210A CN 110610210 A CN110610210 A CN 110610210A CN 201910881579 A CN201910881579 A CN 201910881579A CN 110610210 A CN110610210 A CN 110610210A
Authority
CN
China
Prior art keywords
frame
layer
convolution
positioning information
activation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910881579.XA
Other languages
Chinese (zh)
Other versions
CN110610210B (en
Inventor
吕乔
叶茂
窦强
李鑫鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China filed Critical University of Electronic Science and Technology of China
Priority to CN201910881579.XA priority Critical patent/CN110610210B/en
Publication of CN110610210A publication Critical patent/CN110610210A/en
Application granted granted Critical
Publication of CN110610210B publication Critical patent/CN110610210B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/245Classification techniques relating to the decision surface
    • G06F18/2451Classification techniques relating to the decision surface linear, e.g. hyperplane
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Probability & Statistics with Applications (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a multi-target detection method, which comprises the following steps: s1, extracting a basic feature graph and a context feature graph; s2, capturing a fuzzy activation region in the real-time image, and taking the coordinate information of the fuzzy activation region as first positioning information; s3, setting the cycle number n to 1; s4, taking the coordinate pair of the nth batch of positioning information as the center, and acquiring a local feature matrix of a fixed area near the center position on the basic feature map; s5, inputting the focusing characteristic and the context characteristic into a double-layer cyclic convolution transmitting module; s6, making n equal to n +1, and returning to step S4 until reaching the preset cycle number; outputting all positioning information; s7, inputting all positioning information into an area suggestion network; s8, loop through steps S1 to S7, sum all errors. The invention can output positioning information through the pre-defined double-layer cyclic convolution transmitting module, thereby obtaining the approximate position of the target object in the image and greatly reducing the calculation amount of each characteristic point.

Description

Multi-target detection method
Technical Field
The invention belongs to the technical field of target detection, and particularly relates to a method for detecting an image target in the field of computers.
Background
Nowadays, high-speed parallel computing architectures represented by NVIDIA series are gradually developed, and products thereof are gradually developed into more civilized parallel computing platforms from DirectX to current computing devices such as GTX1080TI and the like. Under the popular trend, various fields needing abundant computing resources are developed rapidly, wherein the image processing technology leads the pioneer and promotes the progress of numerous fields such as intelligent technology, monitoring technology, security technology and the like. In addition, several related hardware technologies in the field of real-time image perception are also being developed, such as infrared cameras, monocular and binocular cameras in peripheral devices, and these perception hardware are gradually developed to a structure more conforming to the human visual system, so as to facilitate the processing of images on software algorithms. Under the dual support of the image perception module and the embedded computing system, how to use a more innovative and ergonomic image analysis technology on the mobile intelligent machine problem becomes a leading-edge subject with great challenges across software, hardware and subjects.
In recent years, due to the rapid development of hardware systems, many higher-performance real-time image analysis and processing methods are emerging, wherein the important problem to be solved is the real-time detection of multiple targets in an image. At present, many mature multi-target detection methods have appeared in the industry, and in the field of traditional machine learning, target detection is generally divided into three steps of violence extraction candidate regions, manual design extraction features, and classification by using a rapid Adaboost algorithm or an SVM algorithm with strong generalization capability.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provide a multi-target detection method which can output positioning information through a predefined double-layer cyclic convolution transmitting module so as to obtain the approximate position of a target object in an image, greatly reduce the calculation amount of each characteristic point, avoid anchoring and calculating each position in a fast-RCNN method and enable the detection to be more consistent with the detection speed under the real-time condition.
The purpose of the invention is realized by the following technical scheme: a multi-target detection method comprises the following steps:
s1, acquiring a real-time image from the camera, extracting a basic feature map from the real-time image, and inputting the real-time image into a context integration network to obtain a context feature map;
s2, inputting the real-time image into a convolution activation network, and capturing a fuzzy activation area in the real-time image; creating a corresponding positioning cache pool for each real-time image, storing coordinate information of all the activation areas of the real-time image, and taking the coordinate information of the fuzzy activation areas as a first batch of positioning information;
s3, setting the cycle number n to 1;
s4, taking the coordinate pair of the nth batch of positioning information as a center, acquiring a local feature matrix of a fixed area near the center position on the basic feature map, and filtering and pooling the local feature matrix to obtain a focusing feature;
s5, inputting the focusing characteristic and the context characteristic into a double-layer cyclic convolution transmitting module, and outputting two predicted values: the first predicted value is the confidence coefficient that the feature belongs to the corresponding category; the second predicted value is the (n + 1) th batch of positioning information input into the double-layer cyclic convolution emission module, and the batch of positioning information is input into a positioning cache pool, the positioning cache pool is maintained globally, and the positioning information output by the current cyclic body is injected into the positioning cache pool after each cyclic body is finished;
s6, making n equal to n +1, and returning to step S4 until reaching the preset cycle number; finally, all positioning information is output, and two error values are output: the first error value is the error between the prediction category near the positioning and the label category, and the second error value is the error between the positioning information and the coordinates of the label target frame;
s7, inputting all positioning information in the positioning cache pool into the area suggestion network, wherein, a fixed number of first suggestion candidate box sets and two error values are output firstly: the first error value is the error between the coordinates of the first suggestion candidate frame and the real coordinates of the target frame, and the second error value is the error between the prediction category of the first suggestion candidate frame and the real category of the target frame;
inputting the first suggestion candidate box into a suggestion-target module, screening and refining the first suggestion candidate box set, and outputting second suggestion candidate boxes, corresponding category labels of each second suggestion candidate box and offsets of coordinates of each second suggestion candidate box and corresponding label coordinates;
inputting the second suggestion candidate box into an ROI posing module in a fast RCNN method, and outputting final interest features with consistent sizes through pooling operation; inputting the final interest features into an RCNN module in a fast RCNN method to respectively obtain the prediction categories and the prediction frame coordinates in the candidate frames corresponding to the final interest features, and generating two error values: the first error is the error between the prediction category and the label category, and the second error is the error between the coordinates of the prediction frame and the coordinates of the label;
and S8, looping steps S1 to S7, summing all errors, performing a back propagation algorithm, and iteratively updating each weight parameter in the network.
Further, the context synthesis network is formed by overlapping basic convolution operation units, wherein the expression of a single convolution operation unit formula is as follows:
wherein, l represents the first layer of the convolution layer; j represents the jth feature map of the current convolutional layer;a jth feature diagram representing the l-1 th roll base;an mth convolution kernel representing a jth feature map of the ith layer volume base layer; mjRepresenting all convolution kernel sets corresponding to the jth feature map; symbol denotes convolution operation;an offset vector parameter representing a jth characteristic diagram of the ith layer volume base layer; f (-) represents the activation function.
Further, the step S2 includes the following sub-steps:
s21, inputting the original image into a superposed basic convolution operation unit, wherein two basic convolution operation units and one basic pooling unit are used as a convolution block unit, five convolution block units with the same structure are used for cascading, and after the cascading, a characteristic map of the original image is output;
s22, inputting the feature map into the GAP layer, and outputting a one-dimensional vector, wherein elements in the one-dimensional vector are the feature matrix average value of each channel in the feature map; calculating the weighted sum of all values in the one-dimensional vector, and solving an activation function layer of the class probability;
s23, carrying out weighted summation on the output characteristics of the last layer of convolution lumps, and solving an activation map based on the category, wherein the formula is as follows:
wherein f isk(x, y) represents the activation value of the last layer of convolution blob output features at its coordinates (x, y) for the kth cell in the feature vector;representing the weight corresponding to each unit k for each class c, namely the importance of the unit k for the class c;
s24, the activation map based on the category obtained in the above steps is scaled to the same size as the original image, the correlation between the activation region and the category is compared, and the coordinate point with the highest local correlation is calculated:
ci=max(g(x0,y0),g(x1,y1)...g(xN,yN))
wherein g (·)) The pixel value representing the position, (x)i,yi) As a coordinate point within the local activation region, ciRepresenting the correlation in the ith local activation region;
and outputting the obtained coordinate point with the highest local correlation as a coordinate point set of the first batch of positioning information.
Further, the double-layer cyclic convolution transmitting module takes a basic feature map, a category mark value, a target frame mark value, a first batch of positioning information and context features as input, continuously explores the optimal positioning under the optimization of a back propagation algorithm through a double-layer cyclic transmission network, and finally outputs all positioning information with a fixed quantity; the method comprises the following specific steps:
s51, using each image as processing unit, taking t batch positioning information Lt=((x0,y0),(x1,y1)...(xm,ym) Extracting high-dimensional vectors in a corresponding fixed range (2 x 2) on the basic characteristic diagram, and processing the high-dimensional vectors into a fixed-dimension positioning characteristic tensor P through vector operationt
S52, inputting the localization feature tensor into a convolution layer, a regularization layer and an excitation function layer, and outputting an activated localization feature tensor, wherein the formula is as follows:
Pt_active=RELU(BN(Conv2d(Pt)))
wherein, when x > 0, relu (x) x; when x is less than or equal to 0, relu (x) is 0; BN (-) is derived from a deep learning network layer whose main function is to prevent the network from overfitting, which is called Batch-normalization network layer; conv2d (. cndot.) is derived from a deep learning network layer whose main function is to extract image features using convolution operations;
positioning information LtInputting the information into a convolution layer, a regularization layer and an excitation function layer, and outputting an activation positioning information tensor, wherein the formula is as follows:
Lt_active=RELU(BN(Conv2d(Lt)))
carrying out tensor multiplication on the two tensors to obtain a focusing characteristic tensor; the formula is as follows:
s53, in the circulation operation, one circulation unit is measured at one moment, and the specific implementation method is as follows:
s531, if the first moment in the loop operation is, initializing a hidden state of a first layer convolution LSTM structure by using a zero vector; otherwise, inputting the focusing feature tensor and the hidden state at the previous moment into a first layer convolution LSTM structure encoder e (-) and outputting a new hidden state of the encoder, wherein the formula is as follows:
wherein the content of the first and second substances,represents the new hidden state of the output encoder e at time t;represents the new cell state of the output encoder e at time t, which is derived from a step defined in the existing method LSTM network structure, for storing hidden information valid in long-term memory; gtA focusing feature tensor representing time t;
hidden state of convolutional LSTM structure encoderInputting a cascaded convolution network ec (-) and a linear classifier, and outputting the classification probability of the focusing region:
wherein V represents an outputThe classification probability of (2); w1Denotes a first weight parameter, W2Represents a second weight parameter; b1Denotes a first offset parameter, b2Representing a second bias parameter; probiIs the probability of a certain class, ViIs the output of the ith unit at the front stage of the classifier, and C represents the total number of categories;
the output definition of the above first layer ends;
s532, in the second layer of convolution LSTM structure decoder d (-) if the cycle is the initial time, the decoder takes the context feature graph as an initialization value; if not, the decoder willAnd the hidden state of the previous moment of the layer is taken as input, and the new hidden state of the decoder is output, wherein the formula is as follows:
hidden state of convolutional LSTM structure decoderInputting a linear regressor el (·), outputting two-dimensional coordinates of the attention orientation position at the next moment, and storing the coordinates into a positioning cache pool, wherein the formula is as follows:
the definition of the output of the above second layer is finished;
s533, calculating a local classification error of the image by using a cross entropy method in the current time cycle and combining the hidden information of the past time and the current information by using a double-layer network; in the time cycle, the double-layer network combines the hidden information of the past time and the current information in combination, and the positioning error of the next time is calculated by using a mean square error method, wherein the formula is as follows:
wherein gt represents the label category, yiThe coordinates of the annotation are represented by,representing the probability of predicting the output representation of the current labeled coordinate to a certain value; in the process of calculating the loss function, the loss function sums the loss of each image at each time and takes the average value as the final loss.
And S534, circulating the steps S531 to S533, and taking all the positioning information obtained by the final loss and all the moments as the output of the double-layer circular convolution transmitting module.
Further, the regional suggestion network takes the basic feature map, the mark value of the target frame and the positioning information of the positioning cache pool as input, improves the RPN method according to the positioning information, which is abbreviated as LRPN, and then outputs the coordinates of a fixed number of suggestion candidate frames and the intra-frame prediction result, and outputs two loss functions;
the specific implementation mode is as follows:
s71, inputting the basic feature graph into the convolution network and the activation network, and outputting an activation feature graph;
s72, introducing anchor frame rules, and setting A anchor frames for each space position on the activation graph; performing convolution operation with the step length of 1 and the convolution kernel of 1 multiplied by 1 on the activation characteristic diagram, and outputting a fraction tensor with the channel number of 2 multiplied by A, wherein the channel number in the tensor represents the class prediction probability fraction in A fixed-size anchor boxes corresponding to each space position on the LRPN activation characteristic diagram; performing convolution operation with the step length of 1 and the convolution kernel of 1 multiplied by 1 on the LRPN activation characteristic diagram, and outputting a coordinate offset tensor with the channel number of 4 multiplied by A, wherein the channel number in the tensor represents the predicted coordinate offset of A fixed-size anchor frames corresponding to each space position on the LRPN activation characteristic diagram and is used for solving the optimal predicted coordinate;
s73, inputting the fraction tensor, the coordinate offset tensor and the positioning information into an LRPN suggestion module, and specifically comprising the following steps:
s731, screening out a corresponding effective anchoring frame in the tensor according to the positioning information, and cutting out the effective anchoring frame beyond the image boundary;
s732, sorting the anchor frames and the corresponding fraction tensors based on fractions, and taking the first N, wherein N is used as a hyper-parameter;
s733, screening the fraction tensors by using a non-maximum inhibition method, and taking the first M fraction tensors which are sorted according to sizes from the rest fraction tensors as a first suggested candidate frame set;
s734, setting the labeling of all the anchor boxes as-1 if the following conditions are not met:
a. an anchor frame corresponding to the positioning information;
b. anchor frames that do not exceed the image boundaries;
modifying each anchor frame based on the predicted coordinate offset, comparing the modified anchor frame coordinates with the labels of the target frames in the image, selecting F anchor frames with the largest overlapping rate and larger than a positive threshold value, and setting the labels of the anchor frames to be 1; b anchor frames with the largest overlapping rate and smaller than a negative threshold are selected, and the labels of the anchor frames are set to be 0; wherein, the F value is set according to the super parameter in the fast RCNN method, and the B value is set according to the super parameter in the fast RCNN method;
s735, removing the anchor frame with the label of-1, and solving the loss function of the rest anchor frames to obtain the anchor frame category prediction and the loss lrpn of the category labelclsAnchor frame coordinate prediction and loss of coordinate tags lrpnbboxAnd outputting the first suggestion candidate box;
s736, screening and refining the first suggestion candidate box set, i.e., inputting the first suggestion candidate box set into a suggestion-target module, specifically operating as follows:
s7361, traversing the frame coordinates of all the labels for any frame coordinate in the first suggested candidate frame set, selecting the frame coordinate with the largest overlapping rate as a corresponding label frame, if the overlapping rate of the label frame and the candidate frame is greater than a threshold value, considering the candidate frame as a foreground, and if the overlapping rate of the label frame and the candidate frame is less than the threshold value, considering the candidate frame as a background;
s7362, setting a fixed number of foreground and background for each training period, sampling from the candidate frames to meet the fixed number requirement, and taking the sampled candidate frame set as a second candidate frame set;
s7363, calculating the offset between the coordinates of the second candidate frame and the coordinates of the corresponding label frame, and using the offset and the second candidate frame set as the output of the module.
The invention has the beneficial effects that: the method comprises the steps of setting a plurality of cascaded modules, extracting an activation position based on identification features in an image through a convolution activation network to serve as an initial value input into a double-layer cyclic convolution emission module, then extracting features used for training in a convolution depth network and a context comprehensive network, obtaining a positioning information set based on visual attention through the double-layer cyclic convolution emission module, obtaining a second suggestion candidate frame based on the positioning information through a region suggestion network, and finally predicting the category and the coordinate of the features in the suggestion candidate frame through an ROI posing module and an RCNN module. The algorithm has the advantages that the positioning information can be output through the pre-defined double-layer circular convolution transmitting module, so that the approximate position of the target object in the image is obtained, the calculation amount of each characteristic point is greatly reduced, the anchoring and calculation of each position in the fast-RCNN method are avoided, and the detection can be more consistent with the detection speed under the real-time condition.
Drawings
FIG. 1 is a flow chart of a multi-target detection method of the present invention.
Detailed Description
When a visual device such as a monocular camera on a mobile machine acquires real-time images from the periphery of the machine, the embedded computing system needs to perform target detection processing on the images in time so as to judge the target position and the target size in the current environment and further take corresponding action measures. Based on the requirement, an accurate and rapid multi-target detection method is crucial. In this process, the mainstream method requires processing of all regions of the image, and each time a processing region has a possibility of being duplicated with other processing regions. In a hierarchical method structure of deep learning, the weight coefficient in a feature expression function is increased correspondingly due to the huge number of regional suggestions, so that the invention designs a scheme, improves the regional processing efficiency and reduces the load of a computing system by combining a human visual focusing mechanism.
The method comprises the steps of setting a plurality of cascaded modules, extracting an activation position based on identification features in an image through a convolution activation network to serve as an initial value input into a double-layer cyclic convolution emission module, then extracting features used for training in a convolution depth network and a context comprehensive network, obtaining a positioning information set based on visual attention through the double-layer cyclic convolution emission module, obtaining a second suggestion candidate frame based on the positioning information through a region suggestion network, and finally predicting the category and the coordinate of the features in the suggestion candidate frame through an ROI posing module and an RCNN module. The algorithm has the advantages that the positioning information can be output through the pre-defined double-layer circular convolution transmitting module, so that the approximate position of the target object in the image is obtained, the calculation amount of each characteristic point is greatly reduced, the anchoring and calculation of each position in the fast-RCNN method are avoided, and the detection can be more consistent with the detection speed under the real-time condition. The technical scheme of the invention is further explained by combining the attached drawings.
As shown in fig. 1, a multi-target detection method includes the following steps:
s1, acquiring a real-time image from the camera, extracting a basic feature map from the real-time image through a relevant technology such as ResNet series, and simultaneously inputting the real-time image into a context integration network to obtain a context feature map as the initial input of a subsequent module;
when hardware such as a camera acquires an original image with larger spatial resolution, a context synthesis network acts on the original image, the context synthesis network is formed by overlapping basic convolution operation units, wherein the expression of a single convolution operation unit formula is as follows:
wherein, l represents the first layer of the convolution layer; j represents the jth feature map of the current convolutional layer;a jth feature diagram representing the l-1 th roll base;an mth convolution kernel representing a jth feature map of the ith layer volume base layer; mjRepresenting all convolution kernel sets corresponding to the jth feature map; symbol denotes convolution operation;an offset vector parameter representing a jth characteristic diagram of the ith layer volume base layer; f (-) represents the activation function.
In the context comprehensive network, the first 9 convolution operation units of the related algorithm VGG16 network are selected as an architecture, the equivalent value of the input channel number, the output channel number, the convolution kernel size, the convolution operation step length and the filling parameter of each convolution operation unit is fixed, the original image with the channel number of 3 is input into the first convolution operation unit, and finally the context feature map with the channel number of 128 is output.
In the scheme, the purpose of using the context integration network is to use the context integration network as an initialization basis of the double-layer cyclic convolution transmitting module, so that the double-layer cyclic convolution transmitting module can obtain the extracted fuzzy characteristics in advance, acquire the global information of the image and accelerate the process of accurately positioning the fuzzy position of the target.
S2, inputting the real-time image into a convolution activation network, and capturing a fuzzy activation area in the real-time image; creating a corresponding positioning cache pool for each real-time image, storing coordinate information of all the activation areas of the real-time image, and taking the coordinate information of the fuzzy activation areas as a first batch of positioning information;
based on an existing unsupervised algorithm scheme CAM, the convolution activation network mainly realizes a process of generating an activation map based on a category unsupervised by a GAP algorithm and outputting target fuzzy positioning information. The method specifically comprises the following substeps:
s21, inputting the original image into a superposed basic convolution operation unit, wherein two basic convolution operation units and one basic pooling unit are used as a convolution block unit, five convolution block units with the same structure are used for cascading, and after the cascading, a characteristic map of the original image is output;
s22, inputting the feature map into the GAP layer, and outputting a one-dimensional vector, wherein elements in the one-dimensional vector are the feature matrix average value of each channel in the feature map; calculating the weighted sum of all values in the one-dimensional vector, and solving an activation function layer of the class probability;
s23, based on function output for activating the category, the important area in the original image is marked and visualized in a mode of mapping the weight output by the GAP layer to the output characteristic of the last layer of convolution block mass, the specific method is to carry out weighted summation on the output characteristic of the last layer of convolution block mass, and solve the activation map based on the category, and the formula is as follows:
wherein f isk(x, y) represents the activation value of the last layer of convolution blob output features at its coordinates (x, y) for the kth cell in the feature vector;representing the weight corresponding to each unit k for each class c, namely the importance of the unit k for the class c; after passing through the GAP layer, for each cell, the activation values at all coordinate positions are solved and summed.
S24, in the convolution depth activation network, the activation region is subjected to weight mapping so as to highlight the importance of the activation region in the original image. The activation map based on the category obtained in the above steps is scaled to the same size as the original image, the correlation between the activation region and the category is compared, and the coordinate point with the highest local correlation is calculated:
ci=max(g(x0,y0),g(x1,y1)...g(xN,yN))
where g (-) represents the pixel value of the location, (x)i,yi) As a coordinate point within the local activation region, ciRepresenting the correlation in the ith local activation region;
and outputting the obtained coordinate point with the highest local correlation as a coordinate point set of the first batch of positioning information.
S3, setting the cycle number n to 1;
s4, this step is the inlet of the circulation body, the input of which comes from the output generated in the previous cycle: the nth batch of positioning information; if the loop body enters for the first time, using the first batch of positioning information in the step S2 as a center, otherwise, taking the coordinate pair of the nth batch of positioning information as the center, acquiring a local feature matrix of a fixed area near the center position on the basic feature map, and filtering and pooling the local feature matrix to obtain a focusing feature;
s5, inputting the focusing characteristic and the context characteristic into a double-layer cyclic convolution transmitting module, and outputting two predicted values: the first predicted value is the confidence coefficient that the feature belongs to the corresponding category; the second predicted value is the (n + 1) th batch of positioning information input into the double-layer cyclic convolution emission module, and the batch of positioning information is input into a positioning cache pool, the positioning cache pool is maintained globally, and the positioning information output by the current cyclic body is injected into the positioning cache pool after each cyclic body is finished; so far, the output of the circulation body outlet is as follows: (n + 1) th batch of positioning information, the output being the input for the next entry into the cycle body inlet;
the double-layer cyclic convolution transmitting module takes a basic feature map, a category mark value, a target frame mark value, a first batch of positioning information and context features as input, continuously explores the optimal positioning under the optimization of a back propagation algorithm through a double-layer cyclic transmitting network, and finally outputs all positioning information with fixed quantity; the method comprises the following specific steps:
s51, taking each image as the processing unitt batch of positioning information Lt=((x0,y0),(x1,y1)…(xm,ym) Extracting high-dimensional vectors in a corresponding fixed range (2 x 2) on the basic characteristic diagram, and processing the high-dimensional vectors into a fixed-dimension positioning characteristic tensor P through vector operationt
S52, inputting the localization feature tensor into a convolution layer, a regularization layer and an excitation function layer, and outputting an activated localization feature tensor, wherein the formula is as follows:
Pt_active=RELU(BN(Conv2d(Pt)))
wherein, when x > 0, relu (x) x; when x is less than or equal to 0, relu (x) is 0; BN (-) is derived from a deep learning network layer whose main function is to prevent the network from overfitting, which is called Batch-normalization network layer; conv2d (. cndot.) is derived from a deep learning network layer whose main function is to extract image features using convolution operations;
positioning information LtInputting the information into a convolution layer, a regularization layer and an excitation function layer, and outputting an activation positioning information tensor, wherein the formula is as follows:
Lt_active=RELU(BN(Conv2d(Lt)))
carrying out tensor multiplication on the two tensors to obtain a focusing characteristic tensor; the formula is as follows:
s53, in the circulation operation, one circulation unit is measured at one moment, and the specific implementation method is as follows:
s531, if the first moment in the loop operation is, initializing a hidden state of a first layer convolution LSTM structure by using a zero vector; otherwise, inputting the focusing feature tensor and the hidden state at the previous moment into a first layer convolution LSTM structure encoder e (-) and outputting a new hidden state of the encoder, wherein the formula is as follows:
wherein the content of the first and second substances,represents the new hidden state of the output encoder e at time t;represents the new cell state of the output encoder e at time t, which is derived from a step defined in the existing method LSTM network structure, for storing hidden information valid in long-term memory; gtA focusing feature tensor representing time t;
hidden state of convolutional LSTM structure encoderInputting a cascaded convolution network ec (-) and a linear classifier, and outputting the classification probability of the focusing region:
wherein V represents the classification probability of the output; w1Denotes a first weight parameter, W2Represents a second weight parameter; b1Denotes a first offset parameter, b2Representing a second bias parameter; the formula has the main function of solving the characteristic vector obtained after classification operation of the current focus area;
Probiis the probability of a certain class, ViIs the output of the ith unit at the front stage of the classifier, and C represents the total number of categories; the formula mainly uses the feature vector of the previous formula for mapping to obtain a classification probability value aiming at a certain class;
the output definition of the above first layer ends;
s532, in the second layer convolution LSTM structure decoder d (-) if the cycle is the initial time, decodingThe context feature map is used as an initialization value; if not, the decoder willAnd the hidden state of the previous moment of the layer is taken as input, and the new hidden state of the decoder is output, wherein the formula is as follows:
hidden state of convolutional LSTM structure decoderInputting a linear regressor el (·), outputting two-dimensional coordinates of the attention orientation position at the next moment, and storing the coordinates into a positioning cache pool, wherein the formula is as follows:
the definition of the output of the above second layer is finished;
s533, calculating a local classification error of the image by using a cross entropy method in the current time cycle and combining the hidden information of the past time and the current information by using a double-layer network; in the time cycle, the double-layer network combines the hidden information of the past time and the current information in combination, and the positioning error of the next time is calculated by using a mean square error method, wherein the formula is as follows:
wherein gt represents the label category, yiThe coordinates of the annotation are represented by,representing the predicted output as representing the current annotated coordinate as a valueRate; in the process of calculating the loss function, the loss function sums the loss of each image at each time and takes the average value as the final loss.
And S534, circulating the steps S531 to S533, and taking all the positioning information obtained by the final loss and all the moments as the output of the double-layer circular convolution transmitting module.
S6, making n equal to n +1, and returning to step S4 until reaching the preset cycle number; finally, all positioning information is output, and two error values are output: the first error value is the error between the prediction category near the positioning and the label category, and the second error value is the error between the positioning information and the coordinates of the label target frame;
s7, inputting all positioning information in the positioning cache pool into the area suggestion network, wherein, a fixed number of first suggestion candidate box sets and two error values are output firstly: the first error value is the error between the coordinates of the first suggestion candidate frame and the real coordinates of the target frame, and the second error value is the error between the prediction category of the first suggestion candidate frame and the real category of the target frame;
inputting the first suggestion candidate box into a suggestion-target module, screening and refining the first suggestion candidate box set, and outputting second suggestion candidate boxes, corresponding category labels of each second suggestion candidate box and offsets of coordinates of each second suggestion candidate box and corresponding label coordinates;
the regional suggestion network takes a basic feature map, a target frame mark value and positioning information of a positioning cache pool as input, improves an RPN method according to the positioning information, is abbreviated as LRPN, and then outputs coordinates of a fixed number of suggestion candidate frames and in-frame prediction results and outputs two loss functions;
the specific implementation mode is as follows:
s71, inputting the basic feature graph into the convolution network and the activation network, and outputting an activation feature graph;
s72, introducing anchor frame rules, and setting A anchor frames for each space position on the activation graph; performing convolution operation with the step length of 1 and the convolution kernel of 1 multiplied by 1 on the activation characteristic diagram, and outputting a fraction tensor with the channel number of 2 multiplied by A, wherein the channel number in the tensor represents the class prediction probability fraction in A fixed-size anchor boxes corresponding to each space position on the LRPN activation characteristic diagram; performing convolution operation with the step length of 1 and the convolution kernel of 1 multiplied by 1 on the LRPN activation characteristic diagram, and outputting a coordinate offset tensor with the channel number of 4 multiplied by A, wherein the channel number in the tensor represents the predicted coordinate offset of A fixed-size anchor frames corresponding to each space position on the LRPN activation characteristic diagram and is used for solving the optimal predicted coordinate;
s73, inputting the fraction tensor, the coordinate offset tensor and the positioning information into an LRPN suggestion module, and specifically comprising the following steps:
s731, screening out a corresponding effective anchoring frame in the tensor according to the positioning information, and cutting out the effective anchoring frame beyond the image boundary;
s732, sorting the anchor frames and the corresponding fraction tensors based on fractions, and taking the first N, wherein N is used as a hyper-parameter;
s733, screening the fraction tensors by using a non-maximum inhibition method, and taking the first M fraction tensors which are sorted according to sizes from the rest fraction tensors as a first suggested candidate frame set;
s734, setting the labeling of all the anchor boxes as-1 if the following conditions are not met:
a. an anchor frame corresponding to the positioning information;
b. anchor frames that do not exceed the image boundaries;
modifying each anchor frame based on the predicted coordinate offset, comparing the modified anchor frame coordinates with the labels of the target frames in the image, selecting F anchor frames with the largest overlapping rate and larger than a positive threshold value, and setting the labels of the anchor frames to be 1; b anchor frames with the largest overlapping rate and smaller than a negative threshold are selected, and the labels of the anchor frames are set to be 0; wherein, the F value is set according to the hyper-parameter in the fast RCNN method, the number ratio of the selected anchor frames larger than the positive threshold value to the anchor frames smaller than the negative threshold value is 1:2, the total number of the anchor frames is 300, and then the F value is set as 100; setting the value B to be 200 according to the super parameter setting in the fast RCNN method in the same way;
s735, removing the anchor frame with the label of-1, and solving the loss function of the rest anchor frames to obtain the anchor frame categoryLoss lrpn of prediction and class labelclsAnchor frame coordinate prediction and loss of coordinate tags lrpnbboxAnd outputting the first suggestion candidate box;
s736, screening and refining the first suggestion candidate box set, i.e., inputting the first suggestion candidate box set into a suggestion-target module, specifically operating as follows:
s7361, traversing the frame coordinates of all the labels for any frame coordinate in the first suggested candidate frame set, selecting the frame coordinate with the largest overlapping rate as a corresponding label frame, if the overlapping rate of the label frame and the candidate frame is greater than a threshold value, considering the candidate frame as a foreground, and if the overlapping rate of the label frame and the candidate frame is less than the threshold value, considering the candidate frame as a background;
s7362, setting a fixed number of foreground and background for each training period, sampling from the candidate frames to meet the fixed number requirement, and taking the sampled candidate frame set as a second candidate frame set;
s7363, calculating the offset of the coordinates of the second candidate frame and the coordinates of the corresponding label frame, and taking the offset and the second candidate frame set as the output of the module;
inputting the second suggestion candidate box into an ROI posing module in a fast RCNN method, and outputting final interest features with consistent sizes through pooling operation; inputting the final interest features into an RCNN module in a fast RCNN method to respectively obtain the prediction categories and the prediction frame coordinates in the candidate frames corresponding to the final interest features, and generating two error values: the first error is the error between the prediction category and the label category, and the second error is the error between the coordinates of the prediction frame and the coordinates of the label;
and S8, looping steps S1 to S7, summing all errors, performing a back propagation algorithm, and iteratively updating each weight parameter in the network.
Details of other module implementations
1) Inputting the second candidate box set into a pooling module, and outputting final interest features with uniform sizes by using a ROI posing method in fast RCNN;
2) inputting the final interest characteristics into a cascade convolution network by using an RCNN module method in fast RCNN, and outputting the coordinates of a second candidate frameMarking a predicted value, inputting the final interest feature into a cascaded convolution network, outputting a category predicted value of the final interest feature, and calculating the loss rcnn of the coordinate predicted value and the label valuebboxCalculating the loss rcnn of the above category prediction value and label valuecls
3) The total loss is calculated as follows:
according to the total loss formula, the invention uses an end-to-end method, and adjusts the weight matrix according to the total loss L in parallel by using a supervised training SGD algorithm, wherein the weight matrix comprises the weight matrix of other supervised modules except the convolution depth activation module.
4) If in the testing stage, the coordinate prediction value in 2) is output as the detection result of the frame coordinate, and the category prediction value in 2) is output as the detection result of the frame category.
The scheme can be used as an independent complete technical scheme for realizing a computer product form, a medium for storing program codes is used as basic hardware for realizing the scheme, real-time camera equipment is usually used as equipment for receiving high-resolution images in external equipment, GTX1080Ti is used as image computing equipment, and terminal platforms such as personal computers and flat panels are used as output equipment of prediction results.
It will be appreciated by those of ordinary skill in the art that the embodiments described herein are intended to assist the reader in understanding the principles of the invention and are to be construed as being without limitation to such specifically recited embodiments and examples. Those skilled in the art can make various other specific changes and combinations based on the teachings of the present invention without departing from the spirit of the invention, and these changes and combinations are within the scope of the invention.

Claims (5)

1. A multi-target detection method is characterized by comprising the following steps:
s1, acquiring a real-time image from the camera, extracting a basic feature map from the real-time image, and inputting the real-time image into a context integration network to obtain a context feature map;
s2, inputting the real-time image into a convolution activation network, and capturing a fuzzy activation area in the real-time image; creating a corresponding positioning cache pool for each real-time image, storing coordinate information of all the activation areas of the real-time image, and taking the coordinate information of the fuzzy activation areas as a first batch of positioning information;
s3, setting the cycle number n to 1;
s4, taking the coordinate pair of the nth batch of positioning information as a center, acquiring a local feature matrix of a fixed area near the center position on the basic feature map, and filtering and pooling the local feature matrix to obtain a focusing feature;
s5, inputting the focusing characteristic and the context characteristic into a double-layer cyclic convolution transmitting module, and outputting two predicted values: the first predicted value is the confidence coefficient that the feature belongs to the corresponding category; the second predicted value is the (i + 1) th batch of positioning information input into the double-layer cyclic convolution emission module, and the batch of positioning information is input into a positioning cache pool, the positioning cache pool is maintained globally, and the positioning information output by the current cyclic body is injected into the positioning cache pool after each cyclic body is finished;
s6, making n equal to n +1, and returning to step S4 until reaching the preset cycle number; finally, all positioning information is output, and two error values are output: the first error value is the error between the prediction category near the positioning and the label category, and the second error value is the error between the positioning information and the coordinates of the label target frame;
s7, inputting all positioning information in the positioning cache pool into the area suggestion network, wherein, a fixed number of first suggestion candidate box sets and two error values are output firstly: the first error value is the error between the coordinates of the first suggestion candidate frame and the real coordinates of the target frame, and the second error value is the error between the prediction category of the first suggestion candidate frame and the real category of the target frame;
inputting the first suggestion candidate box into a suggestion-target module, screening and refining the first suggestion candidate box set, and outputting second suggestion candidate boxes, corresponding category labels of each second suggestion candidate box and offsets of coordinates of each second suggestion candidate box and corresponding label coordinates;
inputting the second suggestion candidate box into an ROI posing module in a fast RCNN method, and outputting final interest features with consistent sizes through pooling operation; inputting the final interest features into an RCNN module in a fast RCNN method to respectively obtain the prediction categories and the prediction frame coordinates in the candidate frames corresponding to the final interest features, and generating two error values: the first error is the error between the prediction category and the label category, and the second error is the error between the coordinates of the prediction frame and the coordinates of the label;
and S8, looping steps S1 to S7, summing all errors, performing a back propagation algorithm, and iteratively updating each weight parameter in the network.
2. The multi-target detection method of claim 1, wherein the context synthesis network is formed by superposition of basic convolution operation units, wherein the expression of a single convolution operation unit formula is as follows:
wherein, l represents the first layer of the convolution layer; j represents the jth feature map of the current convolutional layer;a jth feature diagram representing the l-1 th roll base;an mth convolution kernel representing a jth feature map of the ith layer volume base layer; mjRepresenting all convolution kernel sets corresponding to the jth feature map; symbol denotes convolution operation;an offset vector parameter representing a jth characteristic diagram of the ith layer volume base layer; f (-) represents the activation function.
3. The multi-target detection method according to claim 1, wherein the step S2 includes the following sub-steps:
s21, inputting the original image into a superposed basic convolution operation unit, wherein two basic convolution operation units and one basic pooling unit are used as a convolution block unit, five convolution block units with the same structure are used for cascading, and after the cascading, a characteristic map of the original image is output;
s22, inputting the feature map into the GAP layer, and outputting a one-dimensional vector, wherein elements in the one-dimensional vector are the feature matrix average value of each channel in the feature map; calculating the weighted sum of all values in the one-dimensional vector, and solving an activation function layer of the class probability;
s23, carrying out weighted summation on the output characteristics of the last layer of convolution lumps, and solving an activation map based on the category, wherein the formula is as follows:
wherein f isk(x, y) represents the activation value of the last layer of convolution blob output features at its coordinates (x, y) for the kth cell in the feature vector;representing the weight corresponding to each unit k for each class c, namely the importance of the unit k for the class c;
s24, the activation map based on the category obtained in the above steps is scaled to the same size as the original image, the correlation between the activation region and the category is compared, and the coordinate point with the highest local correlation is calculated:
ci=max(g(x0,y0),g(x1,y1)...g(xN,yN))
where g (-) represents the pixel value of the location, (x)i,yi) As a coordinate point within the local activation region, ciRepresents the ithCorrelation in local activation regions;
and outputting the obtained coordinate point with the highest local correlation as a coordinate point set of the first batch of positioning information.
4. The multi-target detection method according to claim 1, wherein the double-layer cyclic convolution emission module takes a basic feature map, a category mark value, a target frame mark value, a first batch of positioning information and a context feature as input, continuously explores optimal positioning under the optimization of a back propagation algorithm through a double-layer cyclic emission network, and finally outputs all positioning information with a fixed quantity; the method comprises the following specific steps:
s51, using each image as processing unit, taking t batch positioning information Lt=((x0,y0),(x1,y1)...(xm,ym) Extracting high-dimensional vectors in a corresponding fixed range (2 x 2) on the basic characteristic diagram, and processing the high-dimensional vectors into a fixed-dimension positioning characteristic tensor P through vector operationt
S52, inputting the localization feature tensor into a convolution layer, a regularization layer and an excitation function layer, and outputting an activated localization feature tensor, wherein the formula is as follows:
Pt_active=RELU(BN(Conv2d(Pt)))
wherein, when x > 0, relu (x) x; when x is less than or equal to 0, relu (x) is 0;
positioning information LtInputting the information into a convolution layer, a regularization layer and an excitation function layer, and outputting an activation positioning information tensor, wherein the formula is as follows:
Lt_active=RELU(BN(Conv2d(Lt)))
carrying out tensor multiplication on the two tensors to obtain a focusing characteristic tensor; the formula is as follows:
s53, in the circulation operation, one circulation unit is measured at one moment, and the specific implementation method is as follows:
s531, if the first moment in the loop operation is, initializing a hidden state of a first layer convolution LSTM structure by using a zero vector; otherwise, inputting the focusing feature tensor and the hidden state at the previous moment into a first layer convolution LSTM structure encoder e (-) and outputting a new hidden state of the encoder, wherein the formula is as follows:
wherein the content of the first and second substances,represents the new hidden state of the output encoder e at time t;represents a new cell state of the output encoder e at time t, which introduces a step defined in the LSTM network structure of its own method for storing the hidden information valid in the long-term memory; gtA focusing feature tensor representing time t;
hidden state of convolutional LSTM structure encoderInputting a cascaded convolution network ec (-) and a linear classifier, and outputting the classification probability of the focusing region:
wherein V represents the classification probability of the output; w1Denotes a first weight parameter, W2Represents a second weight parameter; b1Denotes a first offset parameter, b2Indicating a second offset parameter;ProbiIs the probability of a certain class, ViIs the output of the ith unit at the front stage of the classifier, and C represents the total number of categories;
the output definition of the above first layer ends;
s532, in the second layer of convolution LSTM structure decoder d (-) if the cycle is the initial time, the decoder takes the context feature graph as an initialization value; if not, the decoder willAnd the hidden state of the previous moment of the layer is taken as input, and the new hidden state of the decoder is output, wherein the formula is as follows:
hidden state of convolutional LSTM structure decoderInputting a linear regressor el (·), outputting two-dimensional coordinates of the attention orientation position at the next moment, and storing the coordinates into a positioning cache pool, wherein the formula is as follows:
the definition of the output of the above second layer is finished;
s533, calculating a local classification error of the image by using a cross entropy method in the current time cycle and combining the hidden information of the past time and the current information by using a double-layer network; in the time cycle, the double-layer network combines the hidden information of the past time and the current information in combination, and the positioning error of the next time is calculated by using a mean square error method, wherein the formula is as follows:
wherein gt represents the label category, yiThe coordinates of the annotation are represented by,representing the probability of predicting the output representation of the current labeled coordinate to a certain value; in the process of calculating the loss function, the loss function sums the loss of each image at each time and takes the average value as the final loss.
And S534, circulating the steps S531 to S533, and taking all the positioning information obtained by the final loss and all the moments as the output of the double-layer circular convolution transmitting module.
5. The multi-target detection method according to claim 1, wherein the regional suggestion network takes a basic feature map, a target frame mark value and positioning information of a positioning cache pool as input, improves an RPN method according to the positioning information, which is abbreviated as LRPN, and then outputs coordinates of a fixed number of suggestion candidate frames and intra-frame prediction results, and outputs two loss functions;
the specific implementation mode is as follows:
s71, inputting the basic feature graph into the convolution network and the activation network, and outputting an activation feature graph;
s72, introducing anchor frame rules, and setting A anchor frames for each space position on the activation graph; performing convolution operation with the step length of 1 and the convolution kernel of 1 multiplied by 1 on the activation characteristic diagram, and outputting a fraction tensor with the channel number of 2 multiplied by A, wherein the channel number in the tensor represents the class prediction probability fraction in A fixed-size anchor boxes corresponding to each space position on the LRPN activation characteristic diagram; performing convolution operation with the step length of 1 and the convolution kernel of 1 multiplied by 1 on the LRPN activation characteristic diagram, and outputting a coordinate offset tensor with the channel number of 4 multiplied by A, wherein the channel number in the tensor represents the predicted coordinate offset of A fixed-size anchor frames corresponding to each space position on the LRPN activation characteristic diagram and is used for solving the optimal predicted coordinate;
s73, inputting the fraction tensor, the coordinate offset tensor and the positioning information into an LRPN suggestion module, and specifically comprising the following steps:
s731, screening out a corresponding effective anchoring frame in the tensor according to the positioning information, and cutting out the effective anchoring frame beyond the image boundary;
s732, sorting the anchor frames and the corresponding fraction tensors based on fractions, and taking the first N, wherein N is used as a hyper-parameter;
s733, screening the fraction tensors by using a non-maximum inhibition method, and taking the first M fraction tensors which are sorted according to sizes from the rest fraction tensors as a first suggested candidate frame set;
s734, setting the labeling of all the anchor boxes as-1 if the following conditions are not met:
a. an anchor frame corresponding to the positioning information;
b. anchor frames that do not exceed the image boundaries;
modifying each anchor frame based on the predicted coordinate offset, comparing the modified anchor frame coordinates with the labels of the target frames in the image, selecting F anchor frames with the largest overlapping rate and larger than a positive threshold value, and setting the labels of the anchor frames to be 1; b anchor frames with the largest overlapping rate and smaller than a negative threshold are selected, and the labels of the anchor frames are set to be 0; wherein, the F value is set according to the super parameter in the FasterRCNN method, and the B value is set according to the super parameter in the FasterRCNN method;
s735, removing the anchor frame with the label of-1, and solving the loss function of the rest anchor frames to obtain the anchor frame category prediction and the loss lrpn of the category labelclsAnchor frame coordinate prediction and loss of coordinate tags lrpnbboxAnd outputting the first suggestion candidate box;
s736, screening and refining the first suggestion candidate box set, i.e., inputting the first suggestion candidate box set into a suggestion-target module, specifically operating as follows:
s7361, traversing the frame coordinates of all the labels for any frame coordinate in the first suggested candidate frame set, selecting the frame coordinate with the largest overlapping rate as a corresponding label frame, if the overlapping rate of the label frame and the candidate frame is greater than a threshold value, considering the candidate frame as a foreground, and if the overlapping rate of the label frame and the candidate frame is less than the threshold value, considering the candidate frame as a background;
s7362, setting a fixed number of foreground and background for each training period, sampling from the candidate frames to meet the fixed number requirement, and taking the sampled candidate frame set as a second candidate frame set;
s7363, calculating the offset between the coordinates of the second candidate frame and the coordinates of the corresponding label frame, and using the offset and the second candidate frame set as the output of the module.
CN201910881579.XA 2019-09-18 2019-09-18 Multi-target detection method Active CN110610210B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910881579.XA CN110610210B (en) 2019-09-18 2019-09-18 Multi-target detection method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910881579.XA CN110610210B (en) 2019-09-18 2019-09-18 Multi-target detection method

Publications (2)

Publication Number Publication Date
CN110610210A true CN110610210A (en) 2019-12-24
CN110610210B CN110610210B (en) 2022-03-25

Family

ID=68891598

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910881579.XA Active CN110610210B (en) 2019-09-18 2019-09-18 Multi-target detection method

Country Status (1)

Country Link
CN (1) CN110610210B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111583204A (en) * 2020-04-27 2020-08-25 天津大学 Organ positioning method of two-dimensional sequence magnetic resonance image based on network model
CN111723852A (en) * 2020-05-30 2020-09-29 杭州迪英加科技有限公司 Robust training method for target detection network
CN111986126A (en) * 2020-07-17 2020-11-24 浙江工业大学 Multi-target detection method based on improved VGG16 network
CN113065650A (en) * 2021-04-02 2021-07-02 中山大学 Multichannel neural network method for long-term memory learning
CN113298094A (en) * 2021-06-10 2021-08-24 安徽大学 RGB-T significance target detection method based on modal association and double-perception decoder
CN113822172A (en) * 2021-08-30 2021-12-21 中国科学院上海微系统与信息技术研究所 Video spatiotemporal behavior detection method

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106250812A (en) * 2016-07-15 2016-12-21 汤平 A kind of model recognizing method based on quick R CNN deep neural network
CN108717693A (en) * 2018-04-24 2018-10-30 浙江工业大学 A kind of optic disk localization method based on RPN
CN108898145A (en) * 2018-06-15 2018-11-27 西南交通大学 A kind of image well-marked target detection method of combination deep learning
CN109344815A (en) * 2018-12-13 2019-02-15 深源恒际科技有限公司 A kind of file and picture classification method
CN109359684A (en) * 2018-10-17 2019-02-19 苏州大学 Fine granularity model recognizing method based on Weakly supervised positioning and subclass similarity measurement
CN109523015A (en) * 2018-11-09 2019-03-26 上海海事大学 Image processing method in a kind of neural network
US10304208B1 (en) * 2018-02-12 2019-05-28 Avodah Labs, Inc. Automated gesture identification using neural networks
CN109961034A (en) * 2019-03-18 2019-07-02 西安电子科技大学 Video object detection method based on convolution gating cycle neural unit
US20190228276A1 (en) * 2018-01-19 2019-07-25 Arcules Inc. License plate reader using optical character recognition on plural detected regions
CN110097136A (en) * 2019-05-09 2019-08-06 杭州筑象数字科技有限公司 Image classification method neural network based

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106250812A (en) * 2016-07-15 2016-12-21 汤平 A kind of model recognizing method based on quick R CNN deep neural network
US20190228276A1 (en) * 2018-01-19 2019-07-25 Arcules Inc. License plate reader using optical character recognition on plural detected regions
US10304208B1 (en) * 2018-02-12 2019-05-28 Avodah Labs, Inc. Automated gesture identification using neural networks
CN108717693A (en) * 2018-04-24 2018-10-30 浙江工业大学 A kind of optic disk localization method based on RPN
CN108898145A (en) * 2018-06-15 2018-11-27 西南交通大学 A kind of image well-marked target detection method of combination deep learning
CN109359684A (en) * 2018-10-17 2019-02-19 苏州大学 Fine granularity model recognizing method based on Weakly supervised positioning and subclass similarity measurement
CN109523015A (en) * 2018-11-09 2019-03-26 上海海事大学 Image processing method in a kind of neural network
CN109344815A (en) * 2018-12-13 2019-02-15 深源恒际科技有限公司 A kind of file and picture classification method
CN109961034A (en) * 2019-03-18 2019-07-02 西安电子科技大学 Video object detection method based on convolution gating cycle neural unit
CN110097136A (en) * 2019-05-09 2019-08-06 杭州筑象数字科技有限公司 Image classification method neural network based

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
REN SHAOQING ET AL: "Faster R-CNN: towards real-time object detection with region proposal networks", 《PROC OF ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS》 *
李旭冬等: "基于卷积神经网络的目标检测研究综述", 《计算机应用研究》 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111583204A (en) * 2020-04-27 2020-08-25 天津大学 Organ positioning method of two-dimensional sequence magnetic resonance image based on network model
CN111583204B (en) * 2020-04-27 2022-10-14 天津大学 Organ positioning method of two-dimensional sequence magnetic resonance image based on network model
CN111723852A (en) * 2020-05-30 2020-09-29 杭州迪英加科技有限公司 Robust training method for target detection network
CN111723852B (en) * 2020-05-30 2022-07-22 杭州迪英加科技有限公司 Robust training method for target detection network
CN111986126A (en) * 2020-07-17 2020-11-24 浙江工业大学 Multi-target detection method based on improved VGG16 network
CN113065650A (en) * 2021-04-02 2021-07-02 中山大学 Multichannel neural network method for long-term memory learning
CN113065650B (en) * 2021-04-02 2023-11-17 中山大学 Multichannel neural network instance separation method based on long-term memory learning
CN113298094A (en) * 2021-06-10 2021-08-24 安徽大学 RGB-T significance target detection method based on modal association and double-perception decoder
CN113298094B (en) * 2021-06-10 2022-11-04 安徽大学 RGB-T significance target detection method based on modal association and double-perception decoder
CN113822172A (en) * 2021-08-30 2021-12-21 中国科学院上海微系统与信息技术研究所 Video spatiotemporal behavior detection method

Also Published As

Publication number Publication date
CN110610210B (en) 2022-03-25

Similar Documents

Publication Publication Date Title
CN110610210B (en) Multi-target detection method
Zhu et al. Online multi-object tracking with dual matching attention networks
CN111709311B (en) Pedestrian re-identification method based on multi-scale convolution feature fusion
CN114202672A (en) Small target detection method based on attention mechanism
Francies et al. A robust multiclass 3D object recognition based on modern YOLO deep learning algorithms
CN111898432B (en) Pedestrian detection system and method based on improved YOLOv3 algorithm
CN108764019A (en) A kind of Video Events detection method based on multi-source deep learning
Nandhini et al. Object Detection Algorithm Based on Multi-Scaled Convolutional Neural Networks
CN115829991A (en) Steel surface defect detection method based on improved YOLOv5s
CN110008844A (en) A kind of long-term gesture tracking method of KCF merging SLIC algorithm
CN112597324A (en) Image hash index construction method, system and equipment based on correlation filtering
CN110930378A (en) Emphysema image processing method and system based on low data demand
CN112149665A (en) High-performance multi-scale target detection method based on deep learning
CN111738074B (en) Pedestrian attribute identification method, system and device based on weak supervision learning
Zeng et al. A novel tensor decomposition-based efficient detector for low-altitude aerial objects with knowledge distillation scheme
Kiruba et al. Hexagonal volume local binary pattern (H-VLBP) with deep stacked autoencoder for human action recognition
Sun et al. FBoT-Net: Focal bottleneck transformer network for small green apple detection
Wang et al. Based on the improved YOLOV3 small target detection algorithm
Wang et al. Non-local attention association scheme for online multi-object tracking
Hu et al. Automatic detection of pecan fruits based on Faster RCNN with FPN in orchard
Li et al. Research on YOLOv3 pedestrian detection algorithm based on channel attention mechanism
Chang et al. Deep Learning Approaches for Dynamic Object Understanding and Defect Detection
Ajith et al. Pedestrian detection: performance comparison using multiple convolutional neural networks
Wu et al. Real-time visual tracking via incremental covariance model update on Log-Euclidean Riemannian manifold
Wen et al. A Lightweight ST-YOLO Based Model for Detection of Tea Bud in Unstructured Natural Environments.

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant