CN113128408B - Article detection method, device, terminal and storage medium - Google Patents

Article detection method, device, terminal and storage medium Download PDF

Info

Publication number
CN113128408B
CN113128408B CN202110429795.8A CN202110429795A CN113128408B CN 113128408 B CN113128408 B CN 113128408B CN 202110429795 A CN202110429795 A CN 202110429795A CN 113128408 B CN113128408 B CN 113128408B
Authority
CN
China
Prior art keywords
feature
pyramid
feature map
region
interest
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110429795.8A
Other languages
Chinese (zh)
Other versions
CN113128408A (en
Inventor
黄惠
沈定国
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen University
Original Assignee
Shenzhen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen University filed Critical Shenzhen University
Priority to CN202110429795.8A priority Critical patent/CN113128408B/en
Publication of CN113128408A publication Critical patent/CN113128408A/en
Application granted granted Critical
Publication of CN113128408B publication Critical patent/CN113128408B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/52Surveillance or monitoring of activities, e.g. for recognising suspicious objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • G06V10/449Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
    • G06V10/451Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
    • G06V10/454Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/05Recognition of patterns representing particular kinds of hidden objects, e.g. weapons, explosives, drugs

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biodiversity & Conservation Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses an article detection method, a device, a terminal and a storage medium, wherein the method comprises the following steps: receiving an image to be detected, and acquiring initial pyramid characteristics of the image to be detected; obtaining deformable convolution parameters according to the initial pyramid features, and obtaining first pyramid features according to the initial pyramid features and the deformable convolution parameters; extracting an interested region according to the first pyramid features, and supplementing feature information of the interested region according to the correlation between each layer of feature map of the first pyramid features and the interested region to obtain an output feature map of the interested region; and acquiring an article detection result according to the output characteristic diagram. The invention can improve the accuracy of article detection.

Description

Article detection method, device, terminal and storage medium
Technical Field
The present invention relates to the field of image processing technologies, and in particular, to an article detection method, an apparatus, a terminal, and a storage medium.
Background
The monitoring of the package X-ray limiting products is an important link of daily package logistics industry and security industry, and the number of the package on-line logistics is far beyond the manually processable range along with popularization and rapid development of on-line shopping. At present, the limiting product monitoring is carried out by carrying out target detection on an X-ray image through a neural network, but the detection effect of the existing target detection network on a large target is far better than that of the existing target detection network on a small target, and the detection precision is not high.
Accordingly, there is a need for improvement and advancement in the art.
Disclosure of Invention
Aiming at the defects in the prior art, the invention provides an article detection method, an article detection device, a terminal and a storage medium, and aims to solve the problem that the detection precision of a target detection network in the prior art is not high.
In order to solve the technical problems, the technical scheme adopted by the invention is as follows:
in a first aspect of the present invention, there is provided a method of detecting an article, the method comprising:
receiving an image to be detected, and acquiring initial pyramid characteristics of the image to be detected;
obtaining deformable convolution parameters according to the initial pyramid features, and obtaining first pyramid features according to the initial pyramid features and the deformable convolution parameters;
extracting an interested region according to the first pyramid features, and supplementing feature information of the interested region according to the correlation between each layer of feature map of the first pyramid features and the interested region to obtain an output feature map of the interested region;
and acquiring an article detection result according to the output characteristic diagram.
The article detection method, wherein the obtaining the deformable convolution parameter according to the initial pyramid feature includes:
Respectively adopting at least one joint convolution to the feature map of the target layer of the initial pyramid feature to obtain at least one intermediate feature map;
fusing the at least one intermediate feature map to obtain a feature map of the target layer of the intermediate pyramid;
acquiring the deformable convolution parameters according to the middle pyramid characteristics;
wherein the joint convolution is implemented by two convolution kernels of sizes 1xj and jx1, respectively, j being a positive integer.
The article detection method, wherein the deformable convolution parameters comprise adaptive offset coordinates of a deformable convolution; the obtaining a first pyramid feature according to the initial pyramid feature and the deformable convolution parameter includes:
carrying out deformable convolution on each layer of feature graphs of the middle pyramid feature according to the deformable convolution parameters and a first preset formula to obtain the first pyramid feature;
the first preset formula is as follows:
wherein ,a pixel value w of a pixel point with a coordinate p on the first layer feature map of the first pyramid feature pn Is the weight of the deformable convolution, D l (p+p n +Δp n ) The coordinates on the feature map of the first layer which is the middle pyramid feature are p+p n +Δp n Pixel value, p, of pixel point of (2) n Δp is a fixed initial offset coordinate of the deformable convolution n Is an adaptive offset coordinate of a deformable convolution.
The article detection method, wherein the supplementing feature information of the region of interest according to the correlation between each layer of feature map of the first pyramid feature and the region of interest, to obtain an output feature map of the region of interest, includes:
acquiring a correlation feature map of each layer of feature map of the first pyramid feature and the region of interest;
and acquiring the output feature map according to the correlation feature map and the first pyramid feature.
The method for detecting an object, wherein the acquiring the correlation feature map of each layer of feature map of the first pyramid feature and the region of interest includes:
acquiring the correlation characteristic diagram through a second preset formula;
the second preset formula is:
wherein ,Gi For the i-th correlation feature map corresponding to the region of interest, F i For the ith region of interest, E l For the first layer feature map of the first pyramid feature, pool (E l ) Representing the E will l Downsampling to sum F i At the same resolution, avgPool represents global pooling.
The article detection method, wherein the obtaining the output feature map according to the correlation feature map and the first pyramid feature includes:
acquiring the output characteristic diagram through a third preset formula;
the third preset formula is:
wherein ,for the pixel value of the ith pixel point in the output feature map corresponding to the kth region of interest, H×W is the resolution of the feature map of the first layer of the first pyramid feature, +.>For the pixel value of the ith pixel point in the correlation characteristic map corresponding to the kth region of interest, < >>And the pixel value of the j-th pixel point in the feature map of the first pyramid feature first layer is the pixel value of the j-th pixel point.
The article detection method is realized through a trained article detection network, wherein the article detection network is obtained through training of a plurality of groups of training data, and each group of training data comprises a sample image to be detected and an article detection result corresponding to the sample image to be detected.
In a second aspect of the present invention, there is provided an article detection apparatus comprising:
the first feature extraction module is used for receiving an image to be detected and acquiring initial pyramid features of the image to be detected;
The second feature extraction module is used for acquiring deformable convolution parameters according to the initial pyramid features and acquiring first pyramid features according to the initial pyramid features and the deformable convolution parameters;
the third feature extraction module is used for extracting a region of interest according to the first pyramid features, and supplementing feature information of the region of interest according to the correlation between each layer of feature map of the first pyramid features and the region of interest to obtain an output feature map of the region of interest;
and the output module is used for acquiring an article detection result according to the output characteristic diagram.
In a third aspect of the present invention, there is provided a terminal comprising a processor, a computer readable storage medium communicatively coupled to the processor, the computer readable storage medium adapted to store a plurality of instructions, the processor adapted to invoke the instructions in the computer readable storage medium to perform the steps of implementing the method of item detection as described in any of the preceding claims.
In a fourth aspect of the present invention, there is provided a computer-readable storage medium storing one or more programs executable by one or more processors to implement the steps of the article detection method of any one of the above.
Compared with the prior art, the object detection method, the device, the terminal and the storage medium provided by the invention have the advantages that pyramid features are extracted after images to be detected are received, the first pyramid features are obtained in a deformable convolution mode, more local context information is collected, and then feature information supplementation is carried out on the region of interest through the correlation between each layer of feature map of the first pyramid features and the region of interest, so that global context information can be well injected into pixels in the region of interest, and the object detection method has higher detection precision compared with the existing detection network model.
Drawings
FIG. 1 is a flow chart of an embodiment of an article detection method provided by the present invention;
fig. 2 is a schematic structural diagram of an article detection network in an embodiment of the method for detecting an article according to the present invention;
FIG. 3 is a schematic diagram of joint convolution in an embodiment of an article detection method provided by the present invention;
fig. 4 is a schematic diagram of a process for acquiring a first pyramid feature in an embodiment of an article detection method according to the present invention;
fig. 5 is a schematic diagram of an obtaining process of an output feature map in an embodiment of an article detection method provided by the present invention;
FIG. 6 is a training loss diagram of an article detection network in experimental verification of the article detection method provided by the invention;
FIG. 7 is a diagram of an article detection network during training in the experimental verification of the method of the present invention;
FIG. 8 is a schematic diagram of a detection result in experimental verification of the method for detecting an article according to the present invention;
FIG. 9 is a graph comparing results of other prior art methods in experimental verification of the method for detecting an article provided by the present invention;
FIG. 10 is a schematic diagram of an embodiment of an article detection apparatus provided by the present invention;
fig. 11 is a schematic diagram of an embodiment of a terminal provided by the present invention.
Detailed Description
In order to make the objects, technical solutions and effects of the present invention clearer and more specific, the present invention will be described in further detail below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
Example 1
As shown in fig. 1, in one embodiment of the method for detecting an article, the method includes the steps of:
s100, receiving an image to be detected, and acquiring initial pyramid features of the image to be detected.
According to the object detection method provided by the embodiment, after the image to be detected is received, the target object in the image to be detected is detected, the image to be detected can be security inspection X-ray imaging, and the target object can be a limited object. The article detection method provided by the embodiment can be realized through a trained article detection network, wherein the article detection network is a neural network, the article detection network is obtained through training of a plurality of sets of training data, and each set of training data comprises a sample image to be detected and an article detection result corresponding to the sample image to be detected. Specifically, the training data may be obtained by means of manual labeling.
The structure of the article detection network may be as shown in fig. 2, and a process of detecting an article in the image to be detected by using the article detection network will be specifically described below.
The image to be detected is input to the article detection network, the initial pyramid feature of the image to be detected is obtained through the article detection network, specifically, the initial pyramid feature of the image to be detected can be obtained through an existing network structure, for example, the initial pyramid feature is extracted through a ResNet101 (residual network) +FPN (feature pyramid networks, characteristic pyramid network), and of course, a person skilled in the art can understand that the initial pyramid feature can be extracted through other network structures, and parameters of various network structures in the article detection network can be enabled to meet the aim of realizing a final output article detection result through a training process of the article detection network.
As shown in fig. 2, after the initial pyramid feature of the image to be detected is acquired, the article detection network may include a local adaptive module (LCB), through which the feature is further extracted, specifically, after the initial pyramid feature of the image to be detected is acquired, the article detection method provided in this embodiment further includes the steps of:
S200, obtaining deformable convolution parameters according to the initial pyramid features, and obtaining first pyramid features according to the initial pyramid features and the deformable convolution parameters.
The obtaining the deformable convolution parameters according to the initial pyramid features comprises the following steps:
s210, respectively adopting at least one joint convolution for the feature map of the target layer of the initial pyramid feature to obtain at least one intermediate feature map;
s220, fusing the at least one middle feature map to obtain a feature map of a target layer of the middle pyramid;
s230, acquiring the deformable convolution parameters according to the middle pyramid features.
Taking four layers of the initial pyramid feature as an example, the feature map of each layer of the initial pyramid feature may be expressed as: a epsilon R C×H×W Where H and W represent the height and width, respectively, of the feature map and C represents the number of channels. In order to make each pixel of the feature map obtain spatial information of the pixel attachment, in this embodiment, a joint convolution manner is adopted to act on the feature map a to obtain the feature map in the middle pyramid. Specifically, the joint convolution is implemented by two convolution kernels of sizes 1xj and jx1, respectively, j being a positive integer. The joint convolution can use a very small parameter to obtain a larger field of view, as shown in fig. 3, for example, a common 7*7 convolution is performed, only a 1*7 convolution and a 7*1 convolution are needed to be combined, for the input channel of 1, the output channel of 1, the field of view of 7*7 is the same, the 7*7 convolution needs 49 parameters, and the joint convolution of the 1*7 convolution and the 7*1 convolution needs only 14 parameters, so that the network is easier to converge due to the saving of the parameters.
In the present embodimentIn order to further extract the context information of the richer local view field, a plurality of joint convolutions are provided, wherein the j value of each joint convolution is different, in particular, for the feature map A of the first layer in the initial pyramid feature l The formula for performing the joint convolution may be as follows:
wherein ,w1×j 、w j×1 Weights, b, of two convolution kernels of sizes 1xj and jx1, respectively 1×j 、b j×1 The offsets of the two convolution kernels of sizes 1xj and jx1 respectively,for the characteristic diagram A l And adopting an i-th joint convolution to obtain an intermediate characteristic diagram.
Taking 4 kinds of joint convolution as an example, for each layer of feature graphs in the initial pyramid, 4 kinds of joint convolution are adopted to perform convolution to obtain 4 pieces of feature graphs, and then the value of j in the formula can be j e {1,3,5,7}, although it is understood that the value of j can be set to other values, and is not limited to the example here.
The feature map of the target layer in the initial pyramid feature is processed by adopting at least one joint convolution to obtain a corresponding intermediate feature map, namely, the feature map of the target layer in the initial pyramid feature is correspondingly obtained to at least one intermediate feature map, each intermediate feature map is fused to obtain the feature map of the target layer of the intermediate pyramid, specifically, the fusion of the at least one intermediate feature map can be realized by adopting a cavity convolution mode, and the formula can be expressed as follows:
wherein ,to adopt the ith joint convolution to obtain an intermediate feature map dw for the feature map of the first layer of the initial pyramid i 、db i Respectively->Weights and offsets of corresponding hole convolutions, D l A feature map of a first layer of the intermediate pyramid features.
As shown in fig. 4, after the middle pyramid is obtained, the feature map of each layer of the middle pyramid is subjected to deformable convolution, so that more peripheral information is adaptively obtained, specifically, deformable convolution parameters are firstly obtained according to the middle pyramid features, and the deformable convolution parameters comprise adaptive offset coordinates of the deformable convolution. Specifically, the obtaining the first pyramid feature according to the initial pyramid feature and the deformable convolution parameter includes:
carrying out deformable convolution on each layer of feature graphs of the middle pyramid feature according to the deformable convolution parameters and a first preset formula to obtain the first pyramid feature;
the first preset formula is as follows:
wherein ,a pixel value w of a pixel point with a coordinate p on the first layer feature map of the first pyramid feature pn Is the weight of the deformable convolution, D l (p+p n +Δp n ) The coordinates on the feature map of the first layer which is the middle pyramid feature are p+p n +Δp n Pixel value, p, of pixel point of (2) n Δp is a fixed initial offset coordinate of the deformable convolution n Is an adaptive offset coordinate of a deformable convolution.
The deformable convolution parameters are obtained according to the middle pyramid features, the deformable convolution parameters can be obtained according to the middle pyramid features through one or more convolutions, specifically, the object detection network can be provided with an adaptive offset coordinate prediction module, the adaptive offset coordinate prediction module comprises one or more convolutions, a feature map of each layer in the middle pyramid features is used as input of the adaptive offset coordinate prediction module, and corresponding adaptive offset coordinates are output after convolution for each pixel point of the feature map of each layer in the middle pyramid features. For the pixel point p on a certain layer of feature map in the middle pyramid feature, the feature vector with 18 channels can be convolved and output through 3*3, each layer of channel on the 18 layers of channels corresponds to one self-adaptive offset coordinate of the pixel point, and the 18 coordinate values respectively form 9 pairs of (x, y) coordinates, and just correspond to a convolution kernel of 3x 3. By adding the fixed initialization offset coordinates of the convolution to the p coordinates and the self-adaptive offset coordinates obtained according to the middle pyramid features, the self-adaptive dynamic adjustment of the convolution kernel position can be realized.
As is clear from the above description, in the target detection method provided in this embodiment, in the process of obtaining the first pyramid feature, reliable context information is gradually collected for each pixel point through joint convolution, hole convolution and deformable convolution, and the whole process from a preset small range to a self-adaptive large range, so that important information of a current area can be aggregated for each pixel point, excessive information which is not collected can not be lost even if a convolution kernel of the hole convolution expands, and in order to solve the fixed determination of a mode of collecting multi-scale information, deformable convolution is added, so that under the condition that the expansion rate is fixed, the capability of dynamically adjusting the convolution position according to the picture content is obtained, the accuracy of feature extraction is improved, and the generalization capability of an article detection network is improved.
As shown in fig. 2, after the first pyramid feature is obtained, an output feature map that is finally used for outputting an article detection result may be further obtained by a global collecting module (GCB) in the article detection network, as shown in fig. 1, after the first pyramid feature is obtained, the article detection method provided in this embodiment further includes the steps of:
S300, extracting an interested region according to the first pyramid features, and supplementing feature information of the interested region according to the correlation between each layer of feature map of the first pyramid features and the interested region to obtain an output feature map of the interested region.
As shown in fig. 5, after obtaining the first pyramid feature, extracting a region of interest according to the first pyramid feature, specifically, the region of interest (ROI) may be obtained by using an existing region of interest extraction network structure such as RPN and ROIAlign, and the feature information of the region of interest is supplemented according to the correlation between each layer of feature map of the first pyramid feature and the region of interest, so as to obtain an output feature map of the region of interest, which includes:
s310, obtaining a correlation feature map of each layer of feature map of the first pyramid feature and the region of interest;
s320, acquiring the output feature map according to the correlation feature map and the first pyramid feature.
Because the sensitivity of feature maps with different resolutions in pyramid features to the size of an object is different, for example, the feature map with larger resolution has the greatest reservation of information, is favorable for positioning and identifying small objects, and the feature map with smaller resolution has stronger semantic refining and more accurate definition of the category of the large object, in the embodiment, unified correlation calculation is performed on each interested region by utilizing each layer of feature map of the first pyramid feature. Specifically, the acquiring the correlation feature map of each layer of feature map of the first pyramid feature and the region of interest includes:
Acquiring the correlation characteristic diagram through a second preset formula;
the second preset formula is:
wherein ,Gi For the i-th correlation feature map corresponding to the region of interest, F i For the ith region of interest, E l For the first layer feature map of the first pyramid feature, pool (E l ) Representing the E will l Downsampling to sum F i At the same resolution, avgPool represents global pooling, i.e., compressing the resolution of the feature map to 1*1. In the embodiment, the usefulness of the unified global feature for the feature of the region of interest is determined by calculating the correlation between the pooled region of interest and the global feature, and the correlation is used as a weight to multiply the unified global feature and then added with the original feature to supplement the feature information of the region of interest.
After obtaining the unified global information-supplemented correlation feature map G, obtaining a global adaptive region of interest as a feature map for finally outputting a detection result of an article by using a pixel-level correlation information supplementing manner, specifically, obtaining the output feature map according to the correlation feature map and the first pyramid feature includes:
acquiring the output characteristic diagram through a third preset formula;
The third preset formula is:
wherein ,for the pixel value of the ith pixel point in the output feature map corresponding to the kth region of interest, H×W is the resolution of the feature map of the first layer of the first pyramid feature, +.>For the pixel value of the ith pixel point in the correlation characteristic map corresponding to the kth region of interest, < >>And the pixel value of the j-th pixel point in the feature map of the first pyramid feature first layer is the pixel value of the j-th pixel point.
As can be seen from the above description, in the method for detecting an article according to the present embodiment, under the condition of compensating for the detection accuracy of some categories (such as large objects), the accuracy of the whole category is improved at the same time, and in addition, global context information is added to the region of interest, so that more identifiable clues can be provided for the blocked object, and therefore, the finally extracted output feature map can realize that the blocked object can be better identified.
S400, acquiring an article detection result according to the output characteristic diagram.
Specifically, the acquiring the article detection structure according to the output feature map may be implemented by a classifier (for example, a classifier such as a Cascade RCNN) in the article detection network, that is, the output feature map is used as an input of the classifier, and the classifier outputs an article detection result.
In this embodiment, parameters (weight, bias, etc. of the convolution kernel) and classifier parameters of each module in the article detection network are obtained in the training process of the article detection network. The training process of the article detection network may be guided by a plurality of L1 and cross entropy loss functions, and the article detection network uses rpn+roialign to extract the region of interest, uses cascades RCNN as a classifier, and in the training process of the article detection network, the total loss function may be defined as:
wherein ,is a cross entropy loss function in RPN for classifying candidate regions, and +.>Then it is the L1 smoothing loss, p, in the RPN for regression candidate region position coordinates i and />Respectively representing a predicted value and a true value of the object class, t i Andrepresenting predicted and actual values of the object position coordinates. A three stage Cascade classifier (cascades RCNN) will decide that it has six loss functions (each stage has its own classification and regression loss functions). Like two loss functions in RPN, < -> and />Representing the first stage of the loss function for classification and regression in Cascade RCNN, the same and />Then it belongs to the second third phase. And lambda is c 、λ r 、λ c1 、λ r1 、λ c2 、λ r2 、λ c3 and λr3 Is the coefficient used to balance their loss terms. Empirically, default values for the coefficients may be set to 1.0, 0.5, 0.25, and 0, respectively.25. Of course, it will be appreciated by those skilled in the art that when different network structures are employed, different loss functions may be set accordingly to achieve the effect of training the article detection network.
The object detection method provided by the embodiment is used for experimental verification of the effect, specifically, the effect is realized on a pythoch platform by adopting a Python programming language, the realization process can accelerate the calculation of a network by using the parallel calculation of CUDA of the GPU, and meanwhile, various training or reasoning commands are executed by using Shell language. The hardware environment is a computer with 24 cores, a main frequency 2.20GHz Xeon central processing unit, a 256GB memory and an Nvidia Quadro P6000 video card. The additional resource consumption for training two pictures with a resolution of 640 x 480 against the existing cascaded market RCNN commodity detection model is shown in table 1.
TABLE 1
In the training process, for the picture data with the batch size of 2 on one display card, the network algorithm provided by the user reasonably uses extra computing resources compared with the basic network Cascade RCNN, and can train on a 12G display card. The network is trained by adopting the display card number of 8, the batch size of 16, the iteration number of 2 ten thousand and the learning rate of 0.1. As can be seen from the final training result graph shown in fig. 6, as the number of iterations increases to 2 tens of thousands of iterations, the network gradually converges to a steady loss, proving that the network converges rapidly under the training resources that can be limited.
The 10000 iterations, 15000 iterations and 20000 iterations of the article detection network were taken out to test an RGB picture alone and to visualize the prediction of the model, the visualization of which is shown in fig. 7. The square boxes in fig. 7 represent the locations of the limiting products, and the numbers from the square vendors represent the categories of the limiting products, and as can be seen from fig. 7, the article detection network can already detect some more obvious and simple limiting products at 10000 iterations, and the network gradually learns the feature distribution of the limiting products as the training iteration times increase. At 20000 iteration, the article detection network can already detect those blocked limiting articles, and the predicted visual result graph basically accords with the real situation of the label, which indicates the feasibility of the article detection method provided by the embodiment in detecting limiting articles during security inspection.
The time required to train and infer the various parts of two pictures with a resolution of 640 x 480 is provided in table 2. As can be seen from Table 2, the time consumed in the reasoning process of the LCB module and the GCB module in the object detection network provided by the invention is about 100ms, and the total time consumed in the network is about 1/5 of the total time consumed in the network, but the application of the network in the actual monitoring process is not affected. Through statistics, the time required by the article detection network to infer a picture is 150ms, namely, the detection result of 6 frames of images can be returned within one second of the process of passing the security inspection machine.
Module Training time (ms) Time-consuming reasoning (ms)
ResNet+FPN 126 62
LCB 86 32
RPN+ROIAlign 311 125
GCB 98 42
Cascade RCNN 116 62
Totals to 737 323
TABLE 2
The evaluation of the effectiveness of each module is provided in table 3, using the mAP as a comparative quantization index, i.e., the average of the areas under the various classes of precision and recall curves. As shown in Table 3, 46.8% mAP was achieved by adding the FPN and Cascade RCNN modules proposed by the former. Subsequently, two modules of LCB and GCB are added, and context information of self-adaptive space information is added for the network, so that 59.2% mAP is finally achieved. This shows that the LCB and GCB modules in the article detection network provided by the present invention are practical and efficient.
Backbone network FPN Cascade RCNN LCB GCB mAP(%)
ResNet-101 28.5
ResNet-101 32.1
ResNet-101 46.8
ResNet-101 54.6
ResNet-101 59.2
TABLE 3 Table 3
The visual results of the detection frame, the classification and the classification score thereof obtained during the prediction are shown in fig. 8, wherein the square frame in fig. 8 represents the position of the limited product, the number above the square frame represents the class of the limited product, and the detection result map basically predicts the limited product contained in the label and basically keeps the same with the class of the limited product in classification. The classification scores of these classifications are also shown after classification, with classification scores approaching 1 showing high accuracy of the model.
In order to verify the robustness and advancement of the network, a quantitative comparison with the current more advanced limit-check networks is provided in table 4. Fps and mAP were used as quantitative indicators for this comparison. Based on the fairness principle, all networks use a uniform backbone network ResNet-101. As can be seen from table 4, our network performs better than some existing methods in the mAP result, and reaches 6FPS in the reasoning speed, which can basically cope with the limit detection task in the actual scenario.
Method Backbone network Speed of reasoning (fps) mAP(%)
Ren et al ResNet-101 9 32.1
Cai et al ResNet-101 7 46.8
Akcay et al ResNet-101 10 42.4
Miao et al ResNet-101 6 55.6
The invention is that ResNet-101 6 59.2
TABLE 4 Table 4
In order to more obviously feel the advantages of the article detection method provided by the invention compared with other methods, the prediction results of the methods proposed by original pictures, labels, miao et al are visualized. As shown in fig. 9, the method proposed by Miao et al has problems of inaccurate detection frame and repeated redundancy. It can be seen in the first and third line of visualizations that the detection box of the method proposed by Miao et al is much larger than the label. The second line of visual results can show that the overlapped small objects cannot be precisely detected by the method proposed by Miao et al, and the condition of missed detection occurs, which is very fatal to the security industry. In the fourth line of visualization results, it can be seen that the method proposed by Miao et al would detect a generic object as a restriction. The object detection method provided by the invention is more excellent than the method proposed by Miao et al.
In summary, the present embodiment provides an article detection method, where pyramid features are extracted after an image to be detected is received, a first pyramid feature is obtained by a deformable convolution method, more local context information is collected, and then feature information of an area of interest is supplemented by correlation between each layer of feature map of the first pyramid feature and the area of interest, so that global context information can be well injected into pixels of the area of interest, and the method has higher detection accuracy compared with the existing detection network model.
It should be understood that, although the steps in the flowcharts shown in the drawings of the present specification are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least a portion of the steps in the flowcharts may include a plurality of sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, the order in which the sub-steps or stages are performed is not necessarily sequential, and may be performed in turn or alternately with at least a portion of the sub-steps or stages of other steps or other steps.
Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.
Example two
Based on the above embodiment, the present invention further provides an article detection device, as shown in fig. 10, where the article detection device includes:
the first feature extraction module is used for receiving an image to be detected and acquiring initial pyramid features of the image to be detected, and is specifically described in the first embodiment;
the second feature extraction module is configured to obtain a deformable convolution parameter according to the initial pyramid feature, and obtain a first pyramid feature according to the initial pyramid feature and the deformable convolution parameter, as described in embodiment one;
the third feature extraction module is configured to extract a region of interest according to the first pyramid feature, and supplement feature information of the region of interest according to relevance between each layer of feature map of the first pyramid feature and the region of interest, so as to obtain an output feature map of the region of interest, as described in embodiment one;
and the output module is used for acquiring an article detection result according to the output characteristic diagram, and the article detection result is specifically described in the first embodiment.
Example III
Based on the above embodiment, the present application also provides a terminal correspondingly, as shown in fig. 11, where the terminal includes a processor 10 and a memory 20. Fig. 11 shows only some of the components of the terminal, but it should be understood that not all of the illustrated components are required to be implemented and that more or fewer components may be implemented instead.
The memory 20 may in some embodiments be an internal storage unit of the terminal, such as a hard disk or a memory of the terminal. The memory 20 may in other embodiments also be an external storage device of the terminal, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card) or the like, which are provided on the terminal. Further, the memory 20 may also include both an internal storage unit and an external storage device of the terminal. The memory 20 is used for storing application software and various data installed in the terminal. The memory 20 may also be used to temporarily store data that has been output or is to be output. In one embodiment, the memory 20 stores an item detection program 30, and the item detection program 30 is executable by the processor 10 to implement the item detection method of the present application.
The processor 10 may in some embodiments be a central processing unit (Central Processing Unit, CPU), microprocessor or other chip for executing program code or processing data stored in the memory 20, for example for performing the item detection method or the like.
In one embodiment, the following steps are implemented when the processor 10 executes the item detection program 30 in the memory 20:
receiving an image to be detected, and acquiring initial pyramid characteristics of the image to be detected;
obtaining deformable convolution parameters according to the initial pyramid features, and obtaining first pyramid features according to the initial pyramid features and the deformable convolution parameters;
extracting an interested region according to the first pyramid features, and supplementing feature information of the interested region according to the correlation between each layer of feature map of the first pyramid features and the interested region to obtain an output feature map of the interested region;
and acquiring an article detection result according to the output characteristic diagram.
The obtaining the deformable convolution parameters according to the initial pyramid features comprises the following steps:
respectively adopting at least one joint convolution to the feature map of the target layer of the initial pyramid feature to obtain at least one intermediate feature map;
Fusing the at least one intermediate feature map to obtain a feature map of the target layer of the intermediate pyramid;
acquiring the deformable convolution parameters according to the middle pyramid characteristics;
wherein the joint convolution is implemented by two convolution kernels of sizes 1xj and jx1, respectively, j being a positive integer.
Wherein the deformable convolution parameters include adaptive bias coordinates of a deformable convolution; the obtaining a first pyramid feature according to the initial pyramid feature and the deformable convolution parameter includes:
carrying out deformable convolution on each layer of feature graphs of the middle pyramid feature according to the deformable convolution parameters and a first preset formula to obtain the first pyramid feature;
the first preset formula is as follows:
wherein ,for the pixel value of the pixel point with the coordinate p on the first layer feature map of the first pyramid feature, +.>Is the weight of the deformable convolution, D l (p+p n +Δp n ) The coordinates on the feature map of the first layer which is the middle pyramid feature are p+p n +Δp n Pixel value, p, of pixel point of (2) n Fixed initial bias for deformable convolutionCoordinates Δp n Is an adaptive offset coordinate of a deformable convolution.
The step of supplementing the feature information of the region of interest according to the correlation between each layer of feature map of the first pyramid feature and the region of interest to obtain an output feature map of the region of interest includes:
Acquiring a correlation feature map of each layer of feature map of the first pyramid feature and the region of interest;
and acquiring the output feature map according to the correlation feature map and the first pyramid feature.
Wherein the obtaining the correlation feature map of each layer of feature map of the first pyramid feature and the region of interest includes:
acquiring the correlation characteristic diagram through a second preset formula;
the second preset formula is:
wherein ,Gi For the i-th correlation feature map corresponding to the region of interest, F i For the ith region of interest, E l For the first layer feature map of the first pyramid feature, pool (E l ) Representing the E will l Downsampling to sum F i At the same resolution, avgPool represents global pooling.
The obtaining the output feature map according to the correlation feature map and the first pyramid feature includes:
acquiring the output characteristic diagram through a third preset formula;
the third preset formula is:
wherein ,for the pixel value of the ith pixel point in the output feature map corresponding to the kth region of interest, H×W is the resolution of the feature map of the first layer of the first pyramid feature, +. >For the pixel value of the ith pixel point in the correlation characteristic map corresponding to the kth region of interest, < >>And the pixel value of the j-th pixel point in the feature map of the first pyramid feature first layer is the pixel value of the j-th pixel point.
Example IV
The present invention also provides a computer-readable storage medium in which one or more programs are stored, the one or more programs being executable by one or more processors to implement the steps of the item detection method as described above.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (7)

1. A method of detecting an article, the method comprising:
receiving an image to be detected, and acquiring initial pyramid characteristics of the image to be detected;
Obtaining deformable convolution parameters according to the initial pyramid features, and obtaining first pyramid features according to the initial pyramid features and the deformable convolution parameters;
extracting an interested region according to the first pyramid features, and supplementing feature information of the interested region according to the correlation between each layer of feature map of the first pyramid features and the interested region to obtain an output feature map of the interested region;
the step of supplementing the feature information of the region of interest according to the correlation between each layer of feature map of the first pyramid feature and the region of interest to obtain an output feature map of the region of interest, includes:
acquiring a correlation feature map of each layer of feature map of the first pyramid feature and the region of interest;
acquiring the output feature map according to the correlation feature map and the first pyramid feature;
the obtaining the correlation feature map of each layer of feature map of the first pyramid feature and the region of interest includes:
acquiring the correlation characteristic diagram through a second preset formula;
the second preset formula is:
wherein ,Gi For the i-th correlation feature map corresponding to the region of interest, F i For the ith region of interest, E l For the first layer feature map of the first pyramid feature, pool (E l ) Representing the E will l Downsampling to sum F i At the same resolution, avgPool represents global pooling;
the obtaining the output feature map according to the correlation feature map and the first pyramid feature includes:
acquiring the output characteristic diagram through a third preset formula;
the third preset formula is:
wherein ,for the pixel value of the ith pixel point in the output feature map corresponding to the kth region of interest, H×W is the resolution of the feature map of the first layer of the first pyramid feature, +.>For the pixel value of the ith pixel point in the correlation characteristic map corresponding to the kth region of interest, < >>The pixel value of the j-th pixel point in the feature map of the first pyramid feature first layer is the pixel value of the j-th pixel point;
and acquiring an article detection result according to the output characteristic diagram.
2. The method of claim 1, wherein the obtaining deformable convolution parameters from the initial pyramid features comprises:
respectively adopting at least one joint convolution to the feature map of the target layer of the initial pyramid feature to obtain at least one intermediate feature map;
Fusing at least one intermediate feature map to obtain a feature map of a target layer of the intermediate pyramid;
acquiring the deformable convolution parameters according to the middle pyramid characteristics;
wherein the joint convolution is implemented by two convolution kernels of sizes 1xj and jx1, respectively, j being a positive integer.
3. The method of claim 2, wherein the deformable convolution parameters comprise adaptive offset coordinates of a deformable convolution; the obtaining a first pyramid feature according to the initial pyramid feature and the deformable convolution parameter includes:
carrying out deformable convolution on each layer of feature graphs of the middle pyramid feature according to the deformable convolution parameters and a first preset formula to obtain the first pyramid feature;
the first preset formula is as follows:
wherein ,for the pixel value of the pixel point with the coordinate p on the first layer feature map of the first pyramid feature, +.>Is the weight of the deformable convolution, D l (p+p n +Δp n ) The coordinates on the feature map of the first layer which is the middle pyramid feature are p+p n +Δp n Pixel value, p, of pixel point of (2) n Δp is a fixed initial offset coordinate of the deformable convolution n Is an adaptive offset coordinate of a deformable convolution.
4. A method of detecting an article according to any one of claims 1 to 3, wherein the method of detecting an article is carried out by a trained article detection network, wherein the article detection network is trained by a plurality of sets of training data, each set of training data comprising an image to be detected of a sample and an article detection result corresponding to the image to be detected of the sample.
5. An article detection device, comprising:
the first feature extraction module is used for receiving an image to be detected and acquiring initial pyramid features of the image to be detected;
the second feature extraction module is used for acquiring deformable convolution parameters according to the initial pyramid features and acquiring first pyramid features according to the initial pyramid features and the deformable convolution parameters;
the third feature extraction module is used for extracting a region of interest according to the first pyramid features, and supplementing feature information of the region of interest according to the correlation between each layer of feature map of the first pyramid features and the region of interest to obtain an output feature map of the region of interest;
The third feature extraction module is specifically configured to:
acquiring a correlation feature map of each layer of feature map of the first pyramid feature and the region of interest;
acquiring the output feature map according to the correlation feature map and the first pyramid feature;
the obtaining the correlation feature map of each layer of feature map of the first pyramid feature and the region of interest includes:
acquiring the correlation characteristic diagram through a second preset formula;
the second preset formula is:
wherein ,Gi For the i-th correlation feature map corresponding to the region of interest, F i For the ith region of interest, E l For the first layer feature map of the first pyramid feature, pool (E l ) Representing the E will l Downsampling to sum F i At the same resolution, avgPool represents global pooling;
the obtaining the output feature map according to the correlation feature map and the first pyramid feature includes:
acquiring the output characteristic diagram through a third preset formula;
the third preset formula is:
wherein ,for the pixel value of the ith pixel point in the output feature map corresponding to the kth region of interest, H×W is the resolution of the feature map of the first layer of the first pyramid feature, +. >For the pixel value of the ith pixel point in the correlation characteristic map corresponding to the kth region of interest, < >>The pixel value of the j-th pixel point in the feature map of the first pyramid feature first layer is the pixel value of the j-th pixel point;
and the output module is used for acquiring an article detection result according to the output characteristic diagram.
6. A terminal, the terminal comprising: a processor, a computer readable storage medium communicatively coupled to the processor, the computer readable storage medium adapted to store a plurality of instructions, the processor adapted to invoke the instructions in the computer readable storage medium to perform the steps of implementing the method of item detection of any of the above claims 1-4.
7. A computer-readable storage medium storing one or more programs executable by one or more processors to implement the steps of the item detection method of any one of claims 1-4.
CN202110429795.8A 2021-04-21 2021-04-21 Article detection method, device, terminal and storage medium Active CN113128408B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110429795.8A CN113128408B (en) 2021-04-21 2021-04-21 Article detection method, device, terminal and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110429795.8A CN113128408B (en) 2021-04-21 2021-04-21 Article detection method, device, terminal and storage medium

Publications (2)

Publication Number Publication Date
CN113128408A CN113128408A (en) 2021-07-16
CN113128408B true CN113128408B (en) 2023-09-22

Family

ID=76778758

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110429795.8A Active CN113128408B (en) 2021-04-21 2021-04-21 Article detection method, device, terminal and storage medium

Country Status (1)

Country Link
CN (1) CN113128408B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110348445A (en) * 2019-06-06 2019-10-18 华中科技大学 A kind of example dividing method merging empty convolution sum marginal information
CN110516732A (en) * 2019-08-22 2019-11-29 北京地平线机器人技术研发有限公司 The training method of feature pyramid network, the method and apparatus for extracting characteristics of image
WO2021027135A1 (en) * 2019-08-15 2021-02-18 平安科技(深圳)有限公司 Cell detection model training method and apparatus, computer device and storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9965719B2 (en) * 2015-11-04 2018-05-08 Nec Corporation Subcategory-aware convolutional neural networks for object detection

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110348445A (en) * 2019-06-06 2019-10-18 华中科技大学 A kind of example dividing method merging empty convolution sum marginal information
WO2021027135A1 (en) * 2019-08-15 2021-02-18 平安科技(深圳)有限公司 Cell detection model training method and apparatus, computer device and storage medium
CN110516732A (en) * 2019-08-22 2019-11-29 北京地平线机器人技术研发有限公司 The training method of feature pyramid network, the method and apparatus for extracting characteristics of image

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于YOLOV3的改进模型在行人检测中的应用;黄同愿;杨雪姣;向国徽;陈辽;;重庆理工大学学报(自然科学)(08);第163-172页 *

Also Published As

Publication number Publication date
CN113128408A (en) 2021-07-16

Similar Documents

Publication Publication Date Title
US10885365B2 (en) Method and apparatus for detecting object keypoint, and electronic device
US20210398294A1 (en) Video target tracking method and apparatus, computer device, and storage medium
US20220092351A1 (en) Image classification method, neural network training method, and apparatus
CN108230359B (en) Object detection method and apparatus, training method, electronic device, program, and medium
CN108122234B (en) Convolutional neural network training and video processing method and device and electronic equipment
US11062453B2 (en) Method and system for scene parsing and storage medium
WO2020228446A1 (en) Model training method and apparatus, and terminal and storage medium
CN108427927B (en) Object re-recognition method and apparatus, electronic device, program, and storage medium
CN109165645B (en) Image processing method and device and related equipment
CN111080628A (en) Image tampering detection method and device, computer equipment and storage medium
CN111192292A (en) Target tracking method based on attention mechanism and twin network and related equipment
JP2021508123A (en) Remote sensing Image recognition methods, devices, storage media and electronic devices
US8923628B2 (en) Computer readable medium, image processing apparatus, and image processing method for learning images based on classification information
EP4244762A1 (en) A temporal bottleneck attention architecture for video action recognition
CN111723815B (en) Model training method, image processing device, computer system and medium
CN111340195A (en) Network model training method and device, image processing method and storage medium
US11354549B2 (en) Method and system for region proposal based object recognition for estimating planogram compliance
CN112330651A (en) Logo detection method and system based on deep learning
WO2024032010A1 (en) Transfer learning strategy-based real-time few-shot object detection method
CN111179270A (en) Image co-segmentation method and device based on attention mechanism
CN112256899B (en) Image reordering method, related device and computer readable storage medium
CN116310688A (en) Target detection model based on cascade fusion, and construction method, device and application thereof
CN111612000A (en) Commodity classification method and device, electronic equipment and storage medium
CN113128408B (en) Article detection method, device, terminal and storage medium
CN111814653A (en) Method, device, equipment and storage medium for detecting abnormal behaviors in video

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant