CN112101373A - Object detection method and device based on deep learning network and electronic equipment - Google Patents

Object detection method and device based on deep learning network and electronic equipment Download PDF

Info

Publication number
CN112101373A
CN112101373A CN201910525931.6A CN201910525931A CN112101373A CN 112101373 A CN112101373 A CN 112101373A CN 201910525931 A CN201910525931 A CN 201910525931A CN 112101373 A CN112101373 A CN 112101373A
Authority
CN
China
Prior art keywords
feature
unit
size
generation unit
feature map
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910525931.6A
Other languages
Chinese (zh)
Inventor
陶轩
谭志明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fujitsu Ltd
Original Assignee
Fujitsu Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fujitsu Ltd filed Critical Fujitsu Ltd
Priority to CN201910525931.6A priority Critical patent/CN112101373A/en
Priority to JP2020100215A priority patent/JP2020205048A/en
Publication of CN112101373A publication Critical patent/CN112101373A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Abstract

The embodiment of the application provides an object detection method and device based on a deep learning network and an electronic device, wherein the object detection device based on the deep learning network comprises the following steps: a feature extraction unit having a plurality of feature extraction means for extracting features of different sizes from an input image; a multi-size feature generation unit having a plurality of cascade-connected feature generation units that generate feature maps (feature maps) corresponding to respective sizes, respectively, by using a deformed Convolution (Deformable Convolution) process, based on the features of different sizes extracted by the feature extraction unit; and an object position detection unit that detects bounding box information of objects of corresponding sizes from feature maps (feature maps) of different sizes generated by the multi-size feature generation unit, respectively, using a candidate region generation Network (RPN).

Description

Object detection method and device based on deep learning network and electronic equipment
Technical Field
The present application relates to the field of electronic information technology.
Background
In recent years, due to the close relationship with image analysis, image-based object detection techniques have received much attention. With the rapid development of deep learning, especially the development of Convolutional Neural Network (CNN), the performance of the object detection technology has been greatly improved. At present, advanced object detection technology has reached very high detection accuracy and Recall Rate (Recall Rate).
Despite the tremendous advances in object detection technology, many challenges remain in this area. One challenge is that it is difficult to identify objects that differ greatly in size, and for this reason researchers developed the Faster R-CNN classifier.
Another challenge is the influence of geometric transformation of the object shape on the recognition result, wherein how to adapt geometric transformation such as size, posture, observation angle and deformation of the object in the image is a key problem in visual recognition. Generally, there are two approaches to mitigate the impact of geometric transformations of object shapes on recognition results: the first approach is to maintain a data set that covers all changes; the second approach is to use a constant manual feature and a specific algorithm with respect to the geometric transformation.
It should be noted that the above background description is only for the convenience of clear and complete description of the technical solutions of the present application and for the understanding of those skilled in the art. Such solutions are not considered to be known to the person skilled in the art merely because they have been set forth in the background section of the present application.
Disclosure of Invention
The inventors of the present application have found that the existing object detection techniques have some limitations when faced with the above two challenges, such as: for the Faster R-CNN classifier, while performing well in recognizing large-sized objects, it is difficult to recognize objects of smaller size; for a method of maintaining a data set that covers all changes, it is difficult to have a data set cover all situations in real life, and efficiency and cost will increase dramatically as data increases; for methods using manual features and specific algorithms, which require a lot of a-priori knowledge and experience to manually set a suitable property for a specific geometric transformation, a new property needs to be manually set when a new geometric transformation is involved.
The embodiment of the application provides an object detection method, an object detection device and electronic equipment based on a deep learning network.
According to a first aspect of embodiments of the present application, there is provided an object detection apparatus based on a deep learning network, including:
a feature extraction unit having a plurality of feature extraction means for extracting features of different sizes from an input image;
a multi-size feature generation unit having a plurality of cascade-connected feature generation units that generate feature maps (feature maps) corresponding to respective sizes, respectively, by using a deformed Convolution (Deformable Convolution) process, based on the features of different sizes extracted by the feature extraction unit; and
and an object position detection unit that detects bounding box information of objects of corresponding sizes from feature maps (feature maps) of different sizes generated by the multi-size feature generation unit, respectively, using a candidate region generation Network (RPN).
According to a second aspect of the embodiments of the present application, there is provided an object detection method based on a deep learning network, including:
a plurality of feature extraction units respectively extracting features of different sizes from an input image;
a plurality of cascaded feature generation units respectively generate feature maps corresponding to the sizes by using deformation Convolution (Deformable Convolution) processing according to the features of different sizes extracted by the plurality of feature extraction units; and
and detecting frame information of objects with corresponding sizes from generated feature maps (features) with different sizes respectively by using a candidate region generation Network (RPN).
According to a third aspect of the embodiments of the present application, there is provided an object detection apparatus based on a deep learning network as described in the first aspect of the embodiments.
One of the beneficial effects of the embodiment of the application lies in: the method can accurately identify the small-sized object in the image and reduce the influence of the geometric transformation of the object shape in the image on the detection result.
Specific embodiments of the present application are disclosed in detail with reference to the following description and drawings, indicating the manner in which the principles of the application may be employed. It should be understood that the embodiments of the present application are not so limited in scope. The embodiments of the present application include many variations, modifications, and equivalents within the spirit and scope of the appended claims.
Features that are described and/or illustrated with respect to one embodiment may be used in the same way or in a similar way in one or more other embodiments, in combination with or instead of the features of the other embodiments.
It should be emphasized that the term "comprises/comprising" when used herein, is taken to specify the presence of stated features, integers, steps or components but does not preclude the presence or addition of one or more other features, integers, steps or components.
Drawings
The accompanying drawings, which are included to provide a further understanding of the embodiments of the application, are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the principles of the application. It is obvious that the drawings in the following description are only some embodiments of the application, and that for a person skilled in the art, other drawings can be derived from them without inventive effort. In the drawings:
fig. 1 is a schematic diagram of an object detection apparatus based on a deep learning network according to a first aspect of an embodiment of the present application;
FIG. 2 is a schematic diagram of a specific structure of the deep learning network-based object detection apparatus of FIG. 1;
FIG. 3 is a schematic diagram of the feature generation unit 102 n;
FIG. 4 is a schematic diagram of an image-based parking detection method of a second aspect of an embodiment of the present application;
fig. 5 is a schematic configuration diagram of an electronic device according to the third aspect of the embodiment of the present application.
Detailed Description
The foregoing and other features of the present application will become apparent from the following description, taken in conjunction with the accompanying drawings. In the description and drawings, particular embodiments of the application are disclosed in detail as being indicative of some of the embodiments in which the principles of the application may be employed, it being understood that the application is not limited to the embodiments described, but, on the contrary, is intended to cover all modifications, variations, and equivalents falling within the scope of the appended claims.
In the embodiments of the present application, the terms "first", "second", and the like are used for distinguishing different elements by reference, but do not denote a spatial arrangement, a temporal order, or the like of the elements, and the elements should not be limited by the terms. The term "and/or" includes any and all combinations of one or more of the associated listed terms. The terms "comprising," "including," "having," and the like, refer to the presence of stated features, elements, components, and do not preclude the presence or addition of one or more other features, elements, components, and elements.
In the embodiments of the present application, the singular forms "a", "an", and the like include the plural forms and are to be construed broadly as "a" or "an" and not limited to the meaning of "a" or "an"; furthermore, the term "the" should be understood to include both the singular and the plural, unless the context clearly dictates otherwise. Further, the term "according to" should be understood as "at least partially according to … …," and the term "based on" should be understood as "based at least partially on … …," unless the context clearly dictates otherwise.
First aspect of the embodiments
A first aspect of an embodiment of the present application provides an object detection apparatus based on a deep learning network.
Fig. 1 is a schematic diagram of an object detection apparatus based on a deep learning network according to a first aspect of an embodiment of the present application. As shown in fig. 1, the deep learning network-based object detection apparatus 100 includes: a feature extraction unit 101, a multi-size feature generation unit 102, and a target position detection unit 103.
The feature extraction unit 101 includes a plurality of feature extraction units for extracting features of different sizes from an input image; the multi-size feature generation unit 102 includes a plurality of cascade-connected feature generation units that generate feature maps (feature maps) corresponding to respective sizes, respectively, by using a deformed Convolution (Deformable Convolution) process, based on the features of different sizes extracted by the feature extraction unit 101; the object position detection unit 103 detects bounding box information of objects of corresponding sizes from feature maps (feature maps) of different sizes generated by the multi-size feature generation unit 102, respectively, using a candidate region generation Network (RPN).
According to the first aspect of the embodiments of the present application, the multi-size feature generation section 102 generates the feature maps corresponding to the respective sizes, thereby accurately detecting both large-sized and small-sized objects in the image, and the multi-size feature generation section 102 generates the feature maps using the deformed convolution process, thereby making it possible to reduce the influence of the geometric transformation of the object shape in the image on the detection result.
In the first aspect of the embodiment of the present application, as shown in fig. 1, the deep learning network-based object detection apparatus 100 may further include: a pooling unit 104.
The Pooling processing unit 104 performs a deformed Pooling (formatting) process on the basis of the feature map (feature maps) portion corresponding to the frame of the object detected by the object position detecting unit 103, so that the objects in the frames have the same size.
In the first aspect of the embodiment of the present application, as shown in fig. 1, the deep learning network-based object detection apparatus 100 may further include: a merging unit 105 and a detection unit 106.
The merging unit 105 merges (concat) the image features in the frames after the pooling unit 104 performs the deformed pooling process; the detection unit 106 classifies the result of the merging unit 105 by using the plurality of fully connected layers (fc), and outputs the classes (classes) of the objects and the frame information of the objects.
In the first aspect of the embodiments of the present application, the deformed convolution process in which the sample points of the convolution kernel in the input image can be deformed is also referred to as a deformable convolution process.
In the first aspect of the embodiments of the present application, the deformation Pooling process, in which the region of interest for which the Pooling process is directed may be deformed in the input image, is also referred to as a Deformable RoI Pooling (Deformable RoI Pooling).
As for the principle and specific processing manner of the deformed convolution processing and the deformed pooling processing, reference can be made to non-patent document 1(Dai J, Qi H, Xiong Y, et al.
Fig. 2 is a schematic diagram of a specific structure of the deep learning network-based object detection apparatus of fig. 1.
In at least one embodiment, the feature extraction section 101 may extract features of different sizes based on, for example, a Residual Neural Network (ResNet).
As shown in fig. 2, the number of feature extraction units in the feature extraction portion 101 may be 2 or more, for example, 4, that is, 1012, 1013, 1014, and 1015. Each feature extraction unit extracts features of different sizes from the input image 200, and each feature may be, for example, a two-dimensional matrix.
As shown in fig. 2, the feature extraction units 1012, 1013, 1014, and 1015 are arranged in series in order from the input side of the input image 200. The size of the feature output by each feature extraction unit is half the size of the feature output by the feature extraction unit of the previous stage, for example, the sizes of the features output by the feature extraction units 1011, 1012, 1013, and 1014 are 1/4,1/8,1/16,1/32 of the size of the input image 200, respectively.
As shown in fig. 2, the feature extraction units 1012, 1013, 1014, and 1015 may have 2 or more feature extraction modules ResBlock _2, ResBlock _3, ResBlock _4, and ResBlock _5, respectively.
As shown in fig. 2, the feature extraction modules ResBlock _2 are 3 in number, and are ResBlock _2a, ResBlock _2b, and ResBlock _2c, respectively. The feature extraction modules ResBlock _3 are 4 in number, and are ResBlock _3a, ResBlock _3b, ResBlock _3c and ResBlock _3d, respectively. The feature extraction modules ResBlock _4 are 6 in number, and are ResBlock _4a, ResBlock _4b, ResBlock _4c, ResBlock _4d, ResBlock _4e and ResBlock _4f, respectively. The feature extraction modules ResBlock _5 are 3 in number, and are ResBlock _5a, ResBlock _5b, and ResBlock _5c, respectively. The number of each feature extraction module shown in fig. 2 is merely an example, and the present application is not limited to the above example.
In the same feature extraction unit, the features extracted by different feature extraction modules have different shapes but the same size. For example, in the feature extraction unit 1012, feature extraction modules ResBlock _2a, ResBlock _2b, and ResBlock _2c are used to extract rectangular features, circular features, elliptical features, and the like, respectively, and the size of the features extracted by the respective feature extraction modules ResBlock _2a, ResBlock _2b, and ResBlock _2c is 1/4 of the size of the input image 200.
As shown in fig. 2, the feature extraction unit 101 may further include a first convolution unit 1011, where the first convolution unit 1011 may perform convolution processing on the input image 200 and input the result of the convolution processing to the feature extraction unit 1012.
In at least one embodiment, the multi-size Feature generation section 102 may generate Feature maps (Feature maps) of different sizes based on, for example, the structure of a Feature Pyramid Network (FPN).
As shown in fig. 2, the number of feature generation units in the multi-size feature generation section 102 is 2 or more, for example, 4, that is, 1025, 1024, 1023, 1022, and 1021.
Among them, the Feature generation units 1025, 1024, 1023, and 1022 are arranged in series in order from the output side of the Feature extraction unit 1015 of the Feature extraction section 101, and output Feature maps (Feature maps) of different sizes, i.e., P5, P4, P3, and P2, respectively, where the size of the Feature Map P5 is 1/2 of P4, the size of the Feature Map P4 is 1/2 of P3, and the size of the Feature Map P3 is 1/2 of P2.
Fig. 3 is a schematic diagram of the feature generation unit 102 n. The feature generation unit 102n may be the feature generation unit 1024, 1023, or 1022.
As shown in fig. 3, P _ pre represents a feature map generated by the previous-stage feature generation unit, and may be P5, P4, or P3, for example. P _ next represents a feature map generated by the present-level feature generation unit, and may be P4, P3, or P2, for example.
As shown in fig. 3, the feature generation unit 102n may include: an interpolation unit 301, a fusion unit 302, and a deformed convolution processing unit 303.
In at least one embodiment, the interpolation unit 301 performs interpolation (interpolation) on the feature map P _ pre output by the previous feature generation unit to obtain an enlarged feature map. The interpolation process may be a bilinear (bilinear) interpolation process by which the feature map P _ pre is enlarged by a predetermined multiple, for example, 2 times.
In at least one embodiment, the fusion unit 302 performs convolution (convolution) on the feature of the size extracted by the feature extraction unit 101n in the feature extraction unit 101 corresponding to the current feature generation unit 102n, and fuses the feature with the enlarged feature map obtained by the interpolation unit 301.
The features extracted by the feature extraction unit 101n corresponding to the feature generation unit 102n have the same matrix size as the feature map enlarged by the interpolation unit 301. For example, when the feature generation unit 102n is 1022, the corresponding feature extraction unit 101n is 1012; when the feature generation unit 102n is 1023, the corresponding feature extraction unit 101n is 1013; when the feature generation unit 102n is 1024, the corresponding feature extraction unit 101n is 1014.
As shown in fig. 3, the fusion unit 302 may have a second convolution block 3021 and a synthesis block 3022. Here, the second convolution module 3021 performs convolution (convolution) processing, for example, convolution processing of 1 × 1 × 256, on the feature of the size extracted by the feature extraction unit 101 n. The synthesis module 3022 may add the matrix after the convolution processing performed by the second convolution module 3021 to the feature map amplified by the interpolation unit 301, that is, splice the matrix after the convolution processing performed by the second convolution module 3021 and the feature map amplified by the interpolation unit 301 in the depth direction, to obtain a three-dimensional matrix.
As shown in fig. 3, the deformed Convolution processing unit 303 performs deformed Convolution (Deformable Convolution) on the matrix obtained by the fusion performed by the fusion unit 302 to form a feature map P _ next output by the current feature generation unit 102 n. Among them, the deformed Convolution processing (Deformable Convolution) performed by the deformed Convolution processing unit 303 may be, for example, 3 × 3 × 256 deformed Convolution processing.
As shown in fig. 3, both the feature map P _ pre and the feature map P _ next may be input to the object position detection section 103 for detecting frame information of objects of different sizes.
As shown in fig. 2, the feature generation unit 1025 corresponds to the feature extraction unit 1015 in the feature extraction unit 101, wherein the feature output from the feature extraction unit 1015 is the smallest-sized feature output from the feature extraction unit 101.
As shown in fig. 2, the feature generation unit 1025 performs a deformed Convolution (Deformable Convolution) process on the features of the minimum size extracted by the feature extraction unit 1015 to form a feature map P5 output by the feature generation unit 1025.
As shown in fig. 2, the feature generation unit 1021 in the multi-size feature generation section 102 may perform pooling (pooling) on the feature map output by the feature generation unit 1025 so that the size of the feature map is halved to obtain the feature map P6, that is, the matrix size of the feature map P6 is half of the matrix size of the feature map P5.
By obtaining feature maps P6, P5, P4, P3, and P2 of different sizes through the multi-size feature generation unit 102, it is possible to facilitate detection of frame information of objects of different sizes in the object position detection unit 103; further, since the multi-size feature generation unit 102 employs the deformed convolution process, it is possible to reduce the influence of the geometric transformation of the object shape in the image on the detection result.
In at least one embodiment, the object position detection section 103 may detect bounding box information of objects of respective sizes from feature maps (features maps) of different sizes generated by the multi-size feature generation section 102 using a candidate region generation Network (RPN).
As shown in fig. 2, the object position detection section 103 may have a plurality of candidate area generation networks, for example, 5, i.e., 1031, 1032, 1033, 1034, and 1035.
Each candidate area generation network corresponds to each size of feature map generated by the multi-size feature generation unit 102, and for example, the candidate area generation networks 1031, 1032, 1033, 1034, and 1035 receive the feature maps P6, P2, P3, P4, and P5, respectively.
Each candidate area generation network is capable of detecting information of a border of an object from the feature map of the corresponding size, the information of the border of the object including, for example, a position of the border of the object, and/or a shape of the border, and/or a size of the border. As for the operation principle of each candidate area generation network, reference may be made to the related art.
As shown in fig. 2, the Pooling processing unit 104 performs deformed Pooling (Deformable Pooling) processing on the basis of feature map (feature maps) portions corresponding to the frames of the objects detected by the candidate area generation network so that the objects in the detected frames have the same size.
Among them, the pooling part 104 may have a plurality of pooling processing units, for example, 5, i.e., 1041, 1042, 1043, 1044, and 1045.
Each pooling processing unit obtains the feature image from the corresponding feature generating unit and obtains information of the frame of the object from the corresponding candidate area generating network, thereby performing deformation pooling on the part of the image feature located in the frame of the object.
For example, the pooling processing unit 1043 obtains the feature image P3 from the corresponding feature generating unit 1023, and obtains the information of the bounding box of the object in the feature image P3 from the corresponding candidate area generating network 1033, so as to perform deformed pooling processing on the part of the feature image P3 located in the bounding box of the object, and obtain the part of the feature image in each bounding box after the pooling processing, where the part of the feature image in each bounding box after the pooling processing may be in the form of a matrix, such as a pixel matrix.
In at least one embodiment, the matrices output from the respective pooling processing units 1041, 1042, 1043, 1044 and 1045 are the same scale in the two-dimensional direction of the image.
As shown in fig. 2, the merging unit 105 may merge (concat) the parts of the feature images in each frame after the pooling process by the pooling process unit 104, and since the parts of the feature images in each frame have the same scale in the two-dimensional direction, the merging by the merging unit 105 corresponds to combining the matrices output by the pooling process units 1041, 1042, 1043, 1044, and 1045 in the depth direction perpendicular to the two-dimensional direction.
As shown in fig. 2, the detection unit 106 classifies the results merged by the merging unit 105 by using a plurality of Fully Connected layers (FCs), and outputs the types (classes) of the objects and the frame information of the objects.
In fig. 2, the number of all-connected layers in the detection section 106 is 4, that is, 1061, 1062, 1063, and 1064. In addition, the present application is not limited thereto, and the number of the full connection layers in the detection section 106 may be more than 4, thereby improving the accuracy of detection.
Regarding the operation principle of each fully connected layer in the detection unit 106, the related art can be referred to, and the description thereof is omitted.
According to the first aspect of the embodiment of the present application, the multi-size feature generation section 102 generates the feature maps corresponding to the respective sizes, whereby both large-size and small-size objects in the image can be accurately detected; the multi-size feature generation unit 102 generates the feature map by using the deformed convolution process, and thus can reduce the influence of the geometric transformation of the object shape in the image on the detection result; since the pooling processing unit 104 uses the deformed pooling processing, it is possible to further reduce the influence of the geometric transformation of the object shape in the image on the detection result.
Second aspect of the embodiments
A second aspect of the embodiments of the present application provides an object detection method based on a deep learning network, which corresponds to the object detection apparatus based on a deep learning network of the first aspect of the embodiments.
Fig. 4 is a schematic diagram of an object detection method based on a deep learning network according to a second aspect of the embodiment of the present application, and as shown in fig. 4, the method 400 includes:
in operation 401, a plurality of feature extraction units respectively extract features of different sizes from an input image;
in operation 402, a plurality of cascaded feature generation units respectively generate feature maps corresponding to the respective sizes by using a deformed Convolution (Deformable Convolution) process according to the features of the different sizes extracted by the plurality of feature extraction units; and
in operation 403, bounding box information of objects of corresponding sizes is detected from the generated feature maps (feature maps) of different sizes respectively using a candidate region generation Network (RPN).
As shown in fig. 4, the method 400 further includes:
in operation 404, a deformed Pooling (Deformable Pooling) process is performed on the basis of feature map (feature maps) portions corresponding to the frames of the objects detected by the candidate area generation network, so that the objects in the detected frames have the same size.
Operation 405, merging (concat) the feature images in the frames after the deformation pooling; and
operation 406 classifies the merged result using the plurality of fully connected layers (fc), and outputs classes (classes) of each object and frame information of each object.
In at least one embodiment, operation 402 may include the following operations:
operation 4021, performing interpolation (interpolation) on the feature map output by the previous feature generation unit to obtain an enlarged feature map;
operation 4022, convolving the features of the size extracted by the feature extraction unit corresponding to the current feature generation unit (1 × 256), and fusing the feature with the enlarged feature map; and
in operation 4023, a deformation Convolution (Deformable Convolution) process is performed on the fused matrix to form a feature map output by the current feature generation unit.
In at least one embodiment, operation 402 may further include the operations of:
in operation 4024, a deformation Convolution (Deformable Convolution) process is performed on the minimum-sized features extracted by the feature extraction unit to form a feature map P5 output by the feature generation unit corresponding to the minimum-sized features.
In at least one embodiment, operation 402 may further include the operations of:
in operation 4024, a pooling (Pooling) process is performed on the feature map output from the feature generating unit corresponding to the feature having the smallest size to form a feature map P6.
According to the second aspect of the embodiments of the present application, it is possible to generate the feature maps corresponding to the respective sizes, whereby both large-sized and small-sized objects in the image can be accurately detected; moreover, the feature map is generated by using the deformation convolution processing, so that the influence of the geometric transformation of the object shape in the image on the detection result can be reduced; further, using the deformed pooling process, the influence of the geometric transformation of the object shape in the image on the detection result can be further reduced.
Third aspect of the embodiments
A third aspect of an embodiment of the present application provides an electronic device, including: the deep learning network-based object detection apparatus according to the first aspect of the embodiments.
Fig. 5 is a schematic configuration diagram of an electronic device according to the third aspect of the embodiment of the present application. As shown in fig. 5, the electronic device 500 may include: a Central Processing Unit (CPU)501 and a memory 502; the memory 502 is coupled to the central processor 501. Wherein the memory 502 can store various data; further, a program for performing control is stored, and is executed under the control of the central processing unit 501.
In one embodiment, the functions of the deep learning network based object detection apparatus 100 may be integrated into the central processor 501.
The central processor 501 may be configured to execute the deep learning network-based object detection method according to the second aspect of the embodiment.
In another embodiment, the deep learning network based object detecting apparatus 100 may be configured separately from the processor 501, for example, the deep learning network based object detecting apparatus 100 may be configured as a chip connected to the processor 501, and the function of the deep learning network based object detecting apparatus 100 is realized by the control of the processor 501.
Further, as shown in fig. 5, the electronic device 500 may further include: an input/output unit 503, a display unit 504, and the like; the functions of the above components are similar to those of the prior art, and are not described in detail here. It is noted that the electronic device 500 does not necessarily include all of the components shown in FIG. 5; furthermore, the electronic device 500 may also comprise components not shown in fig. 5, which may be referred to in the prior art.
The present application also provides a computer readable program, wherein when the program is executed in an object detection apparatus or an electronic device based on a deep learning network, the program causes the object detection apparatus or the electronic device based on the deep learning network to execute the object detection method based on the deep learning network according to the second aspect of the embodiment.
The present invention further provides a storage medium storing a computer readable program, where the storage medium stores the computer readable program, and the computer readable program enables an object detection apparatus or an electronic device based on a deep learning network to execute the object detection method based on the deep learning network according to the second aspect of the embodiments.
The measurement devices described in connection with the embodiments of the present application may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. For example, one or more of the functional block diagrams and/or one or more combinations of the functional block diagrams illustrated in fig. 1 may correspond to individual software modules of a computer program flow or may correspond to individual hardware modules. These software modules may respectively correspond to the respective operations shown in the first aspect of the embodiment. These hardware modules may be implemented, for example, by solidifying these software modules using a Field Programmable Gate Array (FPGA).
A software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. A storage medium may be coupled to the processor such that the processor can read information from, and write information to, the storage medium; or the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC. The software module may be stored in the memory of the mobile terminal or in a memory card that is insertable into the mobile terminal. For example, if the electronic device employs a MEGA-SIM card with a larger capacity or a flash memory device with a larger capacity, the software module may be stored in the MEGA-SIM card or the flash memory device with a larger capacity.
One or more of the functional block diagrams and/or one or more combinations of the functional block diagrams described with respect to fig. 1 may be implemented as a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any suitable combination thereof designed to perform the functions described herein. One or more of the functional block diagrams and/or one or more combinations of the functional block diagrams described with respect to fig. 8, 9 may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP communication, or any other such configuration.
The present application has been described in conjunction with specific embodiments, but it should be understood by those skilled in the art that these descriptions are intended to be illustrative, and not limiting. Various modifications and adaptations of the present application may occur to those skilled in the art based on the teachings herein and are within the scope of the present application.
With respect to the embodiments including the above embodiments, the following remarks are also disclosed:
1. an object detection apparatus based on a deep learning network, comprising:
a feature extraction unit having a plurality of feature extraction means for extracting features of different sizes from an input image;
a multi-size feature generation unit having a plurality of cascade-connected feature generation units that generate feature maps (feature maps) corresponding to respective sizes, respectively, by using a deformed Convolution (Deformable Convolution) process, based on the features of different sizes extracted by the feature extraction unit; and
and an object position detection unit that detects bounding box information of objects of corresponding sizes from feature maps (feature maps) of different sizes generated by the multi-size feature generation unit, respectively, using a candidate region generation Network (RPN).
2. The apparatus according to supplementary note 1, wherein the apparatus further comprises:
and a Pooling unit that performs a deformed Pooling (Deformable Pooling) process on the basis of feature map (feature maps) portions corresponding to the frames of the objects detected by the candidate area generation network, so that the objects in the detected frames have the same size.
3. The apparatus according to supplementary note 1, wherein the feature generation unit includes:
an interpolation unit that performs interpolation (interpolation) on the feature map output by the previous feature generation unit to obtain an enlarged feature map;
a fusion unit which convolutes the feature of the size extracted by the feature extraction unit corresponding to the current feature generation unit (1 × 256) and fuses the feature with the feature map after the enlargement; and
and a deformed Convolution processing unit which performs deformed Convolution (Deformable Convolution) processing on the matrix obtained after the fusion to form the feature map output by the current feature generating unit.
4. The apparatus according to supplementary note 3, wherein,
the multi-size feature generating unit further performs a deformed Convolution (Deformable Convolution) process on the features of the minimum size extracted by the feature extracting unit to form a feature map output by the feature generating unit corresponding to the features of the minimum size.
5. The apparatus according to supplementary note 4, wherein,
the multi-size feature generating unit may further perform pooling (posing) processing on the feature map output from the feature generating unit corresponding to the feature having the smallest size to generate the feature map output from the multi-size feature generating unit.
6. The apparatus according to supplementary note 2, wherein the apparatus further comprises:
a merging unit that merges (concat) the feature images in the plurality of frames after the deformation pooling process; and
and a detection unit for classifying the merged result by using a plurality of full connection layers (fc) and outputting classes (classes) of each object and frame information of each object.
7. An electronic device having the deep learning network-based object detection apparatus according to any one of supplementary notes 1 to 6.
8. An object detection method based on a deep learning network comprises the following steps:
a plurality of feature extraction units respectively extracting features of different sizes from an input image;
a plurality of cascaded feature generation units respectively generate feature maps corresponding to the sizes by using deformation Convolution (Deformable Convolution) processing according to the features of different sizes extracted by the plurality of feature extraction units; and
frame information of objects of corresponding sizes is detected from feature maps (feature maps) of different sizes generated by a multi-size feature generation unit using a candidate region generation Network (RPN).
9. The method according to supplementary note 1, wherein the method further comprises:
a deformed Pooling process (Deformable Pooling) is performed on the basis of feature map portions corresponding to frames of the objects detected by the candidate area generating network, so that the detected objects in the frames have the same size.
10. The method according to supplementary note 1, wherein generating the feature map corresponding to each size includes:
performing interpolation (interpolation) on the feature map output by the previous feature generation unit to obtain an amplified feature map;
performing convolution processing (1 × 256) on the features of the size extracted by the feature extraction unit corresponding to the current feature generation unit, and fusing the features with the amplified feature map; and
and performing deformation Convolution (Deformable Convolution) processing on the matrix obtained after fusion to form the feature map output by the current feature generation unit.
11. The method according to supplementary note 10, wherein generating the feature map corresponding to each size further includes:
the feature extraction unit performs a deformed Convolution (Deformable Convolution) process on the features of the minimum size extracted by the feature extraction unit to form a feature map output by the feature generation unit corresponding to the features of the minimum size.
12. The method according to supplementary note 11, wherein generating the feature map corresponding to each size further includes:
and performing pooling (Pooling) processing on the feature map output by the feature generation unit corresponding to the feature with the minimum size to form the feature map.
13. The method according to supplementary note 9, wherein the method further comprises:
merging (concat) the feature images in the frames after the deformation pooling; and
the merged result is classified using a plurality of full connection layers (fc), and the classes (classes) of each object and the frame information of each object are output.

Claims (10)

1. An apparatus for detecting an object based on a deep learning network, the apparatus comprising:
a feature extraction unit having a plurality of feature extraction means for extracting features of different sizes from an input image;
a multi-size feature generation unit having a plurality of cascade-connected feature generation means for generating feature maps corresponding to respective sizes by using a deformed convolution process based on the features of different sizes extracted by the feature extraction unit; and
and an object position detection unit which detects the frame information of the object of the corresponding size from the feature maps of different sizes generated by the multi-size feature generation unit, respectively, using the candidate area generation network.
2. The apparatus of claim 1, wherein the apparatus further comprises:
and a pooling processing unit that performs a deformed pooling process on the basis of the feature map portion corresponding to the frame of the object detected by each candidate area generation network so that the objects in each detected frame have the same size.
3. The apparatus of claim 1, wherein the feature generation unit comprises:
an interpolation unit that performs interpolation (interpolation) on the feature map output by the previous feature generation unit to obtain an enlarged feature map;
a fusion unit which convolutes the feature of the size extracted by the feature extraction unit corresponding to the current feature generation unit and fuses the feature with the feature map after amplification; and
and the deformation convolution processing unit is used for carrying out deformation convolution processing on the matrix obtained after fusion to form the characteristic diagram output by the current characteristic generating unit.
4. The apparatus of claim 3, wherein,
the multi-size feature generation unit further performs a modified convolution process on the minimum-size features extracted by the feature extraction unit to form a feature map output by the feature generation unit corresponding to the minimum-size features.
5. The apparatus of claim 4, wherein,
the multi-size feature generating unit may further pool the feature map output from the feature generating unit corresponding to the feature having the smallest size to form the feature map output from the multi-size feature generating unit.
6. The apparatus of claim 2, wherein the apparatus further comprises:
a merging unit that merges the feature images in the plurality of frames after the deformation pooling process; and
and a detection unit that classifies the merged result using a plurality of full-link layers and outputs the type of each object and frame information of each object.
7. An electronic device, characterized in that the electronic device is provided with the deep learning network-based object detection device of any one of claims 1 to 6.
8. An object detection method based on a deep learning network, which is characterized by comprising the following steps:
a plurality of feature extraction units respectively extracting features of different sizes from an input image;
a plurality of cascaded feature generation units respectively generate feature maps corresponding to the sizes by using deformation convolution processing according to the features of different sizes extracted by the feature extraction units; and
and detecting the frame information of the object with the corresponding size from the generated feature maps with different sizes by using the candidate area generation network.
9. The method of claim 8, wherein the method further comprises:
a deformed pooling process is performed on the basis of the feature map portion corresponding to the frame of the object detected by each candidate area generation network so that the objects in each detected frame have the same size.
10. The method of claim 8, wherein generating a feature map corresponding to each dimension comprises:
performing interpolation processing on the feature map output by the previous feature generation unit to obtain an amplified feature map;
performing convolution processing on the feature with the size extracted by the feature extraction unit corresponding to the current feature generation unit, and fusing the feature with the amplified feature map; and
and performing deformation convolution processing on the matrix obtained after fusion to form a feature map output by the current feature generation unit.
CN201910525931.6A 2019-06-18 2019-06-18 Object detection method and device based on deep learning network and electronic equipment Pending CN112101373A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201910525931.6A CN112101373A (en) 2019-06-18 2019-06-18 Object detection method and device based on deep learning network and electronic equipment
JP2020100215A JP2020205048A (en) 2019-06-18 2020-06-09 Object detection method based on deep learning network, apparatus, and electronic device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910525931.6A CN112101373A (en) 2019-06-18 2019-06-18 Object detection method and device based on deep learning network and electronic equipment

Publications (1)

Publication Number Publication Date
CN112101373A true CN112101373A (en) 2020-12-18

Family

ID=73749060

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910525931.6A Pending CN112101373A (en) 2019-06-18 2019-06-18 Object detection method and device based on deep learning network and electronic equipment

Country Status (2)

Country Link
JP (1) JP2020205048A (en)
CN (1) CN112101373A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112966556A (en) * 2021-02-02 2021-06-15 豪威芯仑传感器(上海)有限公司 Moving object detection method and system
CN113255589A (en) * 2021-06-25 2021-08-13 北京电信易通信息技术股份有限公司 Target detection method and system based on multi-convolution fusion network

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112733672A (en) * 2020-12-31 2021-04-30 深圳一清创新科技有限公司 Monocular camera-based three-dimensional target detection method and device and computer equipment
CN113255434B (en) * 2021-04-08 2023-12-19 淮阴工学院 Apple identification method integrating fruit characteristics and deep convolutional neural network
CN116524293B (en) * 2023-04-10 2024-01-30 哈尔滨市科佳通用机电股份有限公司 Brake adjuster pull rod head loss fault identification method and system based on deep learning

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017139927A1 (en) * 2016-02-17 2017-08-24 Intel Corporation Region proposal for image regions that include objects of interest using feature maps from multiple layers of a convolutional neural network model
CN109191455A (en) * 2018-09-18 2019-01-11 西京学院 A kind of field crop pest and disease disasters detection method based on SSD convolutional network
CN109448307A (en) * 2018-11-12 2019-03-08 哈工大机器人(岳阳)军民融合研究院 A kind of recognition methods of fire disaster target and device
CN109614985A (en) * 2018-11-06 2019-04-12 华南理工大学 A kind of object detection method based on intensive connection features pyramid network
CN109740686A (en) * 2019-01-09 2019-05-10 中南大学 A kind of deep learning image multiple labeling classification method based on pool area and Fusion Features
CN109816012A (en) * 2019-01-22 2019-05-28 南京邮电大学 A kind of multiscale target detection method of integrating context information

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017139927A1 (en) * 2016-02-17 2017-08-24 Intel Corporation Region proposal for image regions that include objects of interest using feature maps from multiple layers of a convolutional neural network model
CN109191455A (en) * 2018-09-18 2019-01-11 西京学院 A kind of field crop pest and disease disasters detection method based on SSD convolutional network
CN109614985A (en) * 2018-11-06 2019-04-12 华南理工大学 A kind of object detection method based on intensive connection features pyramid network
CN109448307A (en) * 2018-11-12 2019-03-08 哈工大机器人(岳阳)军民融合研究院 A kind of recognition methods of fire disaster target and device
CN109740686A (en) * 2019-01-09 2019-05-10 中南大学 A kind of deep learning image multiple labeling classification method based on pool area and Fusion Features
CN109816012A (en) * 2019-01-22 2019-05-28 南京邮电大学 A kind of multiscale target detection method of integrating context information

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112966556A (en) * 2021-02-02 2021-06-15 豪威芯仑传感器(上海)有限公司 Moving object detection method and system
CN112966556B (en) * 2021-02-02 2022-06-10 豪威芯仑传感器(上海)有限公司 Moving object detection method and system
CN113255589A (en) * 2021-06-25 2021-08-13 北京电信易通信息技术股份有限公司 Target detection method and system based on multi-convolution fusion network
CN113255589B (en) * 2021-06-25 2021-10-15 北京电信易通信息技术股份有限公司 Target detection method and system based on multi-convolution fusion network

Also Published As

Publication number Publication date
JP2020205048A (en) 2020-12-24

Similar Documents

Publication Publication Date Title
CN112101373A (en) Object detection method and device based on deep learning network and electronic equipment
CN109255352B (en) Target detection method, device and system
US10210415B2 (en) Method and system for recognizing information on a card
CN109522874B (en) Human body action recognition method and device, terminal equipment and storage medium
CN109214366B (en) Local target re-identification method, device and system
EP3417425B1 (en) Leveraging multi cues for fine-grained object classification
CN109960742B (en) Local information searching method and device
KR20180104609A (en) Method, system, apparatus and readable storage medium for realizing insurance claims fraud prevention based on a plurality of image correspondence
CN108986152B (en) Foreign matter detection method and device based on difference image
WO2022000862A1 (en) Method and apparatus for detecting object in fisheye image, and storage medium
WO2021217924A1 (en) Method and apparatus for identifying vehicle type at traffic checkpoint, and device and storage medium
JP5936561B2 (en) Object classification based on appearance and context in images
Farag A lightweight vehicle detection and tracking technique for advanced driving assistance systems
CN114511865A (en) Method and device for generating structured information and computer readable storage medium
US9392146B2 (en) Apparatus and method for extracting object
CN112926511A (en) Seal text recognition method, device and equipment and computer readable storage medium
KR101733288B1 (en) Object Detecter Generation Method Using Direction Information, Object Detection Method and Apparatus using the same
CN111079585B (en) Pedestrian re-identification method combining image enhancement with pseudo-twin convolutional neural network
JP6016242B2 (en) Viewpoint estimation apparatus and classifier learning method thereof
CN114758145A (en) Image desensitization method and device, electronic equipment and storage medium
Bandyopadhyay et al. RectiNet-v2: A stacked network architecture for document image dewarping
CN112883973A (en) License plate recognition method and device, electronic equipment and computer storage medium
Chang et al. Re-Attention is all you need: Memory-efficient scene text detection via re-attention on uncertain regions
CN111753766A (en) Image processing method, device, equipment and medium
CN116894972B (en) Wetland information classification method and system integrating airborne camera image and SAR image

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination