CN112101373A

CN112101373A - Object detection method and device based on deep learning network and electronic equipment

Info

Publication number: CN112101373A
Application number: CN201910525931.6A
Authority: CN
Inventors: 陶轩; 谭志明
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2019-06-18
Filing date: 2019-06-18
Publication date: 2020-12-18
Also published as: JP2020205048A

Abstract

The embodiment of the application provides an object detection method and device based on a deep learning network and an electronic device, wherein the object detection device based on the deep learning network comprises the following steps: a feature extraction unit having a plurality of feature extraction means for extracting features of different sizes from an input image; a multi-size feature generation unit having a plurality of cascade-connected feature generation units that generate feature maps (feature maps) corresponding to respective sizes, respectively, by using a deformed Convolution (Deformable Convolution) process, based on the features of different sizes extracted by the feature extraction unit; and an object position detection unit that detects bounding box information of objects of corresponding sizes from feature maps (feature maps) of different sizes generated by the multi-size feature generation unit, respectively, using a candidate region generation Network (RPN).

Description

Object detection method and device based on deep learning network and electronic equipment

Technical Field

The present application relates to the field of electronic information technology.

Background

In recent years, due to the close relationship with image analysis, image-based object detection techniques have received much attention. With the rapid development of deep learning, especially the development of Convolutional Neural Network (CNN), the performance of the object detection technology has been greatly improved. At present, advanced object detection technology has reached very high detection accuracy and Recall Rate (Recall Rate).

Despite the tremendous advances in object detection technology, many challenges remain in this area. One challenge is that it is difficult to identify objects that differ greatly in size, and for this reason researchers developed the Faster R-CNN classifier.

Another challenge is the influence of geometric transformation of the object shape on the recognition result, wherein how to adapt geometric transformation such as size, posture, observation angle and deformation of the object in the image is a key problem in visual recognition. Generally, there are two approaches to mitigate the impact of geometric transformations of object shapes on recognition results: the first approach is to maintain a data set that covers all changes; the second approach is to use a constant manual feature and a specific algorithm with respect to the geometric transformation.

It should be noted that the above background description is only for the convenience of clear and complete description of the technical solutions of the present application and for the understanding of those skilled in the art. Such solutions are not considered to be known to the person skilled in the art merely because they have been set forth in the background section of the present application.

Disclosure of Invention

The inventors of the present application have found that the existing object detection techniques have some limitations when faced with the above two challenges, such as: for the Faster R-CNN classifier, while performing well in recognizing large-sized objects, it is difficult to recognize objects of smaller size; for a method of maintaining a data set that covers all changes, it is difficult to have a data set cover all situations in real life, and efficiency and cost will increase dramatically as data increases; for methods using manual features and specific algorithms, which require a lot of a-priori knowledge and experience to manually set a suitable property for a specific geometric transformation, a new property needs to be manually set when a new geometric transformation is involved.

The embodiment of the application provides an object detection method, an object detection device and electronic equipment based on a deep learning network.

According to a first aspect of embodiments of the present application, there is provided an object detection apparatus based on a deep learning network, including:

a feature extraction unit having a plurality of feature extraction means for extracting features of different sizes from an input image;

a multi-size feature generation unit having a plurality of cascade-connected feature generation units that generate feature maps (feature maps) corresponding to respective sizes, respectively, by using a deformed Convolution (Deformable Convolution) process, based on the features of different sizes extracted by the feature extraction unit; and

and an object position detection unit that detects bounding box information of objects of corresponding sizes from feature maps (feature maps) of different sizes generated by the multi-size feature generation unit, respectively, using a candidate region generation Network (RPN).

According to a second aspect of the embodiments of the present application, there is provided an object detection method based on a deep learning network, including:

a plurality of feature extraction units respectively extracting features of different sizes from an input image;

a plurality of cascaded feature generation units respectively generate feature maps corresponding to the sizes by using deformation Convolution (Deformable Convolution) processing according to the features of different sizes extracted by the plurality of feature extraction units; and

and detecting frame information of objects with corresponding sizes from generated feature maps (features) with different sizes respectively by using a candidate region generation Network (RPN).

According to a third aspect of the embodiments of the present application, there is provided an object detection apparatus based on a deep learning network as described in the first aspect of the embodiments.

One of the beneficial effects of the embodiment of the application lies in: the method can accurately identify the small-sized object in the image and reduce the influence of the geometric transformation of the object shape in the image on the detection result.

Specific embodiments of the present application are disclosed in detail with reference to the following description and drawings, indicating the manner in which the principles of the application may be employed. It should be understood that the embodiments of the present application are not so limited in scope. The embodiments of the present application include many variations, modifications, and equivalents within the spirit and scope of the appended claims.

Features that are described and/or illustrated with respect to one embodiment may be used in the same way or in a similar way in one or more other embodiments, in combination with or instead of the features of the other embodiments.

It should be emphasized that the term "comprises/comprising" when used herein, is taken to specify the presence of stated features, integers, steps or components but does not preclude the presence or addition of one or more other features, integers, steps or components.

Drawings

The accompanying drawings, which are included to provide a further understanding of the embodiments of the application, are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the principles of the application. It is obvious that the drawings in the following description are only some embodiments of the application, and that for a person skilled in the art, other drawings can be derived from them without inventive effort. In the drawings:

fig. 1 is a schematic diagram of an object detection apparatus based on a deep learning network according to a first aspect of an embodiment of the present application;

FIG. 2 is a schematic diagram of a specific structure of the deep learning network-based object detection apparatus of FIG. 1;

FIG. 3 is a schematic diagram of the feature generation unit 102 n;

FIG. 4 is a schematic diagram of an image-based parking detection method of a second aspect of an embodiment of the present application;

fig. 5 is a schematic configuration diagram of an electronic device according to the third aspect of the embodiment of the present application.

Detailed Description

The foregoing and other features of the present application will become apparent from the following description, taken in conjunction with the accompanying drawings. In the description and drawings, particular embodiments of the application are disclosed in detail as being indicative of some of the embodiments in which the principles of the application may be employed, it being understood that the application is not limited to the embodiments described, but, on the contrary, is intended to cover all modifications, variations, and equivalents falling within the scope of the appended claims.

In the embodiments of the present application, the terms "first", "second", and the like are used for distinguishing different elements by reference, but do not denote a spatial arrangement, a temporal order, or the like of the elements, and the elements should not be limited by the terms. The term "and/or" includes any and all combinations of one or more of the associated listed terms. The terms "comprising," "including," "having," and the like, refer to the presence of stated features, elements, components, and do not preclude the presence or addition of one or more other features, elements, components, and elements.

In the embodiments of the present application, the singular forms "a", "an", and the like include the plural forms and are to be construed broadly as "a" or "an" and not limited to the meaning of "a" or "an"; furthermore, the term "the" should be understood to include both the singular and the plural, unless the context clearly dictates otherwise. Further, the term "according to" should be understood as "at least partially according to … …," and the term "based on" should be understood as "based at least partially on … …," unless the context clearly dictates otherwise.

First aspect of the embodiments

A first aspect of an embodiment of the present application provides an object detection apparatus based on a deep learning network.

Fig. 1 is a schematic diagram of an object detection apparatus based on a deep learning network according to a first aspect of an embodiment of the present application. As shown in fig. 1, the deep learning network-based object detection apparatus 100 includes: a feature extraction unit 101, a multi-size feature generation unit 102, and a target position detection unit 103.

The feature extraction unit 101 includes a plurality of feature extraction units for extracting features of different sizes from an input image; the multi-size feature generation unit 102 includes a plurality of cascade-connected feature generation units that generate feature maps (feature maps) corresponding to respective sizes, respectively, by using a deformed Convolution (Deformable Convolution) process, based on the features of different sizes extracted by the feature extraction unit 101; the object position detection unit 103 detects bounding box information of objects of corresponding sizes from feature maps (feature maps) of different sizes generated by the multi-size feature generation unit 102, respectively, using a candidate region generation Network (RPN).

According to the first aspect of the embodiments of the present application, the multi-size feature generation section 102 generates the feature maps corresponding to the respective sizes, thereby accurately detecting both large-sized and small-sized objects in the image, and the multi-size feature generation section 102 generates the feature maps using the deformed convolution process, thereby making it possible to reduce the influence of the geometric transformation of the object shape in the image on the detection result.

In the first aspect of the embodiment of the present application, as shown in fig. 1, the deep learning network-based object detection apparatus 100 may further include: a pooling unit 104.

The Pooling processing unit 104 performs a deformed Pooling (formatting) process on the basis of the feature map (feature maps) portion corresponding to the frame of the object detected by the object position detecting unit 103, so that the objects in the frames have the same size.

In the first aspect of the embodiment of the present application, as shown in fig. 1, the deep learning network-based object detection apparatus 100 may further include: a merging unit 105 and a detection unit 106.

The merging unit 105 merges (concat) the image features in the frames after the pooling unit 104 performs the deformed pooling process; the detection unit 106 classifies the result of the merging unit 105 by using the plurality of fully connected layers (fc), and outputs the classes (classes) of the objects and the frame information of the objects.

In the first aspect of the embodiments of the present application, the deformed convolution process in which the sample points of the convolution kernel in the input image can be deformed is also referred to as a deformable convolution process.

In the first aspect of the embodiments of the present application, the deformation Pooling process, in which the region of interest for which the Pooling process is directed may be deformed in the input image, is also referred to as a Deformable RoI Pooling (Deformable RoI Pooling).

As for the principle and specific processing manner of the deformed convolution processing and the deformed pooling processing, reference can be made to non-patent document 1(Dai J, Qi H, Xiong Y, et al.

Fig. 2 is a schematic diagram of a specific structure of the deep learning network-based object detection apparatus of fig. 1.

In at least one embodiment, the feature extraction section 101 may extract features of different sizes based on, for example, a Residual Neural Network (ResNet).

As shown in fig. 2, the number of feature extraction units in the feature extraction portion 101 may be 2 or more, for example, 4, that is, 1012, 1013, 1014, and 1015. Each feature extraction unit extracts features of different sizes from the input image 200, and each feature may be, for example, a two-dimensional matrix.

As shown in fig. 2, the

feature extraction units

1012, 1013, 1014, and 1015 are arranged in series in order from the input side of the input image 200. The size of the feature output by each feature extraction unit is half the size of the feature output by the feature extraction unit of the previous stage, for example, the sizes of the features output by the

feature extraction units

1011, 1012, 1013, and 1014 are 1/4,1/8,1/16,1/32 of the size of the input image 200, respectively.

As shown in fig. 2, the

feature extraction units

1012, 1013, 1014, and 1015 may have 2 or more feature extraction modules ResBlock _2, ResBlock _3, ResBlock _4, and ResBlock _5, respectively.

As shown in fig. 2, the feature extraction modules ResBlock _2 are 3 in number, and are ResBlock _2a, ResBlock _2b, and ResBlock _2c, respectively. The feature extraction modules ResBlock _3 are 4 in number, and are ResBlock _3a, ResBlock _3b, ResBlock _3c and ResBlock _3d, respectively. The feature extraction modules ResBlock _4 are 6 in number, and are ResBlock _4a, ResBlock _4b, ResBlock _4c, ResBlock _4d, ResBlock _4e and ResBlock _4f, respectively. The feature extraction modules ResBlock _5 are 3 in number, and are ResBlock _5a, ResBlock _5b, and ResBlock _5c, respectively. The number of each feature extraction module shown in fig. 2 is merely an example, and the present application is not limited to the above example.

In the same feature extraction unit, the features extracted by different feature extraction modules have different shapes but the same size. For example, in the feature extraction unit 1012, feature extraction modules ResBlock _2a, ResBlock _2b, and ResBlock _2c are used to extract rectangular features, circular features, elliptical features, and the like, respectively, and the size of the features extracted by the respective feature extraction modules ResBlock _2a, ResBlock _2b, and ResBlock _2c is 1/4 of the size of the input image 200.

As shown in fig. 2, the feature extraction unit 101 may further include a first convolution unit 1011, where the first convolution unit 1011 may perform convolution processing on the input image 200 and input the result of the convolution processing to the feature extraction unit 1012.

In at least one embodiment, the multi-size Feature generation section 102 may generate Feature maps (Feature maps) of different sizes based on, for example, the structure of a Feature Pyramid Network (FPN).

As shown in fig. 2, the number of feature generation units in the multi-size feature generation section 102 is 2 or more, for example, 4, that is, 1025, 1024, 1023, 1022, and 1021.

Among them, the

Feature generation units

1025, 1024, 1023, and 1022 are arranged in series in order from the output side of the Feature extraction unit 1015 of the Feature extraction section 101, and output Feature maps (Feature maps) of different sizes, i.e., P5, P4, P3, and P2, respectively, where the size of the Feature Map P5 is 1/2 of P4, the size of the Feature Map P4 is 1/2 of P3, and the size of the Feature Map P3 is 1/2 of P2.

Fig. 3 is a schematic diagram of the feature generation unit 102 n. The feature generation unit 102n may be the

feature generation unit

1024, 1023, or 1022.

As shown in fig. 3, P _ pre represents a feature map generated by the previous-stage feature generation unit, and may be P5, P4, or P3, for example. P _ next represents a feature map generated by the present-level feature generation unit, and may be P4, P3, or P2, for example.

As shown in fig. 3, the feature generation unit 102n may include: an interpolation unit 301, a fusion unit 302, and a deformed convolution processing unit 303.

In at least one embodiment, the interpolation unit 301 performs interpolation (interpolation) on the feature map P _ pre output by the previous feature generation unit to obtain an enlarged feature map. The interpolation process may be a bilinear (bilinear) interpolation process by which the feature map P _ pre is enlarged by a predetermined multiple, for example, 2 times.

In at least one embodiment, the fusion unit 302 performs convolution (convolution) on the feature of the size extracted by the feature extraction unit 101n in the feature extraction unit 101 corresponding to the current feature generation unit 102n, and fuses the feature with the enlarged feature map obtained by the interpolation unit 301.

The features extracted by the feature extraction unit 101n corresponding to the feature generation unit 102n have the same matrix size as the feature map enlarged by the interpolation unit 301. For example, when the feature generation unit 102n is 1022, the corresponding feature extraction unit 101n is 1012; when the feature generation unit 102n is 1023, the corresponding feature extraction unit 101n is 1013; when the feature generation unit 102n is 1024, the corresponding feature extraction unit 101n is 1014.

As shown in fig. 3, the fusion unit 302 may have a second convolution block 3021 and a synthesis block 3022. Here, the second convolution module 3021 performs convolution (convolution) processing, for example, convolution processing of 1 × 1 × 256, on the feature of the size extracted by the feature extraction unit 101 n. The synthesis module 3022 may add the matrix after the convolution processing performed by the second convolution module 3021 to the feature map amplified by the interpolation unit 301, that is, splice the matrix after the convolution processing performed by the second convolution module 3021 and the feature map amplified by the interpolation unit 301 in the depth direction, to obtain a three-dimensional matrix.

As shown in fig. 3, the deformed Convolution processing unit 303 performs deformed Convolution (Deformable Convolution) on the matrix obtained by the fusion performed by the fusion unit 302 to form a feature map P _ next output by the current feature generation unit 102 n. Among them, the deformed Convolution processing (Deformable Convolution) performed by the deformed Convolution processing unit 303 may be, for example, 3 × 3 × 256 deformed Convolution processing.

As shown in fig. 3, both the feature map P _ pre and the feature map P _ next may be input to the object position detection section 103 for detecting frame information of objects of different sizes.

As shown in fig. 2, the feature generation unit 1025 corresponds to the feature extraction unit 1015 in the feature extraction unit 101, wherein the feature output from the feature extraction unit 1015 is the smallest-sized feature output from the feature extraction unit 101.

As shown in fig. 2, the feature generation unit 1025 performs a deformed Convolution (Deformable Convolution) process on the features of the minimum size extracted by the feature extraction unit 1015 to form a feature map P5 output by the feature generation unit 1025.

As shown in fig. 2, the feature generation unit 1021 in the multi-size feature generation section 102 may perform pooling (pooling) on the feature map output by the feature generation unit 1025 so that the size of the feature map is halved to obtain the feature map P6, that is, the matrix size of the feature map P6 is half of the matrix size of the feature map P5.

By obtaining feature maps P6, P5, P4, P3, and P2 of different sizes through the multi-size feature generation unit 102, it is possible to facilitate detection of frame information of objects of different sizes in the object position detection unit 103; further, since the multi-size feature generation unit 102 employs the deformed convolution process, it is possible to reduce the influence of the geometric transformation of the object shape in the image on the detection result.

In at least one embodiment, the object position detection section 103 may detect bounding box information of objects of respective sizes from feature maps (features maps) of different sizes generated by the multi-size feature generation section 102 using a candidate region generation Network (RPN).

As shown in fig. 2, the object position detection section 103 may have a plurality of candidate area generation networks, for example, 5, i.e., 1031, 1032, 1033, 1034, and 1035.

Each candidate area generation network corresponds to each size of feature map generated by the multi-size feature generation unit 102, and for example, the candidate

area generation networks

1031, 1032, 1033, 1034, and 1035 receive the feature maps P6, P2, P3, P4, and P5, respectively.

Each candidate area generation network is capable of detecting information of a border of an object from the feature map of the corresponding size, the information of the border of the object including, for example, a position of the border of the object, and/or a shape of the border, and/or a size of the border. As for the operation principle of each candidate area generation network, reference may be made to the related art.

As shown in fig. 2, the Pooling processing unit 104 performs deformed Pooling (Deformable Pooling) processing on the basis of feature map (feature maps) portions corresponding to the frames of the objects detected by the candidate area generation network so that the objects in the detected frames have the same size.

Among them, the pooling part 104 may have a plurality of pooling processing units, for example, 5, i.e., 1041, 1042, 1043, 1044, and 1045.

Each pooling processing unit obtains the feature image from the corresponding feature generating unit and obtains information of the frame of the object from the corresponding candidate area generating network, thereby performing deformation pooling on the part of the image feature located in the frame of the object.

For example, the pooling processing unit 1043 obtains the feature image P3 from the corresponding feature generating unit 1023, and obtains the information of the bounding box of the object in the feature image P3 from the corresponding candidate area generating network 1033, so as to perform deformed pooling processing on the part of the feature image P3 located in the bounding box of the object, and obtain the part of the feature image in each bounding box after the pooling processing, where the part of the feature image in each bounding box after the pooling processing may be in the form of a matrix, such as a pixel matrix.

In at least one embodiment, the matrices output from the respective

pooling processing units

1041, 1042, 1043, 1044 and 1045 are the same scale in the two-dimensional direction of the image.

As shown in fig. 2, the merging unit 105 may merge (concat) the parts of the feature images in each frame after the pooling process by the pooling process unit 104, and since the parts of the feature images in each frame have the same scale in the two-dimensional direction, the merging by the merging unit 105 corresponds to combining the matrices output by the

pooling process units

1041, 1042, 1043, 1044, and 1045 in the depth direction perpendicular to the two-dimensional direction.

As shown in fig. 2, the detection unit 106 classifies the results merged by the merging unit 105 by using a plurality of Fully Connected layers (FCs), and outputs the types (classes) of the objects and the frame information of the objects.

In fig. 2, the number of all-connected layers in the detection section 106 is 4, that is, 1061, 1062, 1063, and 1064. In addition, the present application is not limited thereto, and the number of the full connection layers in the detection section 106 may be more than 4, thereby improving the accuracy of detection.

Regarding the operation principle of each fully connected layer in the detection unit 106, the related art can be referred to, and the description thereof is omitted.

According to the first aspect of the embodiment of the present application, the multi-size feature generation section 102 generates the feature maps corresponding to the respective sizes, whereby both large-size and small-size objects in the image can be accurately detected; the multi-size feature generation unit 102 generates the feature map by using the deformed convolution process, and thus can reduce the influence of the geometric transformation of the object shape in the image on the detection result; since the pooling processing unit 104 uses the deformed pooling processing, it is possible to further reduce the influence of the geometric transformation of the object shape in the image on the detection result.

Second aspect of the embodiments

A second aspect of the embodiments of the present application provides an object detection method based on a deep learning network, which corresponds to the object detection apparatus based on a deep learning network of the first aspect of the embodiments.

Fig. 4 is a schematic diagram of an object detection method based on a deep learning network according to a second aspect of the embodiment of the present application, and as shown in fig. 4, the method 400 includes:

in operation 401, a plurality of feature extraction units respectively extract features of different sizes from an input image;

in operation 402, a plurality of cascaded feature generation units respectively generate feature maps corresponding to the respective sizes by using a deformed Convolution (Deformable Convolution) process according to the features of the different sizes extracted by the plurality of feature extraction units; and

in operation 403, bounding box information of objects of corresponding sizes is detected from the generated feature maps (feature maps) of different sizes respectively using a candidate region generation Network (RPN).

As shown in fig. 4, the method 400 further includes:

in operation 404, a deformed Pooling (Deformable Pooling) process is performed on the basis of feature map (feature maps) portions corresponding to the frames of the objects detected by the candidate area generation network, so that the objects in the detected frames have the same size.

Operation 405, merging (concat) the feature images in the frames after the deformation pooling; and

operation 406 classifies the merged result using the plurality of fully connected layers (fc), and outputs classes (classes) of each object and frame information of each object.

In at least one embodiment, operation 402 may include the following operations:

operation 4021, performing interpolation (interpolation) on the feature map output by the previous feature generation unit to obtain an enlarged feature map;

operation 4022, convolving the features of the size extracted by the feature extraction unit corresponding to the current feature generation unit (1 × 256), and fusing the feature with the enlarged feature map; and

in operation 4023, a deformation Convolution (Deformable Convolution) process is performed on the fused matrix to form a feature map output by the current feature generation unit.

In at least one embodiment, operation 402 may further include the operations of:

in operation 4024, a deformation Convolution (Deformable Convolution) process is performed on the minimum-sized features extracted by the feature extraction unit to form a feature map P5 output by the feature generation unit corresponding to the minimum-sized features.

in operation 4024, a pooling (Pooling) process is performed on the feature map output from the feature generating unit corresponding to the feature having the smallest size to form a feature map P6.

According to the second aspect of the embodiments of the present application, it is possible to generate the feature maps corresponding to the respective sizes, whereby both large-sized and small-sized objects in the image can be accurately detected; moreover, the feature map is generated by using the deformation convolution processing, so that the influence of the geometric transformation of the object shape in the image on the detection result can be reduced; further, using the deformed pooling process, the influence of the geometric transformation of the object shape in the image on the detection result can be further reduced.

Third aspect of the embodiments

A third aspect of an embodiment of the present application provides an electronic device, including: the deep learning network-based object detection apparatus according to the first aspect of the embodiments.

Fig. 5 is a schematic configuration diagram of an electronic device according to the third aspect of the embodiment of the present application. As shown in fig. 5, the electronic device 500 may include: a Central Processing Unit (CPU)501 and a memory 502; the memory 502 is coupled to the central processor 501. Wherein the memory 502 can store various data; further, a program for performing control is stored, and is executed under the control of the central processing unit 501.

In one embodiment, the functions of the deep learning network based object detection apparatus 100 may be integrated into the central processor 501.

The central processor 501 may be configured to execute the deep learning network-based object detection method according to the second aspect of the embodiment.

In another embodiment, the deep learning network based object detecting apparatus 100 may be configured separately from the processor 501, for example, the deep learning network based object detecting apparatus 100 may be configured as a chip connected to the processor 501, and the function of the deep learning network based object detecting apparatus 100 is realized by the control of the processor 501.

Further, as shown in fig. 5, the electronic device 500 may further include: an input/output unit 503, a display unit 504, and the like; the functions of the above components are similar to those of the prior art, and are not described in detail here. It is noted that the electronic device 500 does not necessarily include all of the components shown in FIG. 5; furthermore, the electronic device 500 may also comprise components not shown in fig. 5, which may be referred to in the prior art.

The present application also provides a computer readable program, wherein when the program is executed in an object detection apparatus or an electronic device based on a deep learning network, the program causes the object detection apparatus or the electronic device based on the deep learning network to execute the object detection method based on the deep learning network according to the second aspect of the embodiment.

The present invention further provides a storage medium storing a computer readable program, where the storage medium stores the computer readable program, and the computer readable program enables an object detection apparatus or an electronic device based on a deep learning network to execute the object detection method based on the deep learning network according to the second aspect of the embodiments.

The measurement devices described in connection with the embodiments of the present application may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. For example, one or more of the functional block diagrams and/or one or more combinations of the functional block diagrams illustrated in fig. 1 may correspond to individual software modules of a computer program flow or may correspond to individual hardware modules. These software modules may respectively correspond to the respective operations shown in the first aspect of the embodiment. These hardware modules may be implemented, for example, by solidifying these software modules using a Field Programmable Gate Array (FPGA).

A software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. A storage medium may be coupled to the processor such that the processor can read information from, and write information to, the storage medium; or the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC. The software module may be stored in the memory of the mobile terminal or in a memory card that is insertable into the mobile terminal. For example, if the electronic device employs a MEGA-SIM card with a larger capacity or a flash memory device with a larger capacity, the software module may be stored in the MEGA-SIM card or the flash memory device with a larger capacity.

One or more of the functional block diagrams and/or one or more combinations of the functional block diagrams described with respect to fig. 1 may be implemented as a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any suitable combination thereof designed to perform the functions described herein. One or more of the functional block diagrams and/or one or more combinations of the functional block diagrams described with respect to fig. 8, 9 may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP communication, or any other such configuration.

The present application has been described in conjunction with specific embodiments, but it should be understood by those skilled in the art that these descriptions are intended to be illustrative, and not limiting. Various modifications and adaptations of the present application may occur to those skilled in the art based on the teachings herein and are within the scope of the present application.

With respect to the embodiments including the above embodiments, the following remarks are also disclosed:

1. an object detection apparatus based on a deep learning network, comprising:

2. The apparatus according to supplementary note 1, wherein the apparatus further comprises:

and a Pooling unit that performs a deformed Pooling (Deformable Pooling) process on the basis of feature map (feature maps) portions corresponding to the frames of the objects detected by the candidate area generation network, so that the objects in the detected frames have the same size.

3. The apparatus according to supplementary note 1, wherein the feature generation unit includes:

an interpolation unit that performs interpolation (interpolation) on the feature map output by the previous feature generation unit to obtain an enlarged feature map;

a fusion unit which convolutes the feature of the size extracted by the feature extraction unit corresponding to the current feature generation unit (1 × 256) and fuses the feature with the feature map after the enlargement; and

and a deformed Convolution processing unit which performs deformed Convolution (Deformable Convolution) processing on the matrix obtained after the fusion to form the feature map output by the current feature generating unit.

4. The apparatus according to supplementary note 3, wherein,

the multi-size feature generating unit further performs a deformed Convolution (Deformable Convolution) process on the features of the minimum size extracted by the feature extracting unit to form a feature map output by the feature generating unit corresponding to the features of the minimum size.

5. The apparatus according to supplementary note 4, wherein,

the multi-size feature generating unit may further perform pooling (posing) processing on the feature map output from the feature generating unit corresponding to the feature having the smallest size to generate the feature map output from the multi-size feature generating unit.

6. The apparatus according to supplementary note 2, wherein the apparatus further comprises:

a merging unit that merges (concat) the feature images in the plurality of frames after the deformation pooling process; and

and a detection unit for classifying the merged result by using a plurality of full connection layers (fc) and outputting classes (classes) of each object and frame information of each object.

7. An electronic device having the deep learning network-based object detection apparatus according to any one of supplementary notes 1 to 6.

8. An object detection method based on a deep learning network comprises the following steps:

frame information of objects of corresponding sizes is detected from feature maps (feature maps) of different sizes generated by a multi-size feature generation unit using a candidate region generation Network (RPN).

9. The method according to supplementary note 1, wherein the method further comprises:

a deformed Pooling process (Deformable Pooling) is performed on the basis of feature map portions corresponding to frames of the objects detected by the candidate area generating network, so that the detected objects in the frames have the same size.

10. The method according to supplementary note 1, wherein generating the feature map corresponding to each size includes:

performing interpolation (interpolation) on the feature map output by the previous feature generation unit to obtain an amplified feature map;

performing convolution processing (1 × 256) on the features of the size extracted by the feature extraction unit corresponding to the current feature generation unit, and fusing the features with the amplified feature map; and

and performing deformation Convolution (Deformable Convolution) processing on the matrix obtained after fusion to form the feature map output by the current feature generation unit.

11. The method according to supplementary note 10, wherein generating the feature map corresponding to each size further includes:

the feature extraction unit performs a deformed Convolution (Deformable Convolution) process on the features of the minimum size extracted by the feature extraction unit to form a feature map output by the feature generation unit corresponding to the features of the minimum size.

12. The method according to supplementary note 11, wherein generating the feature map corresponding to each size further includes:

and performing pooling (Pooling) processing on the feature map output by the feature generation unit corresponding to the feature with the minimum size to form the feature map.

13. The method according to supplementary note 9, wherein the method further comprises:

merging (concat) the feature images in the frames after the deformation pooling; and

the merged result is classified using a plurality of full connection layers (fc), and the classes (classes) of each object and the frame information of each object are output.

Claims

1. An apparatus for detecting an object based on a deep learning network, the apparatus comprising:

a multi-size feature generation unit having a plurality of cascade-connected feature generation means for generating feature maps corresponding to respective sizes by using a deformed convolution process based on the features of different sizes extracted by the feature extraction unit; and

and an object position detection unit which detects the frame information of the object of the corresponding size from the feature maps of different sizes generated by the multi-size feature generation unit, respectively, using the candidate area generation network.

2. The apparatus of claim 1, wherein the apparatus further comprises:

and a pooling processing unit that performs a deformed pooling process on the basis of the feature map portion corresponding to the frame of the object detected by each candidate area generation network so that the objects in each detected frame have the same size.

3. The apparatus of claim 1, wherein the feature generation unit comprises:

a fusion unit which convolutes the feature of the size extracted by the feature extraction unit corresponding to the current feature generation unit and fuses the feature with the feature map after amplification; and

and the deformation convolution processing unit is used for carrying out deformation convolution processing on the matrix obtained after fusion to form the characteristic diagram output by the current characteristic generating unit.

4. The apparatus of claim 3, wherein,

the multi-size feature generation unit further performs a modified convolution process on the minimum-size features extracted by the feature extraction unit to form a feature map output by the feature generation unit corresponding to the minimum-size features.

5. The apparatus of claim 4, wherein,

the multi-size feature generating unit may further pool the feature map output from the feature generating unit corresponding to the feature having the smallest size to form the feature map output from the multi-size feature generating unit.

6. The apparatus of claim 2, wherein the apparatus further comprises:

a merging unit that merges the feature images in the plurality of frames after the deformation pooling process; and

and a detection unit that classifies the merged result using a plurality of full-link layers and outputs the type of each object and frame information of each object.

7. An electronic device, characterized in that the electronic device is provided with the deep learning network-based object detection device of any one of claims 1 to 6.

8. An object detection method based on a deep learning network, which is characterized by comprising the following steps:

a plurality of cascaded feature generation units respectively generate feature maps corresponding to the sizes by using deformation convolution processing according to the features of different sizes extracted by the feature extraction units; and

and detecting the frame information of the object with the corresponding size from the generated feature maps with different sizes by using the candidate area generation network.

9. The method of claim 8, wherein the method further comprises:

a deformed pooling process is performed on the basis of the feature map portion corresponding to the frame of the object detected by each candidate area generation network so that the objects in each detected frame have the same size.

10. The method of claim 8, wherein generating a feature map corresponding to each dimension comprises:

performing interpolation processing on the feature map output by the previous feature generation unit to obtain an amplified feature map;

performing convolution processing on the feature with the size extracted by the feature extraction unit corresponding to the current feature generation unit, and fusing the feature with the amplified feature map; and

and performing deformation convolution processing on the matrix obtained after fusion to form a feature map output by the current feature generation unit.