CN107221005B

CN107221005B - Object detection method and device

Info

Publication number: CN107221005B
Application number: CN201710309200.9A
Authority: CN
Inventors: 刁梁; 俞大海; 周均扬
Original assignee: Midea Group Co Ltd
Current assignee: Midea Group Co Ltd; Guangdong Midea White Goods Technology Innovation Center Co Ltd
Priority date: 2017-05-04
Filing date: 2017-05-04
Publication date: 2020-05-08
Anticipated expiration: 2037-05-04
Also published as: CN107221005A

Abstract

The invention provides an object detection method and device, wherein the method comprises the following steps: acquiring a depth-of-field picture and an RGB picture of an object to be detected; extracting a connected domain from the depth-of-field picture; acquiring a target characteristic map layer where connected domain coordinates are regressed; inputting a target region in the RGB picture into a neural network for processing until reaching a target characteristic map layer, wherein the target region is a region corresponding to a connected domain comprising an object to be detected in the RGB picture; performing coordinate regression on the feature map obtained in the target feature map layer to obtain a detection result of the object to be detected in the target area; and the detection result comprises the coordinates and the frame of the object to be detected in the RGB picture. Therefore, the object detection area is reduced through the connected domain, only the RGB images corresponding to the connected domain are input into the neural network for processing, a large amount of calculation consumption is saved, only the feature map obtained on the target feature map layer is subjected to coordinate regression, the object detection speed is accelerated, and the object detection efficiency is improved.

Description

Object detection method and device

Technical Field

The present invention relates to the field of image processing technologies, and in particular, to an object detection method and apparatus.

Background

With the rapid development of artificial intelligence and big data technology, more and more products begin to develop towards intellectualization, and image identification is a very important part in intellectualization, namely, images are used as input information, objects in the images are positioned and detected through different methods, and the categories of the objects are identified.

In the related art, object detection can be performed by a conventional image segmentation method, a deep neural network, and the like. Compared with the traditional image segmentation method, the deep neural network method has better robustness, but needs a large amount of data and computing resources for support, so that the object detection speed and accuracy are greatly reduced when the computing resources are limited.

Disclosure of Invention

The present invention has been made to solve at least one of the technical problems of the related art to some extent.

Therefore, a first objective of the present invention is to provide an object detection method, so as to reduce an object detection area through a connected domain, input only RGB pictures corresponding to the connected domain into a neural network for processing, and perform coordinate regression only on a feature map obtained in a target feature map layer, so as to solve the problem that in the prior art, the object detection speed and efficiency are greatly reduced due to insufficient computing resources.

A second object of the present invention is to provide an object detecting device.

A third object of the present invention is to provide another object detecting apparatus.

A fourth object of the invention is to propose a non-transitory computer-readable storage medium.

A fifth object of the invention is to propose a computer program product.

In order to achieve the above object, an embodiment of a first aspect of the present invention provides an object detection method, including the following steps: acquiring a depth-of-field picture and an RGB picture of an object to be detected; extracting a connected domain from the depth-of-field picture; acquiring a target characteristic map layer where the connected domain coordinates are regressed; inputting a target region in the RGB picture into a neural network for processing until reaching the target feature map layer, wherein the target region is a region corresponding to a connected domain comprising the object to be detected in the RGB picture; performing coordinate regression on the feature map obtained in the target feature map layer to obtain a detection result of the object to be detected in the target area; and the detection result comprises the coordinates and the frame of the object to be detected in the RGB picture.

According to the object detection method, the connected domain is extracted from the depth-of-field picture, the target feature map layer where the coordinates of the connected domain are regressed is obtained, then the RGB pictures corresponding to the connected domain are input into the neural network to be processed until reaching the target feature map layer, and finally the coordinates of the feature map obtained on the target feature map layer are regressed to obtain the detection result of the object to be detected in the target region. Therefore, the object detection area is reduced through the connected domain, only the RGB images corresponding to the connected domain are input into the neural network for processing, a large amount of calculation consumption is saved, only the feature map obtained on the target feature map layer is subjected to coordinate regression, the object detection speed is accelerated, and the object detection efficiency is improved.

In order to achieve the above object, a second embodiment of the present invention provides an object detecting device, including: the image acquisition module is used for acquiring a depth-of-field image and an RGB image of the object to be detected; the extraction module is used for extracting a connected domain from the depth-of-field picture; the acquisition module is used for acquiring a target feature map layer where the connected domain coordinates are in during regression; the processing module is used for inputting a target region in the RGB picture into a neural network to be processed until reaching the target feature map layer, wherein the target region is a region corresponding to the connected domain in the RGB picture; the detection module is used for performing coordinate regression on the feature map obtained in the target feature map layer to obtain a detection result of the object to be detected in the target area; and the detection result comprises the coordinates and the frame of the object to be detected in the RGB picture.

The object detection device extracts the connected domain from the depth-of-field picture, acquires the target feature map layer where the coordinates of the connected domain are regressed, inputs the RGB pictures corresponding to the connected domain into the neural network to be processed until reaching the target feature map layer, and finally performs coordinate regression on the feature map obtained on the target feature map layer to obtain the detection result of the object to be detected in the target region. Therefore, the object detection area is reduced through the connected domain, only the RGB images corresponding to the connected domain are input into the neural network for processing, a large amount of calculation consumption is saved, only the feature map obtained on the target feature map layer is subjected to coordinate regression, the object detection speed is accelerated, and the object detection efficiency is improved.

In order to achieve the above object, a third embodiment of the present invention provides another object detecting apparatus, including: a processor; a memory for storing the processor-executable instructions; wherein the processor is configured to: acquiring a depth-of-field picture and an RGB picture of an object to be detected; extracting a connected domain from the depth-of-field picture; acquiring a target characteristic map layer where the connected domain coordinates are regressed; inputting a target region in the RGB picture into a neural network for processing until reaching the target feature map layer, wherein the target region is a region corresponding to a connected domain comprising the object to be detected in the RGB picture; performing coordinate regression on the feature map obtained in the target feature map layer to obtain a detection result of the object to be detected in the target area; and the detection result comprises the coordinates and the frame of the object to be detected in the RGB picture.

In order to achieve the above object, a fourth aspect of the present invention provides a non-transitory computer-readable storage medium, wherein instructions of the storage medium, when executed by a processor on a server side, enable the server side to execute an object detection method, the method including: acquiring a depth-of-field picture and an RGB picture of an object to be detected; extracting a connected domain from the depth-of-field picture; acquiring a target characteristic map layer where the connected domain coordinates are regressed; inputting a target region in the RGB picture into a neural network for processing until reaching the target feature map layer, wherein the target region is a region corresponding to a connected domain comprising the object to be detected in the RGB picture; performing coordinate regression on the feature map obtained in the target feature map layer to obtain a detection result of the object to be detected in the target area; and the detection result comprises the coordinates and the frame of the object to be detected in the RGB picture.

In order to achieve the above object, a fifth aspect of the present invention provides a computer program product, wherein when executed by an instruction processor of the computer program product, an object detection method is performed, and the method includes: acquiring a depth-of-field picture and an RGB picture of an object to be detected; extracting a connected domain from the depth-of-field picture; acquiring a target characteristic map layer where the connected domain coordinates are regressed; inputting a target region in the RGB picture into a neural network for processing until reaching the target feature map layer, wherein the target region is a region corresponding to a connected domain comprising the object to be detected in the RGB picture; performing coordinate regression on the feature map obtained in the target feature map layer to obtain a detection result of the object to be detected in the target area; and the detection result comprises the coordinates and the frame of the object to be detected in the RGB picture.

Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

The foregoing and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

FIG. 1 is a schematic flow diagram of an object detection method according to one embodiment of the invention;

FIG. 2 is a schematic flow diagram of an object detection method according to another embodiment of the invention;

FIG. 3 is a schematic flow diagram of an object detection method according to yet another embodiment of the invention;

FIG. 4 is a schematic diagram of a model structure according to one embodiment of the invention;

FIG. 5 is a schematic diagram of an object detection device according to an embodiment of the present invention;

FIG. 6 is a schematic structural diagram of an acquisition module according to one embodiment of the invention;

FIG. 7 is a schematic diagram of a first computing unit, according to one embodiment of the invention;

fig. 8 is a schematic structural view of an object detecting apparatus according to another embodiment of the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.

An object detection method and apparatus according to an embodiment of the present invention will be described below with reference to the drawings.

With the continuous increase of image data, the application range of object detection based on images is also wider and wider, for example, in image recognition, to acquire a target image and detect an object included in the target image.

At present, as scenes are more and more complex, and when computing resources are insufficient, the speed and the accuracy of object detection by the object detection method in the prior art are greatly reduced.

Compared with the object detection method in the prior art, the object detection method provided by the invention can accelerate the object detection speed and has higher object detection accuracy.

Fig. 1 is a schematic flow chart of an object detection method according to an embodiment of the present invention. As shown in fig. 1, the object detection method includes the steps of:

step 101, obtaining a depth of field picture and an RGB picture of an object to be detected.

Step 102, extracting a connected domain from the depth of field picture.

In practical application, the depth of field picture and the RGB picture of the object to be measured can be acquired through a 3D camera and other devices.

Further, different methods may be adopted to extract the connected domain from the depth-of-field picture according to the needs of the actual application scene, for example, as follows:

in a first example, the depth of field of each pixel point in a depth of field picture is obtained according to a depth of field two-dimensional distribution function, when the difference between the depth of field of two adjacent pixel points is less than or equal to a preset depth of field threshold, it is determined that the two pixel points belong to the same connected domain, and then all the continuous pixel points belonging to the same connected domain are used for constructing the connected domain for the depth of field picture.

As a possible implementation manner, the two-dimensional distribution function of the depth of field is obtained as follows:

D₁＝D(x,y)(0<＝x<＝W₁,0<＝y<＝H₁) Wherein W is₁Length of RGB picture, H₁Is the height of the RGB picture.

Assuming a predetermined depth of field threshold value of d_dAccording to the depth of field threshold value, the depth of field two-dimensional distribution function can be divided into a plurality of connected domainsLet the depth of field of two adjacent pixels be D (x)₁,y₁) And D (x)₂,y₂) At | D (x)₁,y₁)-D(x₂,y₂)|<＝d_dAnd then, two pixel points belong to the same connected domain. The connected component interval may be recorded as: w min_j<＝x<＝w max_j,h min_j<＝y<＝h max_j。

In a second example, the depth-of-field picture is processed by software such as opencv and matlab, and the connected domain of the depth-of-field picture is extracted from the depth-of-field picture.

It should be noted that the above method is merely an example of extracting a connected component from a depth image, and other methods may be selected or set according to actual application needs.

In addition, it should be noted that the number of connected components extracted from the depth picture in the above manner may be N, where N represents the number of connected component intervals included in the depth picture.

And 103, acquiring a target feature map layer where the connected domain coordinates are regressed.

Specifically, a deep convolutional neural network can be designed, and the connected domain coordinate regression problem can be solved by using the deep convolutional network. And further, acquiring a target feature map layer where the connected domain coordinates are regressed.

As an implementation manner, first, a first area of the object to be detected on the RGB picture is calculated, and then, a second area of the convolution kernel used by each feature map layer on the RGB picture is calculated, so that a difference value between the first area and the second area corresponding to each feature map layer can be obtained, and a layer where the second area corresponding to the minimum difference value among all the difference values is located is a target feature map layer, in this embodiment, the target feature map layer may be labeled as OL_j。

And 104, inputting a target area in the RGB picture into the neural network for processing until reaching a target feature map layer, wherein the target area is an area corresponding to a connected domain comprising the object to be detected in the RGB picture.

In this embodiment, the target feature map layer is the last layer of the neural network that performs processing such as feature and downsampling on the target region in the RGB image.

105, performing coordinate regression on the feature map obtained in the target feature map layer to obtain a detection result of the object to be detected in the target area; and the detection result comprises the coordinates and the frame of the object to be detected in the RGB picture.

Specifically, after the target feature map layer is obtained, a region corresponding to a connected domain including an object to be detected in the RGB picture may be input to the neural network as a target region and processed until the target feature map layer.

In order to further improve the accuracy of obtaining the target feature map layer, after the target region is input to the neural network, the processing such as feature extraction, downsampling, and dimension reduction is performed, and for the jth connected domain, the neural network is required to process the target region in the RGB picture corresponding to the jth connected domain to OL_jA characteristic map of the layer.

It should be noted that the neural network in this embodiment refers to a preset neural network model, and the pre-trained neural network model may use various connection layers, for example, a convolutional layer and a pooling layer as sampling, so as to shorten the length and width of the feature map.

It can be understood that the feature maps of the feature map layers can be obtained after the neural network processing, and the detection result of the object to be detected in the target region can be obtained only by performing coordinate regression on the feature maps obtained in the target feature map layer. And the detection result comprises the coordinates and the frame of the object to be detected in the RGB picture.

More specifically, feature vectors are extracted from the feature map by using a convolution kernel of the target feature map layer, coordinate regression calculation is carried out on the feature vectors, candidate results of at least one object to be detected in the RGB image are obtained, and finally the actual coordinates and the frame of the object to be detected are determined from the candidate results based on a maximum suppression algorithm or a clustering algorithm.

It is understood that the candidate result includes coordinates and a frame of the object to be measured in the RGB image.

Based on the above embodiment, in order to describe more clearly how to obtain the target feature map layer where the connected component coordinates are regressed, the following is specifically illustrated by the embodiment shown in fig. 2:

fig. 2 is a schematic flow chart of an object detection method according to another embodiment of the present invention.

In this embodiment, first, a first area of the object to be measured on the RGB picture is calculated, and then, a second area of the convolution kernel used by each feature map layer on the RGB picture is calculated, so that a difference value between the first area and the second area corresponding to each feature map layer can be obtained, and a layer where the second area corresponding to the minimum difference value among all the difference values is located is a target feature map layer.

As shown in fig. 2, that is, step S103 in the above embodiment includes: S201-S204.

Step 201, calculating a first area of the object to be measured on the RGB picture.

Specifically, the average distance of the connected component from the camera is first acquired. As an example, the depth of field of each pixel point in the connected domain is summed, and the ratio of the summed value to the area of the connected domain is made to obtain the average distance of the connected domain. Let the average distance be d_j(j is more than or equal to 1 and less than or equal to N), N represents the number of connected domain intervals, and the average distance between each connected domain and the camera is calculated as follows:

further, the actual length and the actual height of the object to be measured are obtained. It can be understood that, in order to ensure the accuracy of the obtained actual length and actual height of the object to be measured, the length and height of the object to be measured may be measured multiple times, and then the length average value and the height average value of the object to be measured may be obtained in an averaging manner as the actual length and actual height of the object to be measured.

Further, the focal length of the camera is multiplied by the actual length, and the ratio of the multiplied result to the average distance is used for obtaining the picture length of the object to be measured. As an example, the equation Ow can be given by_j＝f*W_r/d_jCalculating the length of the picture of the object to be measured, wherein f is the focal length of the camera, and W_rIs a real length, d_jIs the average distance of the connected domain from the camera.

Further, the focal length of the camera is multiplied by the actual height, and the ratio of the multiplied result to the average distance is used for obtaining the picture height of the object to be measured. As an example, the equation Oh can be given by_j＝f*H_r/d_jCalculating the height of the picture of the object to be measured, wherein f is the focal length of the camera, and H_rIs the actual height, d_jIs the average distance of the connected domain from the camera.

Further, the first area is obtained according to the picture length and the picture height. As an example, the equation Os may be expressed by_j＝Ow_j*Oh_jCalculating a first area, wherein Ow_jAs picture length, Oh_jIs the picture height.

And 202, calculating a second area of the convolution kernel used by each feature map layer on the RGB picture.

Specifically, the method comprises the steps of firstly obtaining the graph length and the graph height of a feature graph obtained by sampling a target area in an ith layer of feature graph layer, wherein i is more than or equal to 1 and is less than or equal to N.

Further, the convolution kernel length and the convolution kernel height of the convolution kernel used for obtaining the feature map layer are obtained.

Further, the convolution kernel length is multiplied by the atlas length and then is compared with the atlas length of the first layer to obtain the convolution kernel picture length of the convolution kernel used by the feature atlas layer of the ith layer on the RGB picture. As an example, this may be represented by the formula Bw_i＝Sw_i*W_i/W₁Calculating the convolution kernel picture length, wherein Sw_iFor convolution kernel length, W_iIs the length of the atlas, W₁The length of the first layer is the length of the first layer.

Further, the convolution kernel height is multiplied by the atlas height and then is compared with the atlas length of the first layer to obtain the convolution kernel picture height of the convolution kernel used by the feature atlas layer of the ith layer on the RGB picture. As an example, this may be represented by the formula Bh_i＝Sh_i*H_i/H₁Calculating the height of convolution kernel picture, where Sh_iFor convolution kernel height, H_iIs the height of the spectrum, H₁The height of the first layer is the height of the RGB picture.

And further, obtaining a second area according to the length of the convolution kernel picture and the height of the convolution kernel picture. As an example, the equation Ss can be given_i＝Bw_i*Bh_iCalculating a first area, wherein Bw_iFor convolution kernel picture length, Bh_iIs the convolution kernel picture height. As can be appreciated, Bw_iAnd Bh_iThe length and the height of the i-th layer convolution kernel on the RGB picture are respectively.

And step 203, calculating the difference between the first area and the second area corresponding to each feature map layer.

And 204, determining a layer where the second area corresponding to the minimum difference value in all the difference values is located as a target feature map layer.

Specifically, the area difference is calculated, the minimum difference value is obtained, and the layer where the second area is located corresponding to the minimum difference value is found to be the target feature map layer. As an example, calculating the area difference and determining the smallest difference of all differences may be expressed as:

Os min_j＝min(|Os_j-Ss₁|…|Os_j-Ss_i||Os_j-Ss_L|)＝|Os_j-Ss_tl. Wherein, L represents the number of feature maps obtained after neural network processing, min () is a minimum function, for example, the first area and the second area on the t-th layer feature map obtained by calculation are the closest, and the target feature map layer corresponding to each connected domain can be determined to be OL_j＝t(1<＝j<N). That is, the number of target feature map layers is the t-th layer, that is, when the neural network processes the target region to the t-th layer, the target region is not processed, and the feature map generated by the t-th layer is used for coordinate regression.

Therefore, the feature map calculation range is narrowed through the depth-of-field connected domain, the target feature map layer is calculated through the distance information, and coordinate regression is only carried out on the target feature map layer, so that the object detection efficiency is further improved.

Based on the above description of the embodiments, in order to make the above process more clear to those skilled in the art, the following takes gesture detection as an example, and is illustrated as follows in conjunction with fig. 3 and 4:

fig. 3 is a flow chart illustrating an object detection method according to still another embodiment of the present invention. As shown in fig. 3, the object detection method includes the steps of:

step 301, calculating the average value of the actual sizes of the objects to be detected.

Specifically, taking gesture detection as an example, the maximum length and width of various human gestures in a space are collected, and a mean value is calculated to obtain a mean value (including length and width) of actual sizes of the gestures to be detected.

Step 302, training an object detection deep neural network model.

Specifically, the SSD model is trained by utilizing gesture data as a detected deep neural network model, wherein the SSD model comprises a plurality of feature sampling layers and a coordinate regression layer.

And step 303, acquiring a depth of field picture and an RGB picture through a depth of field camera.

And 304, extracting a connected domain of the depth-of-field picture, and calculating the number of layers of the bottom layer feature map corresponding to the connected domain.

Specifically, a depth of field picture is collected, and a depth of field connected domain and a connected domain distance mean value are calculated. And then calculating the number of layers of the bottom layer characteristic map layer, namely the target characteristic map layer, obtained after the RGB image region corresponding to the connected domain is substituted into the SSD model according to the distance average value and the size of the convolution kernel of the coordinate regression layer.

And 305, sampling the RGB regions corresponding to the connected domain to the corresponding bottom layer feature map layer to obtain a bottom layer feature map.

And step 306, performing coordinate regression on the bottom layer feature map, and acquiring a detection result of the object to be detected in the target area.

Specifically, the RGB image area is substituted into the model until the RGB image area is sampled to the bottom layer characteristic map layer, and a corresponding bottom layer characteristic map is obtained, wherein the bottom layer characteristic map is the characteristic map acquired in the target characteristic map layer. And then carrying out coordinate regression by utilizing convolution operation, and screening the actual coordinates and the frames of the gesture to be detected from the candidate detection structures by using a maximum suppression algorithm.

More specifically, as shown in the model structure diagram of fig. 4, two connected domains are retrieved from the depth-of-field picture, and the corresponding RGB image is substituted into the model to be sampled until reaching the corresponding coordinate regression layer, so as to perform regression operation.

Therefore, the object detection area is reduced through the depth-of-field connected domain, only the RGB pictures corresponding to the connected domain are substituted into the neural network, and a large amount of calculation consumption is saved. The frame size of the object to be detected is determined by using the distance information, the number of layers of the feature map for coordinate regression is calculated, and only the connected domain feature map is subjected to coordinate regression on the layer, so that the target detection efficiency and the recall rate are improved.

Fig. 5 is a schematic structural diagram of an object detection device according to an embodiment of the present invention. As shown in fig. 5, the object detection device includes: the system comprises a picture acquisition module 11, an extraction module 12, an acquisition module 13, a processing module 14 and a detection module 15.

The image obtaining module 11 is configured to obtain a depth-of-field image and an RGB image of the object to be detected.

And the extracting module 12 is configured to extract a connected domain from the depth-of-field picture.

And the obtaining module 13 is configured to obtain a target feature map layer where the connected domain coordinates are regressed.

And the processing module 14 is configured to input a target region in the RGB image into the neural network to process the target region until reaching the target feature map layer, where the target region is a region corresponding to a connected domain in the RGB image.

The detection module 15 is configured to perform coordinate regression on the feature map obtained in the target feature map layer to obtain a detection result of the object to be detected in the target area; and the detection result comprises the coordinates and the frame of the object to be detected in the RGB picture.

Further, the extraction module 12 is specifically configured to obtain the depth of field of each pixel point in the depth of field picture according to the depth of field two-dimensional distribution function, determine that two pixel points belong to the same connected domain if a difference between the depth of field of two adjacent pixel points is less than or equal to a preset depth of field threshold, and construct the connected domain for the depth of field picture by using all the continuous pixel points belonging to the same connected domain.

Fig. 6 is a schematic structural diagram of an acquisition module according to an embodiment of the present invention. The acquisition module 13 includes: a first calculation unit 131, a second calculation unit 132, and a determination unit 133.

The first calculating unit 131 is configured to calculate a first area of the object to be measured on the RGB image.

And a second calculating unit 132, configured to calculate a second area of the convolution kernel used by each feature map layer on the RGB picture.

The determining unit 133 is configured to calculate a difference between the first area and the second area corresponding to each feature map layer, and determine a layer where the second area corresponding to the minimum difference in all the differences is located as the target feature map layer.

Fig. 7 is a schematic structural diagram of a first computing unit according to an embodiment of the present invention. The first calculation unit 131 includes: a first acquisition subunit 1311, a second acquisition subunit 1312, a third acquisition subunit 1313 and a fourth acquisition subunit 1314.

The first obtaining subunit 1311 is configured to obtain an average distance between the connected component and the camera.

And a second obtaining subunit 1312, configured to obtain an actual length and an actual height of the object to be measured.

A third obtaining subunit 1313, configured to obtain a ratio of the focal length of the camera to the actual length, obtain a picture length of the object to be detected by using the result of multiplication and the average distance as a ratio, obtain a ratio of the focal length of the camera to the actual height, and obtain a picture height of the object to be detected by using the result of multiplication and the average distance as a ratio.

A fourth obtaining subunit 1314, configured to obtain the first area according to the picture length and the picture height.

Further, the second calculating unit 132 is specifically configured to acquire a graph length and a graph height of a feature graph obtained by sampling the target region in the i-th layer of feature graph layer, where i is greater than or equal to 1 and less than or equal to N, acquire a convolution kernel length and a convolution kernel height of a convolution kernel used in the feature graph layer, obtain a ratio of the multiplication of the convolution kernel length and the graph length to the graph length of the first layer, obtain a convolution kernel picture length of the convolution kernel used in the i-th layer of feature graph layer on the RGB picture, obtain a ratio of the multiplication of the convolution kernel height and the graph height to the graph height of the first layer, obtain a convolution kernel picture height of the convolution kernel used in the i-th layer of feature graph layer on the RGB picture, and obtain the second area according to the convolution kernel picture length and the convolution kernel picture height.

Further, the first obtaining subunit 1311 is specifically configured to sum the depth of field of each pixel point in the connected component, and obtain an average distance of the connected component by taking a ratio of a summed value to an area of the connected component.

Further, the detection module 15 is specifically configured to perform feature vector extraction on the feature map by using a convolution kernel of the target feature map layer, perform coordinate regression operation by using the extracted feature vector to obtain candidate detection results of the at least one object to be detected in the RGB image, and determine a detection result of the object to be detected from the candidate detection results based on a maximum suppression algorithm or a clustering algorithm.

Fig. 8 is a schematic structural view of an object detecting apparatus according to another embodiment of the present invention. The object detection device includes:

a memory 21, a processor 22 and a computer program stored on the memory 21 and executable on the processor 22.

The processor 22, when executing the program, implements the object detection method provided in the above-described embodiments.

Further, the object detection device further includes:

a communication interface 23 for communication between the memory 21 and the processor 22.

A memory 21 for storing a computer program operable on the processor 22.

The memory 21 may comprise a high-speed RAM memory, and may further include a non-volatile memory (non-volatile memory), such as at least one disk memory.

And a processor 22, configured to implement the object detection method according to the foregoing embodiment when executing the program.

If the memory 21, the processor 22 and the communication interface 23 are implemented independently, the communication interface 21, the memory 21 and the processor 22 may be connected to each other through a bus and perform communication with each other. The bus may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an Extended ISA (enhanced Industry Standard Architecture) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in FIG. 8, but this is not intended to represent only one bus or type of bus.

Optionally, in a specific implementation, if the memory 21, the processor 22 and the communication interface 23 are integrated on a chip, the memory 21, the processor 22 and the communication interface 23 may complete mutual communication through an internal interface.

The processor 22 may be a Central Processing Unit (CPU), or an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits configured to implement embodiments of the present invention.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing steps of a custom logic function or process, and alternate implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present invention.

The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.

It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. If implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.

In addition, functional units in the embodiments of the present invention may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.

The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc. Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims

1. An object detection method, comprising the steps of:

acquiring a depth-of-field picture and an RGB picture of an object to be detected;

extracting a connected domain from the depth-of-field picture;

acquiring a target characteristic map layer where the connected domain coordinates are regressed; the obtaining of the target feature map layer where the connected domain coordinates are regressed comprises: calculating a first area of the object to be detected on the RGB picture; calculating a second area of a convolution kernel used by each characteristic map layer on the RGB picture; calculating a difference between the first area and the second area corresponding to each feature map layer; determining a layer where the second area corresponding to the minimum difference value in all the difference values is located as the target feature map layer;

inputting a target region in the RGB picture into a neural network for processing until reaching the target feature map layer, wherein the target region is a region in the RGB picture corresponding to the connected domain comprising the object to be detected;

performing coordinate regression on the feature map obtained in the target feature map layer to obtain a detection result of the object to be detected in the target area; and the detection result comprises the coordinates and the frame of the object to be detected in the RGB picture.

2. The object detection method according to claim 1, wherein the extracting the connected component from the depth picture comprises:

acquiring the depth of field of each pixel point in the depth of field picture according to a depth of field two-dimensional distribution function;

if the difference value between the depth of field of two adjacent pixel points is less than or equal to a preset depth of field threshold value, determining that the two pixel points belong to the same connected domain;

and constructing the connected domain for the depth-of-field picture by utilizing all continuous pixel points belonging to the same connected domain.

3. The object detection method according to claim 1, wherein the calculating the first area of the object to be detected on the RGB picture comprises:

acquiring the average distance between the connected domain and a camera;

acquiring the actual length and the actual height of the object to be detected;

multiplying the focal length of the camera by the actual length, and taking the ratio of the multiplied result to the average distance to obtain the picture length of the object to be detected;

multiplying the focal length of the camera by the actual height, and taking the ratio of the multiplied result to the average distance to obtain the picture height of the object to be detected;

and obtaining the first area according to the picture length and the picture height.

4. The object detection method according to claim 1, wherein the calculating the second area of the convolution kernel used by each feature map layer on the RGB picture comprises:

acquiring the map length and the map height of a feature map obtained by sampling the target area in the ith layer of feature map layer; wherein i is more than or equal to 1 and less than or equal to N;

obtaining the convolution kernel length and the convolution kernel height of a convolution kernel used by the characteristic map layer;

multiplying the convolution kernel length by the map length, and then making a ratio of the product to the map length of the first layer to obtain the convolution kernel picture length of the convolution kernel used by the characteristic map layer of the ith layer on the RGB picture;

multiplying the height of the convolution kernel by the height of the atlas, and then making a ratio of the height of the first layer of the atlas to obtain the height of the convolution kernel picture of the convolution kernel used by the ith layer of the characteristic atlas on the RGB picture;

and obtaining the second area according to the length of the convolution kernel picture and the height of the convolution kernel picture.

5. The object detection method of claim 3, wherein the obtaining the average distance of the connected component from the camera comprises:

summing the depth of field of each pixel point in the connected domain;

and taking the ratio of the summed value to the area of the connected domain to obtain the average distance of the connected domain.

6. The object detection method according to any one of claims 1 to 5, wherein performing coordinate regression on the feature maps obtained in the target feature map layer to identify the object to be detected in the target region includes:

extracting feature vectors of the feature map by using the convolution kernel of the target feature map layer;

performing coordinate regression operation by using the extracted feature vectors to obtain a candidate result of the object to be detected in the RGB image, wherein the candidate result comprises coordinates and a frame of the object to be detected in the RGB image;

and determining the actual coordinates and the frame of the object to be detected from the candidate result based on a maximum suppression algorithm or a clustering algorithm.

7. An object detecting device, comprising:

the image acquisition module is used for acquiring a depth-of-field image and an RGB image of the object to be detected;

the extraction module is used for extracting a connected domain from the depth-of-field picture;

the acquisition module is used for acquiring a target feature map layer where the connected domain coordinates are in during regression; wherein, the obtaining module includes: the first calculating unit is used for calculating a first area of the object to be detected on the RGB picture; the second calculation unit is used for calculating a second area of a convolution kernel used by each feature map layer on the RGB picture; the determining unit is used for calculating the difference between the first area and the second area corresponding to each feature map layer, and determining the layer where the second area corresponding to the minimum difference in all the differences is located as the target feature map layer;

the processing module is used for inputting a target area in the RGB picture into a neural network to be processed until reaching the target feature map layer, wherein the target area is an area corresponding to the connected domain comprising the object to be detected in the RGB picture;

the detection module is used for performing coordinate regression on the feature map obtained in the target feature map layer to obtain a detection result of the object to be detected in the target area; and the detection result comprises the coordinates and the frame of the object to be detected in the RGB picture.

8. The object detection device of claim 7, wherein the extraction module is specifically configured to:

9. The object detection device according to claim 7, wherein the first calculation unit includes:

the first acquisition subunit is used for acquiring the average distance between the connected domain and the camera;

the second acquiring subunit is used for acquiring the actual length and the actual height of the object to be detected;

a third obtaining subunit, configured to obtain a ratio of a focal length of the camera to the actual length, obtain a picture length of the object to be measured by using a result of the multiplication and the average distance as a ratio, obtain a picture height of the object to be measured by using a result of the multiplication and the average distance as a ratio;

and the fourth obtaining subunit is configured to obtain the first area according to the picture length and the picture height.

10. The object detection apparatus according to claim 7, wherein the second calculation unit is specifically configured to obtain a graph length and a graph height of a feature graph obtained by sampling the target region at an i-th feature graph layer, wherein i is more than or equal to 1 and less than or equal to N, the convolution kernel length and the convolution kernel height of a convolution kernel used by the characteristic map layer are obtained, the ratio of the multiplication of the convolution kernel length and the map length to the map length of the first layer is obtained, the convolution kernel picture length of the convolution kernel used by the i-th layer of the characteristic map layer on the RGB picture is obtained, the ratio of the multiplication of the convolution kernel height and the map height to the map height of the first layer is obtained, the convolution kernel picture height of the convolution kernel used by the i-th layer of the characteristic map layer on the RGB picture is obtained, and obtaining the second area according to the length of the convolution kernel picture and the height of the convolution kernel picture.

11. The object detection device according to claim 9, wherein the first obtaining subunit is specifically configured to sum the depths of field of each pixel point in the connected domain, and obtain the average distance of the connected domain by taking a ratio of a summed value to an area of the connected domain.

12. The object detecting device according to any one of claims 7 to 11, wherein the detecting module is specifically configured to perform feature vector extraction on the feature map by using the convolution kernel of the target feature map layer, perform coordinate regression operation by using the extracted feature vectors to obtain candidate detection results of the at least one object to be detected in the RGB image, and determine the detection result of the object to be detected from the candidate detection results based on a maximum suppression algorithm or a clustering algorithm.