CN113762003A

CN113762003A - Target object detection method, device, equipment and storage medium

Info

Publication number: CN113762003A
Application number: CN202011196088.0A
Authority: CN
Inventors: 刘浩; 徐卓然; 许新玉
Original assignee: Beijing Jingdong Qianshi Technology Co Ltd
Current assignee: Beijing Jingdong Qianshi Technology Co Ltd
Priority date: 2020-10-30
Filing date: 2020-10-30
Publication date: 2021-12-07
Anticipated expiration: 2040-10-30
Also published as: CN113762003B

Abstract

The embodiment of the invention discloses a method, a device, equipment and a storage medium for detecting a target object. The method comprises the following steps: compressing the acquired three-dimensional point cloud data to an image to be detected at the view angle of the aerial view; inputting an image to be detected into a pre-trained target detection model, and determining the type and position of an object contained in the image to be detected according to the output result of the target detection model; the target detection model comprises a backbone network and a detector, the detector comprises at least two sub-detection modules connected in parallel, and the sizes corresponding to different sub-detection modules are different. According to the embodiment of the invention, the image to be detected is input into the target detection model comprising the main network and at least two parallel sub-detection modules corresponding to different sizes, so that the problem of inaccurate target object detection result in the prior art is solved, the identification capability of obstacles with different sizes is improved in the field of automatic driving, and the safety in the automatic driving process is further ensured.

Description

Target object detection method, device, equipment and storage medium

Technical Field

The embodiment of the invention relates to the technical field of automatic driving, in particular to a target object detection method, a target object detection device, target object detection equipment and a storage medium.

Background

In order to ensure the safety of operation, the autonomous vehicle must detect and recognize obstacles that may obstruct the vehicle from traveling, so as to perform appropriate avoidance operations according to different types and states of the obstacles. At present, the most mature detection scheme in automatic driving is the BEV (Bird's-eye View) detection of laser radar point cloud, that is, three-dimensional point cloud data is compressed to image data of a Bird's-eye View angle, and then the image data is sent to a 2D target detection algorithm for detection.

In the process of implementing the invention, the inventor finds that at least the following problems exist in the prior art:

the visual target detection algorithm based on RGB image can be used for BEV detection, such as SSD (Single Shot detection), YOLO (you Only Look one), RCNN (regions with CNN features) and its variant algorithm. However, these algorithms are not specially designed for BEV detection, in the field of RGB image detection, the size of the same type of target is different due to the difference in distance from the camera, and in BEV detection, the size of the object only depends on the actual size of the object, so that it is not appropriate to apply the multi-scale method in the RGB image directly, and the detection accuracy needs to be improved.

Disclosure of Invention

The embodiment of the invention provides a method, a device, equipment and a storage medium for detecting target objects, which are used for improving the identification capability of the target objects with different sizes and further ensuring the safety in the automatic driving process.

In a first aspect, an embodiment of the present invention provides a method for detecting a target object, where the method includes:

compressing the acquired three-dimensional point cloud data to an image to be detected at the view angle of the aerial view;

inputting the image to be detected into a pre-trained target detection model, and determining the type and the position of an object contained in the image to be detected according to an output result of the target detection model;

the target detection model comprises a backbone network and a detector, wherein the detector comprises at least two sub-detection modules connected in parallel; the main network is used for extracting the characteristics of the image to be detected, each sub-detection module is respectively used for detecting the type and the position of an object with the size corresponding to the sub-detection module based on the characteristic diagram matrix extracted by the main network, and the sizes corresponding to different sub-detection modules are different.

In a second aspect, an embodiment of the present invention further provides an apparatus for detecting a target object, where the apparatus includes:

the image acquisition module to be detected is used for compressing the acquired three-dimensional point cloud data to an image to be detected at the view angle of the aerial view;

the image to be detected input module is used for inputting the image to be detected into a pre-trained target detection model, and determining the type and the position of an object contained in the image to be detected according to an output result of the target detection model;

In a third aspect, an embodiment of the present invention further provides an electronic device, where the electronic device includes:

one or more processors;

a memory for storing one or more programs;

when the one or more programs are executed by the one or more processors, the one or more processors implement any of the above-mentioned target object detection methods.

In a fourth aspect, the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program is used to implement any of the above-mentioned target object detection methods when executed by a processor.

The embodiment of the invention has the following advantages or beneficial effects:

the method comprises the steps of inputting an image to be detected into a target detection model comprising a backbone network and a detector, wherein the detector comprises at least two sub-detection modules which are connected in parallel and correspond to different sizes, so that the problem that a multi-scale method is not suitable for BEV image detection in RGB image detection is solved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, a brief description will be given below of the drawings required for the embodiments or the technical solutions in the prior art, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

Fig. 1 is a flowchart of a method for detecting a target object according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of a specific example of a backbone network according to an embodiment of the present invention. (ii) a

FIG. 3 is a schematic structural diagram of a detector according to an embodiment of the present invention;

fig. 4A is a schematic structural diagram of a specific example of a sharing module according to an embodiment of the present invention;

fig. 4B is a schematic structural diagram of an embodiment of a first sub-detection module according to an embodiment of the present invention;

fig. 4C is a schematic structural diagram of a specific example of a second sub-detection module according to an embodiment of the present invention;

fig. 4D is a schematic structural diagram of a specific example of a third sub-detection module according to an embodiment of the present invention;

fig. 5 is a flowchart of a target object detection method according to a second embodiment of the present invention;

fig. 6 is a schematic diagram of a target object detection apparatus according to a third embodiment of the present invention;

fig. 7 is a schematic structural diagram of an electronic device according to a fourth embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.

Example one

Fig. 1 is a flowchart of a method for detecting a target object according to an embodiment of the present invention, which is applicable to a case of identifying target objects with different physical sizes, and especially applicable to an automatic driving scenario. The method can be executed by a target object detection device, which can be implemented by software and/or hardware, and is integrated in a terminal device. The method specifically comprises the following steps:

and S110, compressing the acquired three-dimensional point cloud data to an image to be detected at the view angle of the aerial view.

In an exemplary embodiment, a laser radar device is used to obtain three-dimensional point cloud data. The working principle of the laser radar equipment can be as follows: the laser radar equipment emits laser to a target object according to a certain moving track, and when the laser irradiates the surface of the target object, the laser radar equipment receives a laser signal reflected by at least one point on the target object, and the laser signal can carry information such as the direction, the distance and the like between the point and the laser radar equipment. According to the laser signal, three-dimensional point cloud data corresponding to the target object can be obtained, and the three-dimensional point cloud data includes a matrix of at least 3 columns and N rows, wherein each row vector includes at least an (x, y, z) array, and the row vector may further include intensity or depth information, and the like, specifically, (x, y, z) may be used to describe a spatial position of any point on the surface of the target object.

The acquired three-dimensional point cloud data are compressed to an image to be detected at the view angle of the aerial view, and specifically, the three-dimensional point cloud data under the world coordinate system are converted into the image coordinate system to obtain the image to be detected. For example, a point position on each row vector in the three-dimensional point cloud data in the world coordinate system is mapped to a pixel position in the image coordinate system, and since a negative value exists in the point position in the world coordinate system and a negative value does not exist in the pixel position in the image coordinate system, the minimum position coordinate of the mapped pixel position is set to (0, 0). Further, mapping the z-coordinate value corresponding to the point position to a pixel value of a pixel position corresponding to the point position, and filling the pixel value into an image to obtain the image to be detected.

S120, inputting the image to be detected into a pre-trained target detection model, and determining the type and the position of an object contained in the image to be detected according to the output result of the target detection model.

In this embodiment, the target detection model includes a backbone network and a detector, and the detector includes at least two sub-detection modules connected in parallel; the main network is used for extracting the features of the image to be detected, each sub-detection module is respectively used for detecting the type and the position of the object with the size corresponding to the sub-detection module based on the feature map matrix extracted by the main network, and the sizes corresponding to different sub-detection modules are different.

In an embodiment, optionally, the backbone network adopts an Xception network structure, the number of channels in the Xception network structure is less than 728, and the number of times of repeating the intermediate stream module in the Xception network structure is less than 8. The advantage that sets up like this lies in, the backbone network that this embodiment adopted compares with prior art, and the number of channels that adopts is still less, and the number of repetition of intermediate flow module is also still less, when guaranteeing the output result degree of accuracy, has improved target detection model functioning speed to the requirement of the high efficiency of autopilot field to the output recognition result has been satisfied.

Fig. 2 is a schematic structural diagram of a specific example of a backbone network according to an embodiment of the present invention. As shown in fig. 2, the backbone network includes an input module, a middle flow module, and an output module. Where "s" in fig. 2 denotes a step size (stride in convolution) in which the convolution kernel is moved, it is exemplified that when s is not given, s is ═ 1, 1. [ ] in fig. 2 indicates an Element-wise sum (Element-wise add), "3 × 3" indicates the size of the convolution kernel, and "32", "128", and "256" each indicate the number of channels corresponding to each convolution layer. In one embodiment, optionally, when the size of the image to be detected is 513 × 1025 × 3, the size of the feature map matrix output by the backbone network is 129 × 65 × 256. Specifically, the image to be detected is an RGB image.

Specifically, the sizes of the target objects corresponding to different sub-detection modules are different. For example, assuming that target objects with different sizes in 3, such as an object a, an object B, and an object C, are included in the image to be detected, the sub-detection modules respectively output the types and positions of the objects corresponding to the object a, the object B, and the object C.

In one embodiment, optionally, each sub-detection module includes: the device comprises a dimensionality reduction module and a rectification module; the dimensionality reduction module is used for carrying out dimensionality reduction processing on the characteristic diagram matrix output by the main network so as to output the characteristic diagram matrix after dimensionality reduction; the rectification module is used for converting the characteristic diagram matrix output by the dimensionality reduction module into a characteristic vector; the dimension of the characteristic diagram matrix after dimension reduction output by each dimension reduction module is inversely related to the dimension corresponding to the sub-detection module to which the dimension reduction module belongs; the detector further comprises: and the first merging module is used for merging the characteristic vectors output by the rectifying modules to obtain an output result of the target detection model.

Specifically, when the size of the target object corresponding to the sub-detection module to which the dimension reduction module belongs is larger, the size of the feature map matrix output by the dimension reduction module is smaller. For example, the size of the target object corresponding to the sub-detection module a is larger than the size of the target object corresponding to the sub-detection module B, and then the size of the feature map matrix output by the dimension reduction module in the sub-detection module a is smaller than the size of the feature map matrix output by the dimension reduction module in the sub-detection module B. In one embodiment, optionally, in the field of automatic driving, the target object corresponding to the first sub-detection module may be a truck or a bus, the target object corresponding to the second sub-detection module may be a passenger car, and the target object corresponding to the third sub-detection module may be a pedestrian or a bicycle. Wherein, the passenger vehicle can be a car or an off-road vehicle, etc. The receptive field is used for representing the size of the area mapped by the characteristic diagram matrix on the image to be detected. In this embodiment, the receptive fields corresponding to the feature map matrices output by the first sub-detection module, the second sub-detection module, and the third sub-detection module become smaller in sequence.

In one embodiment, optionally, the at least two parallel sub-detection modules comprise: the device comprises a first sub-detection module, a second sub-detection module and a third sub-detection module; wherein: the first sub-detection module corresponds to a first size, the second sub-detection module corresponds to a second size, the third sub-detection module corresponds to a third size, the first size is larger than the second size, and the second size is larger than the third size.

In one embodiment, optionally, the detector further comprises: the sharing module is used for performing downsampling processing on the characteristic diagram matrix output by the main network and outputting the characteristic diagram matrix to the first sub-detection module, the second sub-detection module and the third sub-detection module; the first sub-detection module includes: the first down-sampling module is used for outputting the characteristic diagram matrix output by the sharing module after down-sampling; the first dimension reduction module is used for carrying out dimension reduction processing on the characteristic diagram matrix output by the first downsampling module to obtain a characteristic diagram matrix with a fourth size; the first rectifying module is used for converting the characteristic diagram matrix of the fourth size output by the first dimensionality reduction module into a characteristic vector and outputting the characteristic vector; the second sub-detection module includes: the second dimension reduction module is used for carrying out dimension reduction processing on the characteristic diagram matrix output by the sharing module to obtain a characteristic diagram matrix with a fifth size; the second rectifying module is used for converting the characteristic diagram matrix of the fifth size output by the second dimensionality reduction module into a characteristic vector and outputting the characteristic vector; the third sub-detection module includes: the preprocessing module is used for extracting the characteristics of the characteristic diagram matrix output by the backbone network and outputting the characteristic diagram matrix to the second merging module; the up-sampling module is used for up-sampling the characteristic diagram matrix output by the sharing module and outputting the characteristic diagram matrix to the second merging module; the second merging module is used for merging the characteristic diagram matrix output by the preprocessing module and the characteristic diagram matrix output by the up-sampling module and outputting the merged characteristic diagram matrix to the third dimension reduction module; the third dimension reduction module is used for carrying out dimension reduction processing on the characteristic diagram matrix output by the second combination module to obtain a characteristic diagram matrix with a sixth size; the third rectifying module is used for converting the characteristic diagram matrix with the sixth size output by the third dimensionality reduction module into a characteristic vector and outputting the characteristic vector; wherein the fourth dimension is smaller than the fifth dimension, and the fifth dimension is smaller than the sixth dimension.

Fig. 3 is a schematic structural diagram of a detector according to an embodiment of the present invention. Fig. 3 shows 3 sub-detection modules connected in parallel, where a network structure in a box with the longest dotted line represents a first sub-detection module, a network structure in a box with the second longest dotted line represents a second sub-detection module, and a network structure in a box with the shortest dotted line represents a third sub-detection module, where a first size, a second size, and a third size corresponding to the first sub-detection module, the second sub-detection module, and the third sub-detection module respectively decrease sequentially. Specifically, the second merge module may employ a concatenation function (Concat).

In an embodiment, optionally, in the field of automatic driving, when the output result of the backbone network is a feature map matrix of 129 × 65 × 256, the fourth size corresponding to the first sub-detection module is a feature map matrix of 33 × 17, the fifth size corresponding to the second sub-detection module is 65 × 33, and the sixth size corresponding to the third sub-detection module is a feature map matrix of 129 × 65.

In one embodiment, optionally, the sharing module, the first downsampling module, the first dimensionality reduction module, the second dimensionality reduction module, the preprocessing module, the upsampling module, and the third dimensionality reduction module each include at least one depth separable convolution.

Fig. 4A is a schematic structural diagram of an embodiment of a shared module according to an embodiment of the present invention, fig. 4B is a schematic structural diagram of an embodiment of a first sub-detection module according to an embodiment of the present invention, fig. 4C is a schematic structural diagram of an embodiment of a second sub-detection module according to an embodiment of the present invention, and fig. 4D is a schematic structural diagram of an embodiment of a third sub-detection module according to an embodiment of the present invention. In one embodiment, optionally, the upsampling module of the third sub-detection module performs an upsampling operation by using a bilinear interpolation algorithm. Specifically, the size of the feature map matrix output by the Upsampling network structure (Upsampling) in the Upsampling module is 130 × 66, and the size of the feature map matrix output by the Upsampling module is 129 × 65. In one embodiment, the size of the feature map matrix output by the second merging module is optionally 129 × 65 × 384.

According to the technical scheme, the image to be detected is input into the target detection model comprising the backbone network and the detector, wherein the detector comprises at least two parallel sub-detection modules corresponding to different sizes, and the problem that a multi-scale method is not suitable for BEV image detection in RGB image detection is solved.

Example two

Fig. 5 is a flowchart of a target object detection method provided in the second embodiment of the present invention, where the present embodiment performs optimization based on the foregoing embodiments, and optionally, the target detection model training method includes: acquiring training sample data, wherein the training sample data comprises a sample image, and the type and the position center point of a sample object contained in the sample image; inputting the training sample data into a target detection model to be trained, and determining a point matched with the position central point as a target anchor (anchor) central point in each point of a characteristic diagram matrix output by a sub-detection module corresponding to the size of the sample object; obtaining a prediction result of the target detection model, and determining a loss function value according to the prediction result, the type of the sample object and the center point of the target anchor; and reversely adjusting the weight parameters in the target detection model according to the loss function values. Explanations of the same or corresponding terms in this embodiment as those in the above embodiments are omitted here for brevity.

The specific implementation steps of this embodiment include:

and S210, acquiring training sample data.

In this embodiment, the training sample data includes the sample image, the type of the sample object included in the sample image, and the position center point. The method comprises the steps of obtaining a sample image, specifically, compressing the obtained three-dimensional point cloud data to a bird's-eye view angle to obtain the sample image, wherein the sample image comprises at least one sample object and/or at least one sample object. For example, in the field of automated driving, the sample image may include at least one of a passenger car, a pedestrian, a bicycle, a truck, and a bus, wherein the number of each sample object is at least one. Specifically, each sample object in the sample image corresponds to a type and a position center point.

S220, inputting training sample data into a target detection model to be trained, and determining a point matched with the position central point as a target anchor central point in each point of a characteristic diagram matrix output by the sub-detection module corresponding to the size of the sample object.

Specifically, the number of points in the feature map matrix output by the sub-detection module corresponds to the size of the feature map matrix. For convenience of understanding, assuming that the size of the sample image is 1025 × 513, and the size of the feature map matrix output by the sub-detection module is 129 × 65, since the feature map matrix output by the sub-detection module is obtained by down-sampling the sample image, it can be understood that 129 × 65 — 8385 points in the feature map matrix are uniformly and equidistantly arranged on the sample image 1025 × 513. Specifically, the points in the feature map matrix may be referred to as anchor points (anchors), and the coordinate values of the points on the sample image may be referred to as anchor center points.

In an embodiment, optionally, determining, as the center point of the target anchor, a point that matches the center point of the position among the points of the feature map matrix output by the sub-detection modules corresponding to the size of the sample object includes: and respectively determining the distance value between the position center point and each point of the characteristic diagram matrix output by the sub-detection module corresponding to the size of the sample object, and determining the point corresponding to the minimum distance value as the center point of the target anchor.

Wherein, the target anchor central point satisfies the formula:

wherein dist represents Euclidean distance, F is a characteristic diagram matrix output by the current sub-detection module, and the size of the characteristic diagram matrix is (H)_F,W_F,CH_F) Wherein each component is H_FDenotes the height of F, W_FDenotes the width of F, CH_FThe number of channels (channels) of F is shown. a represents a point in the feature map matrix, namely an anchor point, and let a epsilon F represent the position of the point a in F on the projection of the (H, W) plane, C_aCoordinate values of points representing the feature map matrix with respect to the sample image, i.e. anchor center point, C_GTAnd coordinate values representing the center point of the position of the sample object in the training sample data.

For example, if the sample image includes a sample object a and a sample object B, and the size of the sample object a is greater than that of the sample object B, the point of the feature map matrix output by the first sub-detection module is matched with the position center point a corresponding to the sample object a, and the point of the feature map matrix output by the second sub-detection module is matched with the position center point B corresponding to the sample object B.

S230, obtaining a prediction result of the target detection model, and determining a loss function value according to the prediction result, the type of the sample object and the center point of the target anchor; and reversely adjusting the weight parameters in the target detection model according to the loss function values.

In an embodiment, optionally, determining the loss function value according to the prediction result, the type of the sample object, and the target anchor center point includes: determining loss function values respectively aiming at each element in the prediction result according to the prediction result, the type of the sample object and the center point of the target anchor; and adding the loss function values corresponding to the elements respectively to obtain the final loss function value.

In one embodiment, optionally, the prediction result includes a confidence element, a prediction position element, and a prediction category element; determining a loss function value respectively aiming at each element in the prediction result according to the prediction result, the type of the sample object and the center point of the target anchor, wherein the loss function value comprises the following steps: according to the prediction result and the center point of the target anchor, determining a sigmoid focusing loss function value corresponding to the confidence coefficient element; determining a loss function value of the moderated minimum absolute value deviation corresponding to the predicted position element according to the prediction result and the position information of each point of the sample object; and determining a cross entropy loss function value corresponding to the prediction category element according to the prediction result and the type of the sample object.

Specifically, each element in the prediction result is for a single point in the feature map matrix output by the sub-detection module. That is, each point in the feature map matrix corresponds to a confidence element, a predicted location element, and a predicted category element. Illustratively, the confidence element is represented by "s", and the confidence element is used to characterize the confidence of whether the position of each point in the feature map matrix belongs to a sample object or does not belong to the sample object, wherein the non-belonging sample object may represent belonging to the background in the sample image. Specifically, the predicted position element is used to represent an offset of a point in the feature map matrix with respect to a center point of the target anchor, a size feature value of a bounding-box (bounding-box) corresponding to the predicted sample object, and a direction of the predicted sample object. In one embodiment, optionally, the predicted position elements include a horizontal axis offset, a vertical axis offset, a width characteristic of the bounding box, a length characteristic of the bounding box, a direction cosine characteristic, and a direction sine characteristic. Illustratively, the horizontal axis offset is denoted by "Δ x", the vertical axis offset is denoted by "Δ y", the width characteristic value of the bounding box is denoted by "log (w)", where "w" denotes the width of the bounding box, the length characteristic value of the bounding box is denoted by "log (l)", where "l" denotes the length of the bounding box, the direction cosine characteristic value is denoted by "cos (θ)", the direction sine characteristic value is denoted by "sin (θ)", where "θ" denotes the orientation of the sample object. Specifically, the prediction category element is used to characterize the prediction type of the predicted sample object, and the prediction type element is exemplarily represented by "c 0, c1, c2, c3, c 4".

In one embodiment, when the target detection model to be trained includes the first sub-detection module, the second sub-detection module, and the third sub-detection module, and the feature map matrices output by the first sub-detection module, the second sub-detection module, and the third sub-detection module have sizes of 33 × 17, 65 × 33, and 129 × 65, respectively, the prediction result of the target detection model to be trained includes 12-dimensional tuples corresponding to 11091 points, respectively. Specifically, the 12-dimensional tuples are (s,. DELTA.x,. DELTA.y, log (w), log (l), cos (θ), sin (θ), c0, c1, c2, c3, and c 4). In this embodiment, the loss function values are calculated based on the 12-dimensional tuples corresponding to each point in the prediction.

According to the prediction result and the center point of the target anchor, determining a Sigmoid focus loss function (Sigmoid focus loss function) value corresponding to the confidence element, wherein the Sigmoid focus loss function value satisfies the following formula:

sigmoid_focal_loss(p_t)＝-α*(1-p_t)^γ*log(p_t)

wherein, sigmoid _ focal _ loss (p)_t) Representing sigmoid focus loss function values, x representing a confidence element in the prediction result, wherein, if the current point in the feature map matrix is the target anchor center point matching the location center point,then label is 1, if not, label is 0. In one embodiment, α is 0.25 and γ is 2.

Wherein, according to the prediction result and the position information of each point of the sample object, determining a value of a mitigating minimum absolute value deviation loss function (smooth L1 loss function) corresponding to the predicted position element, specifically, the mitigating minimum absolute value deviation loss function value satisfies the formula:

where smooth _ L1_ loss denotes a relaxed minimum absolute value deviation loss function value, i denotes an index of the vector bit number of the position information, n denotes the total bit number of the vector, β denotes the position information in the prediction result, GT denotes the position information in the training sample data, β -GT denotes the difference value of each vector component of the position information in the prediction result and the position information in the training sample data, L1 denotes the sum of the absolute values of the differences between the position information in the prediction result and the position information in the training sample data, and x is L1. The positional information in the present embodiment includes Δ x, Δ y, log (w), log (l), cos (θ), or sin (θ).

Wherein, according to the prediction result and the type of the sample object, determining a cross entropy loss function (cross entropy loss function) value corresponding to the prediction type element, specifically, the cross entropy loss function value satisfies the formula:

wherein H (P, Q) represents the cross entropy loss function value, P represents the type of sample object in the training sample data, Q represents the type of sample object in the prediction result, i represents the number of classes, x_iRepresents c0, c1, c2, c3 or c 4.

S240, compressing the acquired three-dimensional point cloud data to an image to be detected at the view angle of the aerial view.

And S250, inputting the image to be detected into a pre-trained target detection model, and determining the type and the position of an object contained in the image to be detected according to the output result of the target detection model.

According to the technical scheme of the embodiment, the obtained training sample data is input into the target detection model to be trained, the center point of the target anchor is determined based on each point of the characteristic diagram matrix output by each sub-detection module in the target detection model to be trained, and the weight parameters in the target detection model are reversely adjusted based on the prediction result, the type of the sample object and the loss function value determined by the center point of the target anchor, so that the training problem of the target detection model is solved. The method provided by the embodiment of the invention is applied to the field of automatic driving, and the aim of improving the safety in the automatic driving process can be further fulfilled.

The following is an embodiment of an apparatus for detecting a target object according to an embodiment of the present invention, which belongs to the same inventive concept as the method for detecting a target object according to the above embodiments, and reference may be made to the above embodiment of the method for detecting a target object for details that are not described in detail in the embodiment of the apparatus for detecting a target object.

EXAMPLE III

Fig. 6 is a schematic diagram of a target object detection apparatus according to a third embodiment of the present invention, which is applicable to a case of target object identification for different physical sizes, and especially applicable to an automatic driving scenario. The target object detection device includes: an image to be detected acquisition module 310 and an image to be detected input module 320.

The image acquisition module 310 is used for compressing the acquired three-dimensional point cloud data to an image to be detected at a bird's eye view angle;

the image to be detected input module 320 is configured to input an image to be detected into a pre-trained target detection model, and determine the type and position of an object included in the image to be detected according to an output result of the target detection model;

the target detection model comprises a backbone network and a detector, wherein the detector comprises at least two sub-detection modules connected in parallel; the main network is used for extracting the features of the image to be detected, each sub-detection module is respectively used for detecting the type and the position of the object with the size corresponding to the sub-detection module based on the feature map matrix extracted by the main network, and the sizes corresponding to different sub-detection modules are different.

On the basis of the above technical solution, optionally, each sub-detection module includes: the device comprises a dimensionality reduction module and a rectification module; the dimensionality reduction module is used for carrying out dimensionality reduction processing on the characteristic diagram matrix output by the main network so as to output the characteristic diagram matrix after dimensionality reduction; the rectification module is used for converting the characteristic diagram matrix output by the dimensionality reduction module into a characteristic vector; the dimension of the characteristic diagram matrix after dimension reduction output by each dimension reduction module is inversely related to the dimension corresponding to the sub-detection module to which the dimension reduction module belongs;

the detector further comprises: and the first merging module is used for merging the characteristic vectors output by the rectifying modules to obtain an output result of the target detection model.

On the basis of the above technical solution, optionally, at least two sub-detection modules connected in parallel include: the device comprises a first sub-detection module, a second sub-detection module and a third sub-detection module; wherein:

the first sub-detection module corresponds to a first size, the second sub-detection module corresponds to a second size, the third sub-detection module corresponds to a third size, the first size is larger than the second size, and the second size is larger than the third size.

On the basis of the above technical solution, optionally, the detector further includes: the sharing module is used for performing downsampling processing on the characteristic diagram matrix output by the main network and outputting the characteristic diagram matrix to the first sub-detection module, the second sub-detection module and the third sub-detection module;

the first sub-detection module includes: the first down-sampling module is used for outputting the characteristic diagram matrix output by the sharing module after down-sampling; the first dimension reduction module is used for carrying out dimension reduction processing on the characteristic diagram matrix output by the first downsampling module to obtain a characteristic diagram matrix with a fourth size; the first rectifying module is used for converting the characteristic diagram matrix of the fourth size output by the first dimensionality reduction module into a characteristic vector and outputting the characteristic vector;

the second sub-detection module includes: the second dimension reduction module is used for carrying out dimension reduction processing on the characteristic diagram matrix output by the sharing module to obtain a characteristic diagram matrix with a fifth size; the second rectifying module is used for converting the characteristic diagram matrix of the fifth size output by the second dimensionality reduction module into a characteristic vector and outputting the characteristic vector;

the third sub-detection module includes: the preprocessing module is used for extracting the characteristics of the characteristic diagram matrix output by the backbone network and outputting the characteristic diagram matrix to the second merging module; the up-sampling module is used for up-sampling the characteristic diagram matrix output by the sharing module and outputting the characteristic diagram matrix to the second merging module; the second merging module is used for merging the characteristic diagram matrix output by the preprocessing module and the characteristic diagram matrix output by the up-sampling module and outputting the merged characteristic diagram matrix to the third dimension reduction module; the third dimension reduction module is used for carrying out dimension reduction processing on the characteristic diagram matrix output by the second combination module to obtain a characteristic diagram matrix with a sixth size; the third rectifying module is used for converting the characteristic diagram matrix with the sixth size output by the third dimensionality reduction module into a characteristic vector and outputting the characteristic vector; wherein the fourth dimension is smaller than the fifth dimension, and the fifth dimension is smaller than the sixth dimension.

Based on the above technical solution, optionally, the sharing module, the first downsampling module, the first dimensionality reduction module, the second dimensionality reduction module, the preprocessing module, the upsampling module, and the third dimensionality reduction module all include at least one depth separable convolution.

On the basis of the above technical solution, optionally, the apparatus further includes a target detection model training module, including:

the training sample data acquisition unit is used for acquiring training sample data, wherein the training sample data comprises a sample image, and the type and the position center point of a sample object contained in the sample image;

the training sample data input unit is used for inputting training sample data into a target detection model to be trained, and determining a point matched with the position central point as a target anchor central point in each point of a characteristic diagram matrix output by the sub-detection module corresponding to the size of the sample object;

the target detection model training unit is used for acquiring a prediction result of the target detection model and determining a loss function value according to the prediction result, the type of the sample object and a target anchor central point; and reversely adjusting the weight parameters in the target detection model according to the loss function values.

On the basis of the above technical solution, optionally, the training sample data input unit is specifically configured to:

and respectively determining the distance value between the position center point and each point of the characteristic diagram matrix output by the sub-detection module corresponding to the size of the sample object, and determining the point corresponding to the minimum distance value as the center point of the target anchor.

On the basis of the above technical solution, optionally, the target detection model training unit includes:

the loss function value determining subunit is used for determining a loss function value aiming at each element in the prediction result according to the prediction result, the type of the sample object and the target anchor central point;

and the loss function value summation subunit is used for adding the loss function values respectively corresponding to the elements to obtain the final loss function value.

On the basis of the above technical solution, optionally, the prediction result includes a confidence element, a prediction position element, and a prediction category element; a loss function value determining subunit, configured to:

according to the prediction result and the center point of the target anchor, determining a sigmoid focusing loss function value corresponding to the confidence coefficient element;

determining a loss function value of the moderated minimum absolute value deviation corresponding to the predicted position element according to the prediction result and the position information of each point of the sample object;

and determining a cross entropy loss function value corresponding to the prediction category element according to the prediction result and the type of the sample object.

The detection device of the target object provided by the embodiment of the invention can execute the method provided by any embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method.

It should be noted that, in the embodiment of the detection apparatus for the target object, the included units and modules are merely divided according to functional logic, but are not limited to the above division as long as the corresponding functions can be implemented; in addition, specific names of the functional units are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present invention.

Example four

Fig. 7 is a schematic structural diagram of an electronic device according to a fourth embodiment of the present invention. FIG. 7 illustrates a block diagram of an exemplary electronic device 12 suitable for use in implementing embodiments of the present invention. The electronic device 12 shown in fig. 7 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiment of the present invention.

As shown in FIG. 7, electronic device 12 is embodied in the form of a general purpose computing device. The components of electronic device 12 may include, but are not limited to: one or more processors or processing units 16, a system memory 28, and a bus 18 that couples various system components including the system memory 28 and the processing unit 16.

Bus 18 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, a processor, or a local bus using any of a variety of bus architectures. By way of example, such architectures include, but are not limited to, Industry Standard Architecture (ISA) bus, micro-channel architecture (MAC) bus, enhanced ISA bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus.

Electronic device 12 typically includes a variety of computer system readable media. Such media may be any available media that is accessible by electronic device 12 and includes both volatile and nonvolatile media, removable and non-removable media.

The system memory 28 may include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM)30 and/or cache memory 32. The electronic device 12 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 34 may be used to read from and write to non-removable, nonvolatile magnetic media (not shown in FIG. 7, and commonly referred to as a "hard drive"). Although not shown in FIG. 7, a magnetic disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In these cases, each drive may be connected to bus 18 by one or more data media interfaces. System memory 28 may include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of embodiments of the invention.

A program/utility 40 having a set (at least one) of program modules 42 may be stored, for example, in system memory 28, such program modules 42 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each of which examples or some combination thereof may comprise an implementation of a network environment. Program modules 42 generally carry out the functions and/or methodologies of the described embodiments of the invention.

Electronic device 12 may also communicate with one or more external devices 14 (e.g., keyboard, pointing device, display 24, etc.), with one or more devices that enable a user to interact with electronic device 12, and/or with any devices (e.g., network card, modem, etc.) that enable electronic device 12 to communicate with one or more other computing devices. Such communication may be through an input/output (I/O) interface 22. Also, the electronic device 12 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the Internet) via the network adapter 20. As shown, the network adapter 20 communicates with other modules of the electronic device 12 via the bus 18. It should be understood that although not shown in the figures, other hardware and/or software modules may be used in conjunction with electronic device 12, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

The processing unit 16 executes various functional applications and data processing by executing programs stored in the system memory 28, for example, to implement a target object detection method provided by any of the embodiments of the present invention.

EXAMPLE five

The present embodiment provides a computer-readable storage medium on which a computer program is stored, the program, when executed by a processor, implementing a method of detecting a target object according to any embodiment of the present invention, the method comprising:

inputting an image to be detected into a pre-trained target detection model, and determining the type and position of an object contained in the image to be detected according to the output result of the target detection model;

Computer storage media for embodiments of the invention may employ any combination of one or more computer-readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. The computer-readable storage medium may be, for example but not limited to: an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, or the like, as well as conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

It will be understood by those skilled in the art that the modules or steps of the invention described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of computing devices, and optionally they may be implemented by program code executable by a computing device, such that it may be stored in a memory device and executed by a computing device, or it may be separately fabricated into various integrated circuit modules, or it may be fabricated by fabricating a plurality of modules or steps thereof into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.

It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims

1. A method of detecting a target object, comprising:

2. The method of claim 1, wherein each of the sub-detection modules comprises: the device comprises a dimensionality reduction module and a rectification module; the dimensionality reduction module is used for carrying out dimensionality reduction processing on the characteristic diagram matrix output by the main network so as to output the characteristic diagram matrix after dimensionality reduction; the rectification module is used for converting the characteristic diagram matrix output by the dimensionality reduction module into a characteristic vector; the dimension of the characteristic diagram matrix after dimension reduction output by each dimension reduction module is inversely related to the dimension corresponding to the sub-detection module to which the dimension reduction module belongs;

the detector further comprises: and the first merging module is used for merging the eigenvectors output by the rectifying modules to obtain an output result of the target detection model.

3. The method of claim 2, wherein the at least two parallel sub-detection modules comprise: the device comprises a first sub-detection module, a second sub-detection module and a third sub-detection module; wherein:

4. The method of claim 3, wherein the detector further comprises: the sharing module is used for performing downsampling processing on the characteristic diagram matrix output by the main network and outputting the characteristic diagram matrix to the first sub-detection module, the second sub-detection module and the third sub-detection module;

the second sub-detection module includes: the second dimension reduction module is used for carrying out dimension reduction processing on the characteristic diagram matrix output by the sharing module to obtain a characteristic diagram matrix with a fifth size; the second rectification module is used for converting the characteristic diagram matrix of the fifth size output by the second dimensionality reduction module into a characteristic vector and outputting the characteristic vector;

the third sub-detection module comprises: the preprocessing module is used for extracting the characteristics of the characteristic diagram matrix output by the backbone network and outputting the characteristic diagram matrix to the second merging module; the up-sampling module is used for up-sampling the characteristic diagram matrix output by the sharing module and outputting the characteristic diagram matrix to the second merging module; the second merging module is used for merging the characteristic diagram matrix output by the preprocessing module and the characteristic diagram matrix output by the up-sampling module and outputting the merged characteristic diagram matrix to a third dimension reduction module; the third dimension reduction module is configured to perform dimension reduction processing on the feature map matrix output by the second merging module to obtain a feature map matrix of a sixth size; the third rectifying module is used for converting the characteristic diagram matrix with the sixth size output by the third dimensionality reduction module into a characteristic vector and outputting the characteristic vector; wherein the fourth dimension is smaller than the fifth dimension, and the fifth dimension is smaller than the sixth dimension.

5. The method of claim 4, wherein the sharing module, the first downsampling module, the first dimensionality reduction module, the second dimensionality reduction module, the preprocessing module, the upsampling module, and the third dimensionality reduction module each include at least one depth separable convolution.

6. The method of claim 1, wherein the backbone network adopts an Xception network structure, wherein the number of channels in the Xception network structure is less than 728, and wherein the number of repetitions of the intermediate stream module in the Xception network structure is less than 8.

7. The method according to any one of claims 1-6, wherein the training method of the object detection model comprises:

acquiring training sample data, wherein the training sample data comprises a sample image, and the type and the position center point of a sample object contained in the sample image;

inputting the training sample data into a target detection model to be trained, and determining a point matched with the position central point as a target anchor central point in each point of a characteristic diagram matrix output by a sub-detection module corresponding to the size of the sample object;

obtaining a prediction result of the target detection model, and determining a loss function value according to the prediction result, the type of the sample object and the center point of the target anchor; and reversely adjusting the weight parameters in the target detection model according to the loss function values.

8. The method according to claim 7, wherein determining a point matching the position center point as a target anchor center point among the points of the feature map matrix output by the sub-detection module corresponding to the size of the sample object comprises:

and respectively determining the distance value between the position central point and each point of the characteristic diagram matrix output by the sub-detection module corresponding to the size of the sample object, and determining the point corresponding to the minimum distance value as the central point of the target anchor.

9. The method of claim 7, wherein determining a loss function value based on the prediction, the type of the sample object, and the target anchor center point comprises:

determining a loss function value aiming at each element in the prediction result according to the prediction result, the type of the sample object and the target anchor central point;

and adding the loss function values corresponding to the elements respectively to obtain the final loss function value.

10. The method according to claim 9, wherein the prediction result comprises a confidence element, a prediction position element and a prediction category element;

determining a loss function value for each element in the prediction result according to the prediction result, the type of the sample object and the target anchor center point, respectively, including:

determining a loss function value of the minimum deviation of the moderated absolute value corresponding to the predicted position element according to the prediction result and the position information of each point of the sample object;

11. An apparatus for detecting a target object, comprising:

12. An electronic device, characterized in that the electronic device comprises:

one or more processors;

a memory for storing one or more programs;

when executed by the one or more processors, cause the one or more processors to implement a method of target object detection as claimed in any one of claims 1-10.

13. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out a method of detecting a target object according to any one of claims 1 to 10.