WO2020199834A1

WO2020199834A1 - Object detection method and apparatus, and network device and storage medium

Info

Publication number: WO2020199834A1
Application number: PCT/CN2020/077721
Authority: WO
Inventors: 杨泽同; 孙亚楠; 贾佳亚; 戴宇荣; 沈小勇
Original assignee: 腾讯科技（深圳）有限公司
Priority date: 2019-04-03
Filing date: 2020-03-04
Publication date: 2020-10-08
Also published as: CN110032962B; CN110032962A

Abstract

Disclosed are an object detection method and apparatus, and a network device and a storage medium. In the embodiments of the present application, a foreground point can be detected from a point cloud of a scene; a candidate object area corresponding to the foreground point is constructed on the basis of the foreground point and a predetermined size to obtain initial positioning information of the candidate object area; feature extraction is carried out on all points in the point cloud on the basis of a point cloud network to obtain a feature set corresponding to the point cloud; area feature information of the candidate object area is constructed on the basis of the feature set; the type and positioning information of the candidate object area are predicted on the basis of an area prediction network and the area feature information to obtain the predicted type and the predicted positioning information of the candidate object area; and the candidate object area is optimized on the basis of the initial positioning information of the candidate object area and the predicted type and the predicted positioning information of the candidate object area to obtain a target object detection area and positioning information of the target object detection area. The solution can improve the accuracy of object detection.

Description

Object detection method, device, network equipment and storage medium

This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on April 3, 2019, the application number is 201910267019.5, and the application name is "an object detection method, device, network equipment, and storage medium". The reference is incorporated in this application.

Technical field

This application relates to the field of artificial intelligence technology, specifically to object detection technology.

Background technique

Object detection refers to determining the location and category of objects in a scene. At present, object detection technology has been widely used in various scenarios, such as autonomous driving and drones.

The current object detection scheme generally collects scene images, extracts features from the scene images, and then determines the position and category of the object in the scene image based on the extracted features. However, through practice, it has been found that the current object detection scheme has problems such as low object detection accuracy, especially in 3D object detection scenes.

Summary of the invention

The embodiments of the present application provide an object detection method, device, network device, and storage medium, which can improve the accuracy of object detection.

The embodiment of the present application provides an object detection method, which is executed by a network device, and includes:

Detect the former scenic spot from the point cloud of the scene;

Constructing a candidate object area corresponding to the front scenic spot based on the front scenic spot and a predetermined size, and determining initial positioning information of the candidate object area;

Performing feature extraction on all points in the point cloud based on the point cloud network to obtain a feature set corresponding to the point cloud;

Constructing the region feature information of the candidate object region based on the feature set;

Predicting the type and positioning information of the candidate object area based on the area prediction network and the area feature information, and obtaining the prediction type and predicted positioning information of the candidate object area;

Based on the initial positioning information of the candidate object area, the prediction type of the candidate object area and the predicted positioning information, the candidate object area is optimized to obtain the target object detection area and the positioning information of the target object detection area.

Correspondingly, an embodiment of the present application also provides an object detection device, including:

The detection unit is used to detect the former scenic spot from the point cloud of the scene;

An area constructing unit, configured to construct a candidate object area corresponding to the front scenic spot based on the front scenic spot and a predetermined size, to obtain initial positioning information of the candidate object area;

A feature extraction unit, configured to perform feature extraction on all points in the point cloud based on a point cloud network to obtain a feature set corresponding to the point cloud;

A feature construction unit, configured to construct the area feature information of the candidate object area based on the feature set;

The prediction unit is configured to predict the type and location information of the candidate object area based on the area prediction network and the area feature information, and obtain the prediction type and predicted location information of the candidate object area;

The optimization unit is used to optimize the candidate object area based on the initial positioning information of the candidate object area, the prediction type of the candidate object area, and the predicted positioning information to obtain the target object detection area and the positioning information of the target object detection area.

An embodiment of the present application also provides a network device, including a memory and a processor; the memory stores multiple instructions, and the processor loads the instructions in the memory to execute any of the instructions provided in the embodiments of the present application. Steps in the object detection method.

In addition, an embodiment of the present application further provides a storage medium that stores a plurality of instructions, and the instructions are suitable for loading by a processor to execute the steps in any object detection method provided in the embodiments of the present application. .

In addition, the embodiments of the present application also provide a computer program product, including instructions, which when run on a computer, cause the computer to execute the steps in any object detection method provided in the embodiments of the present application.

The embodiment of the present application can detect the front scenic spot from the point cloud of the scene; construct the candidate object area corresponding to the front scenic spot based on the previous scenic spot and the predetermined size, and determine the initial positioning information of the candidate object area; Perform feature extraction on all points in the point cloud to obtain the feature set corresponding to the point cloud; construct the region feature information of the candidate object region based on the feature set; predict the candidate object based on the region prediction network and the region feature information The type and location information of the area, the predicted type and predicted location information of the candidate object area are obtained; the candidate object area is optimized based on the initial location information of the candidate object area, the predicted type and predicted location information of the candidate object area, and the target object detection is obtained Area and location information of the target object detection area. Because this solution can use the point cloud data of the scene for object detection, and can also generate candidate object regions for each front scenic spot in the point cloud, and optimize the candidate object regions based on the regional features of the candidate object regions; therefore, it can greatly Improve the accuracy of object detection, especially for 3D object detection, the detection effect is significantly improved.

Description of the drawings

In order to more clearly describe the technical solutions in the embodiments of the present application, the following will briefly introduce the drawings needed in the description of the embodiments. Obviously, the drawings in the following description are only some embodiments of the present application. For those skilled in the art, other drawings can be obtained based on these drawings without creative work.

FIG. 1a is a schematic diagram of a scene of an object detection method provided by an embodiment of the present application;

Figure 1b is a flowchart of an object detection method provided by an embodiment of the present application;

Figure 1c is a schematic structural diagram of a point cloud network provided by an embodiment of the present application;

Figure 1d is a schematic diagram of the PointNet++ network structure provided by an embodiment of the present application;

Figure 1e is a schematic diagram of an object detection effect in an automatic driving scene provided by an embodiment of the present application;

Figure 2a is a schematic diagram of image semantic segmentation provided by an embodiment of the present application;

2b is a schematic diagram of point cloud segmentation provided by an embodiment of the present application;

Figure 2c is a schematic diagram of candidate region generation provided by an embodiment of the present application;

3 is a schematic diagram of feature construction of candidate regions provided by an embodiment of the present application;

Figure 4a is a schematic structural diagram of a regional prediction network provided by an embodiment of the present application

FIG. 4b is another schematic diagram of the structure of the regional prediction network provided by an embodiment of the present application;

FIG. 5a is a schematic diagram of another process of object detection provided by an embodiment of the present application;

FIG. 5b is an architecture diagram of object detection provided by an embodiment of the present application;

FIG. 5c is a schematic diagram of test experiment results provided by an embodiment of the present application;

Figure 6a is a schematic structural diagram of an object detection device provided by an embodiment of the present application;

6b is another schematic diagram of the structure of the object detection device provided by the embodiment of the present application;

FIG. 6c is another schematic diagram of the structure of the object detection device provided by the embodiment of the present application;

FIG. 6d is another schematic structural diagram of the object detection device provided by the embodiment of the present application;

6e is another schematic diagram of the structure of the object detection device provided by the embodiment of the present application;

Fig. 7 is a schematic structural diagram of a network device provided by an embodiment of the present application.

detailed description

The technical solutions in the embodiments of the present application will be clearly and completely described below in conjunction with the drawings in the embodiments of the present application. Obviously, the described embodiments are only a part of the embodiments of the present application, rather than all of the embodiments. Based on the embodiments in this application, all other embodiments obtained by those skilled in the art without creative work are within the protection scope of this application.

The embodiments of the present application provide an object detection method, device, network device, and storage medium. Wherein, the object detection device may be integrated in a network device, and the network device may be a server or a terminal and other devices; for example, the network device may include equipment such as a vehicle-mounted device, a micro processing box and the like.

The so-called object detection refers to determining or recognizing the location and category of objects in a scene, for example, recognizing the category and location of objects in a road scene, such as street lights, vehicles and their locations.

Referring to FIG. 1a, an embodiment of the present application provides an object detection system including a network device and a collection device, etc.; the communication connection between the network device and the collection device, for example, through a wired or wireless network connection. In an embodiment, the network device and the collection device may be integrated into one device.

Among them, the collection device can be used to collect point cloud data or image data of the scene. In one embodiment, the collection device can upload the collected point cloud data to a network device for processing.

The network device can be used for object detection, specifically, it can detect the front scenic spot from the point cloud of the scene; construct the candidate object area corresponding to the former scenic spot based on the previous scenic spot and a predetermined size to obtain the initial positioning information of the candidate object area; based on the point cloud The network performs feature extraction on all points in the point cloud to obtain the feature set corresponding to the point cloud; constructs the regional feature information of the candidate object region based on the feature set; predicts the type and location information of the candidate object region based on the regional prediction network and regional feature information , Obtain the prediction type and predicted positioning information of the candidate object area; optimize the candidate object area based on the initial positioning information of the candidate object area, the prediction type and predicted positioning information of the candidate object area, and obtain the target object detection area and its positioning information. In practical applications, after obtaining the location information of the target object detection area, the detected objects can be identified in the scene image according to the location information. For example, the detected objects can be selected in the scene image in the form of a detection frame. In the embodiment, the type of the detected object may also be identified in the scene image.

Detailed descriptions are given below. It should be noted that the order of description in the following embodiments is not meant to limit the preferred order of the embodiments.

This embodiment will be described from the perspective of an object detection device. The object detection device can be integrated in a network device. The network device can be a server or a terminal; where the terminal can include a mobile phone, a tablet, Notebook computers, and personal computing (PC, Personal Computer), micro-processing terminals and other equipment.

An object detection method provided by an embodiment of the present application may be executed by a processor of a network device. As shown in FIG. 1b, the specific process of the object detection method may be as follows:

101. Detect the former scenic spot from the point cloud of the scene.

Wherein, the point cloud is a point collection of the surface characteristics of the scene or the target. The points in the point cloud may include the position information of the points, such as three-dimensional coordinates, and may also include color information (RGB) or reflection intensity information (Intensity).

The point cloud can be detected by the principle of laser measurement or photogrammetry, for example, the point cloud of the object can be obtained by scanning with a laser scanner or a photographic scanner. The principle of laser detection point cloud is: when a beam of laser irradiates the surface of an object, the reflected laser will carry information such as position and distance. If the laser beam is scanned according to a certain track, the reflected laser point information will be recorded while scanning. Because the scanning is extremely fine, a large number of laser points can be obtained, so a laser point cloud can be formed. Point cloud formats are *.las; *.pcd; *.txt, etc.

In the embodiment of the present application, the point cloud data of the scene can be collected by the network device itself, or can be collected by other devices, the network device can be obtained from other devices, or searched from a network database, etc.

Among them, the scene can be multiple, for example, a road scene in automatic driving, an aviation scene in a drone flight, and so on.

Among them, the front scenic spot is relative to the background point. A scene can be divided into a background and a foreground. The points in the background can be called background points, and the points in the foreground can be called the front spots. In the embodiment of the present application, the point cloud of the scene can be semantically segmented to identify the front scenic spot in the point cloud of the scene.

In the embodiment of the present application, there are many ways to detect the previous scenic spot from the point cloud. For example, the point cloud of the scene can be semantically segmented directly to obtain the previous scenic spot in the point cloud. Semantic Segmentation refers to classifying each point in the point cloud of a scene so as to identify points belonging to a certain type. There are many ways of semantic segmentation. For example, 2D semantic segmentation or 3D semantic segmentation can be used to perform semantic segmentation on the point cloud.

For another example, in order to be able to detect more previous scenic spots and improve the credibility and accuracy of the detection of the previous scenic spots, in one embodiment, the image of the scene may be segmented semantically to obtain foreground pixels, and then the foreground pixels Map it to the point cloud to get the front spot. Specifically, the step of "detecting the front scenic spot from the point cloud of the scene" may include:

Perform semantic segmentation on the image of the scene to obtain foreground pixels;

The point corresponding to the foreground pixel in the point cloud of the scene is determined as the front scenic spot.

In one embodiment, the foreground pixels can be mapped to the point cloud of the scene to obtain the target points in the point cloud corresponding to the foreground pixels. For example, it can be based on the mapping relationship between the pixels in the image and the points in the point cloud (such as The location mapping relationship, etc.) realize the mapping, and determine the target point having the mapping relationship with the foreground pixel as the front scenic spot.

In another embodiment, the points in the point cloud can be projected into the image of the scene. For example, the points in the point cloud can be projected into the image of the scene through the mapping relationship matrix or transformation matrix between the point cloud and the pixels. Then, the segmentation result (such as foreground pixels, background pixels, etc.) corresponding to the point in the image is used as the segmentation result of the point, and based on the segmentation result of the point, it is determined whether the point is a former scenic spot, and each former scenic spot is determined from the point cloud. Specifically, when the segmentation result of a point is a foreground pixel, it is determined that the point is a front scenic spot.

In order to improve the accuracy of semantic segmentation, the semantic segmentation in the embodiments of the present application can be implemented by a segmentation network based on deep learning. For example, a segmentation network based on DeepLabV3 based on X-ception can be used, and the image of the scene can be performed through the segmentation network. Segmentation to obtain foreground pixels such as foreground pixels of cars, pedestrians, and cyclists in automatic driving. Then, the point in the point cloud is projected into the image of the scene, and then its corresponding segmentation result in the picture is used as the segmentation result of this point, thereby determining the front scenic spot in the point cloud. This method can accurately detect the front spot in the point cloud.

102. Construct a candidate object area corresponding to the previous scenic spot based on the previous scenic spot and a predetermined size, and determine initial positioning information of the candidate object area.

After obtaining the front scenic spot, the embodiment of the present application may construct the object area corresponding to each front scenic spot based on the previous scenic spot and the predetermined size, and use the object area corresponding to the previous scenic spot as the candidate object area.

Among them, the candidate object area may be a two-dimensional area, that is, a 2D area, or a three-dimensional area, that is, a 3D area, which may be determined according to actual requirements. The predetermined size can be set according to actual needs, and the predetermined size can include predetermined size parameters, for example, length l*width w in the 2D area, and length l*width w*height h in the 3D area.

For example, in order to improve the accuracy of object detection, the previous scenic spot can be the center point, and the candidate object area corresponding to the previous scenic spot can be generated according to a predetermined size.

Wherein, the location information of the candidate object area may include position information, size information, and so on of the candidate object area.

For example, in one embodiment, in order to facilitate subsequent calculations in the object detection process, the position information of the candidate object area may be represented by the position information of the reference point in the candidate object area, and the reference point may be set according to actual requirements, for example, The center point of the candidate object area is used as the reference point. For example, taking a three-dimensional area as an example, the position information of the candidate object area may include the 3D coordinates of the center point such as (x, y, z).

Wherein, the size information of the candidate object area may include the size parameter of the area. For example, when the candidate object area is a 2D area, the size information of the candidate object area may include length l*width w, and when the candidate object area is a 3D area, the candidate object area The size information can include length l * width w * height h and so on.

In addition, in some scenes, the orientation of the object is also important reference information. Therefore, in some embodiments, the positioning information of the candidate object region may also include the orientation of the candidate object region, such as forward, backward, downward, Wait upward, the orientation of the candidate object area can indicate the orientation of the object in the scene. In practical applications, the orientation of the candidate object area can be expressed based on an angle. For example, two orientations can be defined, 0° and 90° respectively.

In practical applications, in order to facilitate object detection and user observation, the candidate object area may be identified in the form of a detection frame, for example, 2D detection frame and 3D detection frame identification.

For example, taking a driving road scene as an example, referring to Figure 2a, a 2D segmentation network can be used to semantically segment the image to obtain the image segmentation result (including foreground pixels, etc.); then, referring to Figure 2b, the image segmentation result is mapped to the point cloud, Obtain the point cloud segmentation result (including the former scenic spot). Then, with each front scenic spot as the center, a candidate object area is generated. The schematic diagram of candidate object region generation is shown in Figure 2c. With each front scenic spot as the center, a 3D detection frame of artificially specified size is generated as a candidate object area. The candidate object area is represented by (x,y,z,l,h,w,angle), where x,y,z represent the 3D coordinates of the center point, and l,h,w are the length of the candidate area we set height width. In the actual experiment, l=3.8, h=1.6, w=1.5. Angle represents the orientation of the 3D candidate area. When generating the candidate object area, the embodiment of the present application uses two orientations, 0° and 90° respectively.

Through the above steps, this embodiment of the application can generate a candidate object area, such as a 3D candidate object detection frame, for each front scenic spot.

103. Perform feature extraction on all points in the point cloud based on the point cloud network to obtain a feature set corresponding to the point cloud.

Among them, the point cloud network may be a network based on deep learning, for example, it may be a point cloud network such as PointNet and PointNet++. In the embodiment of the present application, the time sequence between step 103 and step 102 is not limited by the sequence number, and step 102 may be executed before step 103, or step 103 may be executed before step 102, or simultaneously.

Specifically, all points in the point cloud can be input to the point cloud network, and the point cloud network performs feature extraction on the input points to obtain a feature set corresponding to the point cloud.

The following uses PointNet++ as an example to introduce the point cloud network. As shown in FIG. 1c, the point cloud network may include a first sampling network and a second sampling network; wherein, the first sampling network is connected to the second sampling network. In practical applications, the first sampling network can be called an encoder, and the second sampling network can be a decoder. Specifically, the feature downsampling process is performed on all points in the point cloud through the first sampling network to obtain the initial feature of the point cloud; the initial feature is upsampled through the second sampling network to obtain the feature set of the point cloud.

Referring to Figure 1d, the first sampling network includes a plurality of set abstraction layers (SA) connected in sequence, and the second sampling network includes a plurality of set abstraction layers (SA) connected in sequence, and is one-to-one with each set abstraction layer (SA) in the first sampling network. Corresponding feature propagation layer (FP, feature propagation). The SA in the first sampling network corresponds to the FP in the second sampling network, and the number can be set according to actual needs. For example, the first sampling network and the second sampling network include three layers of SA and FP respectively.

1d, the first sampling network can include three downsampling processing (that is, the encoding stage includes three steps of downsampling processing), the number of points are 1024, 256, and 64 respectively; the second sampling network can include three upsampling processing (also That is, the decoding stage includes three steps of up-sampling processing), and the points of the three steps are 256, 1024, and N. The feature extraction process of the point cloud network is as follows:

Input all the points of the point cloud to the first sampling network, and then divide the points in the point cloud into local areas through each collective abstraction layer (SA) in the first sampling network, and extract the features of the central point of the local area to obtain the point cloud For example, referring to Figure 1d, after the input is a point cloud N×4 and three-layer SA down-sampling processing, the output point cloud feature is 64×1024.

In the embodiment of this application, pointnet++ uses the idea of layered feature extraction, and calls each time a set abstraction. Divided into three parts: sampling layer, grouping layer, feature extraction layer. First look at the sampling layer. In order to extract some relatively important center points from the dense point cloud, the farthest point sampling (FPS) method is adopted. Of course, random sampling is also possible. Then there is the grouping layer, which searches for the nearest k nearest neighbors to form a patch within a certain range of the center point extracted by the previous layer. The feature extraction layer is to perform convolution and pooling of these k points through a small pointnet network, and the obtained features are used as the features of this central point, and then sent to the next layer to continue. In this way, the center points obtained for each layer are a subset of the center points of the previous layer, and as the number of layers deepens, the number of center points decreases, but each center point contains more and more information.

According to the above description, the first sampling network in the embodiment of the present application is composed of multiple SA layers. At each level, a set of points is processed and abstracted to generate a new set with fewer elements. The collective abstraction layer consists of three key layers: sampling layer, grouping layer, and point cloud network layer (PointNet layer). The sampling layer selects a set of points from the input points, which define the centroid of the local area. The grouping layer constructs a set of local regions by finding the "adjacent" points around the centroid. The point cloud network layer uses a micro point network to encode the local area set into a feature vector.

In one embodiment, considering that the actual point cloud is rarely evenly distributed, when sampling, for dense areas, small-scale sampling should be used to obtain finest details, but in sparse areas, Large-scale sampling should be used, because too small a scale will result in insufficient sampling at sparse areas. Therefore, the embodiment of the present application proposes an improved SA layer. Specifically, the grouping layer (Grouping layer) in the SA layer can use Multi-scale grouping (MSG, multi-scale grouping). Specifically, the local features under each radius are extracted during grouping, and then combined together . The idea is to sample multi-scale features and concat (connect) in grouping layer. For example, referring to Figure 1d, MSG packets are used in the first and second SA layers.

In addition, in one embodiment, in order to improve the robustness of the sampling density change, a single-scale grouping (SSG) may also be used in the SA, for example, a single-scale grouping (SSG) is used in the SA layer as the output.

After the first sampling network outputs the initial features of the point cloud, the initial features of the point cloud can be input to the second sampling network, and the initial features are up-sampling processing such as residual up-sampling processing through the second sampling network. For example, referring to Fig. 1d, the three-layer FP of the second sampling network performs up-sampling processing on 64×1024 features, and then outputs N×128 features.

In an implementation, in order to improve the prevention of feature gradient changes or loss, when performing up-sampling processing in the second sampling network, it is also necessary to consider the output characteristics of each SA layer in the first sampling network. Specifically, the step of "upsampling the initial features through the second sampling network to obtain the feature set of the point cloud" includes:

Determine the output feature of the previous layer and the input feature of the collective abstraction layer corresponding to the current feature propagation layer as the current input feature of the current feature propagation layer;

The current input feature is up-sampled through the current feature propagation layer to obtain the feature set of the point cloud.

Among them, the output feature of the previous layer can include the SA layer or the FP layer of the current FP layer. For example, referring to Figure 1d, after inputting 64*1024 point cloud features to the first FP layer, the first FP layer will The 64*1024 point cloud feature and the 256*256 feature input to the third SA layer are determined as the current input feature, and the feature is up-sampled, and the obtained feature is output to the second FP layer. The second FP layer takes the output feature 256*128 feature of the previous FP layer and the 1024*128 feature input to the second SA layer as the input feature of the current layer, and performs up-sampling on the feature to obtain 1024*128 feature Enter the value of the third FP layer. The third FP layer uses the 1024*128 features output by the second FP layer and the N*4 features input to the first SA layer as the input features of the current layer, and performs up-sampling processing to output the final feature of the point cloud.

Through the above steps, feature extraction can be performed on all points in the point cloud to obtain a feature set of the point cloud, which prevents information loss and improves the accuracy of object detection.

104. Construct regional feature information of the candidate object region based on the feature set.

In the embodiment of the application, there may be many ways to construct the feature information of the candidate object area based on the feature set of the point cloud. For example, the feature of some points can be selected from the feature set as the feature information of the candidate object area to which it belongs; The position information of some points can be selected from the feature set as the feature information of the candidate object region to which they belong.

For another example, in order to improve the accuracy of regional feature extraction, the feature and location information of some points can also be assembled to construct regional feature information. Specifically, the step of "constructing the region feature information of the candidate object region based on the feature set" may include:

Select multiple target points in the candidate object area;

Extract the features of the target point from the feature set to obtain the first part of the feature information of the candidate object region;

Based on the location information of the target point, construct the second part of the feature of the candidate object area;

The first part of feature information and the second part of feature information are fused to obtain the regional features of the candidate object region.

Among them, the number of target points and the selection method can be set according to actual needs. For example, a certain number of points can be selected randomly in the candidate object area or according to a certain selection method (such as selection based on the distance from the center point, etc.), such as selection 512 points.

After selecting the target point from the candidate object area, the feature of the target point can be extracted from the feature set of the point cloud, and the extracted feature of the target point is used as the first part of the feature information of the candidate object area (which can be represented by F1). For example, after randomly selecting 512 points, the features of these 512 points can be extracted from the feature set (ie feature set) of the point cloud to form the first part of feature information F1.

For example, referring to Figure 3, from the feature set (B, N, C) of the point cloud, the feature of 512 target points in the candidate object area can be cropped to form F1 (B, M, C), where M is the number of target points , Such as M=512, where N is the number of points in the point cloud.

Among them, there are many ways to construct the second part of the feature of the candidate object area based on the location information of the target point. For example, the location information of the target point can be directly used as the second part of the feature information of the candidate object area (which can be represented by F2) .

For another example, in order to improve the accuracy of location feature extraction, it is also possible to construct the second part of the feature of the candidate object region after some transformation of the location information. For example, the step of "constructing the second part of the feature information of the candidate object region based on the position information of the target point" may include:

(1) Standardize the position information of the target point to obtain the standardized position information of the target point.

Among them, the position information of the target point may include the coordinate information of the target point, such as 3D coordinates xyz, and the normalize of the position information can be set according to actual needs. For example, the target point can be determined based on the position information of the center point of the candidate object area. Position information is adjusted. For example, subtract the 3D coordinates of the center of the candidate object from the 3D coordinates of the target point.

(2). The first part of feature information and standardized location information are fused to obtain the fused feature information of the target point.

For example, referring to Figure 3, the standardized position information of M=512 points (such as 3D coordinates xyz) can be fused with the first part of feature F1. Specifically, the two can be fused using Concat (connection) to obtain the fusion Features (B, N, C+3).

(3) Perform spatial transformation on the fusion feature information of the target point to obtain the transformed position information of the target point.

In order to further improve the accuracy of the second part of the feature extraction, the fusion feature can also be spatially transformed.

For example, in one embodiment, a spatial transformation network (STN) may be used for transformation, for example, a supervised spatial transformation network such as T-Net may be used. Referring to Fig. 3, the merged features (B, N, C+3) can be spatially transformed through T-Net to obtain the transformed coordinates (B, 3).

(4) Based on the transformed position information, adjust the standardized position information of the target point to obtain the second part of the feature information of the candidate object region.

For example, the normalized position value of the target point can be subtracted from the transformed position value to obtain the second partial feature F2 of the candidate object region. Referring to FIG. 3, the normalized 3D coordinates (B, N, 3) of the target point can be subtracted from the transformed 3D coordinates (B, 3) to obtain the second partial feature F2.

Due to the spatial transformation of the feature, after subtracting the transformed position from the position feature, the geometric stability or spatial invariance of the position feature can be improved, thereby improving the accuracy of feature extraction.

The first part feature information and the second part feature information of each candidate object area can be obtained by the above method, and then the two parts of features are fused to obtain the area feature information of each candidate object area. For example, referring to FIG. 3, F1 and F2 can be concatenated (Concat) to obtain the connected features (B, N, C+3) of the candidate object region, and this feature is used as the regional feature of the candidate object region.

105. Predict the type and location information of the candidate object area based on the area prediction network and area feature information, and obtain the predicted type and predicted location information of the candidate object area.

Among them, the regional prediction network can be used to predict the type and location information of the candidate object area. For example, it can classify and locate the candidate object area to obtain the prediction type and predicted location information of the candidate prediction area. The network can be based on deep learning. The region prediction network can be trained from the point cloud or image of the sample object.

Wherein, the predicted positioning information may include predicted position information such as 2D or 3D coordinates, dimensions such as length, width, and height. In addition, in an embodiment, it may also include predicted orientation information such as 0° or 90°.

The following describes the structure of the regional prediction network. Referring to Figure 4a, the regional prediction network may include a feature extraction network, a classification network, and a regression network. The classification network and the regression network are respectively connected to the feature extraction network. as follows:

Among them, the feature extraction network is used to perform feature extraction on input information, for example, perform feature extraction on the area feature information of the candidate object area to obtain the global feature information of the candidate object area.

The classification network is used to classify the area. For example, the candidate object area can be classified based on the global feature information of the candidate object area to obtain the prediction type of the candidate object area.

The regression network is used to locate the area, for example, to locate the candidate object area to obtain the predicted location information of the candidate object area. Because the regression network is used to predict the positioning, the output predicted positioning information can also be called regression information, such as predicted regression information.

For example, the step of "predicting the type and location information of the candidate object area based on the area prediction network and area feature information to obtain the prediction type and predicted location information of the candidate object area" may include:

Perform feature extraction on the regional feature information through the feature extraction network to obtain the global feature information of the candidate object region;

Based on the classification network and global feature information, classify the candidate object area to obtain the prediction type of the candidate object area;

Based on the regression network and global feature information, the candidate object area is located, and the predicted location information of the candidate object area is obtained.

In order to improve the accuracy of prediction, referring to FIG. 4b, the feature extraction network in the embodiment of the present application may include: a plurality of sequentially connected collective abstraction layers, namely SA layers; the classification network may include a plurality of fully connected layers (fc) connected in sequence, As shown in Figure 4b, multiple fcs for classification are included, such as cls-fc1, cls-fc2, and cls-pred. Among them, the regression network includes multiple fully connected layers connected in sequence, as shown in Figure 4b, including multiple fcs for regression, such as reg-fc1, reg-fc2, and reg-pred. In the embodiment of the present application, the number of SA layers and fc layers can be set according to actual requirements.

In the embodiment of the present application, the process of extracting the global feature information of the region may include: sequentially performing feature extraction on the region feature information through each set abstraction layer in the feature extraction network to obtain the global feature information of the candidate object region.

The structure of the collective abstraction layer can refer to the above introduction. In one embodiment, the grouping in the SA layer can be grouped in a single scale, that is, SSG grouping is used to improve the accuracy and efficiency of global feature extraction.

Referring to Figure 4b, the regional prediction network can perform feature extraction on regional feature information in turn through three SA layers. For example, when the input feature input is M×131 features, after three SA layer feature extraction, 128×128 and 32× 256 and other features. After the SA layer feature extraction, the global feature information is obtained. At this time, the global feature information can be input to the classification network and the regression network respectively.

The classification network uses the first two cls-fc1 and cls-fc2 to perform dimensionality reduction processing on the global feature information, and performs classification prediction through the last cls-pred layer, and outputs the prediction type of the candidate object region.

The regression network uses the first two reg-fc1 and reg-fc2 to perform dimensionality reduction processing on the global feature information, and performs regression prediction through the last reg-pred layer to obtain the predicted location information of the candidate object region.

Among them, the type of the candidate object area can be set according to actual needs, for example, according to whether there are objects in the area, it can be divided into objects with or without objects; or according to quality, it can also be divided into high, medium, and low quality.

Through the above steps, the type and positioning information of each candidate object area can be predicted.

106. Perform optimization processing on the candidate object area based on the initial positioning information, the prediction type, and the predicted positioning information, to obtain the target object detection area and the positioning information of the target object detection area.

Among them, there are many optimization methods. For example, the positioning information of the candidate object area may be adjusted based on the predicted positioning information first, and then the candidate object area may be filtered based on the prediction type. For another example, in an embodiment, the candidate object regions may be screened based on the prediction type first, and then the positioning information may be adjusted.

For example, the step of "optimizing the candidate object area based on the initial positioning information, prediction type, and predicted positioning information to obtain the target object detection area and the positioning information of the target object detection area" may include:

Screening the candidate object area based on the prediction type of the candidate object area to obtain the filtered object area;

According to the predicted location information of the filtered object area, the initial location information of the filtered object area is optimized and adjusted to obtain the target object detection area and the location information of the target object detection area.

For example, when the prediction type includes object regions and empty regions, the candidate object regions whose prediction type is empty regions can be filtered out, and then based on the predicted positioning information of the remaining candidate object regions after the filtering process, their initial positioning Information is optimized and adjusted.

Specifically, the positioning information optimization adjustment method, for example, can be adjusted based on the difference information between the predicted positioning information and the initial positioning information, for example, the difference in the 3D coordinates of the area, the difference in size, and so on.

For another example, it is also possible to determine an optimal positioning information based on the predicted positioning information and the initial positioning information, and then adjust the positioning information of the candidate object area to the optimal positioning information. For example, determine the 3d coordinates and length, width, and height of an optimal area.

In practical applications, the object detection area can also be identified in the scene image based on the location information of the target object detection area. For example, referring to FIG. 1e, the object detection method provided by the embodiment of the application can accurately detect in the automatic driving scene. The position, size, and direction of objects on the current road are conducive to decision-making and judgment of autonomous driving.

The object detection provided by the embodiments of the present application may be applicable to various scenarios, such as scenarios such as autonomous driving, drones, and security monitoring.

It can be seen from the above that the embodiment of the present application can detect the former scenic spot from the point cloud of the scene; construct the object area corresponding to the former scenic spot based on the former scenic spot and the predetermined size to obtain the initial positioning information of the candidate object area; compare the point cloud based on the point cloud network Perform feature extraction on all points of, to obtain the feature set corresponding to the point cloud; construct the regional feature information of the candidate object region based on the feature set; predict the type and location information of the candidate object region based on the region prediction network and regional feature information to obtain the candidate object region The prediction type and prediction positioning information of the candidate object area; based on the initial positioning information of the candidate object area, the prediction type and prediction positioning information of the candidate object area, the candidate object area is optimized to obtain the target object detection area and the positioning information of the target object detection area. Using the point cloud data of the scene for object detection can improve the accuracy of object detection.

And this solution can also generate candidate object regions for each front scenic spot in the point cloud, which can avoid information loss. At the same time, candidate object regions are generated for each front scenic spot, that is, for any object, its corresponding candidate region will be generated. Therefore, it will not be affected by object scale changes and severe occlusion, which improves the effectiveness and success rate of object detection.

In addition, this solution can also optimize the candidate object region based on the region characteristics of the candidate object region; therefore, the accuracy and quality of object detection can be further improved.

According to the method described in the above embodiment, an example will be given below for further detailed description.

In this embodiment, the object detection device is specifically integrated in a network device as an example for description.

(1) Train the semantic segmentation network, point cloud network, and area prediction network separately, which can be specifically as follows:

1. Training of semantic segmentation network.

First, the network device can obtain a training set of the semantic segmentation network, which includes sample images labeled with pixel types (such as foreground pixels, background pixels, etc.).

Among them, the network device can train semantic segmentation based on the training set and loss function. Specifically, the sample image can be semantically segmented through the semantic segmentation network to obtain foreground pixels of the sample image, and then the segmented pixel type and the labeled pixel type are converged based on the loss function to obtain the trained semantic segmentation network.

2. Training of point cloud network.

The network device obtains a training set of the point cloud network, and the training set includes sample point clouds of sample objects or scenes. The network device can train the point cloud network based on the sample point cloud training set.

3. Regional prediction network

The network device obtains the training set of the area prediction network, the training set may include the sample point cloud labeled with the object area type and positioning information; the area prediction network is trained through the training set, specifically, the object area type of the sample point cloud is predicted And the positioning information, the prediction type is converged with the real type, and the predicted positioning information is converged with the real positioning information to obtain the trained regional prediction network.

The foregoing network training may be performed by the network device itself, or it may be obtained by the network device after the training of other devices is completed. It should be understood that the network applied in the embodiment of the present application is not limited to training in the foregoing manner, and may also be trained in other manners.

(2) Through the trained semantic segmentation network, point cloud network, and regional prediction network, object detection can be performed based on the point cloud. For details, see Figure 5a and Figure 5b.

As shown in Figure 5a, an object detection method, the specific process can be as follows:

501. The network device acquires an image and a point cloud of the scene.

For example, network equipment can obtain scene images and point clouds from image acquisition equipment and point cloud acquisition equipment respectively

502. The network device uses a semantic segmentation network to perform semantic segmentation on the image of the scene to obtain foreground pixels.

Referring to Fig. 5b, taking an autonomous driving scene as an example, a road scene image can be collected first, and a 2D semantic segmentation network can be used to segment the scene image to obtain a segmentation result, including foreground pixels, background pixels, and so on.

503. The network device maps the foreground pixels to the point cloud of the scene to obtain the front scenic spot in the point cloud.

For example, X-ception-based DeepLabV3 can be used as a segmentation network, and the image of the scene can be segmented through the segmentation network to obtain foreground pixels such as foreground pixels of cars, pedestrians, and cyclists in autonomous driving. Then, the point in the point cloud is projected into the image of the scene, and then the segmentation result in the corresponding picture is used as the segmentation result of this point, and then the front scenic spot in the point cloud is generated. This method can accurately detect the front spot in the point cloud.

504. The network device constructs a three-dimensional candidate object area corresponding to each front scenic spot based on each front scenic spot and a predetermined size, and obtains initial positioning information of the candidate object area.

For example, the previous scenic spot is the center point and the three-dimensional candidate object area corresponding to the previous scenic spot is generated according to a predetermined size.

For example, referring to FIG. 5b, after obtaining the previous scenic spot, the candidate object area corresponding to the previous scenic spot can be generated according to a predetermined size by using the previous scenic spot as the center point, that is, a Piont-Based Proposal Generation (Piont-Based Proposal Generation) can be generated.

For detailed candidate object regions, please refer to Fig. 2a to Fig. 2b, and the above-mentioned related introduction.

505. The network device performs feature extraction on all points in the point cloud through the point cloud network to obtain a feature set corresponding to the point cloud.

Referring to Figure 5b, all points in the point cloud (B, N, 4) can be input to PointNet++, and the feature of the point cloud can be extracted through PointNet++ to obtain (B, N, C).

For the specific point cloud network structure and feature extraction process, reference may be made to the description of the foregoing embodiment.

506. The network device constructs regional feature information of the candidate object region based on the feature set.

Referring to FIG. 5b, after obtaining the initial positioning information of the candidate object area and the feature set of the point cloud, the network device can generate the area feature information of the candidate object area based on the feature set of the point cloud (ie, Proposal Feature Generation).

For example, the network device selects multiple target points in the candidate object area; extracts the characteristics of the target point from the feature set to obtain the first part of the feature information of the candidate object area; standardizes the position information of the target point to obtain the standardized position of the target point Information; the first part of the feature information and standardized location information are fused to obtain the fusion feature information of the target point; the fusion feature information of the target is spatially transformed to obtain the transformed location information of the target point; based on the transformed location information, The standardized position information of the target point is adjusted to obtain the second part of the feature information of the candidate object area; the first part of the feature information and the second part of the feature information are fused to obtain the regional feature of the candidate area.

Specifically, the region feature generation can refer to the above-mentioned embodiment and the description of FIG. 3.

507. The network device predicts the type and location information of the candidate object area based on the area prediction network and the area feature information, and obtains the prediction type and predicted location information of the candidate object area.

For example, referring to Figure 5b, the candidate region can be classified (cls) and regression (reg) through the Box Prediction Net, so as to predict the type and regression parameters of the candidate object region. The regression parameters are predicted positioning information. Including three-dimensional coordinates, length, width and height, orientation and other parameters such as (x, y, z, l, h, w, angle).

508. The network device optimizes the candidate object area based on the initial positioning information of the candidate object area, the prediction type of the candidate object area, and the predicted positioning information to obtain the target object detection area and the positioning information of the target object detection area.

For example, the network device can filter the candidate object regions based on the prediction type of the candidate object regions to obtain the filtered object regions; according to the predicted positioning information of the filtered object regions, the initial positioning information of the filtered object regions can be optimized and adjusted to obtain optimization Back object detection area and its positioning information.

The embodiment of the present application may use all the point clouds as input, and then use a PointNet++ structure to generate features for each point in the point cloud. Then use each point in the point cloud as an anchor point to generate a candidate area. After that, the feature of each point is used as input to optimize the candidate area to generate the final detection result.

In addition, the algorithm capabilities provided by the embodiments of this application have been tested on some data sets. For example, the capabilities of the algorithms provided by the embodiments of this application have been tested on an open source autonomous driving data set such as the KITTI data set. The KITTI data set is an automatic The driving data set, with objects of various sizes and distances at the same time, is very challenging. The algorithm of the embodiment of this application surpasses all existing 3D object detection algorithms on KITTI, reaching a brand-new state-of-the-art, and at the same time, it is far superior to the previous best in the difficulty set. algorithm.

On the KITTI data set, the point cloud of 7481 training images and the point cloud of 7518 test images of three categories (cars, pedestrians and cycling) are tested. The average accuracy (AP) of the most extensive experiment is compared with other methods. Other methods include MV3D (Multi-View 3D object detection, multi-modal 3D object detection), AVOD (Aggregate View Object Detection, multi-view object detection) , VoxelNet (3D pixel network), F-PointNet (Frustum-PointNet, cone point cloud network), AVOD-FPN (multi-view object detection-cone point cloud network). Figure 5c shows the test results. As a result, the accuracy of the object detection method (Ours in FIG. 5c) provided by the embodiment of the present application is significantly higher than other methods.

In order to better implement the above method, correspondingly, an embodiment of the present application also provides an object detection device. The object detection device can be integrated in a network device. The network device can be a server, a terminal, a vehicle-mounted device, Equipment such as drones can also be miniature processing boxes.

For example, as shown in FIG. 6a, the object detection device may include a detection unit 601, a region construction unit 602, a feature extraction unit 603, a feature construction unit 604, a prediction unit 605, and an optimization unit 606, as follows:

The detection unit 601 is configured to detect the front scenic spot from the point cloud of the scene;

The area constructing unit 602 is configured to construct a candidate object area corresponding to the front scenic spot based on the previous scenic spot and a predetermined size, and determine initial positioning information of the candidate object area;

The feature extraction unit 603 is configured to perform feature extraction on all points in the point cloud based on the point cloud network to obtain a feature set corresponding to the point cloud;

The feature construction unit 604 is configured to construct the area feature information of the candidate object area based on the feature set;

The prediction unit 605 is configured to predict the type and location information of the candidate object area based on the area prediction network and the area feature information, and obtain the prediction type and predicted location information of the candidate object area;

The optimization unit 606 is configured to perform optimization processing on the candidate object area based on the initial positioning information of the candidate object area, the prediction type of the candidate object area, and the predicted positioning information to obtain the target object detection area and the positioning information of the target object detection area.

In an embodiment, the detection unit 601 is specifically configured to:

In an embodiment, the area construction unit 602 is specifically configured to:

The previous scenic spot is the center point, and the candidate object area corresponding to the previous scenic spot is generated according to a predetermined size.

In an embodiment, referring to FIG. 6b, the feature construction unit 604 specifically includes:

The selection subunit 6041 is configured to select multiple target points in the candidate object area;

An extraction subunit 6042, configured to extract the feature of the target point from the feature set to obtain the first part of feature information of the candidate object region;

The constructing subunit 6043 is configured to construct the second part of the feature information of the candidate object region based on the position information of the target point;

The fusion subunit 6045 is configured to fuse the first partial feature information and the second partial feature information to obtain the region feature information of the candidate object region.

In an embodiment, the subunit 6043 is constructed, specifically for:

Standardize the location information of the target point to obtain standardized location information of the target point;

Fusing the first part of feature information and the standardized location information to obtain the fused feature information of the target point;

Performing spatial transformation on the fused feature information of the target to obtain transformed position information;

Based on the transformed position information, the standardized position information of the target point is adjusted to obtain the second partial feature information of the candidate object region.

In an embodiment, referring to FIG. 6c, the point cloud network includes: a first sampling network, and a second sampling network connected to the first sampling network; the feature extraction unit 603 specifically includes:

A down-sampling subunit 6031, configured to perform feature down-sampling processing on all points in the point cloud through the first sampling network to obtain initial features of the point cloud;

The up-sampling subunit 6032 is configured to perform up-sampling processing on the initial features through the second sampling network to obtain a feature set of the point cloud.

In one embodiment, the first sampling network includes a plurality of aggregate abstraction layers connected in sequence, and the second sampling network includes a plurality of aggregate abstract layers connected in sequence and corresponding to each aggregate abstraction layer in the first sampling network. Feature propagation layer

The downsampling subunit 6031 is specifically used for:

The points in the point cloud are sequentially divided into local areas through the set abstraction layer, and the characteristics of the central points of the local areas are extracted to obtain the initial characteristics of the point cloud;

Inputting the initial features of the point cloud to the second sampling network;

The up-sampling subunit 6032 is specifically used for:

Determine the output feature of the previous layer and the output feature of the set abstraction layer corresponding to the current feature propagation layer as the current input feature of the current feature propagation layer;

In one embodiment, the regional prediction network includes a feature extraction network, a classification network connected to the feature extraction network, and a regression network connected to the feature extraction network; referring to FIG. 6d, the prediction unit 605 specifically includes:

The global feature extraction subunit 6051 is configured to perform feature extraction on the regional feature information through the feature extraction network to obtain global feature information of the candidate object region;

The classification subunit 6052 is configured to classify the candidate object region based on the classification network and the global feature information to obtain the prediction type of the candidate region;

The regression sub-unit 6053 is configured to locate the candidate object area based on the regression network and the global feature information to obtain predicted positioning information of the candidate object area.

In an embodiment, the feature extraction network includes a plurality of sequentially connected collective abstraction layers, the classification network includes a plurality of sequentially connected fully connected layers, and the regression network includes a plurality of sequentially connected fully connected layers;

The global feature extraction subunit 6051 is specifically configured to perform feature extraction on the regional feature information in turn through the collective abstraction layer in the feature extraction network to obtain the global feature information of the candidate object region.

In an embodiment, referring to FIG. 6e, the optimization unit 606 specifically includes:

The screening subunit 6061 is used to screen candidate object regions based on the prediction type of the candidate object regions to obtain the filtered object regions;

The optimization subunit 6062 is configured to optimize and adjust the initial positioning information of the filtered object area according to the predicted positioning information of the filtered object area to obtain the target object detection area and the location information of the target object detection area.

During specific implementation, each of the above units can be implemented as an independent entity, or can be combined arbitrarily, and implemented as the same or several entities. For the specific implementation of each of the above units, please refer to the previous method embodiments, which will not be repeated here.

It can be seen from the above that the object detection device of this embodiment can detect the front scenic spot from the point cloud of the scene through the detection unit 601; then the region construction unit 602 constructs the candidate object area corresponding to the previous scenic spot based on the previous scenic spot and the predetermined size, and obtains The initial positioning information of the candidate object region; the feature extraction unit 603 performs feature extraction on all points in the point cloud based on the point cloud network to obtain the feature set corresponding to the point cloud; the feature construction unit 604 is based on the feature set Construct the region feature information of the candidate object region; the prediction unit 605 predicts the type and location information of the candidate object region based on the region prediction network and the region feature information, and obtains the prediction type and predicted location information of the candidate object region; The unit 606 optimizes the candidate object area based on the initial positioning information of the candidate object area, the prediction type of the candidate object area, and the predicted positioning information to obtain the target object detection area and its positioning information. Because this solution can use the point cloud data of the scene for object detection, and can also generate candidate object regions for each front scenic spot, and optimize the candidate object regions based on the regional characteristics of the candidate object regions; therefore, it can greatly improve the object detection Accuracy, especially suitable for 3D object detection.

In addition, the embodiment of the present application also provides a network device, as shown in FIG. 7, which shows a schematic structural diagram of the network device involved in the embodiment of the present application, specifically:

The network device may include one or more processing core processors 701, one or more computer-readable storage medium memory 702, power supply 703, input unit 704 and other components. Those skilled in the art can understand that the network device structure shown in FIG. 7 does not constitute a limitation on the network device, and may include more or less components than shown in the figure, or combine some components, or arrange different components. among them:

The processor 701 is the control center of the network device. It uses various interfaces and lines to connect the various parts of the entire network device, runs or executes the software programs and/or modules stored in the memory 702, and calls the data stored in the memory 702. Data, perform various functions of network equipment and process data, so as to monitor the network equipment as a whole. Optionally, the processor 701 may include one or more processing cores; preferably, the processor 701 may integrate an application processor and a modem processor, where the application processor mainly processes the operating system, user interface, application programs, etc. , The modem processor mainly deals with wireless communication. It can be understood that the foregoing modem processor may not be integrated into the processor 701.

The memory 702 may be used to store software programs and modules. The processor 701 executes various functional applications and data processing by running the software programs and modules stored in the memory 702. The memory 702 may mainly include a storage program area and a storage data area. The storage program area may store an operating system, an application program required by at least one function (such as a sound playback function, an image playback function, etc.), etc.; Data created by the use of network equipment, etc. In addition, the memory 702 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, a flash memory device, or other volatile solid-state storage devices. Correspondingly, the memory 702 may further include a memory controller to provide the processor 701 with access to the memory 702.

The network device also includes a power supply 703 for supplying power to various components. Preferably, the power supply 703 may be logically connected to the processor 701 through a power management system, so that functions such as charging, discharging, and power consumption management can be managed through the power management system. The power supply 703 may also include any components such as one or more DC or AC power supplies, a recharging system, a power failure detection circuit, a power converter or inverter, and a power status indicator.

The network device may further include an input unit 704, which can be used to receive input digital or character information and generate keyboard, mouse, joystick, optical or trackball signal input related to user settings and function control.

Although not shown, the network device may also include a display unit, etc., which will not be repeated here. Specifically, in this embodiment, the processor 701 in the network device loads the executable file corresponding to the process of one or more applications into the memory 702 according to the following instructions, and the processor 701 runs the executable file stored in The application programs in the memory 702 thus realize various functions, as follows:

Detect the previous scenic spot from the point cloud of the scene; construct the candidate object area corresponding to the former scenic spot based on the previous scenic spot and the predetermined size to obtain the initial positioning information of the candidate object area; perform all the points in the point cloud based on the point cloud network Feature extraction to obtain the feature set corresponding to the point cloud; construct the region feature information of the candidate object region based on the feature set; predict the type and location information of the candidate object region based on the region prediction network and the region feature information, Obtain the prediction type and predicted positioning information of the candidate object area; optimize the candidate object area based on the initial positioning information of the candidate area, the prediction type and predicted positioning information of the candidate object area, and obtain the location of the target object detection area and the target object detection area information.

For the specific implementation of the above operations, please refer to the previous embodiments, which will not be repeated here.

It can be seen from the above that the network device of this embodiment detects the former scenic spot from the point cloud of the scene; constructs the candidate object area corresponding to the former scenic spot based on the former scenic spot and a predetermined size to obtain the initial positioning information of the candidate object area; based on the point cloud network Perform feature extraction on all points in the point cloud to obtain the feature set corresponding to the point cloud; construct the region feature information of the candidate object region based on the feature set; based on the region prediction network and the region feature information, Predict the type and positioning information of the candidate object area to obtain the prediction type and predicted positioning information of the candidate object area; optimize the candidate object area based on the initial positioning information of the candidate object area, the prediction type of the candidate object area and the predicted positioning information, and obtain Target object detection area and its location information. Because this solution can use the point cloud data of the scene for object detection, and can also generate candidate object regions for each front scenic spot, and optimize the candidate object regions based on the regional characteristics of the candidate object regions; therefore, it can greatly improve the object detection Accuracy, especially suitable for 3D object detection.

A person of ordinary skill in the art can understand that all or part of the steps in the various methods of the foregoing embodiments can be completed by instructions, or by instructions to control related hardware. The instructions can be stored in a computer-readable storage medium. And loaded and executed by the processor.

To this end, an embodiment of the present application further provides a storage medium in which multiple instructions are stored, and the instructions can be loaded by a processor to execute the steps in any object detection method provided in the embodiments of the present application. For example, the instruction can perform the following steps:

Detect the previous scenic spot from the point cloud of the scene; construct the candidate object area corresponding to the former scenic spot based on the previous scenic spot and the predetermined size to obtain the initial positioning information of the candidate object area; perform all the points in the point cloud based on the point cloud network Feature extraction to obtain the feature set corresponding to the point cloud; construct the region feature information of the candidate object region based on the feature set; predict the type and location information of the candidate object region based on the region prediction network and the region feature information, Obtain the prediction type and predicted positioning information of the candidate object area; optimize the candidate object area based on the initial positioning information of the candidate object area, the prediction type and predicted positioning information of the candidate object area, and obtain the target object detection area and target object detection area positioning information.

Wherein, the storage medium may include: read only memory (ROM, Read Only Memory), random access memory (RAM, Random Access Memory), magnetic disk or optical disk, etc.

Since the instructions stored in the storage medium can execute the steps in any object detection method provided in the embodiments of this application, it can achieve what can be achieved by any object detection method provided in the embodiments of this application. For the beneficial effects, see the previous embodiment for details, and will not be repeated here.

The object detection method, device, network device, and storage medium provided by the embodiments of the application are described in detail above. Specific examples are used in this article to explain the principles and implementation of the application. The description of the above embodiments is only It is used to help understand the methods and core ideas of this application; at the same time, for those skilled in the art, according to the ideas of this application, there will be changes in the specific implementation and the scope of application. In summary, this specification The content should not be construed as a limitation on this application.

Claims

An object detection method, executed by a network device, the method including:

Detect the former scenic spot from the point cloud of the scene;

Constructing a candidate object area corresponding to the previous scenic spot based on the front scenic spot and a predetermined size, and determining initial positioning information of the candidate object area;

Performing feature extraction on all points in the point cloud based on the point cloud network to obtain a feature set corresponding to the point cloud;

Constructing the region feature information of the candidate object region based on the feature set;

Predicting the type and location information of the candidate object area based on the area prediction network and the area feature information to obtain the prediction type and predicted location information of the candidate object area;

Perform optimization processing on the candidate object area based on the initial positioning information, the prediction type, and the predicted positioning information to obtain a target object detection area and positioning information of the target object detection area.
The object detection method according to claim 1, wherein the detecting the front scenic spot from the point cloud of the scene includes:

Perform semantic segmentation on the image of the scene to obtain foreground pixels;

The point corresponding to the foreground pixel in the point cloud of the scene is determined as the front scenic spot.
8. The object detection method according to claim 1, wherein said constructing a candidate object area corresponding to said former scenic spot based on said former scenic spot and a predetermined size comprises:

Using the front scenic spot as a central point, a candidate object area corresponding to the front scenic spot is generated according to the predetermined size.
The object detection method according to claim 1, wherein said constructing the area feature information of the candidate object area based on the feature set comprises:

Selecting multiple target points in the candidate object area;

Extracting the feature of the target point from the feature set to obtain the first part of feature information of the candidate object region;

Constructing the second partial feature information of the candidate object region based on the position information of the target point;

The first partial feature information and the second partial feature information are fused to obtain the area feature information of the candidate object area.
5. The object detection method according to claim 4, said constructing the second part of the feature information of the candidate object region based on the position information of the target point comprises:

Performing standardized processing on the position information of the target point to obtain the standardized position information of the target point;

Fusing the first part feature information and the standardized location information to obtain the fused feature information of the target point;

Performing spatial transformation on the fused feature information of the target point to obtain transformed position information;

Based on the transformed position information, the standardized position information of the target point is adjusted to obtain the second partial feature information of the candidate object region.
The object detection method according to claim 1, wherein the point cloud network includes: a first sampling network, a second sampling network connected to the first sampling network; Perform feature extraction on all points to obtain the feature set of the point cloud, including:

Performing feature down-sampling processing on all points in the point cloud through the first sampling network to obtain the initial features of the point cloud;

Up-sampling processing is performed on the initial feature through the second sampling network to obtain the feature set of the point cloud.
The object detection method according to claim 6, wherein the first sampling network includes a plurality of set abstraction layers connected in sequence, and the second sampling network includes a plurality of sets connected in sequence and abstracted from each set in the first sampling network. One-to-one corresponding feature propagation layer;

The performing feature down-sampling processing on all points in the point cloud through the first sampling network to obtain the initial features of the point cloud includes:

Performing local area division on the points in the point cloud sequentially through a plurality of the set abstraction layers, and extracting the features of the central points of the local areas to obtain the initial features of the point cloud;

Inputting the initial features of the point cloud to the second sampling network;

The performing up-sampling processing on the initial features through the second sampling network to obtain the feature set of the point cloud includes:

Determine the output feature of the previous layer and the input feature of the collective abstraction layer corresponding to the current feature propagation layer as the current input feature of the current feature propagation layer;

Up-sampling processing is performed on the current input feature through the current feature propagation layer to obtain the feature set of the point cloud.
The object detection method according to claim 1, wherein the area prediction network includes a feature extraction network, a classification network connected to the feature extraction network, and a regression network connected to the feature extraction network;

The predicting the type and positioning information of the candidate object area based on the area prediction network and the area feature information to obtain the prediction type and predicted positioning information of the candidate object area includes:

Performing feature extraction on the region feature information through the feature extraction network to obtain global feature information of the candidate object region;

Classifying the candidate object area based on the classification network and the global feature information to obtain a prediction type of the candidate object area;

Based on the regression network and the global feature information, the candidate object area is located to obtain predicted positioning information of the candidate object area.
The object detection method according to claim 8, wherein the feature extraction network includes a plurality of sequentially connected collective abstraction layers, the classification network includes a plurality of sequentially connected fully connected layers, and the regression network includes a plurality of sequentially connected Fully connected layer

The performing feature extraction on the region feature information through the feature extraction network to obtain the global feature information of the candidate object region includes:

The feature extraction is performed on the regional feature information in turn by each set abstraction layer in the feature extraction network to obtain the global feature information of the candidate object region.
The object detection method of claim 1, wherein the candidate object area is optimized based on the initial positioning information, the prediction type, and the predicted positioning information to obtain a target object detection area and the target object Location information of the detection area, including:

Filtering the candidate object area based on the prediction type to obtain the filtered object area;

According to the predicted location information of the filtered object area, the initial location information of the filtered object area is optimized and adjusted to obtain the target object detection area and the target object detection area location information.
An object detection device, including:

The detection unit is used to detect the former scenic spot from the point cloud of the scene;

An area construction unit, configured to construct a candidate object area corresponding to the front scenic spot based on the front scenic spot and a predetermined size, and determine initial positioning information of the candidate object area;

A feature extraction unit, configured to perform feature extraction on all points in the point cloud based on a point cloud network to obtain a feature set corresponding to the point cloud;

A feature construction unit, configured to construct the area feature information of the candidate object area based on the feature set;

A prediction unit, configured to predict the type and location information of the candidate object area based on the area prediction network and the area feature information, to obtain the prediction type and predicted location information of the candidate object area;

The optimization unit is configured to perform optimization processing on the candidate object area based on the initial positioning information, the prediction type, and the predicted positioning information to obtain the target object detection area and the positioning information of the target object detection area.
The object detection device according to claim 11, the detection unit is specifically configured to:

Perform semantic segmentation on the image of the scene to obtain foreground pixels;

The point corresponding to the foreground pixel in the point cloud of the scene is determined as the front scenic spot.
The object detection device according to claim 11, the area construction unit is specifically configured to:

Using the front scenic spot as a central point, a candidate object area corresponding to the front scenic spot is generated according to the predetermined size.
The object detection device according to claim 11, the feature construction unit specifically includes:

A selection subunit for selecting multiple target points in the candidate object area;

An extraction subunit, configured to extract the feature of the target point from the feature set to obtain the first part of feature information of the candidate object region;

A constructing subunit for constructing the second part of the feature information of the candidate object region based on the position information of the target point;

The fusion subunit is used to fuse the first partial feature information and the second partial feature information to obtain the area feature information of the candidate object area.
The object detection device according to claim 14, wherein the construction subunit is specifically used for:

Performing standardized processing on the position information of the target point to obtain the standardized position information of the target point;

Fusing the first part feature information and the standardized location information to obtain the fused feature information of the target point;

Performing spatial transformation on the fused feature information of the target point to obtain transformed position information;

Based on the transformed position information, the standardized position information of the target point is adjusted to obtain the second partial feature information of the candidate object region.
11. The object detection device according to claim 11, wherein the point cloud network comprises: a first sampling network and a second sampling network connected to the first sampling network; the feature extraction unit specifically includes:

A down-sampling subunit, configured to perform feature down-sampling processing on all points in the point cloud through the first sampling network to obtain the initial features of the point cloud;

The upsampling subunit is configured to perform upsampling processing on the initial features through the second sampling network to obtain the feature set of the point cloud.
The object detection device according to claim 16, wherein the first sampling network includes a plurality of set abstraction layers connected in sequence, and the second sampling network includes a plurality of sets connected in sequence and abstracted from each set in the first sampling network. One-to-one corresponding feature propagation layer;

The downsampling subunit is specifically used for:

Perform local area division on the points in the point cloud sequentially through the set abstraction layer, and extract the features of the central points of the local areas to obtain the initial features of the point cloud;

Inputting the initial features of the point cloud to the second sampling network;

The upsampling subunit is specifically used for:

Determine the output feature of the previous layer and the input feature of the collective abstraction layer corresponding to the current feature propagation layer as the current input feature of the current feature propagation layer;

Up-sampling processing is performed on the current input feature through the current feature propagation layer to obtain the feature set of the point cloud.
11. The object detection device according to claim 11, the area prediction network includes a feature extraction network, a classification network connected to the feature extraction network, and a regression network connected to the feature extraction network; the prediction unit, Specifically:

A global feature extraction subunit, configured to perform feature extraction on the regional feature information through the feature extraction network to obtain the global feature information of the candidate object region;

The classification subunit is configured to classify the candidate object area based on the classification network and the global feature information to obtain the prediction type of the candidate object area;

The regression subunit is used to locate the candidate object area based on the regression network and the global feature information to obtain predicted positioning information of the candidate object area.
The object detection device according to claim 18, wherein the feature extraction network includes a plurality of sequentially connected collective abstraction layers, the classification network includes a plurality of sequentially connected fully connected layers, and the regression network includes a plurality of sequentially connected Fully connected layer

The global feature extraction subunit is specifically configured to perform feature extraction on the regional feature information in turn through each collective abstraction layer in the feature extraction network to obtain the global feature information of the candidate object region.
The object detection device according to claim 11, the optimization unit specifically includes:

The screening subunit is used to screen the candidate object area based on the prediction type of the candidate object area to obtain the selected object area;

The optimization subunit is used to optimize and adjust the initial positioning information of the filtered object area according to the predicted positioning information of the filtered object area to obtain the target object detection area and the positioning information of the target object detection area .
A storage medium, wherein the storage medium stores a plurality of instructions, and the instructions are suitable for loading by a processor to execute the steps in the object detection method according to any one of claims 1 to 10.
A network device, characterized by comprising a memory and a processor; the memory stores a plurality of instructions, and the processor loads the instructions in the memory to execute the object according to any one of claims 1 to 10 Steps in the detection method.
A computer program product, comprising instructions, when running on a computer, causes the computer to execute the steps of the object detection method described in any one of claims 1 to 10.