CN114556445A

CN114556445A - Object recognition method, device, movable platform and storage medium

Info

Publication number: CN114556445A
Application number: CN202080071443.3A
Authority: CN
Inventors: 蒋卓键; 黄浩洸; 栗培梁
Original assignee: SZ DJI Technology Co Ltd
Current assignee: Shenzhen Zhuoyu Technology Co ltd
Priority date: 2020-12-17
Filing date: 2020-12-17
Publication date: 2022-05-27
Also published as: WO2022126522A1

Abstract

An object identification method, apparatus, movable platform and storage medium, the method comprising: acquiring point cloud data and image data of an object to be identified, and acquiring a first fusion characteristic according to the point cloud data and the image data; and, obtaining a first image feature from the image data (S101); acquiring a first size coefficient according to the first fusion characteristic; acquiring a second size coefficient according to the first image characteristic; the first size coefficient and the second size coefficient characterize the probability that the object to be identified belongs to an object of a target size range (S102); acquiring a second fusion feature according to the first fusion feature and the first size coefficient (S103); acquiring a second image characteristic according to the first image characteristic and the second size coefficient; and identifying the object to be identified according to the second fusion characteristic and the second image characteristic (S104). The embodiment realizes effective identification of the object belonging to the target size range.

Description

Object recognition method, device, movable platform and storage medium

Technical Field

The present disclosure relates to the field of object recognition technologies, and in particular, to an object recognition method, an object recognition device, a movable platform, and a storage medium.

Background

With the development of technology, mobile platforms such as unmanned vehicles and unmanned aerial vehicles have also been developed gradually. In the moving process of the movable platform, the environment around the movable platform needs to be sensed, and information of objects existing in the environment is obtained to control the movable platform to work safely and reliably, for example, obstacle avoidance is a problem that the movable platform such as an unmanned vehicle and an unmanned aerial vehicle needs to pay attention to, and the key to realizing obstacle avoidance is to accurately identify the objects around the movable platform, so how to accurately identify the information of the objects around the movable platform becomes a technical problem to be solved urgently.

Disclosure of Invention

In view of the above, an object of the present application is to provide an object recognition method, an object recognition apparatus, a movable platform and a storage medium.

In a first aspect, an embodiment of the present application provides an object identification method, including:

acquiring point cloud data and image data of an object to be identified, and acquiring a first fusion characteristic according to the point cloud data and the image data; acquiring a first image characteristic according to the image data;

acquiring a first size coefficient according to the first fusion characteristic; acquiring a second size coefficient according to the first image characteristic; the first size coefficient and the second size coefficient represent the probability that the object to be identified belongs to the object with the target size range;

acquiring a second fusion characteristic according to the first fusion characteristic and the first size coefficient; acquiring a second image characteristic according to the first image characteristic and the second size coefficient;

and identifying the object to be identified according to the second fusion characteristic and the second image characteristic.

In a second aspect, an embodiment of the present application provides an object recognition apparatus, including a processor and a memory storing a computer program;

the processor, when executing the computer program, implements the steps of:

In a third aspect, an embodiment of the present application provides a movable platform, including:

a body;

the power system is arranged in the machine body and is used for providing power for the movable platform; and (c) a second step of,

an object recognition apparatus as described in the first aspect.

In a third aspect, the present application provides a computer-readable storage medium, which stores a computer program, and when the computer program is executed by a processor, the computer program implements the method according to the first aspect.

According to the object identification method, the object identification device, the movable platform and the storage medium, after point cloud data and image data of an object to be identified are obtained, first fusion features are obtained according to the point cloud data and the image data, and first image features are obtained according to the image data; then, a first size coefficient is obtained according to the first fusion characteristic, a second size coefficient is obtained according to the first image characteristic, and the first size coefficient and the second size coefficient represent the probability that the object to be identified belongs to the object in the target size range; then, acquiring a second fusion feature according to the first fusion feature and the first size coefficient, and acquiring a second image feature according to the first image feature and the second size coefficient; and finally, identifying the object to be identified according to the second fusion characteristic and the second image characteristic. In this embodiment, the attention degree of the object in the target size range is increased by the determined first size coefficient and the second size coefficient, and then the information about the object in the target size range in the second fusion feature and the second image feature is increased by the first size coefficient and the second size coefficient, so that the second fusion feature and the second image feature contain more semantic information about the object in the target size range, and therefore, the small object can be accurately identified based on the second fusion feature and the second image feature, and the identification accuracy of the object is improved; further, in the embodiment, object recognition is performed by combining point cloud data and image data, the point cloud data can provide space information for the image data, the image data can provide color information for the point cloud data, and the two data complement each other, which is beneficial to further improving the accuracy of object recognition.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive labor.

FIG. 1A is a diagram illustrating an application scenario provided by an embodiment of the present application;

FIG. 1B is a schematic illustration of an unmanned vehicle according to an embodiment of the present application;

FIG. 2 is a schematic flow chart diagram illustrating an object recognition method according to an embodiment of the present application;

FIGS. 3A and 3B are schematic diagrams of different structures of an object recognition model according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of an object recognition apparatus according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of a movable platform according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

With the development of technology, mobile platforms such as unmanned vehicles and unmanned aerial vehicles have also been developed gradually. In the moving process of the movable platform, the environment around the movable platform needs to be sensed, and information of an object existing in the environment is obtained to control the movable platform to work safely and reliably, for example, obstacle avoidance is a problem that the movable platform such as an unmanned vehicle and an unmanned aerial vehicle needs to pay attention to, and the key to realize obstacle avoidance is to accurately identify the object around the movable platform, so how to accurately identify the information of the object around the movable platform becomes a technical problem to be solved urgently.

The inventor finds that in the object identification method in the related art, for a large object with a large size, the related sensor can acquire more detection data from the large object, further more characteristic information can be extracted from the detection data, and the more characteristic information can lead the object identification method in the related technology to identify the large object relatively accurately, however, for small objects with smaller size, the detection data obtained by the relevant sensor from the small object is much smaller than that of the large object, so that the feature information extracted from the detection data of the small object is also less, further, the object identification method in the related art has low identification accuracy for small objects, and is suitable for the application of the small objects in the service scene, for example, in some extreme scenarios in the field of vehicle driving, the object identification method in the related art may have serious consequences for safe driving of the vehicle due to the identification defect of the small object.

Based on this, the embodiment of the application provides an object identification method, after point cloud data and image data of an object to be identified are obtained, a first fusion feature is obtained according to the point cloud data and the image data, and a first image feature is obtained according to the image data; then, a first size coefficient is obtained according to the first fusion characteristic, a second size coefficient is obtained according to the first image characteristic, and the first size coefficient and the second size coefficient represent the probability that the object to be identified belongs to the object in the target size range; then, acquiring a second fusion feature according to the first fusion feature and the first size coefficient, and acquiring a second image feature according to the first image feature and the second size coefficient; and finally, identifying the object to be identified according to the second fusion characteristic and the second image characteristic. In this embodiment, the attention degree of the object in the target size range is increased by the determined first size coefficient and the second size coefficient, and then the information about the object in the target size range in the second fusion feature and the second image feature is increased by the first size coefficient and the second size coefficient, so that the second fusion feature and the second image feature contain more semantic information about the object in the target size range, and therefore, the small object can be accurately identified based on the second fusion feature and the second image feature, and the identification accuracy of the object is improved; further, in the embodiment, object recognition is performed by combining point cloud data and image data, the point cloud data can provide space information for the image data, the image data can provide color information for the point cloud data, and the two data complement each other, which is beneficial to further improving the accuracy of object recognition.

The object recognition method of the present embodiment may be implemented by an object recognition apparatus. In one possible implementation, the object recognition device may be a computer chip or an Integrated Circuit with data Processing capability, such as a Central Processing Unit (CPU), a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), or an off-the-shelf Programmable Gate Array (FPGA), for example. The object recognition device may be mounted in a movable platform; the movable platform of the embodiment of the application can comprise: automobile, unmanned vehicles, unmanned ship or robot, wherein, the automobile can be unmanned vehicle, also can be for someone vehicle of driving, unmanned vehicles can be for four rotor unmanned aerial vehicle, six rotor unmanned aerial vehicle or eight rotor unmanned aerial vehicle etc.. In another implementation, the object recognition device may also be a movable platform or the like; the movable platform at least comprises an automobile, an unmanned aerial vehicle, an unmanned ship or a robot, wherein the automobile can be an unmanned vehicle or a manned vehicle.

In an exemplary application scenario, please refer to fig. 1A and 1B, where a movable platform is an unmanned vehicle, the unmanned vehicle includes the object recognition device for example, fig. 1A shows a driving scenario of the unmanned vehicle 100, fig. 1B shows a structural diagram of the unmanned vehicle 100, the unmanned vehicle 100 may be equipped with a laser radar 10 for acquiring point cloud data, and the unmanned vehicle 100 is further equipped with a shooting device 20 for acquiring image data. For example, the number of the laser radar 10 and the photographing device 20 may be one or more. It is understood that the installation positions of the lidar 10 and the cameras 20 can be specifically set according to the actual application scene, and for example, one of the lidar 10 and one of the cameras 20 can be installed at the head of the unmanned vehicle 100. In the driving process of the unmanned vehicle 100, the laser radar 10 collects point cloud data of objects around the unmanned vehicle 100 and transmits the point cloud data to the object recognition device 30 in the unmanned vehicle 100, the shooting device 20 collects image data of the objects around the unmanned vehicle 100 and transmits the image data to the object recognition device 30 in the unmanned vehicle 100, and the object recognition device 30 obtains the point cloud data and the image data of the objects around the unmanned vehicle 100 and recognizes the objects based on the object recognition method of the embodiment of the present application to obtain a recognition result. In a first possible implementation, after obtaining the recognition result, the unmanned vehicle 100 may use the recognition result for obstacle avoidance decision or route planning; in a second possible implementation manner, the recognition result may be displayed on an interface of the unmanned vehicle 100 or an interface of a terminal in communication connection with the unmanned vehicle 100, so as to enable a user to know the driving condition of the unmanned vehicle 100 and the road condition around the unmanned vehicle 100; in a third possible implementation, the recognition result may be transmitted to other components in the unmanned vehicle 100, so that the other components control the unmanned vehicle 100 to operate safely and reliably based on the recognition result.

Next, the object identification method provided in the embodiment of the present application is explained: referring to fig. 2, fig. 2 is a schematic flow chart of an object identification method according to an embodiment of the present application, where the method may be implemented by an object identification device; the object recognition device can be a movable platform, or the object recognition device is installed in the movable platform as a chip; the method comprises the following steps:

in step S101, point cloud data and image data of an object to be recognized are acquired, and a first fusion feature is acquired according to the point cloud data and the image data; and acquiring a first image characteristic according to the image data.

In step S102, a first size coefficient is obtained according to the first fusion feature; acquiring a second size coefficient according to the first image characteristic; the first size coefficient and the second size coefficient represent the probability that the object to be identified belongs to the object with the target size range.

In step S103, a second fused feature is obtained according to the first fused feature and the first size coefficient; and acquiring a second image characteristic according to the first image characteristic and the second size coefficient.

In step S104, the object to be recognized is recognized according to the second fusion feature and the second image feature.

The point cloud data and the image data of the object to be recognized are obtained by sampling the space through a sensor of the movable platform. Specifically, the point cloud data can be acquired by a laser radar arranged on a movable platform or a shooting device with a depth information acquisition function; and/or the image data may be acquired using a camera disposed on the movable platform.

The laser radar is used for transmitting a laser pulse sequence to the space where the movable platform is located, then receiving the laser pulse sequence reflected by the object to be identified, and generating point cloud data according to the reflected laser pulse sequence. In one example, the lidar may determine a time of receipt of the sequence of laser pulses reflected back, for example by detecting a rising edge time and/or a falling edge time of an electrical signal pulse. In this way, the lidar may calculate TOF (Time of flight) using the reception Time information and the transmission Time of the laser pulse sequence, thereby determining the distance from the object to be identified to the lidar. The laser radar belongs to an autonomous light emitting sensor, does not depend on light source illumination, is less interfered by ambient light, can normally work even in a lightless closed environment, is convenient for generating a high-precision three-dimensional model in the subsequent process, and has wide applicability.

The shooting device with the depth information acquisition function comprises but is not limited to a binocular vision sensor or a structured light depth camera and the like; the binocular vision sensor acquires two images of a target scene from different positions based on a parallax principle, and three-dimensional geometric information is acquired by calculating the position deviation between corresponding points of the two images so as to generate point cloud data; the structured light depth camera projects light with certain structural characteristics into a space and then collects the light, different image phase information can be collected due to different depth areas of an object to be recognized by the light with a certain structure, and then the light is converted into depth information, so that point cloud data can be obtained.

The image data may be a color image, a grayscale image, an infrared image, or the like, and the photographing device for acquiring the image data includes, but is not limited to, a visible light camera, a grayscale camera, an infrared camera, and the like. The camera may capture a sequence of images at a specified frame rate. The camera may have adjustable camera parameters. The camera can take different images under different shooting parameters, although subject to the exact same external conditions (e.g., position, illumination). The capture parameters may include exposure (e.g., exposure time, shutter speed, aperture, film speed), gain, gamma, region of interest, binning/sub-sampling, pixel clock, offset, trigger, ISO, etc. The exposure-related parameter may control the amount of light reaching an image sensor in the camera. For example, the shutter speed may control the amount of time light reaches the image sensor while the aperture may control the amount of light reaching the image sensor in a given time. The gain-related parameter may control the amplification of the signal from the optical sensor. The ISO may control the level of sensitivity of the camera to the available light.

In an exemplary embodiment, the movable platform is equipped with a lidar and a visible light camera. In one example, the lidar and the visible light camera may operate at the same frame rate. In another example, the lidar and the visible light camera may also operate at different frame rates, and the frame rates of the lidar and the visible light camera are satisfied that point cloud data and image data can be acquired within a preset time period.

For example, in the moving process of the movable platform, the laser radar may collect point cloud data of an object to be recognized in real time and transmit the point cloud data to the object recognition device, and the photographing device may collect image data of the object to be recognized in real time and transmit the image data to the object recognition device. In the embodiment, object identification is performed by combining point cloud data and image data, the point cloud data can provide space information for the image data, the image data can provide color information for the point cloud data, and the two data complement each other, so that the object identification accuracy is further improved.

The object recognition device acquires point cloud data and image data of an object to be recognized, then acquires a first fusion feature according to the point cloud data and the image data, and acquires a first image feature according to the image data; then, a first size coefficient is obtained according to the first fusion feature, and a second size coefficient is obtained according to the first image feature, the first size coefficient and the second size coefficient characterize the probability that the object to be identified belongs to the object in the target size range, the embodiment realizes that the information of the object in the target size range (namely, the first size coefficient and the second size coefficient) is determined from the first fusion feature and the first image feature, the attention degree of the object in the target size range is increased through the determined first size coefficient and the second size coefficient, further, a second fusion feature is obtained according to the first fusion feature and the first size coefficient, a second image feature is obtained according to the first image feature and the second size coefficient, and the increase of the target size in the second fusion feature and the second image feature is realized through the first size coefficient and the second size coefficient And semantic information of the object in the inch range, so that the small object can be accurately identified based on the second fusion characteristic and the second image characteristic, and the accuracy of object identification is improved.

The target size range of the present embodiment may include: the size range smaller than the preset target size threshold value is not limited in this embodiment, considering that the definitions of the sizes of the small objects are different for different types of movable platforms, the preset target size threshold value may be flexibly configured according to the service requirements.

For point cloud data, the point cloud data is unstructured data and needs to be processed into a format capable of performing data analysis, for example, the point cloud data is processed to obtain a point cloud density corresponding to each voxel of the point cloud data. The processing method of the point cloud data may be a point cloud three-dimensional meshing process, for example, the point cloud data is subjected to raster division to obtain a plurality of voxels of the point cloud data, and the ratio of the number of point clouds included in each voxel in the point cloud data to the number of all point clouds in the point cloud data constitutes the point cloud density of the voxel. The point cloud density represents the number of point clouds contained in the voxel, and if the point cloud density is high, the probability that the voxel corresponds to the object is high, so that the point cloud density corresponding to each voxel of the point cloud data can be used as the characteristic information of the object. The irregular point cloud is processed into a regular expression form, so that the contour information of the object to be recognized can be better represented.

For image data, data analysis may be performed using pixel values in the image data, including but not limited to RGB values or gray scale values, etc.

In some embodiments, the object recognition device may obtain the first fused feature according to a point cloud density of each voxel in the point cloud data and a pixel value in the image data. The first fusion features are fused with feature information of the point cloud data and feature information of the image data, so that the characteristics of the object to be recognized can be better reflected from a three-dimensional angle and a two-dimensional angle, and the accuracy of object recognition can be improved.

In a possible implementation manner, the point cloud data and the image data may be stitched to obtain stitched data, and feature extraction may be performed on the stitched data to obtain the first fusion feature. In one example, feature extraction may be performed on the merged data through a pre-trained object recognition model to obtain the first fused feature.

Wherein the point cloud data may be rasterized into three-dimensional meshes of H W C (where H and W represent length and width, respectively, and C represents depth of the three-dimensional mesh), each mesh representing a voxel, the value of the voxel is the point cloud density of that voxel, the image data (in the example of an RGB image) may be represented as H W3 data (where H and W represent length and width, respectively, and 3 represents RGB3 channels), the point cloud data and the image data may be stitched into stitched data of size H x W (C +3), the splicing data comprises position information (obtained from point cloud data) and color information (obtained from image data) of the object to be recognized, and the characteristics of the object to be recognized can be better embodied from a three-dimensional angle and a two-dimensional angle, so that the accuracy of object recognition is improved.

In another possible implementation manner, in order to further improve the accuracy of object identification, the object identification apparatus may determine a first projection position of the point cloud data in a two-dimensional space based on a projection relationship from a point cloud to an image, then obtain a pixel value at the first projection position in the image data, and generate point cloud data including the pixel value according to the pixel value at the first projection position and the point cloud data; taking the image data as an RGB image as an example, the object recognition device may assign RGB values at the first projection position in the image data to a corresponding point cloud, thereby generating a color point cloud; in the embodiment, the point cloud data fuses color data in the image, and further strengthens the corresponding relation between the image data and the point cloud data, so that the point cloud data comprising the pixel values can better embody the characteristics of the object to be recognized, and the accuracy of object recognition is improved. The projection relationship from the point cloud to the image can be obtained based on an external parameter from a point cloud coordinate system to a camera coordinate system and an internal parameter of a camera, a coordinate of a certain point in the point cloud coordinate system is set to be P, the external parameter from the point cloud coordinate system to the camera coordinate system is set to be RT, the internal parameter of the camera is set to be K, a first projection position of the point in the two-dimensional space is set to be P, and then P is RT and K.

Then, the object recognition device may obtain the first fusion feature according to the point cloud data including the pixel value and the image data, for example, the object recognition device may obtain a spliced data by splicing the point cloud data including the pixel value and the image data, and perform feature extraction on the spliced data, thereby obtaining the first fusion feature. In one example, feature extraction may be performed on the merged data through a pre-trained object recognition model to obtain the first fusion feature.

Wherein the point cloud data may be rasterized into three-dimensional meshes of H × W × C, each mesh represents a voxel, the voxel values are point cloud densities of the voxel and pixel values (such as RGB values) of the point cloud, and the image data (for example, an RGB image) may be represented as data of H × W × 3, then the point cloud data including the pixel values and the image data may be spliced into spliced data of size H × W (C +3), the spliced data includes position information, corresponding relationship information of position and color of the object to be identified (obtained from the point cloud data including the pixel values), and color information (obtained from the image data), and the characteristics of the object to be identified can be comprehensively represented from three-dimensional angles and two-dimensional angles, thereby facilitating improvement of the accuracy of object identification.

In some embodiments, the object recognition device performs feature extraction on the image data to obtain the first image feature; in one example, the image data may be subjected to feature extraction by a pre-trained object recognition model to obtain the first image feature.

After the first fusion feature and the first image feature are obtained, the object identification device obtains a first size coefficient from the first fusion and a second size coefficient from the first image feature, wherein the first size coefficient and the second size coefficient represent the probability that the object to be identified belongs to the object with the target size range.

In a possible implementation manner, the first fusion feature and the first image feature may be represented in the form of a feature map, the object identification apparatus may obtain, according to each position in the feature map including the first fusion feature, a first size coefficient at the position, where the first size coefficient represents a probability that each position in the feature map including the first fusion feature belongs to an object of a target size range, and the first fusion feature fuses point cloud information and image information, so that the first size coefficient obtained from the first fusion feature also represents point cloud information and image information, and thus has better semantic information; and the object recognition device may acquire, from each position in the feature map including the first image feature, a second size coefficient at the position, the second size coefficient characterizing a probability that each position in the feature map including the first image feature belongs to an object of a target size range.

In one example, the object recognition apparatus may obtain a first size coefficient from the first fused feature and a second size coefficient from the first image feature using a pre-trained object recognition model.

After the first size coefficient and the second size coefficient are obtained, the object recognition device obtains a second fusion feature according to the first fusion feature and the first size coefficient, obtains a second image feature according to the first image feature and the second size coefficient, and finally performs target recognition on the object to be recognized according to the second fusion feature and the second image feature. The embodiment realizes that the information about the object with the target size range in the second fusion characteristic and the second image characteristic is increased through the first size coefficient and the second size coefficient, so that the second fusion characteristic and the second image characteristic can be fused with more semantic information about the object with the target size range, and therefore, the small object can be accurately identified based on the second fusion characteristic and the second image characteristic, and the accuracy of object identification is improved; of course, in the case where the point cloud data and the image data also include data collected from a large object, the second fusion feature and the second image feature obtained based on the point cloud data and the image data also include feature information extracted from the large object, and therefore, the large object can also be accurately identified.

In one example, the second fused feature may be a sum of the first fused feature and the first size factor; and/or the second image feature may be a sum of the first image feature and the second size coefficient, thereby enabling the second fused feature and the second image feature to fuse more semantic information about objects of a target size range. In other examples, the second fusion feature or the second image feature may also be obtained through other operation manners, for example, in other possible embodiments, the second fusion feature is a product of the first fusion feature and the first size coefficient, or the second image feature is a product of the first image feature and the second size coefficient, and may be specifically set according to an actual application scenario.

Illustratively, the first fused feature, the first image feature, the second fused feature, and the second image feature may be represented in the form of a feature map. The object recognition apparatus may acquire a first size coefficient at each position in the feature map including the first fusion feature, and acquire a second size coefficient at each position in the feature map including the first image feature; that is, each position in the feature map including the first fusion feature corresponds to the first size coefficient, and each position in the feature map including the first image feature corresponds to the second size coefficient; therefore, the value of each position in the feature map including the first fused feature may be added to the first size coefficient of the position to obtain a feature map including the second fused feature; and adding the value of each position in the feature map comprising the first image feature and the second size coefficient of the position to obtain the feature map comprising the second image feature.

When the object to be recognized is recognized according to the second fusion feature and the second image feature, in order to further improve the accuracy of object recognition, the object recognition apparatus may further fuse the second fusion feature and the second image feature, specifically, the object recognition apparatus determines a second projection position of the second fusion feature in the two-dimensional space based on a projection relationship from the point cloud to the image, then obtains an image feature of the second image feature at the second projection position, obtains a third fusion feature according to the image feature and the corresponding second fusion feature, and finally recognizes the object to be recognized according to the third fusion feature, and obtains a recognition result. In this embodiment, a third fusion feature is obtained by fusing the second fusion feature and the second image feature, the third fusion feature fuses image information and point cloud information and further fuses more information about an object of a target size range, and the third fusion feature has better semantic information.

The projection relation of the point cloud to the image can be obtained based on external parameters from a point cloud coordinate system to a camera coordinate system and internal parameters of a camera. The recognition result at least comprises: confidence and/or state information of the object to be identified; the confidence coefficient is used for representing the probability that the object to be recognized belongs to the obstacle. The status information includes at least one of: size information, position information, and orientation information of the object.

In some embodiments, the point cloud data and the image data may be input into a pre-trained object recognition model, and the point cloud data and the image data may be processed by the object recognition model for further object recognition.

In some embodiments, the object recognition method according to the embodiments of the present application may be implemented by a trained object recognition model, and the object recognition device may be preset with the object recognition model, so as to perform an object recognition process by using the object recognition model.

The training process of the object recognition model may be: the method comprises the steps of firstly representing a model through a modeling table, then evaluating the model through constructing an evaluation function, and finally optimizing the evaluation function according to sample data and an optimization method to adjust the model to be optimal.

Modeling is the conversion of an actual problem into a problem that can be understood by a computer, i.e., the conversion of an actual problem into a way that can be represented by a computer. Modeling generally refers to a process of estimating an objective function of a model based on a large amount of sample data.

The evaluation aims to judge the quality of the built model. For the model built in the first step, the evaluation is an index for representing the quality of the model. This involves some evaluation criteria and some evaluation function design. The index can be evaluated in a targeted manner in machine learning. For example, after the modeling is completed, a loss function needs to be designed for the model to evaluate the output error of the model.

The goal of the optimization is to evaluate the function. Namely, an optimization method is used for carrying out optimization solution on the evaluation function, and a model with the highest evaluation is found. For example, the parameters of the model can be optimally adjusted by finding the minimum value (optimal solution) of the output error of the loss function by an optimization method such as a gradient descent method.

It can be understood that, before a model is trained, a suitable parameter estimation method is determined, and then each parameter in the objective function of the model is estimated by using the parameter estimation method, so as to determine the final mathematical expression of the objective function.

The object recognition model in the related art can achieve a very good effect on object recognition, and can recognize objects with very high accuracy, but the inventor finds that in an actual business scene related to a movable platform, the object recognition model usually focuses on large-size targets such as people, vehicles, roads or trees, and the large-size targets in sample data usually have large data volume. The model tends to be globally optimal in the training process, the features of the target with a larger size tend to be more obvious and easier to be concerned, the features of the small object are relatively subtle and difficult to obtain the attention of the model, so that the model tends to extract the features of the large object, and the paradoxical deviation finally causes the model to be capable of well recognizing the large object and not to be capable of well paying attention to the recognition of the small object, thereby causing the defects of the object recognition model in the related technology. From the perspective of a service scenario, for example, in the field of vehicle driving, in some extreme scenarios, the subtle imperfections of the object recognition model may have serious consequences for the safe driving of the vehicle. From the technical point of view, it is extremely challenging and difficult to further solve the defects based on the existing higher accuracy, because in the field of machine learning, as described above, there are many links involved from the modeling stage to the training stage, such as the selection and processing of sample data, the design of data features, the design of models, the design of loss functions or the design of optimization methods, and the like, and the subtle differences of any link are factors causing the defects of the models.

Based on this, please refer to fig. 3A, an embodiment of the present application provides an object recognition model for object recognition, where the object recognition model includes a first feature extraction network, a second feature extraction network, a first size coefficient extraction network, a second size coefficient extraction network, and an object prediction network; the first feature extraction network is used for acquiring a first fusion feature according to the point cloud data and the image data; the second feature extraction network is used for acquiring first image features according to the image data; the first size coefficient extraction network is used for acquiring a first size coefficient according to the first fusion feature and acquiring a second fusion feature according to the first fusion feature and the first size coefficient; the second size coefficient extraction network is used for acquiring a second size coefficient according to the first image characteristic and acquiring a second image characteristic according to the first image characteristic and the second size coefficient; and the object prediction network is used for identifying the object to be identified according to the second fusion characteristic and the second image characteristic to obtain an identification result. The recognition result at least comprises: confidence and/or state information of the object to be identified; the confidence coefficient is used for representing the probability that the object to be recognized belongs to the obstacle. The status information includes at least one of: size information, position information, and orientation information of the object to be recognized.

The first feature extraction network may perform feature extraction on the stitched data (the stitched data is formed by stitching point cloud data and the image data, or formed by stitching the point cloud data including the pixel value and the image data) to obtain the first fusion feature.

The second feature extraction network is configured to perform feature extraction on the image data to obtain the first image feature.

The first size coefficient extraction network and the second size coefficient extraction network both comprise convolution layers, and the first size coefficient is obtained by performing convolution operation on the first fusion feature through the first size coefficient extraction network; and the second size coefficient is obtained by performing convolution operation on the first image feature through the second size coefficient extraction network. It can be understood that the number of the convolutional layers may be specifically set according to an actual application scenario, for example, the number of the convolutional layers is at least 2, and the first size coefficient is obtained by performing at least two convolution operations on the first fusion feature through the first size coefficient extraction network; and the second size coefficient is obtained by performing convolution operation on the first image characteristic at least twice through the second size coefficient extraction network. The first size coefficient and the second size coefficient represent the probability that the object to be identified belongs to the object with the target size range.

The first fusion feature is fused with point cloud information and image information, so that a first size coefficient obtained from the first fusion feature also reflects the point cloud information and the image information, and therefore the first fusion feature has better semantic information. Further, the embodiment determines information (i.e., the first size coefficient and the second size coefficient) of an object belonging to a target size range from the first fusion feature and the first image feature, and increases the attention degree of the object in the target size range through the determined first size coefficient and the second size coefficient, which is beneficial to improving the recognition accuracy of the model for a small object.

In some embodiments, the first size coefficient extraction network is further configured to obtain a second fused feature from a sum of the first fused feature and the first size coefficient; the second size coefficient extraction network is further configured to obtain a second image feature according to a sum of the first image feature and the second size coefficient, but is not limited thereto. For example, the second fusion feature or the second image feature may also be obtained through other operation manners, for example, in some other possible embodiments, the second fusion feature is a product of the first fusion feature and the first size coefficient, or the second image feature is a product of the first image feature and the second size coefficient, and may be specifically set according to an actual application scenario.

For example, referring to fig. 3B, the first size coefficient extraction network includes a first size coefficient extraction sub-network and a first fusion sub-network, the first size coefficient extraction sub-network is configured to perform a convolution operation on the first fusion feature to obtain a first size coefficient, and the first fusion sub-network is configured to obtain a second fusion feature according to the first fusion feature and the first size coefficient. The second size coefficient extraction network comprises a second size coefficient extraction sub-network and a second fusion sub-network, the second size coefficient extraction sub-network is used for performing convolution operation on the first image features to obtain second size coefficients, and the second fusion sub-network is used for obtaining second image features according to the first image features and the second size coefficients. In this embodiment, the attention degree of the object in the target size range is increased by the determined first size coefficient and the second size coefficient, and the information about the object in the target size range in the second fusion feature and the second image feature is increased by the first size coefficient and the second size coefficient, and the second fusion feature and the second image feature contain more semantic information about the object in the target size range, so that the small object can be accurately identified based on the second fusion feature and the second image feature.

The object prediction network is configured to obtain a third fusion feature according to the second fusion feature and the second image feature, and then perform target recognition on the object to be recognized according to the third fusion feature, specifically, the object prediction network may obtain the third fusion feature based on the foregoing manner, and perform target recognition on the object to be recognized according to the third fusion feature, so as to obtain a recognition result. The recognition result at least comprises: confidence and/or state information of the object to be identified; the confidence coefficient is used for representing the probability that the object to be recognized belongs to the obstacle. The status information includes at least one of: size information, position information, and orientation information.

Next, a training process of the object recognition model will be described. In this embodiment, sample data for training may be prepared in advance. The sample data may include: the point cloud sample data and the image sample data comprise data of objects which belong to a target size range and data of objects which do not belong to the target size range, so that the trained object identification model can accurately identify small objects and large objects.

The model training in this embodiment may be supervised training or unsupervised training. In some examples, a supervised training mode can be adopted to improve the training speed, the real values can be marked in the sample data, and the speed and the accuracy of model training can be improved through the supervised training mode. The object state information may include one or more kinds of information, and specific information may be configured according to service needs, and as an example, the true value includes: confidence of an object belonging to a target size range (probability of representing that the object to be recognized is a small object), object confidence (probability of representing that the object to be recognized is an obstacle), state information of the object; the status information may include at least one of: size information, position information, and orientation information.

In some examples, the sample data may be data obtained by performing feature engineering on raw data. The characteristic engineering refers to a process of finding out some characteristics with physical significance from original data to participate in model training, and the process involves data cleaning, data dimension reduction, characteristic extraction, characteristic normalization, characteristic evaluation and screening, characteristic dimension reduction or characteristic coding and the like.

For example, for point cloud data, the point cloud data is unstructured data and needs to be processed into a format that can be input to an object recognition model, for example, the point cloud data is processed to obtain a point cloud density corresponding to each voxel of the point cloud data, and the point cloud density corresponding to each voxel of the point cloud data is used as an input of the object recognition model. With respect to the image data, the pixel values of the image data may be used as input to the object recognition model.

In some embodiments, in order to improve the accuracy of object recognition, the sample data input into the object recognition model in the present embodiment includes: and obtaining splicing sample data according to the point cloud sample data and the image sample data, and the image sample data.

In one example, the point cloud sample data may be rasterized into a three-dimensional mesh of H × W × C (where H and W represent length and width, respectively, and C represents depth of the three-dimensional mesh), each mesh represents a voxel whose value is the point cloud density of the voxel, the image sample data (in the example of an RGB image) may be represented as data of H × W × 3 (where H and W represent length and width, respectively, and 3 represents RGB3 channels), and the point cloud sample data and the image sample data may be stitched into stitched sample data of size H × W (C + 3). The splicing sample data comprises position information (obtained from point cloud data) and color information (obtained from image data) of the object, and can better reflect the characteristics of the object from a three-dimensional angle and a two-dimensional angle, so that the accuracy of object identification is improved.

In another example, point cloud sample data including pixel values may be generated according to the point cloud sample data and the image sample data (the generation method may refer to the aforementioned method for generating the point cloud data including the pixel values, which is not described herein again), where the point cloud sample data may be rasterized into H × W × C three-dimensional meshes, each mesh represents a voxel, the value of the voxel is the point cloud density of the voxel and the pixel value (such as an RGB value) of the point cloud, the image sample data (taking an RGB image as an example) may be represented as H × W × 3 data, and then the point cloud sample data including the pixel values and the image data may be spliced into splicing sample data having a size of H × W (C + 3).

By using the sample data, the object recognition model can be obtained by training the machine learning model by using the sample data. The machine learning model may be a neural network model or the like, such as a deep learning based neural network model. The specific structural design of the object recognition model is one of the important aspects of the training process. In this embodiment, the structure of the object recognition model at least includes: the system comprises a first feature extraction network, a second feature extraction network, a first size coefficient extraction network, a second size coefficient extraction network and an object prediction network.

According to the embodiment of the application, the information about the object in the target size range is respectively extracted through the first size coefficient extraction network and the second size coefficient extraction network to obtain the first size coefficient and the second size coefficient, the information about the object in the target size range is respectively enhanced on the basis of the first fusion feature and the first image feature, and the second fusion feature and the second image feature which comprise more semantic information about the object in the target size range are obtained, so that the recognition of the model for the object in the target size range is enhanced.

In a possible implementation manner, the first feature extraction network and the second feature extraction network are backbone networks, the first size coefficient extraction network is a branch network of the first feature extraction network, and the second size coefficient extraction network is a branch network of the second feature extraction network, so that the first size coefficient and the second size coefficient can be further extracted on the basis of extracting the first fusion feature and the first image feature, and then the second fusion feature and the second image feature are obtained.

In an exemplary embodiment, in the field of automatic driving, three-dimensional object detection is a core problem, and a situation that a small object is difficult to detect can exist when a sensor such as a laser radar is used for detection.

Another important aspect of the training process is that a suitable loss function needs to be designed according to the business requirements. And under the scene of supervised model training, the sample data is marked with a true value, and the loss function is used for measuring the error between the predicted value and the true value of the model. The loss function is crucial to the identification accuracy of the model, and it is difficult to design the loss function based on the existing sample data and the requirement of the model. In some examples, the loss function of the corresponding scenario may be constructed by using some existing loss functions, such as a logarithmic loss function, a quadratic loss function, an exponential loss function, an 0/1 loss function, and the like.

Based on the requirements of the embodiment, as an example, the loss function adopted by the object recognition model in the training process at least includes: a first loss function for optimizing the first size coefficient extraction network, a second loss function for optimizing the second size coefficient extraction network, a third loss function for describing state differences, and a fourth loss function for describing confidence differences. The first loss function and the second loss function can enable the model to focus on objects in a target size range, and enable the model to distinguish small objects more obviously.

Wherein the optimization objective of the first loss function comprises: increasing the first size factor if the sample data indicates an object that belongs to a target size range; the sample data comprises point cloud sample data and image sample data. More specifically, the first size coefficient extraction network is configured to extract a first size coefficient from the first fused feature, and further derive the second fused feature based on the first size coefficient and the first fused feature; the first size coefficient extraction network is configured to predict a probability that the second fused feature belongs to an object of a target size range. Therefore, the optimization objective of the first loss function specifically includes: if the second fusion feature obtained based on the sample data indicates an object that belongs to a target size range, the second fusion feature is increased, and if the second fusion feature obtained based on the sample data indicates an object that does not belong to a target size range, the second fusion feature is decreased.

In one example, the second fusion feature is illustrated as a feature diagram: the optimization objective of the first loss function specifically includes: if a certain position in the feature map including the second fusion features obtained based on the sample data indicates an object belonging to the target size range, the feature value at the position is increased, otherwise, the feature value at the position is decreased, so that the object recognition model can distinguish small objects more obviously.

The first loss function may be configured to describe a difference between a confidence prediction value of an object belonging to the target size range, which is obtained by the object recognition model from sample data, and a confidence true value of the object belonging to the target size range, which corresponds to the sample data; this can be illustrated by an example of a formula:

wherein, the first and the second end of the pipe are connected with each other,

representing the real value of the label, namely the confidence coefficient real value of the object of which the object to be identified belongs to the target size range; fseg (x)_k1) Representing a confidence prediction value of the object belonging to the target size range predicted by the object recognition model from the sample data.

Wherein the optimization objective of the second loss function comprises: increasing the second size factor if the image sample data indicates an object belonging to a target size range. More specifically, the second size coefficient extraction network is configured to extract a second size coefficient from the first image feature and derive the second image feature further based on the second size coefficient and the first image feature; the second size coefficient extraction network is configured to predict a probability that the second image feature belongs to an object of a target size range. Therefore, the optimization objective of the second loss function specifically includes: if the second image feature obtained based on the image sample data indicates an object belonging to a target size range, the second image feature is increased, and if the second image feature obtained based on the image sample data indicates an object not belonging to the target size range, the second image feature is decreased.

In one example, the second image feature is illustrated as a feature diagram: the optimization objective of the second loss function specifically includes: if a certain position in the feature map including the second image features obtained based on the image sample data indicates an object belonging to the target size range, the feature value at the position is increased, and otherwise, the feature value at the position is decreased, so that the object recognition model can distinguish small objects more obviously.

The second loss function is used forAnd describing the difference between the confidence coefficient predicted value of the object which belongs to the target size range and is obtained by the object recognition model from the image sample data and the confidence coefficient true value of the object which belongs to the target size range and corresponds to the image sample data. This can be illustrated by an example of a formula:

representing the real value of the label, namely the confidence coefficient real value of the object of which the object to be identified belongs to the target size range; fseg (x)_k2) And representing a confidence prediction value of the object which belongs to the target size range and is predicted by the object recognition model from image sample data.

Wherein the optimization objective of the third loss function comprises: and reducing the difference between the predicted value of the state information of the object, which is acquired from the sample data by the object identification model, and the true value of the state information of the object corresponding to the sample data. The object state information includes at least one of: size information, position information, and orientation information of the object.

The third loss function is used for describing the difference between the predicted size, the predicted position and/or the predicted orientation of the object obtained by the object recognition model from the sample data and the real size, the real position and/or the real orientation of the object corresponding to the sample data respectively; the sample data includes point cloud sample data and image sample data. This can be illustrated by an example of a formula:

wherein the content of the first and second substances,

representing the real value of the label, namely the real size, the real position and/or the real orientation of the object labeled in the sample data; floc (x)_i) Prediction ruler for representing object obtained by object recognition model from sample dataCun, predicted location, and/or predicted orientation.

Wherein the optimization objective of the fourth loss function comprises: and reducing the difference between the confidence coefficient predicted value of the object obtained by the object recognition model from the sample data and the confidence coefficient real value of the object corresponding to the sample data. The confidence of the object represents the probability that the predicted object is an obstacle.

The fourth loss function is used for describing the difference between the confidence coefficient predicted value of the object obtained by the object recognition model from the sample data and the confidence coefficient true value of the object corresponding to the sample data. This can be illustrated by an equation:

wherein the content of the first and second substances,

representing the true value of the annotation, i.e. the true value of the confidence of the object annotated in the sample data, fpred (x)_i) And representing the confidence degree predicted value of the object obtained by the object recognition model from the sample data.

In summary, the loss function adopted by the object recognition model of the embodiment in the training process may include:

the specific formula of the loss function is only a schematic illustration, and in practical application, the specific mathematical description of the function may be flexibly configured as needed, and whether to add a regularization term may also be determined as needed, which is not limited in this embodiment.

In the training process, an optimization method is needed to perform optimization solution on the evaluation function, and a model with the highest evaluation is found. For example, the minimum value (optimal solution) of the output error of the loss function can be found through an optimization method such as a gradient descent method, and the parameters of the model are adjusted to be optimal, that is, the optimal coefficients of each network layer in the model are solved. In some examples, the process of solving may be by calculating the output of the model and the error values of the loss function to solve for the gradient of adjusting the model parameters. As an example, a back propagation function may be invoked to calculate a gradient, and the calculation result of the loss function is back propagated into the object recognition model, so that the object recognition model updates model parameters.

In some examples, the solution to the loss function may be performed using a separate solver. In other examples, taking the object recognition model as a neural network model as an example, the network branches may be arranged on the basis of the main network for calculating the loss function of the network. As an example, the loss function can be divided into the four functions described above: the first loss function used for optimizing the first size coefficient extraction network, the second loss function used for optimizing the second size coefficient extraction network, the third loss function used for describing state difference and the fourth loss function used for describing confidence coefficient difference guide the parameters of the neural network to be updated uniformly, and the neural network has better prediction performance.

Through the training process, the object recognition model is obtained after the training is finished, and the obtained object recognition model can be tested by utilizing the test sample so as to test the recognition accuracy of the object recognition model. The finally obtained object recognition model may be disposed in an object recognition device, which may be, for example, a movable platform, or the object recognition device may be mounted as a chip in the movable platform.

During the moving process of the movable platform, point cloud data can be acquired through a laser radar arranged on the movable platform or a shooting device with a depth information acquisition function, image data is acquired through the shooting device arranged on the movable platform, then an object recognition device in the movable platform acquires splicing data according to the point cloud data and the image data, and then the splicing data and the image data are input into the object recognition model, so that a recognition result output by the object recognition model is acquired, wherein the recognition result comprises confidence coefficient and state information of an object. Further, in order to facilitate a subsequent processing process based on the recognition result, the recognition result is data with a confidence degree greater than a preset threshold, and for data with a confidence degree not greater than the preset threshold, the data is not an obstacle, and the data with the confidence degree not greater than the preset threshold does not need to be further processed, and the preset threshold can be specifically set according to an actual application scene; as an example, for an input object to be recognized, a series of candidate frames may be recognized, each candidate frame may correspond to one possible object, based on the confidence degrees of the object to be recognized, the probability that the candidate frame corresponds to the obstacle may be determined, the confidence degrees corresponding to the candidate frames are ranked, and after ranking, screening is performed according to a set threshold, and an object that is greater than the set threshold may be recognized, so as to obtain a final recognition result.

After obtaining the recognition result, in a first possible implementation manner, the movable platform may use the recognition result to perform an obstacle avoidance decision or perform route planning.

In a second possible implementation manner, the recognition result may be displayed on an interface of the movable platform or an interface of a terminal in communication connection with the movable platform, so as to enable a user to know a driving condition of the movable platform and a road condition around the movable platform; furthermore, the point cloud data and the image data can be displayed on an interface of the movable platform or an interface of a terminal in communication connection with the movable platform, and the identification result is further displayed on the point cloud data and the image data, so that a user can conveniently watch the point cloud data and the image data in combination with a specific scene, and the user can further know the actual driving condition.

In a third possible implementation manner, the identification result may be transmitted to other components in the movable platform, so that the other components control the movable platform to work safely and reliably based on the identification result.

Correspondingly, referring to fig. 4, an object recognition device 30 is further provided in the embodiment of the present application, where the object recognition device 30 may be a movable platform, or the object recognition device 30 is installed in the movable platform as a chip; the object recognition device 30 comprises a processor 31 and a memory 32 in which a computer program is stored;

the processor 31, when executing the computer program, realizes the steps of:

In one embodiment, the point cloud data and the image data of the object to be recognized are obtained by sampling the space through a sensor of the movable platform.

In an embodiment, the processor 31 is further configured to: acquiring a first fusion characteristic according to the point cloud density of each voxel in the point cloud data and the pixel value in the image data; wherein the voxel is obtained by performing grid division on the point cloud data.

In an embodiment, the first fused feature and the first image feature are represented in the form of a feature map.

The first size coefficient characterizes a probability that each location in a feature map that includes the first fused feature belongs to an object of a target size range; and the second size factor characterizes a probability that each location in a feature map comprising the first image feature belongs to an object of a target size range.

In an embodiment, the processor 31 is further configured to: determining a first projection position of the point cloud data in a two-dimensional space based on the projection relation of the point cloud to an image; acquiring a pixel value at the first projection position in the image data; generating point cloud data comprising the pixel values according to the pixel values and the point cloud data; and acquiring the first fusion feature according to the point cloud data comprising the pixel values and the image data.

In one embodiment, the projection relationship of the point cloud to the image is obtained based on external parameters of a point cloud coordinate system to a camera coordinate system and internal parameters of a camera.

In an embodiment, the processor 31 is further configured to: splicing the point cloud data and the image data to obtain spliced data and inputting the spliced data into an object identification model; and performing feature extraction on the splicing information through a first feature extraction network in the object recognition model to obtain the first fusion feature.

In an embodiment, the first image feature is obtained by performing feature extraction on the image data.

In an embodiment, the processor 31 is further configured to: inputting the image data into an object recognition model, and performing feature extraction on the image data through a second feature extraction network in the object recognition model to obtain the first image feature.

In one embodiment, the first size coefficient is obtained from the first fused feature through a first size coefficient extraction network in an object recognition model; and the second size coefficient is obtained from the first image feature through a second size coefficient extraction network in the object recognition model.

In one embodiment, the first size coefficient extraction network and the second size coefficient extraction network each include convolutional layers.

In an embodiment, the first size coefficient is obtained by performing a convolution operation on the first fusion feature through the first size coefficient extraction network; and the second size coefficient is obtained by performing convolution operation on the first image feature through the second size coefficient extraction network.

In one embodiment, the loss function used by the object recognition model in the training process at least comprises: a first loss function for optimizing the first size coefficient extraction network and a second loss function for optimizing the second size coefficient extraction network.

In one embodiment, the optimization objective of the first loss function includes: increasing the first size factor if the sample data indicates an object that belongs to a target size range; the sample data comprises point cloud sample data and image sample data; and, the optimization objective of the second loss function comprises: increasing the second size factor if the image sample data indicates an object belonging to a target size range.

In an embodiment, the first loss function is used for describing a difference between a confidence prediction value of an object belonging to the target size range, which is obtained by the object recognition model from sample data, and a confidence true value of the object belonging to the target size range, which corresponds to the sample data; the sample data includes point cloud sample data and image sample data.

The second loss function is used for describing the difference between the confidence degree predicted value of the object which belongs to the target size range and is obtained by the object recognition model from the image sample data and the confidence degree true value of the object which belongs to the target size range and corresponds to the image sample data.

In one embodiment, the second fused feature is a sum of the first fused feature and the first size factor; and/or the second image feature is a sum of the first image feature and the second size factor.

In an embodiment, the processor 31 is further configured to: determining a second projection position of the second fusion feature in the two-dimensional space based on the projection relation from the point cloud to the image; acquiring an image feature of the second image features at the second projection position; and obtaining a third fusion feature according to the image feature and the corresponding second fusion feature, and identifying the object to be identified according to the third fusion feature.

In an embodiment, the third fused feature is a mean value of the image feature and the corresponding second fused feature; alternatively, the third fused feature is the larger of the image feature and the corresponding second fused feature.

In an embodiment, the processor 31 is further configured to: identifying the object to be identified according to the second fusion characteristic and the second image characteristic to generate an identification result; the recognition result at least comprises: confidence and/or state information of the object to be identified; the confidence coefficient is used for representing the probability that the object to be recognized belongs to the obstacle.

In one embodiment, the status information includes at least one of: size information, position information, and orientation information.

In an embodiment, the confidence and the state information of the object to be recognized are obtained by processing the second fusion feature and the second image feature by using an object prediction network in an object recognition model.

In an embodiment, the loss functions used by the object recognition model in the training process at least include a third loss function for describing state differences and a fourth loss function for describing confidence level differences.

The state difference includes: the object identification model obtains the difference between the predicted size, the predicted position and/or the predicted orientation of the object from the sample data and the real size, the real position and/or the real orientation of the object corresponding to the sample data respectively; the sample data includes point cloud sample data and image sample data.

The confidence difference comprises: and the object recognition model obtains the difference between the confidence coefficient predicted value of the object obtained from the sample data and the confidence coefficient real value of the object corresponding to the sample data.

In one embodiment, the recognition result is data with a confidence level greater than a preset threshold.

In an embodiment, the processor 31 is further configured to: and identifying the object to be identified according to the second fusion characteristic and the second image characteristic to generate an identification result, wherein the identification result is used for carrying out obstacle avoidance decision or mobile route planning on the movable platform.

In an embodiment, the processor 31 is further configured to: identifying the object to be identified according to the second fusion characteristic and the second image characteristic to generate an identification result; and the identification result is used for displaying on an interface of the movable platform or an interface of a terminal device in communication connection with the movable platform.

In one embodiment, the point cloud data is acquired by using a laser radar arranged on a movable platform or a shooting device with a depth information acquisition function; and/or the image data is acquired by a shooting device arranged on the movable platform.

In one embodiment, the movable platform comprises: an unmanned aerial vehicle, an automobile, an unmanned ship, or a robot.

For the device embodiments, since they substantially correspond to the method embodiments, reference may be made to the partial description of the method embodiments for relevant points. The various embodiments described herein may be implemented using a computer-readable medium such as computer software, hardware, or any combination thereof. For a hardware implementation, the embodiments described herein may be implemented using at least one of an Application Specific Integrated Circuit (ASIC), a Digital Signal Processor (DSP), a Digital Signal Processing Device (DSPD), a Programmable Logic Device (PLD), a Field Programmable Gate Array (FPGA), a processor, a controller, a microcontroller, a microprocessor, and an electronic unit designed to perform the functions described herein. For a software implementation, the implementation such as a process or a function may be implemented with a separate software module that allows performing at least one function or operation. The software codes may be implemented by software applications (or programs) written in any suitable programming language, which may be stored in memory and executed by the controller.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. The terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

Accordingly, referring to fig. 5, an embodiment of the present application further provides a movable plate 100, including: a body 101; a power system 102 installed in the body 101 for providing power to the movable platform 100; and the object recognition device 30 described above.

Optionally, the movable platform 100 is a vehicle, a drone, an unmanned ship, or a movable robot.

In an exemplary embodiment, a non-transitory computer-readable storage medium comprising instructions, such as a memory comprising instructions, executable by a processor of an apparatus to perform the above-described method is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

A non-transitory computer readable storage medium, instructions in the storage medium, when executed by a processor of a terminal, enable the terminal to perform the above-described method.

The method and apparatus provided by the embodiments of the present application are described in detail above, and the principle and the embodiments of the present application are explained herein by applying specific examples, and the description of the embodiments above is only used to help understand the method and the core idea of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, the specific implementation manner and the application scope may be changed, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. An object recognition method, comprising:

2. The method of claim 1, wherein the point cloud data and the image data of the object to be identified are obtained by spatially sampling a sensor of a movable platform.

3. The method of claim 1, wherein the obtaining a first fused feature from the point cloud data and the image data comprises:

acquiring a first fusion characteristic according to the point cloud density of each voxel in the point cloud data and the pixel value in the image data; wherein the voxel is obtained by performing grid division on the point cloud data.

4. The method of claim 1, wherein the first fused feature and the first image feature are represented in the form of a feature map;

the first size coefficient characterizes a probability that each location in a feature map that includes the first fused feature belongs to an object of a target size range;

the second size factor characterizes a probability that each location in a feature map that includes the first image feature belongs to an object of a target size range.

5. The method of claim 1, wherein the obtaining a first fused feature from the point cloud data and the image data comprises:

determining a first projection position of the point cloud data in a two-dimensional space based on the projection relation of the point cloud to an image;

acquiring a pixel value at the first projection position in the image data;

generating point cloud data comprising the pixel values according to the pixel values and the point cloud data;

and acquiring the first fusion characteristic according to the point cloud data comprising the pixel value and the image data.

6. The method of claim 5, wherein the projection relationship of the point cloud to the image is derived based on external parameters of the point cloud coordinate system to the camera coordinate system and internal parameters of the camera.

7. The method of claim 1, wherein the obtaining a first fused feature from the point cloud data and the image data comprises:

splicing the point cloud data and the image data to obtain spliced data and inputting the spliced data into an object identification model;

and performing feature extraction on the splicing information through a first feature extraction network in the object recognition model to obtain the first fusion feature.

8. The method of claim 1, wherein the first image feature is obtained by feature extraction of the image data.

9. The method of claim 1 or 8, wherein said obtaining a first image feature from said image data comprises:

inputting the image data into an object recognition model, and performing feature extraction on the image data through a second feature extraction network in the object recognition model to obtain the first image feature.

10. The method of claim 1, wherein the first size coefficient is obtained from the first fused feature by a first size coefficient extraction network in an object recognition model;

and the second size coefficient is obtained from the first image feature through a second size coefficient extraction network in the object recognition model.

11. The method of claim 10, wherein the first size coefficient extraction network and the second size coefficient extraction network each comprise convolutional layers.

12. The method according to claim 10 or 11, wherein the first size coefficient is obtained by performing a convolution operation on the first fused feature by the first size coefficient extraction network;

and the second size coefficient is obtained by performing convolution operation on the first image feature through the second size coefficient extraction network.

13. The method of claim 10, wherein the loss function employed by the object recognition model in the training process comprises at least: a first loss function for optimizing the first size coefficient extraction network and a second loss function for optimizing the second size coefficient extraction network.

14. The method of claim 13, wherein the optimization objective of the first loss function comprises: increasing the first size factor if the sample data indicates an object that belongs to a target size range; the sample data comprises point cloud sample data and image sample data;

and, the optimization objective of the second loss function comprises: increasing the second size factor if the image sample data indicates an object belonging to a target size range.

15. The method according to claim 13 or 14,

the first loss function is used for describing the difference between a confidence coefficient predicted value of an object which is acquired by the object recognition model from sample data and belongs to the target size range and a confidence coefficient true value of the object which corresponds to the sample data and belongs to the target size range; the sample data comprises point cloud sample data and image sample data;

16. The method of claim 1, wherein the second fused feature is a sum of the first fused feature and the first size factor;

and/or the second image feature is a sum of the first image feature and the second size factor.

17. The method according to claim 1, wherein the identifying the object to be identified according to the second fusion feature and the second image feature comprises:

determining a second projection position of the second fusion feature in the two-dimensional space based on the projection relation from the point cloud to the image;

acquiring an image feature of the second image features at the second projection position;

and obtaining a third fusion feature according to the image feature and the corresponding second fusion feature, and identifying the object to be identified according to the third fusion feature.

18. The method of claim 17, wherein the third fused feature is a mean of the image feature and a corresponding second fused feature;

alternatively, the third fused feature is the larger of the image feature and the corresponding second fused feature.

19. The method of claim 1, further comprising:

identifying the object to be identified according to the second fusion characteristic and the second image characteristic to generate an identification result; the recognition result at least comprises: confidence and/or state information of the object to be identified; the confidence coefficient is used for representing the probability that the object to be recognized belongs to the obstacle.

20. The method of claim 19, wherein the status information comprises at least one of: size information, position information, and orientation information.

21. The method according to claim 19, wherein the confidence and state information of the object to be recognized are obtained by processing the second fusion feature and the second image feature using an object prediction network in an object recognition model.

22. The method of claim 21, wherein the loss functions employed by the object recognition model in the training process include at least a third loss function for describing state variance and a fourth loss function for describing confidence variance;

the state difference includes: the object identification model obtains the predicted size, the predicted position and/or the predicted orientation of the object from the sample data, and the difference between the real size, the real position and/or the real orientation of the object corresponding to the sample data respectively; the sample data comprises point cloud sample data and image sample data;

23. The method of claim 19, wherein the recognition result is data with a confidence level greater than a preset threshold.

24. The method of claim 1, further comprising:

and identifying the object to be identified according to the second fusion characteristic and the second image characteristic to generate an identification result, wherein the identification result is used for carrying out obstacle avoidance decision or mobile route planning on the movable platform.

25. The method of claim 1, further comprising:

identifying the object to be identified according to the second fusion characteristic and the second image characteristic to generate an identification result;

and displaying the identification result on an interface of a movable platform or an interface of a terminal device in communication connection with the movable platform.

26. The method according to claim 1, wherein the point cloud data is acquired by using a laser radar or a camera with a depth information acquisition function, which is disposed on a movable platform; and/or the image data is acquired by a shooting device arranged on the movable platform.

27. The method of claim 26, wherein the movable platform comprises: an unmanned aerial vehicle, an automobile, an unmanned ship, or a robot.

28. An object recognition apparatus comprising a processor and a memory storing a computer program;

the processor, when executing the computer program, implements the steps of:

29. The apparatus of claim 28, wherein the point cloud data and the image data of the object to be identified are obtained by spatially sampling a sensor of the movable platform.

30. The apparatus of claim 28, wherein the processor is further configured to: acquiring a first fusion characteristic according to the point cloud density of each voxel in the point cloud data and the pixel value in the image data; wherein the voxel is obtained by performing grid division on the point cloud data.

31. The apparatus of claim 28, wherein the first fused feature and the first image feature are represented in a feature map;

32. The apparatus of claim 28, wherein the processor is further configured to:

acquiring a pixel value at the first projection position in the image data;

and acquiring the first fusion feature according to the point cloud data comprising the pixel values and the image data.

33. The apparatus of claim 32, wherein the projection relationship of the point cloud to the image is derived based on external parameters of the point cloud coordinate system to the camera coordinate system and internal parameters of the camera.

34. The apparatus of claim 28, wherein the processor is further configured to:

35. The apparatus of claim 28, wherein the first image feature is obtained by performing feature extraction on the image data.

36. The apparatus of claim 28 or 35, wherein the processor is further configured to: inputting the image data into an object recognition model, and performing feature extraction on the image data through a second feature extraction network in the object recognition model to obtain the first image feature.

37. The apparatus of claim 28, wherein the first size coefficient is obtained from the first fused feature through a first size coefficient extraction network in an object recognition model;

38. The apparatus of claim 37, wherein the first size coefficient extraction network and the second size coefficient extraction network each comprise convolutional layers.

39. The apparatus according to claim 37 or 38, wherein the first size coefficient is obtained by performing a convolution operation on the first fused feature by the first size coefficient extraction network;

40. The apparatus of claim 37, wherein the loss function employed by the object recognition model in the training process comprises at least: a first loss function for optimizing the first size coefficient extraction network and a second loss function for optimizing the second size coefficient extraction network.

41. The apparatus of claim 40, wherein the optimization objective of the first loss function comprises: increasing the first size factor if the sample data indicates an object that belongs to a target size range; the sample data comprises point cloud sample data and image sample data;

42. The apparatus according to claim 40 or 41, wherein the first loss function is used to describe a difference between a confidence prediction value of an object belonging to the target size range, obtained by the object recognition model from sample data, and a confidence true value of an object belonging to the target size range corresponding to the sample data; the sample data comprises point cloud sample data and image sample data;

43. The apparatus of claim 28, wherein the second fused feature is a sum of the first fused feature and the first size factor;

44. The apparatus of claim 28, wherein the processor is further configured to:

45. The apparatus according to claim 44, wherein the third fused feature is a mean of the image feature and the corresponding second fused feature;

46. The apparatus of claim 28, wherein the processor is further configured to: identifying the object to be identified according to the second fusion characteristic and the second image characteristic to generate an identification result; the recognition result at least comprises: confidence and/or state information of the object to be identified; the confidence coefficient is used for representing the probability that the object to be recognized belongs to the obstacle.

47. The apparatus of claim 46, wherein the status information comprises at least one of: size information, position information, and orientation information.

48. The apparatus according to claim 46, wherein the confidence level and the state information of the object to be recognized are obtained by processing the second fusion feature and the second image feature using an object prediction network in an object recognition model.

49. The apparatus according to claim 48, wherein the loss functions employed by the object recognition model in the training process include at least a third loss function for describing state variance and a fourth loss function for describing confidence variance;

the state difference includes: the object identification model obtains the difference between the predicted size, the predicted position and/or the predicted orientation of the object from the sample data and the real size, the real position and/or the real orientation of the object corresponding to the sample data respectively; the sample data comprises point cloud sample data and image sample data;

50. The apparatus according to claim 46, wherein the recognition result is data with a confidence level greater than a preset threshold.

51. The apparatus of claim 28, wherein the processor is further configured to: and identifying the object to be identified according to the second fusion characteristic and the second image characteristic to generate an identification result, wherein the identification result is used for carrying out obstacle avoidance decision or mobile route planning on the movable platform.

52. The apparatus of claim 28, wherein the processor is further configured to: identifying the object to be identified according to the second fusion characteristic and the second image characteristic to generate an identification result; and the identification result is used for displaying on an interface of the movable platform or an interface of a terminal device in communication connection with the movable platform.

53. The apparatus of claim 28, wherein the point cloud data is obtained by a laser radar disposed on a movable platform or a camera with a depth information acquisition function; and/or the image data is acquired by a shooting device arranged on the movable platform.

54. The apparatus of claim 53, wherein the movable platform comprises: an unmanned aerial vehicle, an automobile, an unmanned ship, or a robot.

55. A movable platform, comprising:

a body;

the power system is arranged in the machine body and used for providing power for the movable platform; and the number of the first and second groups,

an object identification device as claimed in any of claims 28 to 54.

56. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when executed by a processor, implements the method of any one of claims 1 to 27.