CN113255444A

CN113255444A - Training method of image recognition model, image recognition method and device

Info

Publication number: CN113255444A
Application number: CN202110421118.1A
Authority: CN
Inventors: 彭亮; 刘飞; 邓丹; 钱炜; 杨政; 何晓飞
Original assignee: Hangzhou Fabu Technology Co Ltd
Current assignee: Hangzhou Fabu Technology Co Ltd
Priority date: 2021-04-19
Filing date: 2021-04-19
Publication date: 2021-08-13

Abstract

The application provides a training method of an image recognition model, an image recognition method and a device, wherein the image recognition model is applied to a monocular detector, and the method comprises the following steps: acquiring first point cloud data acquired by laser radar equipment; determining a three-dimensional object frame corresponding to a target object in the first point cloud data according to the first point cloud data; and training an initial image recognition model according to the three-dimensional object frame to obtain the image recognition model, wherein the image recognition model is used for recognizing the object in the image to be recognized. The method and the device can improve the identification accuracy of the image identification model, and the image identification model runs in any monocular 3D detector, so that the cost of object detection can be reduced while the detection accuracy and the identification accuracy of the target object are ensured.

Description

Training method of image recognition model, image recognition method and device

Technical Field

The present application relates to the field of image processing, and in particular, to a training method for an image recognition model, an image recognition method, and an image recognition device.

Background

In the field of automatic driving, three-dimensional object detection is generally required to improve the safety of vehicle driving and to avoid collision between the vehicle and other objects on the road.

Currently, three-dimensional object detection is generally realized by using a laser radar device, but the laser radar device is high in price and limited in working range. In order to solve the problem, a monocular detector can be used for detecting the three-dimensional object instead of a laser radar device. However, the monocular method of detecting a three-dimensional object based on a monocular detector is difficult to capture accurate depth information in an image. To enable the monocular detector to detect depth information, the depth image predicted by the pre-trained depth estimator can be used as a network input to guide the monocular detector to perform depth learning, so as to capture the depth information in the image.

However, in the above manner, the depth image predicted by the depth estimator may lose a part of information, resulting in low detection accuracy of the three-dimensional object.

Disclosure of Invention

The embodiment of the application provides a training method of an image recognition model, an image recognition method and a device, which can improve the recognition accuracy of the image recognition model, and the image recognition model is operated in any monocular 3D detector, so that the detection accuracy and the recognition accuracy of a target object are ensured, and simultaneously, the cost of object detection can be reduced.

In a first aspect, an embodiment of the present application provides a training method for an image recognition model, where the image recognition model is applied to a monocular detector, and the method includes:

acquiring first point cloud data acquired by laser radar equipment;

determining a three-dimensional object frame corresponding to a target object in the first point cloud data according to the first point cloud data;

training an initial image recognition model according to the first point cloud data and the three-dimensional object frame to obtain the image recognition model, wherein the image recognition model is used for recognizing an object in an image to be recognized.

In a possible implementation manner, the determining, according to the first point cloud data, a three-dimensional object frame corresponding to a target object in the first point cloud data includes:

and inputting the first point cloud data into a pre-trained three-dimensional recognition model based on the laser radar to obtain a three-dimensional object frame corresponding to a target object in the first point cloud data, wherein the three-dimensional recognition model is obtained by training an initial recognition model by adopting a three-dimensional object marking frame corresponding to each object in the second point cloud data.

acquiring an RGB color mode (RGB) image corresponding to the first point cloud data;

segmenting the RGB image to obtain a two-dimensional frame and a semantic mask;

and determining a three-dimensional object frame corresponding to the target object in the first point cloud data according to the two-dimensional frame and the semantic mask.

In a possible implementation manner, the determining, according to the two-dimensional box and the semantic mask, a three-dimensional box corresponding to a target object in the first point cloud data includes:

determining third point cloud data corresponding to a target object in the first point cloud data according to the two-dimensional frame and the semantic mask;

and determining a minimum three-dimensional bounding box covering third point cloud data corresponding to the target object, and determining the minimum three-dimensional bounding box as a three-dimensional object frame corresponding to the target object.

In a possible implementation manner, the determining, according to the two-dimensional frame and the semantic mask, third point cloud data corresponding to a target object in the first point cloud data includes:

determining initial point cloud data corresponding to the target object according to the two-dimensional frame and the semantic mask;

clustering the initial point cloud data corresponding to the target object to obtain a plurality of clusters;

and determining the initial point cloud data in the cluster with the most initial point cloud data in the plurality of clusters as the third point cloud data.

In a second aspect, the present application provides an image recognition method, including:

acquiring an image to be identified;

inputting the image to be recognized into an image recognition model to obtain an object in the image to be recognized, wherein the image recognition model is obtained by training an initial image recognition model according to point cloud data and a three-dimensional object frame corresponding to a target object in the point cloud data, and the point cloud data is acquired through laser radar equipment.

In a third aspect, an embodiment of the present application provides a training apparatus for an image recognition model, including:

the acquisition unit is used for acquiring first point cloud data acquired through the laser radar equipment.

And the processing unit is used for determining a three-dimensional object frame corresponding to the target object in the first point cloud data according to the first point cloud data.

And the training unit is used for training an initial image recognition model according to the three-dimensional object frame to obtain the image recognition model, and the image recognition model is used for recognizing the object in the image to be recognized.

In a possible implementation manner, the processing unit is specifically configured to:

acquiring an RGB image corresponding to the first point cloud data; segmenting the RGB image to obtain a two-dimensional frame and a semantic mask; and determining a three-dimensional object frame corresponding to the target object in the first point cloud data according to the two-dimensional frame and the semantic mask.

determining third point cloud data corresponding to a target object in the first point cloud data according to the two-dimensional frame and the semantic mask; and determining a minimum three-dimensional bounding box covering third point cloud data corresponding to the target object, and determining the minimum three-dimensional bounding box as a three-dimensional object box corresponding to the target object.

determining initial point cloud data corresponding to the target object according to the two-dimensional frame and the semantic mask; clustering the initial point cloud data corresponding to the target object to obtain a plurality of clusters; and determining the initial point cloud data in the cluster with the most initial point cloud data in the plurality of clusters as the third point cloud data.

In a fourth aspect, the present application provides an image recognition apparatus comprising:

and the acquisition unit is used for acquiring the image to be identified.

The processing unit is used for inputting the image to be recognized into an image recognition model to obtain an object in the image to be recognized, the image recognition model is obtained by training an initial image recognition model according to point cloud data and a three-dimensional object frame corresponding to a target object in the point cloud data, and the point cloud data is acquired through laser radar equipment.

In a fifth aspect, an embodiment of the present application further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the training method for the image recognition model described in any one of the possible implementation manners of the first aspect or the image recognition method described in any one of the possible implementation manners of the second aspect.

In a sixth aspect, an embodiment of the present application further provides a server, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the computer program to implement the method for training an image recognition model in any one of the possible implementations of the first aspect.

In a seventh aspect, an embodiment of the present application further provides a vehicle, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor executes the computer program to implement the image recognition method in any one of the possible implementation manners of the second aspect.

In an eighth aspect, an embodiment of the present application further provides a computer program product, where the computer program product includes: a computer program stored in a readable storage medium, from which at least one processor of an electronic device can read the computer program, the at least one processor executing the computer program causing the electronic device to perform the method for training an image recognition model described in any of the possible implementations of the first aspect above or the method for image recognition described in any of the possible implementations of the second aspect above.

Therefore, according to the training method, the image recognition method and the device for the image recognition model, when the image recognition model is trained, the three-dimensional object frame corresponding to the target object is obtained by directly utilizing the first point cloud data collected by the laser radar equipment, and the training of the initial image recognition model is guided through the three-dimensional object frame, so that any information of the target object is not lost in the training process, and the recognition accuracy of the image recognition model is improved. In addition, the training method is visual, simple, convenient and effective, the image recognition model can be operated in any monocular 3D detector, the target object detection precision and the recognition accuracy are guaranteed, and meanwhile the object detection cost can be reduced.

Drawings

Fig. 1 is a system architecture diagram of a training method of an image recognition model according to an embodiment of the present application;

fig. 2 is a schematic flowchart of a training method for an image recognition model according to an embodiment of the present disclosure;

FIG. 3 is a schematic flow chart diagram illustrating another method for training an image recognition model according to an embodiment of the present application;

FIG. 4 is a schematic flow chart diagram illustrating a further method for training an image recognition model according to an embodiment of the present application;

FIG. 5 is a schematic flow chart diagram illustrating a further method for training an image recognition model according to an embodiment of the present application;

FIG. 6 is a flowchart illustrating an image recognition method according to an embodiment of the present application;

FIG. 7 is a schematic structural diagram of an image recognition model training apparatus provided in an embodiment of the present application;

fig. 8 is a schematic structural diagram of an image recognition apparatus according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of a server according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of a vehicle according to an embodiment of the present application.

With the foregoing drawings in mind, certain embodiments of the disclosure have been shown and described in more detail below. These drawings and written description are not intended to limit the scope of the disclosed concepts in any way, but rather to illustrate the concepts of the disclosure to those skilled in the art by reference to specific embodiments.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

In the embodiments of the present application, "at least one" means one or more, "a plurality" means two or more. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone, wherein A and B can be singular or plural. In the description of the text of the present application, the character "/" generally indicates that the former and latter associated objects are in an "or" relationship.

The training method and the image recognition method for the image recognition model provided by the embodiment of the application can be applied to scenes such as automatic driving or intelligent transportation and the like, and can also be applied to other scenes needing to detect three-dimensional objects. In the present application, an automatic driving scenario will be described as an example.

In the field of automatic driving, it is very important for a vehicle to perform three-dimensional object detection, which can avoid collision with other objects on the road. Therefore, in order to improve the safety of vehicle travel, a three-dimensional object detection device in an autonomous vehicle plays an important role.

Currently, due to the price of lidar devices and the limitations of the operating range, monocular detectors are commonly used for the detection of three-dimensional objects. However, due to the ill-conditioned nature of the monocular image captured by the monocular detector, it is difficult for the monocular method to capture accurate depth information in the image. And the laser radar point cloud can provide accurate depth measurement for a scene, so that the monocular detector can be guided to learn depth information. To achieve this goal, multi-stage pipelines based on depth maps are currently being developed. In particular, this type of method splits the training process into multiple stages, and in the first stage of training, the lidar point cloud can be projected onto the image plane to train the depth estimator. In the second stage, the depth map predicted from the pre-trained depth estimator may be used as a network input for training a monocular depth map-based detector. However, such complex pipelines implicitly utilize the lidar point cloud through the intermediate depth estimator, thereby losing a portion of valuable information, such as a portion of depth information, and thus resulting in poor accuracy of the three-dimensional object detected by the monocular detector.

In the embodiment of the application, the above problems are taken into consideration, and the method for training the image recognition model is provided. Furthermore, when the object in the image to be recognized is detected through the image recognition model, the accuracy of the detected object is high.

Fig. 1 is a system architecture diagram of a training method of an image recognition model according to an embodiment of the present application, as shown in fig. 1, the system includes a laser radar device 11, a server 12, and a vehicle 13, where the vehicle 13 is provided with a monocular detector. The networks used between them may include various types of wireless networks, such as, but not limited to: the internet, local area networks, WIFI, WLAN, cellular communication networks (GPRS, CDMA, 2G/3G/4G/5G cellular networks), satellite communication networks, and so forth.

As in fig. 1. For example, the laser radar device 11 may acquire the radar point cloud map in real time, and the laser radar device 11 may send the acquired radar point cloud map to the server 12 in real time through the wireless network. When the autonomous vehicle 13 runs on a road, the on-board monocular detector may acquire an RGB color mode (RGB) image on the road in real time, and the hardware terminal of the vehicle 13 may transmit the position information of the vehicle 13 and the acquired RGB color mode (RGB) image to the server 12 in real time through the wireless network. Receiving the information sent by the hardware terminal of the vehicle (13), the server 12 will implement training of the image recognition model according to the received information, and will send the finally trained image recognition model to the vehicle (13), and the vehicle (13) will use the trained image recognition model to recognize the object in the image to be recognized.

Hereinafter, a method for training an image recognition model provided by the present application will be described in detail by using specific embodiments. It is to be understood that the following detailed description may be combined with other embodiments, and that the same or similar concepts or processes may not be repeated in some embodiments.

Fig. 2 is a flowchart illustrating a training method of an image recognition model according to an embodiment of the present disclosure, where the training method of the image recognition model may be executed by software and/or a hardware device, for example, the hardware device may be a terminal or a server. For example, referring to fig. 2, the training method of the image recognition model may include:

s201, first point cloud data collected through laser radar equipment are obtained.

For example, the first point cloud data includes point cloud data of a scene where the detected target object is located, where the first point cloud data may be captured by the laser radar device, or may be acquired in an offline collection manner, so that the cost may be reduced.

The target object may be a person, a car, a sign, or the like. The first point cloud data comprises accurate depth information, and the depth information can accurately determine the position of the target object, so that the image recognition model trained by the first point cloud data with the depth information has higher detection precision.

In this step, when the laser radar device collects the initial point cloud data, the initial point cloud data can be sent to the server in real time, and the server determines the first point cloud data from the initial point cloud data according to the position of the target object.

S202, determining a three-dimensional object frame corresponding to the target object in the first point cloud data according to the first point cloud data.

In this step, the first point cloud data may include at least one target object, and when the first point cloud data includes a plurality of target objects, each target object corresponds to a respective three-dimensional object frame, where the three-dimensional object frame is used to identify the target object.

For example, the process of determining the three-dimensional object frame corresponding to the target object in the first point cloud data may be implemented in two ways, one is in a supervised mode, and the other is in an unsupervised mode. A specific process for determining the three-dimensional object frame in the above two ways is explained in the following embodiments.

The three-dimensional object frame obtained by the method is obtained by directly operating the first point cloud data, so that any relevant information about the target object is not lost, and the integrity of the information of the target object is ensured.

S203, training the initial image recognition model according to the three-dimensional object frame to obtain an image recognition model, wherein the image recognition model is used for recognizing an object in the image to be recognized.

The initial image recognition model is mainly applied to the monocular probe, so that the initial image recognition model can adopt the image recognition model of the existing monocular probe 3D, such as Smoke, CenterNet and the like.

For example, an initial identification frame of the target object may be obtained by identifying, by an initial image identification model, an RGB color mode (RGB) image obtained by a monocular detector. The RGB color mode (RGB) image and the first point cloud data are acquired at the same scene at the same time, and the object information of the image in the RGB color mode (RGB) image corresponds to the object included in the first point cloud data one to one.

When the initial image recognition model is trained, the initial image recognition model can be trained according to the three-dimensional object frame and the initial recognition frame, so that the image recognition model is obtained. Specifically, the training process of the initial image recognition model is to evaluate the consistency degree of an initial recognition frame of a target object and a three-dimensional object frame corresponding to the target object in the first point cloud data through a monocular loss function, if the consistency degree reaches a preset threshold value, the training of the initial image recognition model is completed, and the trained initial image recognition model is the final image recognition model; if the consistency degree does not reach the preset threshold value, the parameters in the initial image recognition model need to be adjusted, the initial image recognition model after the parameters are adjusted is determined as a new initial image recognition model, and the training process is repeatedly executed until the consistency degree reaches the preset threshold value.

Wherein, the monocular loss function is shown in formulas (1) - (4):

L＝L_cls+L_2D+L_3D (1)

L_2D＝-log(IoU(b′_2D,b_2D) (3)

L_3D＝SmoothL₁(b′_3D-b_3D) (4)

wherein L is_clsAccuracy of object class prediction, L_clsThe smaller the value of (c), the more accurate the class of the prediction. c represents the true class of the target object, c_iRepresenting the predicted probability, n, of an object identified by the initial image recognition model in the ith class_cThe total number of the object categories stored in the terminal or the server.

L_2DIndicating the degree of matching, L, of the initial recognition frame and the two-dimensional frame of the target object_2DThe smaller the value is, the higher the matching degree of the initial identification frame is, wherein the two-dimensional frame of the target object is obtained by removing the height information of the target object from the three-dimensional object frame. b'_2DRepresenting an initial recognition box, b_2DThe two-dimensional box representing the target object, IoU, is an intersection set operator.

L_3DAnd the matching degree of the initial recognition three-dimensional object frame and the three-dimensional object frame is represented, and the initial recognition three-dimensional object frame is obtained by adding height information of a target object to the initial recognition frame.

Wherein, Smooth_L1As shown in the following equation:

b'_3D-b_3DIs less than 1:

b'_3D-b_3DIs greater than 1:

wherein, b'_3DRepresenting the initially recognized three-dimensional object frame, b_3DRepresenting a three-dimensional object frame, smoothen_L1Through to b'_3DAnd b_3DIs compared with the coverage area of L_3DThe smaller the value, the higher the recognition accuracy of initially recognizing the three-dimensional object frame.

L represents a degree of a course of the three-dimensional object corresponding to the target object in the initial recognition frame and the first point cloud data, and according to the description, the smaller the L value, the higher the accuracy of the trained initial image recognition model.

For example, after the image recognition model is obtained, after the monocular detector captures an image to be recognized, the image to be recognized may be input to the image recognition model to recognize an object in the image to be recognized.

In the embodiment of the application, when the image recognition model is trained, the three-dimensional object frame corresponding to the target object is obtained by directly utilizing the first point cloud data acquired by the laser radar equipment, and the training of the initial image recognition model is guided by the three-dimensional object frame, so that any information of the target object is not lost in the training process, and the recognition accuracy of the image recognition model is improved. In addition, the training method is visual, simple, convenient and effective, the image recognition model can be operated in any monocular 3D detector, the target object detection precision and the recognition accuracy are guaranteed, and meanwhile the object detection cost can be reduced.

Based on the embodiment shown in fig. 2, in order to facilitate understanding, in S102, how to determine the three-dimensional object frame corresponding to the target object in the first point cloud data is implemented according to the first point cloud data; next, by using the second embodiment shown in fig. 3, the following detailed description will be made on determining the three-dimensional object frame corresponding to the target object in the first point cloud data according to the first point cloud data in the supervision mode.

Fig. 3 is a schematic flowchart of another training method for an image recognition model provided in an embodiment of the present application, which is used in this embodiment to describe in detail how, in S102 in the embodiment shown in fig. 2, a process of determining a three-dimensional object frame corresponding to a target object in first point cloud data according to the first point cloud data, and the embodiment shown in fig. 3 is used to determine a three-dimensional object frame in a supervised mode, as shown in fig. 3, the method includes:

s301, first point cloud data collected through laser radar equipment are obtained.

S301 is similar to S201, and is not described herein again.

S302, inputting the first point cloud data into a pre-trained three-dimensional recognition model based on the laser radar, and obtaining a three-dimensional object frame corresponding to the target object in the first point cloud data.

And the three-dimensional recognition model is obtained by training the initial recognition model by adopting a three-dimensional object marking frame corresponding to each object in the second point cloud data.

Specifically, the second point cloud data includes radar point cloud data collected for a scene required for training the three-dimensional identification model based on the laser radar. The second point cloud data can be collected through laser radar equipment and can also be acquired in an off-line mode, and the cost can be reduced through the off-line acquisition mode.

For example, the three-dimensional object labeling box corresponding to each object in the second point cloud data is obtained by manually labeling a key identification point on the second point cloud data, and is therefore called a supervision mode. The three-dimensional identification model based on the laser radar can adopt Second or F-pointet, and the initial identification frame is obtained by the Second point cloud data based on the three-dimensional identification model based on the laser radar. And training the initial recognition model by using the three-dimensional object labeling frames corresponding to each object in the second point cloud data. When the initial image recognition model is trained, the initial image recognition model can be trained according to the initial recognition frame and the three-dimensional object labeling frame corresponding to each object in the second point cloud data, so that the three-dimensional recognition model based on the laser radar is obtained. Specifically, the training process of the initial image recognition model is to evaluate the consistency degree of an initial recognition frame of a target object and a three-dimensional object marking frame corresponding to the target object in the point cloud data of the second point cloud data through a monocular loss function, if the consistency degree reaches a preset threshold value, the training of the initial image recognition model is completed, and the trained initial image recognition model is the final image recognition model; if the consistency degree does not reach the preset threshold value, the parameters in the initial image recognition model need to be adjusted, the initial image recognition model after the parameters are adjusted is determined as a new initial image recognition model, and the training process is repeatedly executed until the consistency degree reaches the preset threshold value.

The lidar loss function is similar to the monocular loss function in the embodiment shown in fig. 2, and is not described herein again.

When the three-dimensional object frame corresponding to the target object in the first point cloud data is determined, the first point cloud data can be input to a trained three-dimensional recognition model based on the laser radar, and the three-dimensional object frame is obtained.

S303, training an initial image recognition model according to the three-dimensional object frame to obtain the image recognition model, wherein the image recognition model is used for recognizing an object in the image to be recognized.

S303 is similar to S203, and is not described herein again.

The three-dimensional recognition model obtained by the method is realized based on the point cloud data acquired by the laser radar, so that the three-dimensional recognition model has higher detection precision, and the value information of the target object in any first point cloud data is not lost in the process of training the three-dimensional recognition model by using the laser radar point cloud. Therefore, the three-dimensional object frame obtained by the target object in the first point cloud data identified by the three-dimensional identification model does not lose any value object related to the target object. And the three-dimensional object frame corresponding to the target object in the first point cloud data predicted by the three-dimensional detector based on the laser radar has quite high precision due to accurate depth measurement, and can be directly used for training detector detection models of other non-laser radars.

In this embodiment, in the supervision mode, in the process of determining the three-dimensional object frame corresponding to the target object in the first point cloud data according to the first point cloud data, the obtained three-dimensional object frame does not lose any information related to the target object, and the information of the object included in the three-dimensional object frame can ensure the recognition accuracy of the image recognition model. In addition, key data points in the target object are marked on the second point cloud data in a manual mode, so that the workload of workers can be greatly reduced, and the cost of manual marking is reduced.

Fig. 4 is a schematic flowchart of a training method for an image recognition model according to an embodiment of the present application, and this embodiment describes in detail how, in S102 in the embodiment shown in fig. 2, a process of determining a three-dimensional object frame corresponding to a target object in first point cloud data according to the first point cloud data, where the embodiment shown in fig. 4 is different from the embodiment shown in fig. 3 in that the embodiment shown in fig. 4 is to determine the three-dimensional object frame in an unsupervised mode, and as shown in fig. 4, the method includes:

s401, first point cloud data collected through laser radar equipment are obtained.

S401 is similar to S201, and is not described herein again.

S402, acquiring an RGB color mode (RGB) image corresponding to the first point cloud data.

Wherein, an RGB color mode (RGB) image may be acquired by a monocular detector. An RGB color mode (RGB) image and first point cloud data are acquired at the same scene at the same time, and object information of the image in the RGB image corresponds to objects included in the first point cloud data one to one.

S403, segmenting the RGB color mode (RGB) image to obtain a two-dimensional frame and a semantic mask.

In this step, an offline 2D instance segmentation model may be used to segment an RGB color mode (RGB) image to obtain a two-dimensional frame (2D box) and a semantic mask (mask).

S404, determining a three-dimensional object frame corresponding to the target object in the first point cloud data according to the two-dimensional frame and the semantic mask.

In this step, a camera view cone may be constructed through a two-dimensional box (2D box) and a semantic mask (mask) to determine a three-dimensional object box corresponding to the target object in the first point cloud data.

In a possible implementation manner, when the three-dimensional frame corresponding to the target object in the first point cloud data is determined according to the two-dimensional frame and the semantic mask, the third point cloud data corresponding to the target object in the first point cloud data may be determined according to the two-dimensional frame and the semantic mask, the minimum three-dimensional bounding box covering the third point cloud data corresponding to the target object is determined, and the minimum three-dimensional bounding box is determined as the three-dimensional object frame corresponding to the target object.

Specifically, a camera view cone is constructed through a two-dimensional frame (2D box) and a semantic mask (mask) so as to select related laser radar points for a target object, and therefore third point cloud data corresponding to the target object in the first point cloud data is determined. Illustratively, based on the camera view frustum, the initial point cloud data corresponding to the target object is determined, and those 2D detection boxes without any lidar points inside will be ignored. However, because the lidar points located in the same view frustum are composed of the target object and the mixed background points or the shielding points around the target object, in order to delete the mixed background points or the shielding points around the target object in the initial point cloud data, the initial point cloud data is clustered by adopting a clustering method of DBSCAN to obtain a plurality of clusters. And determining the initial point cloud data in the cluster with the most initial point cloud data in the plurality of clusters as the third point cloud data.

Most of the initial point cloud data are point cloud data corresponding to the target object, the point cloud data are more concentrated, and the initial point cloud data are clustered by the clustering method of the DBSCAN, so that the most clusters in the initial point cloud data are screened to obtain third point cloud data, thereby ensuring that the point cloud data in the third point cloud data are all point cloud data of the target object, and completely eliminating the point cloud data of the scene.

And after the third point cloud data are obtained, performing horizontal projection on the third point cloud data to obtain a Bird Eye View (Bird Eye View). The method for obtaining the convex hull from the Bird's Eye View (Bird Eye View) is as follows: selecting the rightmost point (y is minimum, x is maximum) in the BEV (bird's-eye view), marking the point as P0, sorting the points included in the other third point clouds (from small to large) by taking the included angle (anticlockwise direction) between P0 and the x axis as a reference, and deleting the point closer to P0 if the two points have the same included angle. By the method for obtaining the convex hull through the BEV (bird's-eye view), all the points in the BEV (bird's-eye view) are traversed to form a closed convex hull. Enumerating the sides of the convex hull polygon of the convex hull, making a circumscribed rectangle, comparing the areas of the circumscribed rectangles, and selecting the rectangle with the smallest area as a minimum three-dimensional boundary frame which is a three-dimensional object frame corresponding to the target object. Other parameters of the three-dimensional object frame may be calculated from statistics of the remaining points, e.g. height may be expressed in terms of the maximum spatial offset of the points along the y-axis; the longitudinal center coordinate is calculated by averaging the longitudinal coordinates of the points. At the same time, the minimum three-dimensional bounding box size eliminates objects that are likely outliers because the three-dimensional sizes of most valid objects are very close. Although some potential targets are ignored and filtered, the final result can still make the image recognition model applied to the monocular detection method obtain accurate detection results.

S405, training an initial image recognition model according to the three-dimensional object frame to obtain the image recognition model, wherein the image recognition model is used for recognizing an object in an image to be recognized.

S405 is similar to S203, and is not described herein again.

In this embodiment, in an unsupervised mode, in the process of determining the three-dimensional object frame corresponding to the target object in the first point cloud data according to the first point cloud data, the obtained three-dimensional object frame does not lose any information related to the target object, and the information in the three-dimensional object frame is only the information of the target object and does not have information of a mixed background point or a shielding point around the target object, so that the information of the object contained in the three-dimensional object frame can ensure the recognition accuracy of the image recognition model.

Fig. 5 is a schematic flowchart of a training method for an image recognition model according to an embodiment of the present disclosure, and the embodiment takes a target recognition object, specifically a vehicle, as an example to describe in detail an operation manner of the training method for an image recognition model according to the present disclosure.

As shown in fig. 5, first point cloud data is obtained in the first step, and the specific obtaining manner is similar to S201, which is not described herein again. And after the first point cloud data is acquired, a first radar cloud three-dimensional object frame is acquired through a supervision mode or an unsupervised period mode.

In an exemplary supervision mode, similar to S302, the initial recognition model is trained by using the three-dimensional object labeling boxes corresponding to the objects in the second point cloud data, so as to obtain a three-dimensional recognition model based on the laser radar. As shown in fig. 5, specifically, the second point cloud data acquired in advance is identified based on the initial three-dimensional identification model of the laser radar, and an initial identification frame of the second point cloud target object is acquired. Evaluating the consistency degree of the initial identification frame of the target object and the three-dimensional object marking frame corresponding to the target object in the point cloud data of the second point cloud data through a monocular loss function through a LiDAR loss function, finishing training on the initial image identification model if the consistency degree reaches a preset threshold value, wherein the trained initial image identification model is the final image identification model; if the consistency degree does not reach the preset threshold value, parameters in the initial image recognition model need to be adjusted, the initial image recognition model after the parameters are adjusted is determined as a new initial image recognition model, and the training process is repeatedly executed until the consistency degree reaches the preset threshold value. And after the final image recognition model is obtained, recognizing the first point cloud data by the final image recognition model to obtain a first point cloud three-dimensional object frame.

As an example, another way of unsupervised mode is similar to the fourth embodiment. As shown in fig. 5, first, a first radar point cloud data and an RGB color mode (RGB) image corresponding to the first radar point cloud data are obtained at the same time. Then, by segmenting the RGB image, a two-dimensional frame (2D box) and a semantic mask (mask) are obtained. And constructing a camera view cone through a two-dimensional box (2D box) and a semantic mask (mask), and determining initial point cloud data from the first point cloud data based on the camera view cone. As shown in fig. 5, a mixed background point cloud exists around the target object (vehicle, for example) in the initial point cloud data. After the initial point cloud data is processed by the clustering method of the DBSCAN, the initial point cloud data in the clusters containing the most initial point cloud data in the plurality of clusters is selected and determined as the third point cloud data, so that the mixed background point cloud in the selected target point cloud data is eliminated. As shown in the point cloud after clustering in fig. 5, the initial point cloud data is divided into 4 clusters by the clustering method of DBSCAN, where the cluster containing the most initial point cloud data is the point cloud data of the target object (for example, a vehicle). And then horizontally projecting the third point cloud data to obtain a Bird's-Eye View (Bird Eye View), and obtaining a convex hull from the Bird's-Eye View (Bird Eye View). As further shown in fig. 5, the convex hull is converted into a minimum bounding box of the bird's eye view, and the minimum bounding box is added with height information to obtain a three-dimensional object frame corresponding to the target object (for example, a vehicle).

After a first radar cloud three-dimensional object frame is obtained through a supervision mode or an unsupervised period mode, the consistency degree of an initial recognition frame of a target object and a three-dimensional object frame corresponding to the target object in first point cloud data is evaluated through a monocular loss function, if the consistency degree reaches a preset threshold value, training of an initial image recognition model is completed, and the trained initial image recognition model is a final image recognition model; if the consistency degree does not reach the preset threshold value, the parameters in the initial image recognition model need to be adjusted, the initial image recognition model after the parameters are adjusted is determined as a new initial image recognition model, and the training process is repeatedly executed until the consistency degree reaches the preset threshold value. The initial identification frame is obtained by identifying an RGB color mode (RGB) image obtained by a monocular detector through an initial image identification model, the RGB color mode (RGB) image and the first point cloud data are obtained in the same scene at the same time, and object information of an image in the RGB color mode (RGB) image corresponds to an object included in the first point cloud data.

In the description of this embodiment, in the unsupervised mode, in the process of determining the three-dimensional object frame corresponding to the target object in the first point cloud data according to the first point cloud data, the obtained three-dimensional object frame does not lose any information related to the target object, the information in the three-dimensional object frame is only the information of the target object, and the information of the mixed background point or the shielding point around the target object is removed by the technical means described in the embodiments, so that the information of the object included in the three-dimensional object frame can ensure the recognition accuracy of the image recognition model.

Fig. 6 is a flowchart illustrating an image recognition method according to an embodiment of the present application, where the image recognition method may be executed by an in-vehicle hardware device, and referring to fig. 6, the image recognition method may include:

s601, acquiring an image to be identified.

The method comprises the steps of obtaining an image to be identified of an object to be identified through a monocular detector.

S602, inputting the image to be recognized into an image recognition model to obtain the object in the image to be recognized.

The image recognition model is obtained by training an initial image recognition model according to point cloud data and a three-dimensional object frame corresponding to a target object in the point cloud data, wherein the point cloud data is acquired through laser radar equipment.

The image recognition model is similar to the image recognition model obtained by the training method shown in any one of the above embodiments, and details are not repeated here.

And inputting the image to be recognized into an image recognition model for image recognition. And inputting the image to be identified into an image identification model to obtain monocular three-dimensional image identification information.

Furthermore, the obtained monocular three-dimensional image identification information is sent to the vehicle-mounted automatic driving device by the image identification model, and the automatic driving device can sense the real three-dimensional world through the received monocular three-dimensional image identification information and avoid collision with other objects on the road.

Fig. 7 is a schematic structural diagram of an image recognition model training apparatus 700 according to an embodiment of the present application, for example, please refer to fig. 7, where the image recognition model training apparatus 700 may include:

an obtaining unit 701, configured to obtain first point cloud data collected by a laser radar device.

A processing unit 702, configured to determine, according to the first point cloud data, a three-dimensional object frame corresponding to a target object in the first point cloud data.

A training unit 703, configured to train an initial image recognition model according to the three-dimensional object frame, to obtain the image recognition model, where the image recognition model is used to recognize an object in an image to be recognized.

Optionally, the processing unit 702 is specifically configured to input the first point cloud data into a pre-trained three-dimensional recognition model based on a laser radar, so as to obtain a three-dimensional object frame corresponding to a target object in the first point cloud data, where the three-dimensional recognition model is obtained by training an initial recognition model by using the three-dimensional object frame corresponding to each object in the second point cloud data.

Optionally, the processing unit 702 is specifically configured to obtain an RGB color mode (RGB) image corresponding to the first point cloud data; segmenting the RGB color mode (RGB) image to obtain a two-dimensional frame and a semantic mask; and determining a three-dimensional object frame corresponding to the target object in the first point cloud data according to the two-dimensional frame and the semantic mask.

Optionally, the processing unit 702 is specifically configured to determine, according to the two-dimensional frame and the semantic mask, third point cloud data corresponding to a target object in the first point cloud data; and determining a minimum three-dimensional bounding box covering third point cloud data corresponding to the target object, and determining the minimum three-dimensional bounding box as a three-dimensional object frame corresponding to the target object.

Optionally, the processing unit 702 is specifically configured to determine, according to the two-dimensional frame and the semantic mask, initial point cloud data corresponding to the target object; clustering the initial point cloud data corresponding to the target object to obtain a plurality of clusters; and determining the initial point cloud data in the cluster with the most initial point cloud data in the plurality of clusters as the third point cloud data.

The device 700 for training the image recognition model according to the embodiment of the present application can execute the method for training the image recognition model according to any one of the embodiments, and the implementation principle and the beneficial effect thereof are similar to those of the method for training the image recognition model.

Fig. 8 is a schematic structural diagram of an apparatus 800 for an image recognition method according to an embodiment of the present application, and for example, referring to fig. 8, the apparatus 800 for the image recognition method may include:

an acquiring unit 801, configured to acquire an image to be identified.

The processing unit 802 inputs the image to be recognized into an image recognition model to obtain an object in the image to be recognized, the image recognition model is obtained by training an initial image recognition model according to point cloud data and a three-dimensional object frame corresponding to a target object in the point cloud data, and the point cloud data is acquired through laser radar equipment.

The device 800 for training an image recognition model according to the embodiment of the present application can execute the image recognition method according to any of the embodiments, and the implementation principle and the beneficial effect of the device are similar to those of the image recognition method, and reference may be made to the implementation principle and the beneficial effect of the image recognition method, which are not described herein again.

Fig. 9 is a schematic structural diagram of a server according to an embodiment of the present application, for example, please refer to fig. 9, where the server includes:

the memory 901, the processor 902, and a computer program stored in the memory 901 and capable of being run on the processor 902, where the processor 902 implements the training method of the image recognition model shown in any of the above embodiments when executing the program, and the implementation principle and the beneficial effect thereof are similar to those of the training method of the image recognition model, and reference may be made to the implementation principle and the beneficial effect of the training method of the image recognition model, which is not described herein again.

Fig. 10 is a schematic structural diagram of a vehicle according to an embodiment of the present application, and for example, please refer to fig. 10, the vehicle includes:

the memory 1001, the processor 1002, and the computer program stored in the memory 1001 and capable of being executed on the processor 1002, when the processor 1002 executes the program, implement the image recognition method according to any of the embodiments described above, and the implementation principle and the beneficial effect thereof are similar to those of the image recognition method, which can be referred to as the implementation principle and the beneficial effect of the image recognition method, and no further description is provided herein.

The embodiment of the present application further provides a readable storage medium, where a computer program is stored, and when the program is executed by a processor, the method for training an image recognition model shown in any of the above embodiments is implemented, and the implementation principle and the beneficial effect of the method are similar to those of the method for training an image recognition model, which can be referred to as the implementation principle and the beneficial effect of the method for training an image recognition model, and are not described herein again.

An embodiment of the present application further provides a computer program product, where the computer program product includes: a computer program, where the computer program is stored in a readable storage medium, where the computer program can be read by at least one processor of an electronic device from the readable storage medium, and the at least one processor executes the computer program to enable the electronic device to execute the training method of the image recognition model shown in any of the above embodiments, where the implementation principle and the beneficial effect of the method are similar to those of the training method of the image recognition model, and reference may be made to the implementation principle and the beneficial effect of the training method of the image recognition model, and details are not repeated here.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts shown as units may or may not be physical units, may be located in one position, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment. In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.

The integrated module implemented in the form of a software functional module may be stored in a computer-readable storage medium. The software functional module is stored in a storage medium and includes several instructions to enable a computer device (which may be a personal computer, a server, or a network device) or a processor (processor) to execute some steps of the methods according to the embodiments of the present application.

It should be understood that the Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in connection with the present invention may be embodied directly in a hardware processor, or in a combination of the hardware and software modules within the processor.

The memory may comprise a high-speed RAM memory, and may further comprise a non-volatile storage NVM, such as at least one disk memory, and may also be a usb disk, a removable hard disk, a read-only memory, a magnetic or optical disk, etc.

The bus may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an Extended ISA (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, the buses in the figures of the present application are not limited to only one bus or one type of bus.

The computer-readable storage medium may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks. A storage media may be any available media that can be accessed by a general purpose or special purpose computer.

Finally, it should be noted that: the above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present application.

Claims

1. A method for training an image recognition model, wherein the image recognition model is applied to a monocular detector, the method comprising:

acquiring first point cloud data acquired by laser radar equipment;

and training an initial image recognition model according to the three-dimensional object frame to obtain the image recognition model, wherein the image recognition model is used for recognizing the object in the image to be recognized.

2. The method according to claim 1, wherein the determining, according to the first point cloud data, a three-dimensional object frame corresponding to a target object in the first point cloud data comprises:

3. The method according to claim 1, wherein the determining, according to the first point cloud data, a three-dimensional object frame corresponding to a target object in the first point cloud data comprises:

acquiring an RGB image corresponding to the first point cloud data;

segmenting the RGB image to obtain a two-dimensional object frame and a semantic mask;

and determining a three-dimensional object frame corresponding to the target object in the first point cloud data according to the two-dimensional object frame and the semantic mask.

4. The method according to claim 3, wherein the determining a three-dimensional object frame corresponding to the target object in the first point cloud data according to the two-dimensional frame and the semantic mask comprises:

and determining a minimum three-dimensional bounding box covering third point cloud data corresponding to the target object, and determining the minimum three-dimensional bounding box as a three-dimensional object box corresponding to the target object.

5. The method according to claim 4, wherein the determining, according to the two-dimensional box and the semantic mask, third point cloud data corresponding to a target object in the first point cloud data comprises:

6. An image recognition method, comprising:

acquiring an image to be identified;

7. An apparatus for training an image recognition model, comprising:

the acquisition unit is used for acquiring first point cloud data acquired by laser radar equipment;

the processing unit is used for determining a three-dimensional object frame corresponding to a target object in the first point cloud data according to the first point cloud data;

8. An image recognition apparatus, comprising:

the device comprises an acquisition unit, a recognition unit and a processing unit, wherein the acquisition unit is used for acquiring an image to be recognized;

9. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out a method of training an image recognition model according to any one of claims 1 to 5 or a method of image recognition according to claim 6.

10. A server comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of training an image recognition model according to any one of claims 1 to 5 when executing the program.

11. A vehicle comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the image recognition method of claim 6 when executing the program.

12. A computer program product comprising a computer program which, when executed by a processor, implements the method of training an image recognition model of any one of claims 1-5 or the method of image recognition of claim 6.