WO2023201723A1

WO2023201723A1 - Object detection model training method, and object detection method and apparatus

Info

Publication number: WO2023201723A1
Application number: PCT/CN2022/088566
Authority: WO
Inventors: 毕舒展; 张洁
Original assignee: 华为技术有限公司
Priority date: 2022-04-22
Filing date: 2022-04-22
Publication date: 2023-10-26
Also published as: CN117280385A

Abstract

An object detection model training method. For any monocular camera, intrinsic parameters of the monocular camera are transformed into extension intrinsic parameters of another monocular camera and an image acquired by the monocular camera is mapped into an image acquired by an extension camera; and then an object detection model is trained. In a training stage, when 3D coordinates are generated, intrinsic parameters of the extension camera are used for the image of the extension camera to match the intrinsic parameters and the image of the camera, so that the model can be suitable for the extension camera, and the generalization capability of the model is improved. Further disclosed are a training apparatus for an object detection apparatus, an object detection method and apparatus, a chip system, a program product, a storage medium and an electronic device.

Description

Training method of target detection model, target detection method and device

Technical field

The present application relates to the field of computer vision technology, and in particular to a training method for a target detection model, a target detection method and a device.

Background technique

Target detection is a traditional task in the field of computer vision. Different from image recognition, target detection requires the position of the target object to be given in the form of a minimum bounding box (Bounding box). In 3D (three dimensional, three-dimensional) target detection, the 3D bounding box of the target object needs to be given. Taking the field of autonomous driving as an example, 3D target detection obtains the 3D coordinates of the target object, and then obtains a 3D frame based on the 3D coordinates. The 3D frame is then visualized on the image and aerial view as shown in Figure 1.

Some 3D target detection methods based on monocular cameras have been proposed in related technologies. This method uses a target detection model to process images collected by a monocular camera, and can obtain the 3D vertices of the target object, and then obtain the 3D frame. However, this method cannot perform data amplification through image-based geometric transformation, so the generalization ability of the model is poor.

Contents of the invention

This application provides a training method for a target detection model, a target detection method and a device, which are used to improve the generalization ability of a 3D target detection model based on a monocular camera.

In the first aspect, this application provides a training method for a target detection model, including:

Obtaining internal parameters of at least one monocular camera and images collected by the at least one monocular camera;

According to the internal parameters of the first camera and the images collected by the first camera, the target detection model is trained N times, the first camera is any one of the at least one monocular camera, and the N is greater than 1 integer;

Wherein, the N times of training include at least one first training, and the first training includes the following steps:

Transform the internal parameters of the first camera to obtain extended internal parameters used in the first training;

Perform geometric transformation on the image collected by the first camera according to the internal parameters of the first camera and the extended internal parameter to obtain the sample image used for the first training, and place the target object in the image collected by the first camera. The three-dimensional position information is used as the annotated position information of the target object in the sample image;

Use the target detection model to detect the sample image to obtain the first two-dimensional position information and the first depth information of the target object in the sample image coordinate system;

Perform coordinate transformation on the first two-dimensional position information and the first depth information according to the extended internal parameters to obtain the first three-dimensional position information of the target object in the camera coordinate system corresponding to the extended internal parameters;

According to the difference between the first three-dimensional position information and the marked position information, parameters of the target detection model are adjusted.

In this embodiment, for any monocular camera, the internal parameters of the monocular camera are transformed into the internal parameters of another monocular camera (i.e., the extended internal parameters), and the image collected by the monocular camera is mapped to the image collected by the extended camera. , and then train the target detection model. In the training phase, when generating 3D coordinates, the intrinsic parameters of the extended camera are used for the images of the extended camera, so that the intrinsic parameters of the camera are involved in image matching, so that the model can be applied to the extended camera and improve the generalization ability of the model. In other words, the same target detection model can be applied to cameras with different internal parameters.

In some embodiments, the N times of training include at least one second training, and the second training includes the following steps:

Use the target detection model to detect the image collected by the first camera to obtain the second two-dimensional position information and second depth information of the target object in the image coordinate system of the image collected by the first camera;

Perform coordinate transformation on the second two-dimensional position information and the second depth information according to the internal parameters of the first camera to obtain the second three-dimensional position information of the target object in the first camera coordinate system;

According to the difference between the second three-dimensional position information and the marked position information, parameters of the target detection model are adjusted.

In this implementation, it is ensured that the target detection model is trained based on real cameras in the camera set, and the target detection model can be well applied to any camera in the camera set.

In some embodiments, if the S first trainings among the N trainings transform the internal parameters of the first camera, then the expanded internal parameters obtained by the transformation during each training of the S first trainings are Different, S is an integer less than or equal to N.

In this method, the target detection model can use more samples for training, thereby further improving the generalization ability of the target detection model.

In some implementations, the transformation of the internal parameters of the first camera to obtain the extended internal parameters used in the first training includes:

Randomly perturb the internal parameters of the first camera to obtain the expanded internal parameters used in the first training.

In this method, it is possible to ensure that the extended internal parameters change around the internal parameters of the first camera as much as possible to ensure that the accuracy of the inference results of the target detection model meets the expected requirements and to achieve training convergence as soon as possible. Random perturbation is used to make the internal parameters of the first camera float up and down within a range to increase the amount of data and improve the robustness of the target detection model.

In some embodiments, the random perturbation of the internal parameters of the first camera includes:

For the sub-parameters in the internal parameters of the first camera, construct a normal distribution curve centered on the sub-parameters;

Obtain a point from the normal distribution curve within a specified range centered on the sub-parameter, and use the obtained point as an extended sub-parameter of the sub-parameter;

The sub-parameters in the intrinsic parameters of the first camera are replaced with extended sub-parameters of the sub-parameters.

In this method, a large number of extended internal parameters can be expanded, and training samples can be increased to improve the generalization ability of the model.

In some embodiments, performing a geometric transformation on the image collected by the first camera according to the internal parameters of the first camera and the extended internal parameters to obtain the sample image used for the first training includes:

De-distort the image collected by the first camera based on the internal parameters of the first camera and the distortion coefficient of the first camera to obtain a de-distorted image;

The dedistorted image is processed based on the extended internal parameters to obtain the sample image.

In the embodiment of the present application, dedistortion can ensure that the image collected by the first camera is accurately mapped to the image space of the extended camera, thereby improving the accuracy of the model inference results.

In a second aspect, this application provides a target detection method, which is applied to the process of detecting a target object using a target detection model trained by the method described in any one of the first aspects. The method includes:

Obtain the image to be detected collected by the monocular camera and the internal parameters of the monocular camera;

Detect the image to be detected and obtain the two-dimensional position information and depth information of the target object in the coordinate system of the image to be detected;

Coordinate transformation is performed on the two-dimensional position information and the depth information according to the internal parameters of the monocular camera to obtain the three-dimensional position information of the target object in the coordinate system of the monocular camera.

In this implementation, multiple cameras with different internal parameters have been used during model training. If the model adapts to its internal parameters, then the inverse matrix of the internal parameters can be used to left-multiply (u, v, Z') and get the correct 3D coordinates.

In a third aspect, this application also provides a training device for a target detection model, including:

An information acquisition module, configured to acquire internal parameters of at least one monocular camera and images collected by the at least one monocular camera;

A training module configured to train the target detection model N times based on the internal parameters of the first camera and the images collected by the first camera, where the first camera is any one of the at least one monocular camera, and the N is an integer greater than 1;

In some embodiments, the training module is also used to perform at least one second training in the N trainings, and the second training includes the following steps:

In some implementations, the training module is used to randomly perturb the internal parameters of the first camera to obtain extended internal parameters used in this training.

In some implementations, the random perturbation of the internal parameters of the first camera is performed, and the training module is specifically used to:

In some embodiments, the training module is used to:

In a fourth aspect, this application also provides a target detection device, which is applied to the process of detecting a target object using the target detection model obtained by the device according to any one of the third aspects. The device includes:

The image acquisition module to be detected is used to acquire the image to be detected collected by the monocular camera and the internal parameters of the monocular camera;

A two-dimensional information acquisition module, used to detect the image to be detected and obtain the two-dimensional position information and depth information of the target object in the coordinate system of the image to be detected;

A three-dimensional information determination module is configured to perform coordinate transformation on the two-dimensional position information and the depth information according to the internal parameters of the monocular camera to obtain the three-dimensional position information of the target object in the coordinate system of the monocular camera.

In a fifth aspect, this application provides a chip system, including: a memory for storing a computer program; a processor; when the processor calls and runs the computer program from the memory, the electronic device installed with the chip system executes the following steps: The method described in any one of the first aspect and the second aspect.

In a sixth aspect, the present application provides a computer program product containing instructions that, when run on a computer, cause the computer to perform the method described in any one of the first and second aspects.

In a seventh aspect, the present application provides a computer-readable storage medium including instructions that, when run on a computer, cause the computer to perform the method described in any one of the first aspect and the second aspect.

In an eighth aspect, this application also provides an electronic device, including:

Memory, used to store readable programs;

At least one processor, configured to call and run the readable program from the memory, so that the communication device implements the method described in any one of the first aspect and the second aspect.

Description of the drawings

Figure 1 is a schematic diagram of visualizing 3D boxes on images and aerial views;

Figure 2 is a schematic diagram of an application scenario provided by the embodiment of the present application;

Figure 3 is a schematic diagram of another application scenario provided by the embodiment of the present application;

Figure 4 is a schematic diagram of a monocular camera installed in an autonomous vehicle;

Figure 5 is a schematic flowchart of the training method of the target detection model provided by the embodiment of the present application;

Figure 6 is another schematic flowchart of the training method of the target detection model provided by the embodiment of the present application;

Figure 7 is a schematic flow chart of the target detection method provided by the embodiment of the present application;

Figure 8 is a schematic diagram of the same target detection model used by multiple monocular cameras according to an embodiment of the present application;

Figure 9 is a schematic structural diagram of a training device for a target detection model provided by an embodiment of the present application;

Figure 10 is a schematic structural diagram of a target detection device provided by an embodiment of the present application;

Figure 11 is a schematic structural diagram of an electronic device provided by an embodiment of the present application.

Detailed ways

The embodiments of the present application will be described in detail below with reference to the accompanying drawings.

As mentioned in the background art, when performing 3D target detection based on images collected by a monocular camera, the target detection model cannot use image-based geometric transformation, a commonly used data amplification method, for data amplification, and further cannot use data amplification. The model is trained in this way, resulting in poor generalization ability of the model. This is because the geometric transformation based on the image will affect the mapping relationship from 2D to 3D. This is explained below. Using the target detection model to obtain the 3D box includes the following steps:

In the first step, the image collected by the monocular camera is input into the target detection model. The target detection model can be a depth prediction model based on two-dimensional images. This model is used to infer the images collected by the monocular camera and detect the target object in the target detection model. 2D coordinates (u, v) in the image and predict the depth Z corresponding to the 2D coordinates. Usually, the model infers the 2D coordinates and depth of the center point of the 3D box in the image. It may also infer the 2D coordinates and depth of a few 3D box vertices in the image. The model can be used to infer the 2D coordinates and depth of the 3D box vertices. 2D coordinates and depth, and the geometry of the obstacle to calculate the 2D coordinates and depth of the remaining unreasoned 3D box vertices. Since the 3D box has 8 vertices plus the center point, the 2D coordinates and depth of the 9 key points can be obtained.

In the second step, the imaging principle of the monocular camera is used to analyze the detection results of the target detection model to obtain the 3D coordinates of the target object in the camera coordinate system. As shown in Equation 1, it is used to describe the mapping relationship from 2D to 3D:

In formula (1), (u, v) represents the 2D coordinates of the key point, Z represents the depth of the key point, K represents the intrinsic parameter of the monocular camera, and K can be obtained through camera calibration.

Among them, fx, fy, cx and cy are all sub-parameters of the internal parameters, K ^-1 is the inverse matrix of K, and (X, Y, Z) are the 3D coordinates of the pixel in the camera coordinate system of the monocular camera.

Through the above two steps, the 3D coordinates of the target object can be obtained, and then the length, width, height and orientation angle of the target object can be inferred.

When geometric transformations are performed on the image, such as translation, scaling, rotation, etc., the (u, v, Z) values will not match the K ^-1 used in the second step, thus calculating incorrect 3D coordinates.

Therefore, the target detection model of a monocular camera is bound to specific internal parameters, and one model cannot be applied to multiple cameras with different internal parameters. That is, each monocular camera requires a target detection model, and a target detection model is only applicable to a monocular camera to which it is bound. Therefore, simple image-based data amplification of sample images is not suitable for the target detection model of monocular cameras, resulting in the generalization ability of the target detection model not being improved.

In view of this, in order to improve the generalization ability of the target detection model, embodiments of the present application provide a feasible data amplification method to train the target detection model and improve the generalization ability of the model.

In the embodiment of the present application, with the help of the internal camera parameter K of the monocular camera, geometric transformation can be performed on the images collected by the monocular camera to achieve data amplification. In order to maintain the correct 2D to 3D mapping relationship, based on the internal parameter K, the internal parameter K′ (hereinafter also referred to as the extended internal parameter) can be constructed, and the original image is projected into an image captured by an extended camera with the internal parameter K′ . If K′ is different from K, only cx is different, then the above transformation is equivalent to performing a left and right translation operation on the original image; if only cy is different, it is equivalent to an up and down translation operation on the original image; if only fx is different, it is equivalent to performing a translation operation on the original image Perform a horizontal scaling operation; if only fy is different, it is equivalent to performing a vertical scaling operation on the original image. Moreover, for an image that has been geometrically transformed using K′, for any (u, v, Z) point on the plane, the correct 3D coordinates can be obtained by left-multiplying the inverse matrix K′ ^-1 of K′.

Based on this, in the embodiment of the present application, for any monocular camera, the internal parameters of the monocular camera are transformed into the internal parameters of another monocular camera (that is, the extended internal parameters of the extended camera), and the images collected by the monocular camera are mapped For the images collected by the extended camera, an object detection model is then trained. In the training phase, when generating 3D coordinates, the intrinsic parameters of the extended camera are used for the images of the extended camera, so that the intrinsic parameters of the camera are involved in image matching, so that the model can be applied to the extended camera and improve the generalization ability of the model. In other words, the same target detection model can be applied to cameras with different internal parameters.

In the embodiment of the present application, the target detection model can be trained on the server side or the terminal side. After training, the target detection model can be applied in the terminal. The terminal can be, for example, a car, a mobile phone, a robot, or other equipment equipped with a monocular camera. Taking a car as an example, Figure 2 is a schematic diagram of an application scenario provided by an embodiment of the present application. The application scenario includes a car 101 and a target object 102. The car 101 collects images containing the target object through the monocular camera installed on the car 101, and the image can be input to the target detection model 103 in the car 101 to detect the 3D position of the target object 102.

In another application scenario, the target detection model 103 can also be applied in the server. As shown in Figure 3, the application scenario includes a car 101, a target object 102 and a server 104. The monocular camera of the car 101 collects an image containing the target object, and then the car 101 sends the image to the server 104. The server 104 uses its built-in target detection model 103 to infer the image to obtain the 3D position of the target object.

In order to facilitate an intuitive understanding of the generalization ability of the target detection model in the embodiment of the present application, this is explained below in conjunction with Figure 4. As shown in Figure 4, there are four types of monocular cameras installed in the vehicle, with a total of 16 cameras. The first camera type is a long-range camera with a total of 1, the second camera type is a mid-range camera with a total of 4, the third camera type is a short-range camera with a total of 7, and the fourth camera type is a fisheye camera with a total of 4 indivual. In related technologies, each monocular camera needs to train a target detection model separately, and a total of 16 target detection models need to be trained. Using the method provided by the embodiments of this application, since the camera internal parameters of the same type of camera have little difference, one target detection model can be trained for each type of camera, and ultimately a total of four target detection models need to be trained. Each target detection model is suitable for the same type of monocular camera, so the generalization ability of the target detection model is improved.

As shown in Figure 5, it is a schematic flow chart of the training method of the target detection model in the embodiment of the present application, including:

Step 501: Obtain internal parameters of at least one monocular camera and images collected by at least one monocular camera.

For example, as shown in Figure 4, monocular cameras of the same camera type may constitute a camera set. The cameras of a camera ensemble are used to jointly train the same object detection model. During implementation, it is not limited to the same camera type. As long as the internal parameter difference is less than the gap threshold, different types of monocular cameras can also build a camera set for training the same target detection model. During training, any monocular camera in the camera set can be used as the first camera to train the target detection model.

Step 502: Train the target detection model N times based on the internal parameters of the first camera and the images collected by the first camera, where N is an integer greater than 1.

Among them, the N times of training include at least one first training, and the first training includes the following steps:

Step 5021: Transform the internal parameters of the first camera to obtain extended internal parameters used in the first training.

In some embodiments, if there are S times of first training among N times of training that transform the internal parameters of the first camera, then the expanded internal parameters obtained by each transformation in the S times of first training are different, and S is less than or equal to N integer. For example, the internal parameter K of the first camera is transformed to obtain K' ₁ , K' ₂ , K' ₃ ...K' _m , a total of m different extended internal parameters. For any first camera, assuming that it collects a total of p images and m extended internal parameters, it is equivalent to adding p*m sample images, thus increasing the sample images for training the target detection model, This further improves the generalization ability of the target detection model.

In some implementations, the internal parameters of the first camera can be transformed in a variety of ways, for example, any sub-parameter in the internal parameters is translated with equal or unequal steps, such as the internal parameters of the first camera.

After fx is translated by step size d, fx′ is obtained, then the extended internal parameter of

The translation method is, for example, fx′=fx+d, or, for example, fx′=fx-d. In addition, it can also be adjusted according to the direction of expected change. For example, if K′ is different from K and only cx is different, it is equivalent to performing a left and right translation operation on the original image; if only cy is different, it is equivalent to performing an up and down translation operation on the original image; If only fx is different, it is equivalent to performing a horizontal scaling operation on the original image; if only fy is different, it is equivalent to performing a vertical scaling operation on the original image.

In the embodiment of the present application, in order to avoid a large gap between the extended internal parameters and the internal parameters of the first camera, the reasoning ability of the target detection model may be limited, causing the target detection model to fail to converge. Therefore, during implementation, try to ensure that the extended internal parameters are distributed around the internal parameters of the first camera. For example, the internal parameter distance from the first camera is within the threshold, so as to ensure that the accuracy of the inference results of the target detection model meets the expected requirements as much as possible, and to achieve it as soon as possible. Training convergence. In a possible implementation, the implementation of the extended internal parameters used in this training can be obtained by randomly perturbing the internal parameters of the first camera within a threshold. Randomly perturb the internal parameters of the first camera so that the internal parameters of the first camera float up and down within a range to increase the amount of data and improve the robustness of the target detection model.

Compared with the original method of using fixed internal parameters for a target detection model, the embodiment of the present application adopts a data amplification method of variable internal parameters, so that the model can adapt to different internal parameters during training, so that it can adapt to multiple different internal parameters during use. Camera effects. Therefore, the generalization ability of the model can be improved, the same target detection model can be adapted to multiple cameras with different internal parameters, and one model can be used by multiple monocular cameras, which can reduce development costs.

In some possible implementations, in order to facilitate the expansion of internal parameters through random perturbation, the embodiment of the present application uses the original value (i.e., sub-parameter) of the internal parameter K of the first camera as the center point, based on the preset standard deviation, using Normal distribution is used to generate random values to replace the original values in K. It can be implemented as follows: for the sub-parameters in the internal parameters of the first camera, construct a normal distribution curve centered on the sub-parameters; then, in the specified sub-parameters centered on the sub-parameters Within the range, obtain a point from the normal distribution curve, and use the obtained point as the expanded sub-parameter of the sub-parameter; then replace the sub-parameter in the internal parameter of the first camera with the expanded sub-parameter to obtain the expanded internal parameter.

For example, such as the internal parameters of the first camera

Construct a normal distribution curve 1 for fx, and then obtain a point fx′ from the normal distribution curve 1 within a specified range centered on fx, then the extended internal parameter of

Of course, during implementation, the extended internal parameters are not limited to being different in one sub-parameter compared to the internal parameters of the first camera. There can be multiple different subparameters. For example, not only the normal distribution curve 1 of fx can be constructed, but also the normal distribution curve 2 of cx can be constructed. And get a point cx′ close to cx from the normal distribution curve 2 to replace cx, and the resulting extended internal parameter

Since the internal parameters of the first camera have four sub-parameters, the extended sub-parameters of one or more of the sub-parameters can be selected to construct the extended internal parameters.

In addition, it should be noted that the parameter differences between different extended internal parameters can be differences in the same sub-parameters or differences in different sub-parameters. For example, in the normal distribution curve 1 for fx, multiple (maybe thousands) different fx′ can be obtained to obtain different extended internal parameters. For another example, if not only fx′ is different, but also other sub-parameters such as cx′ are different, different extended internal parameters will be obtained.

The cycle of model training can be as high as hundreds of thousands of times, each time using a different internal parameter K′, and this internal parameter K′ is either derived from the internal parameters of the real camera, or generated based on random perturbations of the internal parameters of the real camera. As a result, the model uses a large amount of training samples for training, which can improve the generalization ability of the model.

In this application, using different internal reference generation methods can produce different effects. For example, there are multiple cameras on a self-driving vehicle. These cameras can be calibrated to obtain their internal parameter list, and then the internal parameters in the list can be used cyclically during model training to train the model, so that one model can adapt to multiple cameras on the vehicle. For another example, a camera has default internal parameter values when it leaves the factory. However, due to the manufacturing process, the actual internal parameters will be different from the default internal parameters, but they are roughly distributed near the default internal parameters. Perturbed internal parameters can be generated with the default internal parameters as the center and normal distribution, and used in model training to make the model adapt to all situations near its default internal parameters, achieving the effect of one model adapting to one camera.

Step 5022: Perform geometric transformation on the image collected by the first camera according to the internal parameters and extended internal parameters of the first camera to obtain the sample image used for the first training, and use the three-dimensional position information of the target object in the image collected by the first camera as the target object. Annotation location information in sample images.

Image geometric transformation, also known as image space transformation, maps the coordinate position in one image to a new coordinate position in another image without changing the pixel value of the image. Usually geometric transformation in image processing can eliminate geometric distortion caused by imaging angle, perspective relationship, etc. as much as possible. Geometric transformations of images can include translation, rotation, scaling, orthoparallel projection, etc. The geometric transformation of the image can be achieved through spatial transformation and interpolation algorithms. Among them, the key to geometric transformation is the transformation parameter in the mapping process, which can be one or more of translation components, scaling factors, rotation angles, etc. Generally speaking, camera internal parameters do not need to be considered when performing geometric transformation on images collected by a camera. In the embodiment of the present application, the extended internal parameters of the extended camera are used as transformation parameters to realize translation and zooming of images collected by the first camera. Wait for operations. For example, when the extended intrinsic parameter K′ is only different from the intrinsic parameter K of the first camera, cx is equivalent to performing a left and right translation operation on the original image.

In addition, the images collected by the monocular camera have image distortion. Image distortion is caused by deviations in lens manufacturing precision and assembly processes, which lead to distortion of the original image. Lens distortion is divided into two categories: radial distortion and tangential distortion. Radial distortion is caused by the inherent properties of the convex lens itself, which occurs because light rays bend more away from the center of the lens than closer to the center. Distortion is distributed along the radius of the lens, mainly including barrel distortion and pincushion distortion. Tangential distortion is caused by the fact that the lens itself is not parallel to the camera sensor plane (imaging plane). This situation is mostly caused by the installation deviation of the lens being pasted to the lens module. Therefore, in order to ensure that the image captured by the first camera is accurately mapped to the image space of the extended camera, in the embodiment of the present application, the image captured by the first camera is dedistorted based on the internal parameters of the first camera and the distortion coefficient of the first camera, and we obtain The dedistorted image is then processed based on the extended internal parameters to obtain a sample image.

Step 5023: Use the target detection model to detect the sample image, and obtain the first two-dimensional position information and the first depth information of the target object in the sample image coordinate system. In the embodiment of the present application, the first two-dimensional position information is, for example, the position coordinates of each key point of the target object in the sample image, such as the 2D coordinates (u, v) mentioned above. The first depth information is the depth for which the 2D coordinates (u, v) correspond, such as Z in formula (1).

Step 5024: Perform coordinate transformation on the first two-dimensional position information and the first depth information according to the extended internal parameters to obtain the first three-dimensional position information of the target object in the camera coordinate system corresponding to the extended internal parameters. The first three-dimensional position information is the 3D coordinates of the key points of the target object calculated through formula (1).

Step 5025: Adjust parameters of the target detection model based on the difference between the first three-dimensional position information and the annotated position information.

There are multiple functions included in the object detection model, each function has its own parameters, thus defining the functionality of the model. The purpose of training is to estimate and adjust the parameters of these functions based on the training data set, so that the model can learn the mapping from input images to expected results.

Taking autonomous vehicles as an example, the overall training process for extended internal parameters can be summarized as shown in Figure 6: After collecting the image, it is necessary to annotate the 3D coordinates of the key points of the obstacles in the image to obtain the annotated position of the target object. The annotated position It is saved in the annotation file, so the annotation file contains the real 3D coordinates (x, y, z) of the key points of the obstacle (that is, the annotation position of the target object). During training:

First, in step 601, the image A, internal parameter K and distortion coefficient D collected by the monocular camera are obtained.

In step 602, the internal parameter K is transformed into the internal parameter K'.

In step 603, image A is dedistorted using intrinsic parameter K and distortion coefficient D to obtain image A', and then in step 604, image A' is geometrically transformed using extended intrinsic parameter K' to obtain image B.

It should be noted that the execution order of step 602 and step 603 is not limited.

In step 605, image B is input to the target detection model, and the 2D coordinates of the key points of the obstacle and their depths ( ^up , v ^p , Z ^p ) are output.

In step 606, the predicted 3D coordinates are obtained by left-multiplying ( ^up , v ^p , Z ^p ) by the inverse matrix K′ ^-1 of the extended internal parameter K′.

In step 607, the difference between the predicted 3D coordinates and the real 3D coordinates (x, y, z) is determined, and parameters of the target detection model can be adjusted based on the difference.

Among them, during training, the training samples used in the same training batch belong to the same extended camera. Each training batch includes multiple training samples, and the total difference between the predicted 3D coordinates and the real 3D coordinates of all training samples in the same training batch is calculated to adjust the parameters of the target detection model.

In addition to using the extended camera to train the target detection model, in the embodiment of the present application, the aforementioned N times of training also include at least one second training. The second training will use the internal parameters of the first camera to train the target detection model, which can be implemented as:

Use the target detection model to detect the image collected by the first camera, and obtain the second two-dimensional position information and the second depth information of the target object in the image coordinate system of the image collected by the first camera;

Perform coordinate transformation on the second two-dimensional position information and the second depth information according to the internal parameters of the first camera to obtain the second three-dimensional position information of the target object in the coordinate system of the first camera;

According to the difference between the second three-dimensional position information and the annotated position information, the parameters of the target detection model are adjusted.

This ensures that the target detection model is trained based on real cameras in the camera set, and the target detection model can be well applied to any camera in the camera set.

Based on the same inventive concept, embodiments of the present application also provide a method for target detection using the above target detection model, as shown in Figure 7, including the following steps:

Step 701: Obtain the image to be detected collected by the monocular camera and the internal parameters of the monocular camera.

Step 702: Detect the image to be detected and obtain the two-dimensional position information and depth information of the target object in the coordinate system of the image to be detected.

Step 703: Perform coordinate transformation on the two-dimensional position information and depth information according to the internal parameters of the monocular camera to obtain the three-dimensional position information of the target object in the monocular camera coordinate system.

Taking an autonomous vehicle as an example, the camera set includes internal parameters C of multiple monocular cameras. Multiple extended internal parameters E can be expanded by using the internal parameters C of multiple monocular cameras. Then the internal parameter set F used in the training phase includes internal parameter C and internal parameter E. When deploying the target detection model on a real vehicle, first calibrate the internal parameters K" and distortion coefficient D of the monocular camera used. Since variable internal parameters are used in model training, the real internal parameters K" used in inference are only model training A subset of the internal parameter set F used in .

Use parameters K" and D to dedistort the collected image A to obtain image B. Send image B to the target detection model for inference to obtain the 2D coordinates and depth (u, v, Z') of the key points of the obstacle. Then use the inverse matrix of K" to multiply left by (u, v, Z') to obtain the 3D coordinates (X', Y', Z') of the key points of the obstacle. After predicting the 3D coordinates of multiple obstacle key points, the complete 3D box information of the obstacle is inferred based on geometric relationships.

To sum up, multiple cameras with different internal parameters have been used during model training. If the model adapts to its internal parameters, then the inverse matrix of the internal parameters can be used to left multiply (u, v, Z') and get the correct 3D coordinates.

As shown in Figure 8, the same camera type includes four monocular cameras, assuming that they are camera 1, camera 2, camera 3 and camera 4 in order. Then these four cameras share the same target detection model. When camera 1 collects images and inputs them into the target detection model, the internal parameter K1 and distortion coefficient D1 of camera 1 are used to dedistort the collected images, and then the target detection model is input. The target detection model obtains the internal parameter K1 of camera 1 and uses the internal parameter K1. The inverse matrix of gets the 3D coordinates of the obstacle. In the same way, for the image collected by camera 2, the internal parameter K2 and distortion coefficient D2 of camera 2 are used to dedistort the image collected, and then input into the target detection model. The target detection model obtains the internal parameter K2 of camera 2, and uses the internal parameter K2 The inverse matrix of gets the 3D coordinates of the obstacle. The processing methods of Camera 3 and Camera 4 are similar, so I won’t go into details here.

To sum up, this application releases a target detection model that is suitable for one or even multiple cameras. Due to the manufacturing process, even cameras of the same brand and model will have differences in their internal parameters. If you train a corresponding model for each camera, it will obviously be too expensive. After adopting the solution of this application, a model can be trained to make it suitable for a certain model or even several models of cameras.

Based on the same inventive concept, this application also provides a training device 900 for a target detection model. As shown in Figure 9, the training device 900 includes:

Information acquisition module 901, used to acquire internal parameters of at least one monocular camera and images collected by the at least one monocular camera;

The training module 902 is configured to train the target detection model N times according to the internal parameters of the first camera and the images collected by the first camera, where the first camera is any one of the at least one monocular camera, so N is an integer greater than 1;

In some implementations, the training module is used to perform random perturbations on the internal parameters of the first camera to obtain expanded internal parameters used in the first training.

In some embodiments, the training module is used to:

Based on the same inventive concept, this application also provides a target detection device 1000, which is used in the process of detecting target objects using the target detection model obtained by the training device 900. As shown in Figure 10, the target detection device 1000 includes:

The image to be detected acquisition module 1001 is used to acquire the image to be detected collected by a monocular camera and the internal parameters of the monocular camera;

The two-dimensional information acquisition module 1002 is used to detect the image to be detected and obtain the two-dimensional position information and depth information of the target object in the coordinate system of the image to be detected;

The three-dimensional information determination module 1003 is used to perform coordinate transformation on the two-dimensional position information and the depth information according to the internal parameters of the monocular camera to obtain the three-dimensional position information of the target object in the monocular camera coordinate system. .

Based on the same inventive concept, this application provides a chip system, including: a memory for storing a computer program; a processor; when the processor calls and runs the computer program from the memory, it causes the electronic device installed with the chip system to execute The training method or the target detection method of any of the target detection models described in this application.

Based on the same inventive concept, this application provides a computer program product containing instructions that, when run on a computer, causes the computer to execute any of the target detection model training methods or target detection methods described in this application.

Based on the same inventive concept, this application provides a computer-readable storage medium, including instructions. When the instructions are run on a computer, the computer executes any of the training methods or target detection methods of the target detection model described in this application. .

Based on the same inventive concept, embodiments of the present application also provide an electronic device. The electronic device may have a structure as shown in Figure 11. The electronic device may be a computer device or a chip that can support the computer device to implement the above method. or system-on-a-chip.

The electronic device 1100 shown in Figure 11 may include at least one processor 1101, which is configured to be coupled with a memory, read and execute instructions in the memory to implement the target detection model of the embodiment of the present application. The steps of a training method or an object detection method. Optionally, the electronic device may also include a communication interface 1102 for supporting the electronic device to receive or send signaling or data. The communication interface 1102 in the electronic device can be used to interact with other electronic devices. The processor 1101 may be used to implement the electronic device to perform the steps in the method shown in any one of Figures 5-7. Optionally, the electronic device may also include a memory 1103 in which computer instructions are stored. The memory 1103 may be coupled with the processor 1101 and/or the communication interface 1102 to support the processor 1001 in calling the computer instructions in the memory 1103 to implement Steps in the method shown in any one of Figures 5-7; In addition, the memory 1103 can also be used to store data involved in the method embodiments of the present application, for example, used to store data necessary to support the communication interface 1002 to implement interaction. , instructions, and/or, used to store configuration information such as WORM attributes necessary for the electronic device to execute the method described in the embodiments of this application.

Embodiments of the present application also provide a computer-readable storage medium. Computer instructions are stored on the computer-readable storage medium. When these computer instructions are called and executed by a computer, they can cause the computer to complete any one of the above method embodiments and method embodiments. methods involved in possible designs. In the embodiments of this application, the computer-readable storage medium is not limited. For example, it may be RAM (random-access memory), ROM (read-only memory), etc.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented using software, it may be implemented in whole or in part in the form of computer instructions. When the computer instructions are loaded and executed on the computer, the processes or functions described in the embodiments of the present application are generated in whole or in part. The computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another, e.g., the computer instructions may be transferred from a website, computer, server, or data center Transmission to another website, computer, server or data center through wired (such as coaxial cable, optical fiber) or wireless (such as infrared, wireless, microwave, etc.) means. The computer-readable storage medium may be any available medium that can be accessed by a computer or a data electronic device such as a server or data center integrated with one or more available media. The available media may be magnetic media (eg, floppy disk, hard disk, tape), optical media (eg, DVD), or semiconductor media (eg, Solid State Disk (SSD)), etc.

The steps of the method or algorithm described in the embodiments of this application can be directly embedded in hardware, a software unit executed by a processor, or a combination of the two. The software unit may be stored in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, register, hard disk, removable disk, CD-ROM or any other form of storage medium in the art. For example, the storage medium can be connected to the processor, so that the processor can read information from the storage medium and can store and write information to the storage medium. Optionally, the storage medium can also be integrated into the processor. The processor and the storage medium can be installed in the ASIC, and the ASIC can be installed in the terminal device. Optionally, the processor and the storage medium may also be provided in different components in the terminal device.

These computer instructions may also be loaded onto a computer or other programmable data processing device such that a series of operational steps are performed on the computer or other programmable device to produce computer-implemented processes, thereby causing the instructions to execute on the computer or other programmable device Provides steps for implementing the functionality specified in a process or processes in a flow diagram and/or in a block or blocks in a block diagram.

Although the present application has been described in conjunction with specific features and embodiments thereof, it will be apparent that various modifications and combinations may be made without departing from the scope of the present application. Accordingly, the specification and drawings are intended to be merely illustrative of the application as defined by the appended claims and are to be construed to cover any and all modifications, variations, combinations or equivalents within the scope of the application. Obviously, those skilled in the art can make various changes and modifications to the present application without departing from the spirit and scope of the present application. In this way, if these modifications and variations of the present application fall within the scope of the claims of the present application and equivalent technologies, the present application is also intended to include these modifications and variations.

Claims

A training method for a target detection model, which is characterized by including:

Obtaining internal parameters of at least one monocular camera and images collected by the at least one monocular camera;

According to the internal parameters of the first camera and the images collected by the first camera, the target detection model is trained N times, the first camera is any one of the at least one monocular camera, and the N is greater than 1 integer;

Wherein, the N times of training include at least one first training, and the first training includes the following steps:

Transform the internal parameters of the first camera to obtain extended internal parameters used in the first training;

Perform geometric transformation on the image collected by the first camera according to the internal parameters of the first camera and the extended internal parameter to obtain the sample image used for the first training, and place the target object in the image collected by the first camera. The three-dimensional position information is used as the annotated position information of the target object in the sample image;

Use the target detection model to detect the sample image to obtain the first two-dimensional position information and the first depth information of the target object in the sample image coordinate system;

Perform coordinate transformation on the first two-dimensional position information and the first depth information according to the extended internal parameters to obtain the first three-dimensional position information of the target object in the camera coordinate system corresponding to the extended internal parameters;

According to the difference between the first three-dimensional position information and the marked position information, parameters of the target detection model are adjusted.
The method according to claim 1, characterized in that the N times of training include at least one second training, and the second training includes the following steps:

Use the target detection model to detect the image collected by the first camera to obtain the second two-dimensional position information and second depth information of the target object in the image coordinate system of the image collected by the first camera;

Perform coordinate transformation on the second two-dimensional position information and the second depth information according to the internal parameters of the first camera to obtain the second three-dimensional position information of the target object in the first camera coordinate system;

According to the difference between the second three-dimensional position information and the marked position information, parameters of the target detection model are adjusted.
The method according to claim 1 or 2, characterized in that if the S times of first training among the N times of training transform the internal parameters of the first camera, then each of the S times of first training The expanded internal parameters obtained by transformation during training are different, and S is an integer less than or equal to N.
The method according to any one of claims 1-3, characterized in that said transforming the internal parameters of the first camera to obtain the extended internal parameters used in the first training includes:

Randomly perturb the internal parameters of the first camera to obtain the expanded internal parameters used in the first training.
The method according to claim 4, wherein the random perturbation of the internal parameters of the first camera includes:

For the sub-parameters in the internal parameters of the first camera, construct a normal distribution curve centered on the sub-parameters;

Obtain a point from the normal distribution curve within a specified range centered on the sub-parameter, and use the obtained point as an extended sub-parameter of the sub-parameter;

The sub-parameters in the intrinsic parameters of the first camera are replaced with extended sub-parameters of the sub-parameters.
The method according to any one of claims 1 to 5, characterized in that the image collected by the first camera is geometrically transformed according to the internal parameters of the first camera and the extended internal parameters to obtain the The first training uses sample images, including:

De-distort the image collected by the first camera based on the internal parameters of the first camera and the distortion coefficient of the first camera to obtain a de-distorted image;

The dedistorted image is processed based on the extended internal parameters to obtain the sample image.
A target detection method, characterized in that it is applied to the process of detecting target objects using a target detection model trained by the method according to any one of claims 1 to 6, and the method includes:

Obtain the image to be detected collected by the monocular camera and the internal parameters of the monocular camera;

Detect the image to be detected and obtain the two-dimensional position information and depth information of the target object in the coordinate system of the image to be detected;

Coordinate transformation is performed on the two-dimensional position information and the depth information according to the internal parameters of the monocular camera to obtain the three-dimensional position information of the target object in the coordinate system of the monocular camera.
A training device for a target detection model, which is characterized by including:

An information acquisition module, configured to acquire internal parameters of at least one monocular camera and images collected by the at least one monocular camera;

A training module configured to train the target detection model N times based on the internal parameters of the first camera and the images collected by the first camera, where the first camera is any one of the at least one monocular camera, and the N is an integer greater than 1;

Wherein, the N times of training include at least one first training, and the first training includes the following steps:

Transform the internal parameters of the first camera to obtain extended internal parameters used in the first training;

Perform geometric transformation on the image collected by the first camera according to the internal parameters of the first camera and the extended internal parameter to obtain the sample image used for the first training, and place the target object in the image collected by the first camera. The three-dimensional position information is used as the annotated position information of the target object in the sample image;

Use the target detection model to detect the sample image to obtain the first two-dimensional position information and the first depth information of the target object in the sample image coordinate system;

Perform coordinate transformation on the first two-dimensional position information and the first depth information according to the extended internal parameters to obtain the first three-dimensional position information of the target object in the camera coordinate system corresponding to the extended internal parameters;

According to the difference between the first three-dimensional position information and the marked position information, parameters of the target detection model are adjusted.
The device according to claim 8, wherein the training module is further configured to perform at least one second training in the N trainings, and the second training includes the following steps:

Use the target detection model to detect the image collected by the first camera to obtain the second two-dimensional position information and second depth information of the target object in the image coordinate system of the image collected by the first camera;

Perform coordinate transformation on the second two-dimensional position information and the second depth information according to the internal parameters of the first camera to obtain the second three-dimensional position information of the target object in the first camera coordinate system;

According to the difference between the second three-dimensional position information and the marked position information, parameters of the target detection model are adjusted.
The device according to claim 8 or 9, characterized in that if the S times of first training among the N times of training transform the internal parameters of the first camera, then each of the S times of first training The expanded internal parameters obtained by transformation during training are different, and S is an integer less than or equal to N.
The device according to any one of claims 8-10, characterized in that the training module is used to randomly perturb the internal parameters of the first camera to obtain the expanded internal parameters used in the first training.
The device according to claim 11, characterized in that, performing the random perturbation of the internal parameters of the first camera, the training module is specifically used to:

For the sub-parameters in the internal parameters of the first camera, construct a normal distribution curve centered on the sub-parameters;

Obtain a point from the normal distribution curve within a specified range centered on the sub-parameter, and use the obtained point as an extended sub-parameter of the sub-parameter;

The sub-parameters in the intrinsic parameters of the first camera are replaced with extended sub-parameters of the sub-parameters.
The device according to any one of claims 8-12, characterized in that the training module is used for:

De-distort the image collected by the first camera based on the internal parameters of the first camera and the distortion coefficient of the first camera to obtain a de-distorted image;

The dedistorted image is processed based on the extended internal parameters to obtain the sample image.
A target detection device, characterized in that the target detection model obtained by the device according to any one of claims 8-13 is applied to the process of detecting a target object, and the device includes:

The image acquisition module to be detected is used to acquire the image to be detected collected by the monocular camera and the internal parameters of the monocular camera;

A two-dimensional information acquisition module, used to detect the image to be detected and obtain the two-dimensional position information and depth information of the target object in the coordinate system of the image to be detected;

A three-dimensional information determination module is configured to perform coordinate transformation on the two-dimensional position information and the depth information according to the internal parameters of the monocular camera to obtain the three-dimensional position information of the target object in the coordinate system of the monocular camera.
A chip system, characterized in that it includes: a memory for storing a computer program; a processor; when the processor calls and runs the computer program from the memory, it causes the electronic device installed with the chip system to execute the steps as claimed in claim 1- The method described in any one of 7.
A computer program product containing instructions, characterized in that, when run on a computer, it causes the computer to perform the method according to any one of claims 1-7.
A computer-readable storage medium, characterized by comprising instructions that, when run on a computer, cause the computer to perform the method according to any one of claims 1-7.
An electronic device, characterized by including:

Memory, used to store readable programs;

At least one processor, configured to call and run the readable program from the memory, so that the communication device implements the method according to any one of claims 1 to 7.