CN113947768A - Monocular 3D target detection-based data enhancement method and device - Google Patents

Monocular 3D target detection-based data enhancement method and device Download PDF

Info

Publication number
CN113947768A
CN113947768A CN202111205373.9A CN202111205373A CN113947768A CN 113947768 A CN113947768 A CN 113947768A CN 202111205373 A CN202111205373 A CN 202111205373A CN 113947768 A CN113947768 A CN 113947768A
Authority
CN
China
Prior art keywords
image
pixel
monocular
new
original image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111205373.9A
Other languages
Chinese (zh)
Inventor
李耀波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jingdong Kunpeng Jiangsu Technology Co Ltd
Original Assignee
Jingdong Kunpeng Jiangsu Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jingdong Kunpeng Jiangsu Technology Co Ltd filed Critical Jingdong Kunpeng Jiangsu Technology Co Ltd
Priority to CN202111205373.9A priority Critical patent/CN113947768A/en
Publication of CN113947768A publication Critical patent/CN113947768A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/50Depth or shape recovery
    • G06T7/55Depth or shape recovery from multiple images
    • G06T7/593Depth or shape recovery from multiple images from stereo images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras

Abstract

The invention discloses a data enhancement method and device based on monocular 3D target detection, and relates to the technical field of computers. One embodiment of the method comprises: receiving a plurality of input original images, and carrying out random cutting and zooming processing on each original image to obtain a plurality of new images; for each new image, re-determining an equivalent camera internal parameter matrix based on the affine transformation relation of the coordinates of each pixel between the original image and the new image; and inputting a plurality of new images and the equivalent camera internal reference matrix corresponding to each new image into the monocular 3D target detection model together for training so as to update the model parameters and obtain the trained monocular 3D target detection model. The implementation mode uses a random cutting and scaling strategy, solves the problem of unbalanced scale distribution of the target sample, and can greatly improve the detection performance of the near obstacle target.

Description

Monocular 3D target detection-based data enhancement method and device
Technical Field
The invention relates to the field of automatic driving in computer technology, in particular to a data enhancement method and device based on monocular 3D target detection.
Background
Monocular 3D object detection is a type of image-based detection method that identifies foreground objects in a 2D visual image and gives an automatic driving perception task of the category, position and pose of the objects. Compared with laser radar and ultrasonic identification technologies, monocular 3D target detection has lower sensor cost for environment perception, but has the problem of unreliable depth value perception.
Data enhancement is one of the most effective methods for improving the detection performance of the model, and in order to improve the performance of monocular 3D target detection, data enhancement processing can be performed on an original image. But limited by the constraint of the physical space geometric projection relation of 3D perception, a large number of effective data enhancement methods in 2D, such as rotation, deformation, random clipping, and random scaling, cannot be used in monocular 3D object detection.
Monocular 3D target detection methods such as Mono3D, SMOKE, and Mono3D + +, mostly only adopt a simple data enhancement strategy of random inversion, but such methods cannot complete multi-scale sample enhancement of targets, and the problem of imbalance of samples in a scale interval is still serious, which will severely limit the detection performance of a monocular 3D target detection model for targets of different scales.
Disclosure of Invention
In view of this, embodiments of the present invention provide a data enhancement method and apparatus based on monocular 3D target detection, which can at least solve the problem that the existing monocular 3D target detection cannot achieve multi-scale enhancement of a target sample.
To achieve the above object, according to an aspect of an embodiment of the present invention, there is provided a data enhancement method based on monocular 3D object detection, including:
receiving a plurality of input original images, and carrying out random cutting and zooming processing on each original image to obtain a plurality of new images;
for each new image, re-determining an equivalent camera internal parameter matrix based on the affine transformation relation of the coordinates of each pixel between the original image and the new image;
and inputting a plurality of new images and the equivalent camera internal reference matrix corresponding to each new image into the monocular 3D target detection model together for training so as to update the model parameters and obtain the trained monocular 3D target detection model.
Optionally, the randomly clipping and scaling each original image to obtain a plurality of new images includes:
using the matting height and the matting width to carry out matting operation with random positions in the original image to obtain a matting area; wherein, the sectional area does not exceed the boundary of the original image;
and carrying out size scaling processing on the cutout area according to the preset image height and the preset image width of the monocular 3D target detection model to obtain a plurality of new images.
Optionally, before the using the cutout height and the cutout width, the method further includes:
receiving a value selected from a scale coefficient range, and using the value as a scale coefficient of the scratch;
and taking the product of the preset image height and the scale coefficient as a matting height, and taking the product of the preset image width and the scale coefficient as a matting width.
Optionally, the re-determining the equivalent camera internal reference matrix based on the affine transformation relationship between the coordinates of each pixel in the original image and the new image includes:
carrying out equal probability random clipping and scaling mathematical description on the coordinates of each pixel in the cutout area, converting the mathematical description into a determinant, and obtaining the affine transformation relation of the coordinates of each pixel between the original image and the new image;
and substituting the affine transformation relation into the projection relation between the pixel coordinate system of the original image and the camera coordinate system so as to reestablish the projection relation between the pixel coordinate system and the camera coordinate system for each pixel in the new image, and acquiring the equivalent camera internal reference matrix from the new projection relation.
Optionally, the mathematical description of performing equiprobable random clipping and scaling on the coordinates of each pixel in the scratch area includes:
taking the reciprocal of the scale coefficient as a scaling coefficient for scaling the size of the matting area;
determining a first coordinate of a top left corner vertex of each matting area in the original image and a second coordinate of each pixel in the matting area in the original image, and subtracting the first coordinate from the second coordinate to obtain a coordinate of each pixel in the matting area;
and scaling the coordinates of each pixel in the matting area by using the scaling coefficient to obtain the pixel coordinates of each pixel in a new image.
Optionally, the re-determining the equivalent camera internal reference matrix based on the affine transformation relationship between the coordinates of each pixel in the original image and the new image includes:
and taking the reciprocal of the scale coefficient as a scaling coefficient for scaling the size of the matting areas, and combining the coordinates of the top left corner vertex of each matting area in the original image to calculate and obtain the equivalent camera internal parameter matrix.
Optionally, the target in the original image includes labeled 3D information;
the method further comprises the following steps:
acquiring a reference focal length value preset for focal length normalization, and calculating a reference depth value of a target in each new image under the reference focal length value by using a monocular 3D target detection model;
determining a normalized focal length corresponding to each new image, and calculating a ratio of each normalized focal length to the reference focal length value to obtain a focal length normalization coefficient corresponding to each new image;
multiplying each reference depth value by the corresponding focal length normalization coefficient to obtain a reasoning depth value of the target in each new image; the reasoning depth value is used for comparing the monocular 3D target detection model with a real depth value labeled on a target in a training process so as to calculate loss cost.
To achieve the above object, according to another aspect of the embodiments of the present invention, there is provided a data enhancement apparatus based on monocular 3D object detection, including:
the cutting and zooming module is used for receiving a plurality of input original images and carrying out random cutting and zooming processing on each original image to obtain a plurality of new images;
the matrix determining module is used for re-determining the equivalent camera internal parameter matrix for each new image based on the affine transformation relation of the coordinates of each pixel between the original image and the new image;
and the model training module is used for inputting a plurality of new images and the equivalent camera internal reference matrix corresponding to each new image into the monocular 3D target detection model together for training so as to update the model parameters and obtain the trained monocular 3D target detection model.
Optionally, the cropping scaling module is configured to:
using the matting height and the matting width to carry out matting operation with random positions in the original image to obtain a matting area; wherein, the sectional area does not exceed the boundary of the original image;
and carrying out size scaling processing on the cutout area according to the preset image height and the preset image width of the monocular 3D target detection model to obtain a plurality of new images.
Optionally, the cropping scaling module is further configured to:
receiving a value selected from a scale coefficient range, and using the value as a scale coefficient of the scratch;
and taking the product of the preset image height and the scale coefficient as a matting height, and taking the product of the preset image width and the scale coefficient as a matting width.
Optionally, the matrix determining module is configured to:
carrying out equal probability random clipping and scaling mathematical description on the coordinates of each pixel in the cutout area, converting the mathematical description into a determinant, and obtaining the affine transformation relation of the coordinates of each pixel between the original image and the new image;
and substituting the affine transformation relation into the projection relation between the pixel coordinate system of the original image and the camera coordinate system so as to reestablish the projection relation between the pixel coordinate system and the camera coordinate system for each pixel in the new image, and acquiring the equivalent camera internal reference matrix from the new projection relation.
Optionally, the matrix determining module is configured to:
taking the reciprocal of the scale coefficient as a scaling coefficient for scaling the size of the matting area;
determining a first coordinate of a top left corner vertex of each matting area in the original image and a second coordinate of each pixel in the matting area in the original image, and subtracting the first coordinate from the second coordinate to obtain a coordinate of each pixel in the matting area;
and scaling the coordinates of each pixel in the matting area by using the scaling coefficient to obtain the pixel coordinates of each pixel in a new image.
Optionally, the matrix determining module is configured to:
and taking the reciprocal of the scale coefficient as a scaling coefficient for scaling the size of the matting areas, and combining the coordinates of the top left corner vertex of each matting area in the original image to calculate and obtain the equivalent camera internal parameter matrix.
Optionally, the target in the original image includes labeled 3D information;
the apparatus further comprises a focal length normalization module configured to:
acquiring a reference focal length value preset for focal length normalization, and calculating a reference depth value of a target in each new image under the reference focal length value by using a monocular 3D target detection model;
determining a normalized focal length corresponding to each new image, and calculating a ratio of each normalized focal length to the reference focal length value to obtain a focal length normalization coefficient corresponding to each new image;
multiplying each reference depth value by the corresponding focal length normalization coefficient to obtain a reasoning depth value of the target in each new image; the reasoning depth value is used for comparing the monocular 3D target detection model with a real depth value labeled on a target in a training process so as to calculate loss cost.
To achieve the above object, according to still another aspect of embodiments of the present invention, there is provided a data enhancement electronic device based on monocular 3D object detection.
The electronic device of the embodiment of the invention comprises: one or more processors; a storage device, configured to store one or more programs, which when executed by the one or more processors, cause the one or more processors to implement any of the above-mentioned monocular 3D object detection-based data enhancement methods.
To achieve the above object, according to a further aspect of the embodiments of the present invention, there is provided a computer readable medium having a computer program stored thereon, the computer program, when executed by a processor, implementing any one of the above-mentioned monocular 3D object detection-based data enhancement methods.
According to the scheme provided by the invention, one embodiment of the invention has the following advantages or beneficial effects: the monocular 3D object detection model requires 2 inputs: the method adopts a focal length normalization method to solve the depth inference problem of different focal length images at the depth decoding stage after 3D box regression of a network model, and integrally improves the detection performance of a subsequent monocular 3D target detection model.
Further effects of the above-mentioned non-conventional alternatives will be described below in connection with the embodiments.
Drawings
The drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention. Wherein:
fig. 1 is a schematic main flow chart of a data enhancement method based on monocular 3D object detection according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating a random _ crop _ resize method for an original image according to an embodiment of the present invention;
FIG. 3 is a schematic flow chart diagram illustrating an alternative monocular 3D object detection-based data enhancement method according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of the main modules of a monocular 3D object detection based data enhancement device according to an embodiment of the present invention;
FIG. 5 is an exemplary system architecture diagram in which embodiments of the present invention may be employed;
FIG. 6 is a schematic block diagram of a computer system suitable for use with a mobile device or server implementing an embodiment of the invention.
Detailed Description
Exemplary embodiments of the present invention are described below with reference to the accompanying drawings, in which various details of embodiments of the invention are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
Data enhancement is one of the most effective ways to improve model detection performance, which does not incur additional computational cost during reasoning. Random scaling, random cropping, color distortion, and other geometric and color enhancement techniques are widely used to detect models in 2D. In addition to the usual data enhancement, copy-paste enhancement is also widely applied in 2D detection and segmentation tasks.
However, due to violation of geometric constraint rules, none of the monocular 3D object detection methods adopts these data enhancement methods, so that horizontal flipping and color distortion become the only data enhancement methods used. The monocular original image is derived from perspective imaging, the scale distribution of the target in the monocular image is seriously unbalanced, and the training effect of the model is limited. The method cannot complete multi-scale sample enhancement of the target, the problem of unbalance of the sample on the scale is still serious, and the detection performance of the monocular 3D target detection model on the targets with different scales is severely limited.
Referring to fig. 1, a main flowchart of a data enhancement method based on monocular 3D object detection according to an embodiment of the present invention is shown, including the following steps:
s101: receiving a plurality of input original images, and carrying out random cutting and zooming processing on each original image to obtain a plurality of new images;
s102: for each new image, re-determining an equivalent camera internal parameter matrix based on the affine transformation relation of the coordinates of each pixel between the original image and the new image;
s103: and inputting a plurality of new images and the equivalent camera internal reference matrix corresponding to each new image into the monocular 3D target detection model together for training so as to update the model parameters and obtain the trained monocular 3D target detection model.
In the above embodiment, for step S101, the monocular 3D target detection is mainly applied to the obstacle 3D target detection scenario in the automatic driving field, and the scheme is applied to the data preprocessing scenario in the training phase of the monocular 3D target detection model.
A monocular camera can only acquire a plane view, and an original image and a camera internal reference matrix are used as input to detect a physical position (x, y, z), a physical dimension length, width and height (w, h, l) and an attitude angle yaw of an obstacle target (such as a vehicle, a person and the like) in a camera coordinate system based on a monocular 3D target detection method SMOKE in the prior art, so that 3D information (x, y, z, w, h, l, yaw) is formed. SMOKE is a one-stage monocular 3D target detection algorithm based on keypoint detection. Therefore, the scheme labels 3D information on the object in the original image.
Conventionally, original images are input into a monocular 3D object detection model according to batch _ size for training, for example, 10000 original images are divided into 10 batches, and only 1000 original images are input into each batch. Therefore, the situation that the test image and the training image may be consistent or different batches of pictures are consistent exists, and each original image only has one scale for the obstacle target, so that a monocular 3D target detection model cannot be trained well.
The monocular 3D object detection model is based on 3D object detection of an original image, and for the above problem, the scheme adopts a random _ crop _ resize (random cropping and scaling) mode for the original image, specifically referring to fig. 2:
s201: receiving a value selected from a scale coefficient range, and using the value as a scale coefficient of the scratch;
s202: acquiring a preset image height and a preset image width of a monocular 3D target detection model, and taking the product of the preset image height and the scale coefficient as a matting height and the product of the preset image width and the scale coefficient as a matting width;
s203: carrying out position random matting operation in each original image by using the matting height and the matting width to obtain a matting area; wherein, the sectional area does not exceed the boundary of the original image;
s204: and carrying out size scaling processing on the cutout area according to the preset image height and the preset image width to obtain a plurality of new images.
1) Setting minimum proportion min _ scale (such as 0.7) and maximum proportion max _ scale (such as 1.0) of the scratch, and randomly selecting a numerical value from the scale coefficient range [ min _ scale, max _ scale ] as the scale coefficient scale of the scratch. The process can be manual selection, or can be self-selection by a worker according to an equal probability mode after writing a program, and the automatic equal probability selection mode of the writing program is preferably selected by considering the automatic running condition of the scheme.
2) The image width and the image height of the monocular 3D target detection model are preset as input _ w and input _ h respectively, so that the cutout width crop _ w and the cutout height crop _ h are calculated and obtained.
3) Based on crop _ w and crop _ h, the random position matting operation is carried out in the original input single-frame original image, and the matting region does not exceed the original image region, so that the problem that the scratched image exceeds the original image boundary and is zero-filled does not exist, the model training stage is consistent with the data preprocessing stage, and the performance of the model deployment stage is promoted.
4) The sizes of the image obtained by crop are crop _ w and crop _ h, resize processing is carried out on the image obtained by crop, and then a new image suitable for the monocular 3D target detection model can be generated, wherein the width and the height of the new image are respectively refer to inher _ w and inher _ h.
In step S102, it is assumed that, initially, the projection relationship between the pixel coordinate system and the camera coordinate system of the original image is as follows:
Figure BDA0003306628640000091
wherein u and v are horizontal coordinates and vertical coordinates (i.e. second coordinates) of a certain pixel on the original image in a pixel coordinate system, K is a camera internal reference matrix, and X isc、YcAnd ZcIs the coordinate of the pixel in the camera coordinate system, i.e. the position information.
After an original image is subjected to random _ crop _ reset processing to obtain a plurality of new images, the coordinates of each pixel in the original image are changed, so that if the pixels in the 3D space of the camera coordinate system continue to use the original internal reference projection matrix to project on the new images, the problem that the coordinates do not correspond exists, namely, the coordinates of the pixels in the new images cannot form a perspective geometric relationship in the original pixel coordinate system and the camera coordinate system, therefore, the scheme reestablishes and generates the projection relationship between the new pixel coordinate system and the camera coordinate system through theoretical derivation, and the following specific description is provided:
1. and (3) carrying out mathematical description of equal probability random clipping and scaling on the coordinates of each pixel in the cutout region:
u'=s(u-u1)
v'=s(v-v1)
wherein u 'is a horizontal coordinate of a pixel in the new image under a pixel coordinate system, and v' is a vertical coordinate of the pixel in the new image under the pixel coordinate system; s is a scaling coefficient which is in reciprocal relation with the scale coefficient scale; u. of1、v1The first coordinate of the top left vertex of the matting region in the original image is the original point of the original image.
Converting the two formulas into a determinant form, so as to obtain an affine transformation relation of the coordinates of each pixel between the original image and the new image:
Figure BDA0003306628640000101
2. by substituting the above-mentioned "affine transformation relationship between the coordinates of each pixel in the original image and the new image" into "projection relationship between the pixel coordinate system of the original image and the camera coordinate system", it is possible to derive "perspective projection relationship between the pixel coordinate system of each pixel in the new image and the camera coordinate system" again.
Figure BDA0003306628640000102
Figure BDA0003306628640000103
And K' is an equivalent camera internal reference matrix corresponding to the new image, when the crop _ resize of the original image is performed, affine transformation can be equivalently performed on pixel coordinates, and the pixel coordinate system of the pixels in the new image, the new equivalent internal reference matrix and the camera coordinate system can reestablish a geometric perspective relation.
For step S103, the new image after data enhancement and the equivalent camera reference matrix corresponding to each new image are input into the monocular 3D target detection model together for training, so as to update the model parameters. The trained monocular 3D target detection model can accurately deduce the 3D information (x, y, z, w, h, l, yaw) of the position, the length, the width, the height and the yaw angle of the obstacle target from the single frame image in the reasoning stage.
Here, a specific example is:
assume that the input original image has 640 × 480 pixels, and the original camera internal reference matrix K is:
Figure BDA0003306628640000104
the selected matting scale factor scale is 0.8, so that the new image size 512 x 384 is smaller than the original image size. The parameters obtained by random _ crop _ reset include: the scaling coefficient s is 1/scale is 1.25, and the coordinate u of the vertex at the upper left corner of the new image in the original image1=24、v1=32;
Inputting the parameters into a new camera internal reference matrix formula to obtain K':
Figure BDA0003306628640000111
and then, according to the K', a geometric perspective relation is reestablished for the new image, so that the data enhancement effect on the pixel scale and the pixel position of the pixel point is realized, namely the generated new image is equivalent to be shot at the same position of a new camera relative to the original image.
The method provided by the embodiment mainly aims at the data preprocessing scene during model training, and can generate a plurality of new images with the target position offset and the size scaling of the obstacle by adopting a random _ crop _ reset strategy which does not exceed the image boundary for the input original image, and the size interval distribution in the new images has balance; and meanwhile, updating the equivalent camera internal reference matrix of the new image according to the K' so as to reestablish the projection relation.
Referring to fig. 3, a schematic flow chart of an optional monocular 3D object detection-based data enhancement method according to an embodiment of the present invention is shown, including the following steps:
s301: acquiring a reference focal length value preset for focal length normalization, and calculating a reference depth value of a target in each new image under the reference focal length value by using a monocular 3D target detection model; wherein, the target in the original image comprises marked 3D information;
s302: determining a normalized focal length corresponding to each new image, and calculating a ratio of each normalized focal length to the reference focal length value to obtain a focal length normalization coefficient corresponding to each new image;
s303: multiplying each reference depth value by the corresponding focal length normalization coefficient to obtain a reasoning depth value of the target in each new image; the reasoning depth value is used for comparing the monocular 3D target detection model with a real depth value labeled on a target in a training process so as to calculate loss cost.
In the above embodiment, in steps S301 to S303, 3D information (x, y, z, w, h, l, yaw) is already labeled on the obstacle object in the original image input by the present scheme. Perspective imaging has the characteristic of large and small distance, the pixel height of a target in an original image actually reflects the depth value z of the target, so that under the training of supervision information (3D information of the target), a model can deduce the depth value z of the target and other 3D information.
Some of the new images have different focal lengths, for example 6mm for image 1 and 12mm for image 2. When the images are subjected to mixed training, as the actual depth values of targets with the same pixel height on different focal length images are different, focal length normalization processing is required, and a proportional formula according to a perspective principle is used:
Figure BDA0003306628640000121
wherein n is the number of pixels of the predefined target in the vertical height direction, dyIs the pixel size of the camera sensor, h is the height of the target, d is the distance of the target to the camera, i.e. the depth value depth.
The new formula can be obtained by transforming the formula:
Figure BDA0003306628640000122
wherein f isrefA constant reference focal length value, e.g. f, introduced for focal length normalizationref=720.0;fyIs normalized focal length, f, in camera internal reference matrix Ky/frefThe value is different for different images as a focus normalization coefficient; fregressWhen the height of the pixel of the obstacle target in the image is n for the SMOKE algorithm model, the reference focal length value frefAnd (5) performing lower regression processing on the obtained reference depth value.
The reasoning depth value of the target in each new image is a focal length coefficient fy/frefMultiplied by FregressAnd the reasoning depth value is used for comparing the real depth value of the monocular 3D target detection model with the real depth value of the target label in the training process so as to calculate the loss cost.
The method provided by the embodiment introduces the technical problem of different image focal lengths in the training process of obtaining a plurality of new images through data enhancement, and the technical problem of training inference of images with different focal lengths is solved by adopting the focal length normalization method, so that the problem of depth value regression caused by different pixel heights of the same target under cameras with different focal lengths is avoided.
Compared with the prior art, the method provided by the embodiment of the invention has at least the following beneficial effects:
1. the method has the advantages that a random cutting and zooming strategy which does not exceed the image boundary is adopted for the input original image, so that a plurality of new images of the obstacle with target position offset and size zooming can be generated, the purpose of enhancing the sample size is realized, and the distribution balance of the target sample in the size interval is greatly relieved;
2. by updating the camera internal reference matrix, the perspective geometric constraint relation is still met even if the size of the target sample is changed, so that the performance of the monocular 3D target detection model is greatly improved;
3. by adopting the depth estimation method of focal length normalization, the problem of supervision information conflict during mixed training of images with various focal lengths can be effectively solved aiming at the problem of inconsistent focal lengths of different new images, so that the model deployment supports cameras with different focal lengths.
Referring to fig. 4, a schematic diagram of main modules of a data enhancement apparatus 400 based on monocular 3D object detection according to an embodiment of the present invention is shown, including:
a cropping and scaling module 401, configured to receive multiple input original images, and perform random cropping and scaling processing on each original image to obtain multiple new images;
a matrix determining module 402, configured to, for each new image, re-determine an equivalent camera internal reference matrix based on an affine transformation relationship between the original image and the new image of the coordinates of each pixel;
the model training module 403 is configured to input a plurality of new images and an equivalent camera internal reference matrix corresponding to each new image into the monocular 3D target detection model together for training, so as to update model parameters, and obtain a trained monocular 3D target detection model.
In the device for implementing the present invention, the cropping zoom module 401 is configured to:
using the matting height and the matting width to carry out matting operation with random positions in the original image to obtain a matting area; wherein, the sectional area does not exceed the boundary of the original image;
and carrying out size scaling processing on the cutout area according to the preset image height and the preset image width of the monocular 3D target detection model to obtain a plurality of new images.
In the device for implementing the present invention, the cropping zoom module 401 is further configured to:
receiving a value selected from a scale coefficient range, and using the value as a scale coefficient of the scratch;
and taking the product of the preset image height and the scale coefficient as a matting height, and taking the product of the preset image width and the scale coefficient as a matting width.
In the device implemented by the present invention, the matrix determining module 402 is configured to:
carrying out equal probability random clipping and scaling mathematical description on the coordinates of each pixel in the cutout area, converting the mathematical description into a determinant, and obtaining the affine transformation relation of the coordinates of each pixel between the original image and the new image;
and substituting the affine transformation relation into the projection relation between the pixel coordinate system of the original image and the camera coordinate system so as to reestablish the projection relation between the pixel coordinate system and the camera coordinate system for each pixel in the new image, and acquiring the equivalent camera internal reference matrix from the new projection relation.
In the device implemented by the present invention, the matrix determining module 402 is configured to:
taking the reciprocal of the scale coefficient as a scaling coefficient for scaling the size of the matting area;
determining a first coordinate of a top left corner vertex of each matting area in the original image and a second coordinate of each pixel in the matting area in the original image, and subtracting the first coordinate from the second coordinate to obtain a coordinate of each pixel in the matting area;
and scaling the coordinates of each pixel in the matting area by using the scaling coefficient to obtain the pixel coordinates of each pixel in a new image.
In the device implemented by the present invention, the matrix determining module 402 is configured to:
and taking the reciprocal of the scale coefficient as a scaling coefficient for scaling the size of the matting areas, and combining the coordinates of the top left corner vertex of each matting area in the original image to calculate and obtain the equivalent camera internal parameter matrix.
In the implementation device of the invention, the target in the original image comprises marked 3D information;
the apparatus further comprises a focal length normalization module configured to:
acquiring a reference focal length value preset for focal length normalization, and calculating a reference depth value of a target in each new image under the reference focal length value by using a monocular 3D target detection model;
determining a normalized focal length corresponding to each new image, and calculating a ratio of each normalized focal length to the reference focal length value to obtain a focal length normalization coefficient corresponding to each new image;
multiplying each reference depth value by the corresponding focal length normalization coefficient to obtain a reasoning depth value of the target in each new image; the reasoning depth value is used for comparing the monocular 3D target detection model with a real depth value labeled on a target in a training process so as to calculate loss cost.
In addition, the detailed implementation of the device in the embodiment of the present invention has been described in detail in the above method, so that the repeated description is not repeated here.
Fig. 5 shows an exemplary system architecture 500 to which embodiments of the invention may be applied, including terminal devices 501, 502, 503, a network 504 and a server 505 (by way of example only).
The terminal devices 501, 502, 503 may be various electronic devices having display screens and supporting web browsing, and are installed with various communication client applications, and users may interact with the server 505 through the network 504 using the terminal devices 501, 502, 503 to receive or send messages, and the like.
The network 504 serves to provide a medium for communication links between the terminal devices 501, 502, 503 and the server 505. Network 504 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.
Server 505 may be a server providing various services for performing operations for acquiring a new image using a random crop scaling strategy, determining an equivalent camera internal reference matrix for the new image, and training a monocular 3D object detection model using both.
It should be noted that the method provided by the embodiment of the present invention is generally executed by the server 505, and accordingly, the apparatus is generally disposed in the server 505.
It should be understood that the number of terminal devices, networks, and servers in fig. 5 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
Referring now to FIG. 6, a block diagram of a computer system 600 suitable for use with a terminal device implementing an embodiment of the invention is shown. The terminal device shown in fig. 6 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.
As shown in fig. 6, the computer system 600 includes a Central Processing Unit (CPU)601 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)602 or a program loaded from a storage section 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data necessary for the operation of the system 600 are also stored. The CPU 601, ROM 602, and RAM 603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.
The following components are connected to the I/O interface 605: an input portion 606 including a keyboard, a mouse, and the like; an output portion 607 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 608 including a hard disk and the like; and a communication section 609 including a network interface card such as a LAN card, a modem, or the like. The communication section 609 performs communication processing via a network such as the internet. The driver 610 is also connected to the I/O interface 605 as needed. A removable medium 611 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 610 as necessary, so that a computer program read out therefrom is mounted in the storage section 608 as necessary.
In particular, according to the embodiments of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 609, and/or installed from the removable medium 611. The computer program performs the above-described functions defined in the system of the present invention when executed by the Central Processing Unit (CPU) 601.
It should be noted that the computer readable medium shown in the present invention can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The modules described in the embodiments of the present invention may be implemented by software or hardware. The described modules may also be provided in a processor, which may be described as: a processor includes a crop scaling module, a matrix determination module, and a model training module. Where the names of these modules do not in some cases constitute a limitation on the module itself, for example, the model training module may also be described as a "model monocular 3D object detection model training module".
As another aspect, the present invention also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be separate and not incorporated into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to comprise:
receiving a plurality of input original images, and carrying out random cutting and zooming processing on each original image to obtain a plurality of new images;
for each new image, re-determining an equivalent camera internal parameter matrix based on the affine transformation relation of the coordinates of each pixel between the original image and the new image;
and inputting a plurality of new images and the equivalent camera internal reference matrix corresponding to each new image into the monocular 3D target detection model together for training so as to update the model parameters and obtain the trained monocular 3D target detection model.
According to the technical scheme of the embodiment of the invention, compared with the prior art, the method has the following beneficial effects:
1. the method has the advantages that a random cutting and zooming strategy which does not exceed the image boundary is adopted for the input original image, so that a plurality of new images of the obstacle with target position offset and size zooming can be generated, the purpose of enhancing the sample size is realized, and the distribution balance of the target sample in the size interval is greatly relieved;
2. by updating the camera internal reference matrix, the perspective geometric constraint relation is still met even if the size of the target sample is changed, so that the performance of the monocular 3D target detection model is greatly improved;
3. by adopting the depth estimation method of focal length normalization, the problem of supervision information conflict during mixed training of images with various focal lengths can be effectively solved aiming at the problem of inconsistent focal lengths of different new images, so that the model deployment supports cameras with different focal lengths.
The above-described embodiments should not be construed as limiting the scope of the invention. Those skilled in the art will appreciate that various modifications, combinations, sub-combinations, and substitutions can occur, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. A data enhancement method based on monocular 3D target detection is characterized by comprising the following steps:
receiving a plurality of input original images, and carrying out random cutting and zooming processing on each original image to obtain a plurality of new images;
for each new image, re-determining an equivalent camera internal parameter matrix based on the affine transformation relation of the coordinates of each pixel between the original image and the new image;
and inputting a plurality of new images and the equivalent camera internal reference matrix corresponding to each new image into the monocular 3D target detection model together for training so as to update the model parameters and obtain the trained monocular 3D target detection model.
2. The method of claim 1, wherein the randomly cropping and scaling each original image to obtain a plurality of new images comprises:
carrying out position random matting operation in each original image by using the matting height and the matting width to obtain a matting area; wherein, the sectional area does not exceed the boundary of the original image;
and carrying out size scaling processing on the cutout area according to the preset image height and the preset image width of the monocular 3D target detection model to obtain a plurality of new images.
3. The method of claim 2, further comprising, prior to the using the matte height and matte width:
receiving a value selected from a scale coefficient range, and using the value as a scale coefficient of the scratch;
and taking the product of the preset image height and the scale coefficient as a matting height, and taking the product of the preset image width and the scale coefficient as a matting width.
4. The method according to claim 2 or 3, wherein the re-determining the equivalent camera internal reference matrix based on the affine transformation relationship between the original image and the new image of the coordinates of each pixel comprises:
carrying out equal probability random clipping and scaling mathematical description on the coordinates of each pixel in the cutout area, converting the mathematical description into a determinant, and obtaining the affine transformation relation of the coordinates of each pixel between the original image and the new image;
and substituting the affine transformation relation into the projection relation between the pixel coordinate system of the original image and the camera coordinate system so as to reestablish the projection relation between the pixel coordinate system and the camera coordinate system for each pixel in the new image, and acquiring the equivalent camera internal reference matrix from the new projection relation.
5. The method of claim 4, wherein the mathematical description of equiprobable random clipping scaling of the coordinates of each pixel in the matte region comprises:
taking the reciprocal of the scale coefficient as a scaling coefficient for scaling the size of the matting area;
determining a first coordinate of a top left corner vertex of the matting area in the original image and a second coordinate of each pixel in the matting area in the original image, and subtracting the first coordinate from the second coordinate to obtain a coordinate of each pixel in the matting area;
and scaling the coordinates of each pixel in the matting area by using the scaling coefficient to obtain the coordinates of each pixel in a new image.
6. The method according to claim 2 or 3, wherein the re-determining the equivalent camera internal reference matrix based on the affine transformation relationship between the original image and the new image of the coordinates of each pixel comprises:
and taking the reciprocal of the scale coefficient as a scaling coefficient for scaling the size of the matting areas, and combining the coordinates of the top left corner vertex of each matting area in the original image to calculate and obtain the equivalent camera internal parameter matrix.
7. The method of claim 1, wherein the object in the original image contains labeled 3D information;
the method further comprises the following steps:
acquiring a reference focal length value preset for focal length normalization, and calculating a reference depth value of a target in each new image under the reference focal length value by using a monocular 3D target detection model;
determining a normalized focal length corresponding to each new image, and calculating a ratio of each normalized focal length to the reference focal length value to obtain a focal length normalization coefficient corresponding to each new image;
multiplying each reference depth value by the corresponding focal length normalization coefficient to obtain a reasoning depth value of the target in each new image; the reasoning depth value is used for comparing the monocular 3D target detection model with a real depth value labeled on a target in a training process so as to calculate loss cost.
8. A data enhancement device based on monocular 3D object detection, comprising:
the cutting and zooming module is used for receiving a plurality of input original images and carrying out random cutting and zooming processing on each original image to obtain a plurality of new images;
the matrix determining module is used for re-determining the equivalent camera internal parameter matrix for each new image based on the affine transformation relation of the coordinates of each pixel between the original image and the new image;
and the model training module is used for inputting a plurality of new images and the equivalent camera internal reference matrix corresponding to each new image into the monocular 3D target detection model together for training so as to update the model parameters and obtain the trained monocular 3D target detection model.
9. An electronic device, comprising:
one or more processors;
a storage device for storing one or more programs,
when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-7.
10. A computer-readable medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-7.
CN202111205373.9A 2021-10-15 2021-10-15 Monocular 3D target detection-based data enhancement method and device Pending CN113947768A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111205373.9A CN113947768A (en) 2021-10-15 2021-10-15 Monocular 3D target detection-based data enhancement method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111205373.9A CN113947768A (en) 2021-10-15 2021-10-15 Monocular 3D target detection-based data enhancement method and device

Publications (1)

Publication Number Publication Date
CN113947768A true CN113947768A (en) 2022-01-18

Family

ID=79330933

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111205373.9A Pending CN113947768A (en) 2021-10-15 2021-10-15 Monocular 3D target detection-based data enhancement method and device

Country Status (1)

Country Link
CN (1) CN113947768A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114359880A (en) * 2022-03-18 2022-04-15 北京理工大学前沿技术研究院 Riding experience enhancement method and device based on intelligent learning model and cloud
CN116912621A (en) * 2023-07-14 2023-10-20 浙江大华技术股份有限公司 Image sample construction method, training method of target recognition model and related device
WO2023201723A1 (en) * 2022-04-22 2023-10-26 华为技术有限公司 Object detection model training method, and object detection method and apparatus

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114359880A (en) * 2022-03-18 2022-04-15 北京理工大学前沿技术研究院 Riding experience enhancement method and device based on intelligent learning model and cloud
CN114359880B (en) * 2022-03-18 2022-05-24 北京理工大学前沿技术研究院 Riding experience enhancement method and device based on intelligent learning model and cloud
WO2023201723A1 (en) * 2022-04-22 2023-10-26 华为技术有限公司 Object detection model training method, and object detection method and apparatus
CN116912621A (en) * 2023-07-14 2023-10-20 浙江大华技术股份有限公司 Image sample construction method, training method of target recognition model and related device
CN116912621B (en) * 2023-07-14 2024-02-20 浙江大华技术股份有限公司 Image sample construction method, training method of target recognition model and related device

Similar Documents

Publication Publication Date Title
CN113947768A (en) Monocular 3D target detection-based data enhancement method and device
CN109191512B (en) Binocular image depth estimation method, binocular image depth estimation device, binocular image depth estimation apparatus, program, and medium
US10970821B2 (en) Image blurring methods and apparatuses, storage media, and electronic devices
CN109753971B (en) Correction method and device for distorted text lines, character recognition method and device
EP4064176A1 (en) Image processing method and apparatus, storage medium and electronic device
US20180359415A1 (en) Panoramic video processing method and device and non-transitory computer-readable medium
US11004179B2 (en) Image blurring methods and apparatuses, storage media, and electronic devices
CN112733820B (en) Obstacle information generation method and device, electronic equipment and computer readable medium
CN110796664B (en) Image processing method, device, electronic equipment and computer readable storage medium
CN114511041B (en) Model training method, image processing method, device, equipment and storage medium
CN108597034B (en) Method and apparatus for generating information
CN113592706B (en) Method and device for adjusting homography matrix parameters
CN110047126B (en) Method, apparatus, electronic device, and computer-readable storage medium for rendering image
CN112085842B (en) Depth value determining method and device, electronic equipment and storage medium
CN115104126A (en) Image processing method, apparatus, device and medium
CN114723640B (en) Obstacle information generation method and device, electronic equipment and computer readable medium
CN112929562B (en) Video jitter processing method, device, equipment and storage medium
JPH05256613A (en) Method and device for parallax computing from stereo picture and measuring device for depth
CN111260544A (en) Data processing method and device, electronic equipment and computer storage medium
CN114255493A (en) Image detection method, face detection device, face detection equipment and storage medium
CN111563956A (en) Three-dimensional display method, device, equipment and medium for two-dimensional picture
CN113724129B (en) Image blurring method, storage medium and terminal equipment
CN114845055B (en) Shooting parameter determining method and device of image acquisition equipment and electronic equipment
CN112991179B (en) Method, apparatus, device and storage medium for outputting information
CN113312979B (en) Image processing method and device, electronic equipment, road side equipment and cloud control platform

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination